I have configured an active/active cluster with 2 PA-5220 in routed mode (dynamic routing with OSPF) in different datacenters. The problem was that the firewalls not synced thier session tables vice versa. I have controlled the HA links several times all was configured well.
I have tried to debug the problem, so i looked at the cluster state on the CLI:
admin@node1(active-primary)> show high-availability state-synchronization -------------------------------------------------------------------------------- State Synchronization Status: Complete -------------------------------------------------------------------------------- state synchronization to peer device enabled: yes -------------------------------------------------------------------------------- state synchronization messages processed since system up message enable version sent received -------------------------------------------------------------------------------- session setup yes 9 24036024 810298 session teardown yes 9 24296403 822660 session update yes 9 117885229 5838204 predict session add yes 9 33302 1581 predict session delete yes 9 32947 1168 predict session update yes 9 15960 284 ARP update no 1 0 0 ARP delete no 1 0 0 MAC update no 1 0 0 MAC delete no 1 0 0 IPSec sequence number update yes 3 0 0 ND update no 1 0 0 ND delete no 1 0 0 DoS Aggregate entry update yes 1 0 0 DoS Class Tbl IP update yes 1 0 0 DoS Class Tbl IP delete yes 1 0 0 DoS Block Tbl IP update yes 1 0 0 DoS Block Tbl IP delete yes 1 0 0 A/A session setup yes 9 24038236 810298 A/A session statistics yes 9 0 0 A/A packet forward using HA2 yes 9 0 0 Return MAC Update yes 1 0 0 Return MAC Delete yes 1 0 0 V6 Return MAC Update yes 1 0 0 V6 Return MAC Delete yes 1 0 0 HA2 monitor message yes 1 489636 488960 predict session modify yes 9 0 0 --------------------------------------------------------------------------------
You can see that the firewall creates sessions and updates and has send it and recieved it.
But if you look at the global counters you will see the following:
admin@node2(active-secondary)> show counter global filter severity error Global counters: Elapsed time since last sampling: 49.87 seconds name value rate severity category aspect description -------------------------------------------------------------------------------- flow_rcv_dot1q_tag_err 54 0 drop flow parse Packets dropped: 802.1q tag not configured flow_no_interface 54 0 drop flow parse Packets dropped: invalid interface flow_policy_nofwd 5 0 drop flow session Session setup: no destination zone from forwarding flow_tcp_non_syn_drop 37977 0 drop flow session Packets dropped: non-SYN TCP without session match flow_fwd_l3_ttl_zero 1953 0 drop flow forward Packets dropped: IP TTL reaches zero flow_fwd_l3_noroute 35 0 drop flow forward Packets dropped: no route flow_fwd_l3_noarp 7 0 drop flow forward Packets dropped: no ARP flow_fwd_zonechange 1171 0 drop flow forward Packets dropped: forwarded to different zone flow_fwd_notopology 6292 0 drop flow forward Packets dropped: no forwarding configured on interface flow_xmt_platform_encap_err 426 0 drop flow offload Packets dropped: Platform encapsulation error flow_predict_hash_insert_failure 1429 0 error flow pktproc Predict session has insert failure flow_host_decap_err 15 0 drop flow mgmt Packets dropped: decapsulation error from control plane flow_host_service_deny 9399 0 drop flow mgmt Device management session denied flow_fpga_ingress_exception_err 1314289 7 drop flow offload Packets dropped: receive ingress exception error from offload processor flow_fpga_egress_exception_err 1457 0 drop flow offload Packets dropped: receive egress exception error from offload processor flow_fpp_sess_bind_ack_flow_state_error 1850 0 drop flow offload FPP Sess bind ACK flow state verification error ctd_filter_decode_failure_zip 30 0 error ctd pktproc Number of decode filter failure for zip ctd_filter_decode_failure_qpdecode 4 0 error ctd pktproc Number of decode filter failure for qpdecode ha_err_xmt_l2 52 0 error ha system HA sync transmit error: link layer info unavailable ha_err_state 11536 0 error ha system Packets dropped: invalid HA state ha_err_decap 463971 2 error ha system Packets dropped: HA message decapsulation error ha_err_decap_intf 1816 0 error ha system Packets dropped: HA message decapsulation error because interface not match ha_err_decap_proto 462155 2 error ha system Packets dropped: HA message protocol decapsulation error ha_err_msg_payload 154264121 958 error ha system Packets dropped: HA message payload processing error ha_err_session_update 51758863 201 error ha system Packets dropped: HA session update error ha_aa_pktfwd_err_rcv_no_interface 4169 0 drop ha aa Active/Active: packets received on the non-configured local interface -------------------------------------------------------------------------------- Total counters shown: 26 --------------------------------------------------------------------------------
The counter „ha_err_msg_payload“ had an rate of 958 per second.
ha_err_msg_payload 154264121 958 error ha system Packets dropped: HA message payload processing error
Description: „Packets dropped: HA message payload processing error“ – I have nothing found at PaloAlto pages to that error so i had to reasearch it on my own…
The solution …
After a while i recognized that some sessions are syncronized. What was the difference.
I have on both sites LACP portchannels + subinterfaces for different VLANs. As usual I give the interface for eg. VLAN 500 the interface ID 500 => AE1.500. But in my szenario I had different VLANs in the datacenters for the same security zone. So on Node1 the interface name was AE1.500 (VLAN 500) and in the other datacenter on Node2 AE1.1500 (VLAN 1500). We have for each security zone a VRF-lite setup, the VLAN 500 in datacenter1 is in the same VRF like VLAN 1500 in datacenter2. I thought that the session table matching is only applied on the security zone, but in Active/Active mode it is nessesary that the interface name is equal on both sites! The VLAN ID can be different but the interface name must be equal. The network where the session syncronization worked accidentally has the same VLAN on both sites and also the same subinterface ID. After changing the subinterface id the sync works perfectly.
Hello Maximilian,
I know this is an old forum, but I’m facing a similar issue and I was wondering how did you manage to „not synchronise“ the VLAN ID on the same subinterface. According to PaloAlto, all info on the interface except the IP address is sync’ed between both members on an A/A deployment (https://docs.paloaltonetworks.com/pan-os/8-1/pan-os-admin/high-availability/reference-ha-synchronization/what-settings-dont-sync-in-activeactive-ha)
It would be helpful, as we’re deploying a Cisco SD-Access solution where this VLAN ID „non-sync“ would be very helpful.
Thanks in advance,
Víctor.
Thank you for this post! I was having the same issues as you and only after reading your post, was I able to finally fix my issues 🙂