Palo Alto Networks – Active/Active HA Cluster not syncing sessions

I have configured an active/active cluster with 2 PA-5220 in routed mode (dynamic routing with OSPF) in different datacenters. The problem was that the firewalls not synced thier session tables vice versa. I have controlled the HA links several times all was configured well.

I have tried to debug the problem, so i looked at the cluster state on the CLI:

admin@node1(active-primary)> show high-availability state-synchronization 

--------------------------------------------------------------------------------
State Synchronization Status: Complete
--------------------------------------------------------------------------------
state synchronization to peer device enabled: yes  
--------------------------------------------------------------------------------
state synchronization messages processed since system up

message                          enable   version  sent             received        
--------------------------------------------------------------------------------
session setup                    yes      9        24036024         810298          
session teardown                 yes      9        24296403         822660          
session update                   yes      9        117885229        5838204         
predict session add              yes      9        33302            1581            
predict session delete           yes      9        32947            1168            
predict session update           yes      9        15960            284             
ARP update                       no       1        0                0               
ARP delete                       no       1        0                0               
MAC update                       no       1        0                0               
MAC delete                       no       1        0                0               
IPSec sequence number update     yes      3        0                0               
ND update                        no       1        0                0               
ND delete                        no       1        0                0               
DoS Aggregate entry update       yes      1        0                0               
DoS Class Tbl IP update          yes      1        0                0               
DoS Class Tbl IP delete          yes      1        0                0               
DoS Block Tbl IP update          yes      1        0                0               
DoS Block Tbl IP delete          yes      1        0                0               
A/A session setup                yes      9        24038236         810298          
A/A session statistics           yes      9        0                0               
A/A packet forward using HA2     yes      9        0                0               
Return MAC Update                yes      1        0                0               
Return MAC Delete                yes      1        0                0               
V6 Return MAC Update             yes      1        0                0               
V6 Return MAC Delete             yes      1        0                0               
HA2 monitor message              yes      1        489636           488960          
predict session modify           yes      9        0                0               
--------------------------------------------------------------------------------

You can see that the firewall creates sessions and updates and has send it and recieved it.

But if you look at the global counters you will see the following:

admin@node2(active-secondary)> show counter global filter severity error

Global counters:
Elapsed time since last sampling: 49.87  seconds

name                                   value     rate severity  category  aspect    description
--------------------------------------------------------------------------------
flow_rcv_dot1q_tag_err                    54        0 drop      flow      parse     Packets dropped: 802.1q tag not configured
flow_no_interface                         54        0 drop      flow      parse     Packets dropped: invalid interface
flow_policy_nofwd                          5        0 drop      flow      session   Session setup: no destination zone from forwarding
flow_tcp_non_syn_drop                  37977        0 drop      flow      session   Packets dropped: non-SYN TCP without session match
flow_fwd_l3_ttl_zero                    1953        0 drop      flow      forward   Packets dropped: IP TTL reaches zero
flow_fwd_l3_noroute                       35        0 drop      flow      forward   Packets dropped: no route
flow_fwd_l3_noarp                          7        0 drop      flow      forward   Packets dropped: no ARP
flow_fwd_zonechange                     1171        0 drop      flow      forward   Packets dropped: forwarded to different zone
flow_fwd_notopology                     6292        0 drop      flow      forward   Packets dropped: no forwarding configured on interface
flow_xmt_platform_encap_err              426        0 drop      flow      offload   Packets dropped: Platform encapsulation error
flow_predict_hash_insert_failure        1429        0 error     flow      pktproc   Predict session has insert failure
flow_host_decap_err                       15        0 drop      flow      mgmt      Packets dropped: decapsulation error from control plane
flow_host_service_deny                  9399        0 drop      flow      mgmt      Device management session denied
flow_fpga_ingress_exception_err      1314289        7 drop      flow      offload   Packets dropped: receive ingress exception error from offload processor
flow_fpga_egress_exception_err          1457        0 drop      flow      offload   Packets dropped: receive egress exception error from offload processor
flow_fpp_sess_bind_ack_flow_state_error      1850        0 drop      flow      offload   FPP Sess bind ACK flow state verification error
ctd_filter_decode_failure_zip             30        0 error     ctd       pktproc   Number of decode filter failure for zip
ctd_filter_decode_failure_qpdecode         4        0 error     ctd       pktproc   Number of decode filter failure for qpdecode
ha_err_xmt_l2                             52        0 error     ha        system    HA sync transmit error: link layer info unavailable
ha_err_state                           11536        0 error     ha        system    Packets dropped: invalid HA state
ha_err_decap                          463971        2 error     ha        system    Packets dropped: HA message decapsulation error
ha_err_decap_intf                       1816        0 error     ha        system    Packets dropped: HA message decapsulation error because interface not match
ha_err_decap_proto                    462155        2 error     ha        system    Packets dropped: HA message protocol decapsulation error
ha_err_msg_payload                 154264121      958 error     ha        system    Packets dropped: HA message payload processing error
ha_err_session_update               51758863      201 error     ha        system    Packets dropped: HA session update error
ha_aa_pktfwd_err_rcv_no_interface       4169        0 drop      ha        aa        Active/Active: packets received on the non-configured local interface
--------------------------------------------------------------------------------
Total counters shown: 26
--------------------------------------------------------------------------------

The counter „ha_err_msg_payload“ had an rate of 958 per second.

ha_err_msg_payload                 154264121      958 error     ha        system    Packets dropped: HA message payload processing error

Description: „Packets dropped: HA message payload processing error“ – I have nothing found at PaloAlto pages to that error so i had to reasearch it on my own…

The solution …

After a while i recognized that some sessions are syncronized. What was the difference.

I have on both sites LACP portchannels + subinterfaces for different VLANs. As usual I give the interface for eg. VLAN 500 the interface ID 500 => AE1.500. But in my szenario I had different VLANs in the datacenters for the same security zone. So on Node1 the interface name was AE1.500 (VLAN 500) and in the other datacenter on Node2 AE1.1500 (VLAN 1500). We have for each security zone a VRF-lite setup, the VLAN 500 in datacenter1 is in the same VRF like VLAN 1500 in datacenter2. I thought that the session table matching is only applied on the security zone, but in Active/Active mode it is nessesary that the interface name is equal on both sites! The VLAN ID can be different but the interface name must be equal. The network where the session syncronization worked accidentally has the same VLAN on both sites and also the same subinterface ID. After changing the subinterface id the sync works perfectly.

2 Gedanken zu „Palo Alto Networks – Active/Active HA Cluster not syncing sessions“

Victor Mira
31. Januar 2020 um 11:04 Uhr
Hello Maximilian,
I know this is an old forum, but I’m facing a similar issue and I was wondering how did you manage to „not synchronise“ the VLAN ID on the same subinterface. According to PaloAlto, all info on the interface except the IP address is sync’ed between both members on an A/A deployment (https://docs.paloaltonetworks.com/pan-os/8-1/pan-os-admin/high-availability/reference-ha-synchronization/what-settings-dont-sync-in-activeactive-ha)
It would be helpful, as we’re deploying a Cisco SD-Access solution where this VLAN ID „non-sync“ would be very helpful.
Thanks in advance,
Víctor.
Antworten
Ibi
13. April 2020 um 07:30 Uhr
Thank you for this post! I was having the same issues as you and only after reading your post, was I able to finally fix my issues 🙂
Antworten

Schreibe einen Kommentar Antworten abbrechen

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.