vPC, the gotcha's you need to know

Hi Guys

Having spent a lot of time with customers working on vPC deployments, I have found quite a few of the gotcha's for vPC that I want to share with you now. There are plenty of guides out there on the internet including from Cisco themselves but I have found a lot of them to be dated as improvements are constantly made to vPC.

This blog post addresses vPC considerations for the following version:

NX-OS 6.0(2) on Nexus 7000 Hardware


Now we have that out of the way :)

So, if you don't know what vPC is, have never even looked at the basics on how to configure it, this is not the blog post for you. This blog post assumes you have vPC enabled and are maybe experiencing strange behavior, or you have been through the basics of vPC and are about to deploy but just want to know the gotchas

Let's talk about one vPC design caveat, addressed very well by Brad Hedlund in his blog post

Layer 3 Considerations

This particular vPC design caveat could end up causing you lots of grief if you are unaware of it.

To understand this caveat, you must understand the following rule:

vPC will not allow traffic that was RECEIVED over a VPC peer-link to be sent out a vPC member port.

This is a loop prevention method, keep that in the back of your head as you read this.


So, Let's say you have two Nexus 7k's, let's make things really simple and say that you have two VLAN's, one is your server/router VLAN, VLAN 99 and the other is VLAN 100 which is your user VLAN.

So, you have a router connected to the First Nexus, from a routing point of view it peers with the two nexus over VLAN 99.
Your router is not etherchanneled to the Nexus, it's just connected via a normal access port


You then have a server which is on a vPC port channel, called vPC 1, vPC 1 has a configuration like so:

int po1
 switchport access vlan 100
 switchport mode access
!
Pretty simple config but will do for what we are trying to show. It is connected to both Nexus

Now, for some reason your router, even though it is physically connected to the primary nexus, decides to use the Secondary Nexus as a next-hop address for the vlan 100 subnet, maybe something happened with the routing protocol on the first nexus, or it was simply misconfigured from the start, whatever the case, you have now broken the golden rule for vPC loop prevention I mentioned above

Think about it, you have a router (let's say 99.1.1.254) trying to get to a server (let's say 10.1.1.2) but it's next hop is the nexus connected OVER the vPC peer Link, then the second nexus would need to route it down a vPC MEMBER PORT


The traffic will be dropped by the loop prevention technology.


There are several solutions to this, most of which are well addressed in Brad Hedlunds document, you could create a VLAN for the router and the two nexus to establish their peer relationship on and make sure that that VLAN is not trunked to any vPC member ports, you could create an entirely seperate link between the two Nexus to carry the Layer 3, you could run the router into both chassis and use Layer 3 ports. Lot's of options. But if you ever have problems and the routing is not working, go back to that golden rule, Am i coming in over a vPC link and then trying to go out a member port?

The next Layer 3 caveat is an odd one, but worth talking about. Apparently some SAN's out there from EMC and Netapp, implement something they call "fast routing" which basically means that whenever they receive a packet from an IP address, they store the MAC address and IP address combination in there ARP table, so by the end of it there ARP table would look something like this:

9.1.1.1 aaaa.bbbb.cccc
9.2.2.2 aaaa.bbbb.cccc
3.3.3.3 aaaa.bbbb.cccc

Where aaaa.bbbb.cccc is the MAC address of there default gateway, the idea behind this is that it means the SAN does not have to perform a route lookup/ARP request and should save it some time, in my humble opinion it would shave maybe a fraction of a millisecond in most modern CPU's on the SAN's and in return horribly breaks the RFC (is it acceptable as part of the RFC? Am I dead wrong? it would not be the first time, leave a reply below or ping me on twitter @ccierants)

Anyway, regardless of the merits, this causes problems for the Nexus when used in combination with VRRP
 the problem is that with VRRP, the default gateway has a VRRP Defined MAC, but the actual reply when it comes back to the Netapp will actually be from the Burnt In MAC address, this can cause problems! Because now when teh netapp does its look in it's arp table, it will send the traffic there, if for some reason this is the non active neighbor (the non VRRP Master), and the frame is destined for a vPC port member.. guess what, we just broke the golden rule again.

So in order to fix this, cisco implemented the peer-gateway command, the peer-gateway command tells the Nexus 7k's to route any frame rather than forwarding it over the vPC link if it is received for either mac address of either Nexus 7k. Easy Peasy!

Here is how to configure it, I can't see a single downside to configuring peer-gateway so recommend you always turn this on :)


Nexus(config)# vpc domain 1
Nexus(config-vpc-domain)# peer-gateway

Easy :)


Ok, On to a few more caveats.

Making changes to your vPC's
This is not strictly an issue with the version of NX-OS we are running in our example as the feature to stop this causing problems is turned on by default, however it is included here in case someone turned it off :)

Let's say you had a simple vPC that looked like this on both switches:

int po1
 mtu 9216
 switchport mode access 
 switchport access vlan 50
!

Simple, easy, but for some reason you want to change the MTU, this would be considered a type 1 mismatch and as soon as you changed it, the vPC would be brought down across BOTH NEXUS 7'K's!!!

"What the hell just happened? I was careful and I only changed one port, now my server has gone offline, since it was etherchannel'd I should have been fine!" < - this is what you would have been saying to yourself prior to NX-OS 5.2, as a feature called "Graceful consistency check" did not exist, to see if you have graceful consistency check enabled:


Nexus# show vpc
Legend:
                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 1  
Peer status                       : peer adjacency formed ok     

... ...
Graceful Consistency Check        : Enabled

Auto-recovery status              : Enabled (timeout = 240 seconds)



If this is not set as enabled.. trust me, set it as enabled:


Nexus# conf t
Enter configuration commands, one per line.  End with CNTL/Z.

Nexus(config)# vpc domain 1
Nexus(config-vpc-domain)# graceful consistency-check

OK excellent let's keep going :)

The next thing to talk about quickly is the difference between the peer-link and the peer-keepalive link.

The peer-link is an important part of the vPC puzzle, the peer-keepalive link is actually not so important. the peer-keepalive link you could actually unplug and your vPC peers would continue to function quite happily, you would have messages that the peer keepalive had failed, but you would be able to continue working, in previous NX-OS releases you would have been unable to make configuration changes, but this is not the case anymore.

What the Peer-keepalive does do however, is that in the event your peer-link fails, the peer keepalive is used to prevent a split brain scenario, if your peer-links die but the chassis itself actually remains up, you will get a message like so:

 Nexus %ETHPORT-3-IF_ERROR_VLANS_SUSPENDED: VLANs 110, 99,on Interface p
ort-channel10 are being suspended. (Reason: vPC peer is not reachable over cfs
)


This is to prevent loops, any vPC member ports are shutdown on the secondary vPC peer.

OK next it is time to talk about a command auto-recover, this is NOT SET BY DEFAULT in this NX-OS  although I would argue strongly that it should be.

Let's say for some reason, you are in a situation where both your Nexus's have been turned off, and you can only bring back one of them (turn on one of them), maybe you had a power outage and only have enough power to bring up one (A UPS from a particular feed has died) or maybe a power spike blew up one chassis and your waiting for cisco to deliver the spares for the other in the mean time, whatever the situation may be, if the end result is, you are turning on one chassis but not the other you need the auto-recover command. This command is NOT relevant if you had two Nexus switches up and let's say the power failed to one of them, if you restored the power to that Nexus, you would not need to worry about this command: the two Nexus would see each other and restore there relationship, and while one of them was offline, the vPC would have kept working.

By default, if a Nexus has been turned on with vPC configuration, and vPC port channels configured, if it cannot see it's vPC peer, it will not bring the vPC port channels up!

You can tell the Nexus upon bootup to wait a certain amount of time before deciding that hey, the other nexus involved in my vPC is not coming back any time soon, he is on an extended lunchbreak or something, so let's get those vPC's up so we can start forwarding traffic.

Here is how to turn it on:

Nexus(config)# vpc domain 1
Nexus(config-vpc-domain)# auto-recovery
Warning:
 Enables restoring of vPCs in a peer-detached state after reload, will wait for 240 seconds to determine if peer is un-reachable

 

As per the warning, the default time to wait before bringing up the vPC's if you can't see a peer is 240 seconds, this timer can be adjusted as a parameter to the auto-recovery command.


I will now bring you to one final caveat

Mis-configured port-channel on the end device.

 So your probably use to the fact that, if you enable two interfaces for a port channel using LACP, if the other end doesn't have port-channel turned on or there is some other problem, no worries right? LACP will just place the port(s) into standalone mode and spanning-tree will just choose an active path.

Unfortunately with the Nexus, there is no such thing as standalone, it is either part of a vPC or it will be suspensed as the following output shows:



Nexus# show port-channel sum
Flags:  D - Down        P - Up in port-channel (members)
        I - Individual  H - Hot-standby (LACP only)
        s - Suspended   r - Module-removed
        S - Switched    R - Routed
        U - Up (port-channel)
        M - Not in use. Min-links not met
--------------------------------------------------------------------------------
Group Port-       Type     Protocol  Member Ports
      Channel
--------------------------------------------------------------------------------
6     Po6(SD)     Eth      LACP      Eth1/1(s)  


Nexus# show int eth1/1
Ethernet1/1 is down (suspended(no LACP PDUs))

Easy to fix:

Nexus(config)#int po1
Nexus(config-if)# lacp ?
  max-bundle            Configure the port-channel max-bundle
  min-links             Configure the port-channel min-links
  suspend-individual    Configure lacp port-channel state. Disabling this will
                        cause lacp to put the port to individual state and not
                        suspend the port in case it does not get LACP BPDU from
                        the peer ports in the port-channel 


so if we enter:


Nexus(config-if)# shut
Nexus(config-if)# no lacp suspend-individual
Warning: !! Disable lacp suspend-individual only on port-channel with edge ports. Disabling this on network port port-channel could lead to loops.! 

Nexus(config-if)# no shut

As per the warning guys, this could cause you HUGE problems if you enable this on a port that is part of a vPC, so I would only use this no lacp suspend-individual on ports that are not part of a vPC port channel (in which case, why are you port channeling, in which case, why don't you just fix the fact that the other end is not doing port-channel or just remove the port channel config from the Nexus?)

I hope this helps someone out there!






35 comments:

  1. Great Article. I think it just bailed me out of trouble.

    Jared Scrivener CCIE3 #16983

    ReplyDelete
  2. This act like a Mythbuster in my mind ...god I was thinking all wrong about VPC so far. Excellent post Peter.

    ReplyDelete
  3. Guys thanks for your kind words I am extremely happy to hear I helped!

    ReplyDelete
  4. Hi Peter,

    Awesome article, nice to see other people are experiencing the same "gotcha's" as me :). There also some other nice-to-know/gotcha's

    1. In a vPC setup there is a vPC primary and a vPC secondary. In case the peer-link fails (but keep-alive is up) the vPC secondary will shut down all it´s interfaces in vPC until peer-link is available again. You can control which is primary and which is secondary with the "role priority" command in vPC

    2. If you for some reason have configured "ip arp timeout" on an SVI to battle unicast flooding before migrating to N7K, it can cause problem with the default's that are in the N7K. There is a mechansm in place which is configured with the best timers. Problem consist of dropping of packets in inter-vlan routing.

    3. As we know that with FHRP the vPC pair will forward the virtual mac addreses out gatewaying the standby ip. You can also turn on so that the vPC pair will act as one STP root bridge, rather then two. With the command "peer switch". Be sure that both vPC peers have the same priority configured.

    4. Be sure that each instance of vPC need a separate vPC domain. IF you have multiple vPC's for example in different DC's the vPC domain id can traverse your OTV/L2 and give a nagging error between two sites.

    5. This is a very interesting one, though it is not recommended to run L3 over the peer-link as you have pointed out, you however can in strictest term. There is a command to exclude a vlan from the vPC peer-link, so the vlan will traverse the link but will not partake in the vPC. "Peer-gateway exclude-vlan "

    6. Last one single attached hosts to N7K should be connected to the vPC primary device (as mentioned per 1.). The vlan that the single attached host is on should not be a vPC vlan for the possibility of drop due to packets travelin the peer-link.

    Thats my 50 cents :)

    Regards,

    Gustav

    ReplyDelete
    Replies
    1. Hmm, I think number 5 is wrong. I believe it would only exclude that vlan from the peer-gateway functionality, but not from the vPC itself.

      Nexus5548-DC(config-vpc-domain)# peer-gateway ?

      exclude-vlan Specify VLANs to be excluded from peer-gateway functionality

      Delete
  5. Hey Peter,

    Would you please elaborate about not using 'no lacp suspend-individual' on vPC interfaces?

    My reading of the warning is that it's only safe to use on "edge" ports, rather than switch-to-switch links.

    I don't see the risk of using it on vPC interfaces in general.

    I can think of several server vPC use cases:
    - migrating servers from active/standby catalysts to LACP Nexus - sometimes those server guys just don't keep up, why not let them run as individual ports?
    - servers that would like to boot via PXE (individual), then pull down a kernel that runs LACP.
    - New server installs often require network connectivity before running LACP is possible.

    So, what do you think? Configuring a vPC to a server, but allowing it to run as active/standby individuals doesn't jump out at me as a huge problem.

    ReplyDelete
  6. I have a question in regards to the loop prevention mechanism.

    "vPC will not allow traffic that was RECEIVED over a VPC peer-link to be sent out a vPC member port."

    Will vPC send traffic that was RECEIVED over a VPC peer-link to be sent out of an ORPHAN port?

    ReplyDelete
    Replies
    1. Yes it will as stated in http://www.cisco.com/en/US/docs/switches/datacenter/sw/design/vpc_design/vpc_best_practices_design_guide.pdf

      Delete
  7. Hello Peter,
    can U pls explaine exception from the "golden vPC rule" which sounds as follows: "The only exception to this rule occurs when vPC member port goes down. vPC peer devices exchange member port states and reprogram in hardware the vPC loop avoidance logic for that particular vPC. The peer-link is then used as backup path for optimal resiliency" (taken from reviced in Jan of 2013 http://www.cisco.com/en/US/docs/switches/datacenter/sw/design/vpc_design/vpc_best_practices_design_guide.pdf)? Let's assume that egress port for frame received via vpc-peer-link of the receiving vpc-peer is neither orphan nor L3-port\SVI. Then this port is vPC-member port. Then Q arise why vpc-peer sending frame over vpc-peer-link didnt forwarded frame up via its local vpc-member to destination, right? So mentioned exception must be extended as follows: "The only exception to this rule occurs when vPC member port to source of the frame on receiving vpc-peer goes down while vPC member port to destination on sending vpc-peer goes down as well". Is it correct or there are other possible scenarios for this behavour?
    Thank U

    ReplyDelete
  8. Hi Peter, while trying to troubleshoot a Cisco Network / HP Blade Chassi Problem, i stumbled across this post and thought would try and pick your brains.

    I am not a cisco guy...We are having issues with Vpc/Lacp (I believe)

    We have a Blade Chassi connecting to two Nexus switches, there are two 10gb uplinks from the blade chassti to two nexus's (cross switch Vpc),

    1. when we are trying to connect from a microsofyt lycn client to the lycn server - when the server/client is on same vlan's - we have no issues (even with two uplinks configured in lacp trunk)

    2. when we move the client to another vlan which (which network team calls, routed vlan), we see intermittent issues connecting to lycn server - when we have 2 links in lacp trunk,

    as soon as we remove one uplink / or disable it from the cisco switch, the client on a differnt vlan , starts to work (log in) absolutely fine with 1 uplink in picture.

    I am very confused, where the problem, could lie?

    Thanks

    ReplyDelete
    Replies
    1. Hmm, that's interesting. One would have to look at the configurations of your both Nexus switches and know the IP and maybe the MAC of both the Lync client and of the Lync server.

      I had a similar issue with a customer once, but at the end there was nothing wrong with the vPC config, but instead, there were some static MAC entries at the CAM of the secondary switch, which were causing the 'intermitent' problems. Check your static mappings.

      Delete
  9. Great Article Peter..Keep posting. :)

    ReplyDelete
  10. Great article! Congrats for spend some of your time helping others!

    ReplyDelete
  11. Hi Peter,

    Congrats, this is really good work!
    One thing to add as well since i just stepped upon and might help others...

    It is related to the golden vPC rule: Let's say we have the topology where a server (Server1) is connected via LACP to N7K1(vpc primary) & N7K2 (vpc secondary) and uses its default gateway (hsrp ip on an SVI interface of a vpc vlan on both N7Ks) to reach Server2 which is reachable over another SVI (also of a vpc vlan) on the N7Ks. On the situation where you administratively shutdown the SVI on N7K2 and traffic from the Server1 is sent to N7K2 (due to the lacp hashing algorithm result), the traffic will have to be sent over the peer-link from N7K2 to N7K1. N7K1 will have to drop the traffic since the destination IP is not himself and since N7K2 still has its own vPC member ports UP.

    Bottom-line: I will never shut down an SVI again on only one of the vpc peers - at least not light-hearted!
    Is there any way to overcome this? I can't think of any at the moment..

    ReplyDelete
    Replies
    1. Confirmed bug: CSCtj94130

      Delete
    2. Hi Gizas,
      Thanks for this point. Also, this relates to a VPC best practice that "doesn't use HSRP/VRRP object tracking" in VPC. If using object tracking, one VPC peer switch can shutdown the SVI. Traffic blackingholing can happen with the same reason you stated.

      And thanks the Anonymous on the bug. Cisco will not fix this bug since Cisco doesn't consider it is a bug. There are also statement as "both VPC peer should keep the same SVI operate states".

      Personally, I think Cisco should modify the loop avoidance mechanism.
      Current: The frame received from the peer link with loop avoidance bit set will not be forwarded to VPC member ports. It can be forwarded to orphan ports and single attached VPC member port (the VPC member port in anther switch is down).
      Suggested rule: The frame should also be forwarded if the destination MAC matches the switches local MAC. These frames are more likely be the frames that need to be routed. This also can resolve the "no routing across VPC" rule.

      Delete
  12. Hi Peter,

    As usual, thanks for sharing your thoughts. I have one small, but important correction:

    Where you talk about:
    "You then have a server which is on a vPC port channel, called vPC 1, vPC 1 has a configuration like so:
    int po1
    switchport access vlan 100
    switchport mode access"

    You actually forgot to include the "vpc 1" command. Without it, the the vPC check / loop prevention mechanism would not work as you explain because Po1 would NOT be a vPC member port. It would be an orphan port, which is not suspended by default when a failure in the remote peer occurs.

    ReplyDelete
  13. Great Article Peter, keep them coming :)

    ReplyDelete
  14. Hi Peter, excellent article, I have the following question:

    Can a FEX have uplinks to different vPC domains? I have a FEX that has 2 parent switches (2 5548s that form a vPC domain). Can I form uplinks from the same FEX to another pair of 5Ks with a different vPC domain ID? Or will that create a parent-switch confilct?

    Thanks for your help in advance!

    ReplyDelete
  15. Thanks Pete, You've just resolved my issue Packet loss when VPC'd from two nexus 93120tx to Solaris T5-2s. Packet loss, probably due to packets traversing the peer link. The Peer-gateway command resolved the problem.

    Thanks again.

    ReplyDelete
  16. Thanks Peter.

    Another wrinkle to consider: The "lacp suspend-individual" configuration is already the default on 7k (v 6.2(10)), YET, opposite on 5k (v 7.0(6)N1(1)), where “no lacp suspend-individual” is the default.

    I don't know why Cisco would make these parameters different on these platforms. I've experienced this myself too when configuring VPCs on 5ks. This applies to both VPCs and NON-VPC port-channels on 5ks.

    ReplyDelete
  17. Thanks Peter.

    Another wrinkle to consider: The "lacp suspend-individual" configuration is already the default on 7k (v 6.2(10)), YET, opposite on 5k (v 7.0(6)N1(1)), where “no lacp suspend-individual” is the default.

    I don't know why Cisco would make these parameters different on these platforms. I've experienced this myself too when configuring VPCs on 5ks. This applies to both VPCs and NON-VPC port-channels on 5ks.

    ReplyDelete
  18. Great article,
    Maybe one can solve a problem for me, involving vpc's and private-vlans.
    At my company we use vpc's between Nexus 5k and ASR9k. The vpc portchannel is configured as a trunk with vlans allowed.
    Now we want to change it to switchport mode private-vlan trunk, and switchport private-vlan trunk allowed vlan.
    I have done some testing, when removing the configuration and pasting the private-vlan config, there will be outage.
    When shutting the peer-link on the standby node there is downtime and with shutting the peer-link and tracking the vpc, there is downtime too.
    All the technics like "Graceful Consistency Check" and "lacp suspend-individual", are in place.

    Is there a way to change this configuration without outage? Anybody? Thank you.

    ReplyDelete
  19. This comment has been removed by a blog administrator.

    ReplyDelete
  20. One comment from my side...

    Looks like using PBR on SVI removes loop avoidance bit.


    I tested with two N5k whch are connected two two N7ks. on N7k is configured vPC with two SVIs in VLANs 100 and 200, but without FHRP. There is no vPC on N5k. Each 5k just have uplink to N7k (vPC is on N7k).

    N5k1 has SVi in VLAN 100 and N7k1 as def GW and N5k2 has SVI in VLAN 200 and def GW on N7k2.

    Using sh port-channel load-balance forwarding.... i discovered which link will N5k1 (from VLAN 100) use when pinging N5k2 (VLAN200) and shutdown few uplinks, but VPC on N7k was still up (there was still live uplink to bothe N7ks). So ping now from N5k1 to N5k2 are going to N7k1.

    On N7k1, SVI 100 I enabled PBR and create route-map with set ip next-hop to N7k2, I also enabled peer-gateway.

    And guess wnhat?!

    When I did traceroute ate N5k1 for IP N5k2, traffic went to N7k1 (VLAN100), then N7k2 (VLAN100) and then to N5k2 SVI (VLAN 200). It looks like "golden rule" was broken...

    ReplyDelete
  21. Back in 1998 IT certification organizations began a trend that changed the direction for employers, employees and training institutions. Employers started requesting these certs, employees began the almost chaotic chase of a myriad of acronyms and training institutions started the pocket lining process of selling, influencing and in some instances misleading potential students.
    Back in 1998 IT certification organizations began a trend that changed the direction for employers, employees and training institutions. Employers started requesting these certs, employees began the almost chaotic chase of a myriad of acronyms and training institutions started the pocket lining process of selling, influencing and in some instances misleading potential students.

    ReplyDelete
  22. Excellent Explanations! Thank you :)

    ReplyDelete
  23. Nice post.Thanks for sharing this post .I really appreciate the kind of topics you post here.
    Trading erp software in chennai

    ReplyDelete
  24. Certification Coaching is a platform where certified professional participate in discussions and help people to successfully pass thier exams for certification.

    ReplyDelete
  25. great quotes & wishes for mom!!! happymothersday.xyz I love to share it with my mom on this mothers day Thanks a lot


    ReplyDelete
  26. Nice post. Thanks for sharing! I want people to know just how good this information is in your article. It’s interesting content and Great work. keep it up
    Ai & Artificial Intelligence Course in Chennai
    PHP Training in Chennai
    Ethical Hacking Course in Chennai Blue Prism Training in Chennai
    UiPath Training in Chennai

    ReplyDelete
  27. Excellent post for the people who really need information for this technology. oracle training in chennai

    ReplyDelete

Popular old posts.