I’ve recently spent hours trying to work out why certain traffic types have been dropped over VXLAN. Particularly when wifi traffic traversed the campus wide network but also certain streaming services. After a hunch I worked out the commonality amongst failing traffic types was packet size. Large packets were being silently dropped.
This was a problem because my testing plan on the PoC used a series of pings, SSH and file transfer to prove traffic went where it should. Everything tested out 100%. Only in production did problems occur. Even then, it was hours after the go-live when users complained. My tests in production proved connectivity using the same methods as the PoC.
The behaviour was strange. Certain Google suite sites worked 100% but some didn’t (e.g Calendar versus Mail). Even stranger was that sites failed over wifi but were fine on wired. This last point was simply because the wifi uses tunnels between access points and controllers which eats into the available MTU.
Eventually I worked out that above a fixed MTU packets would be dropped. This could be demonstrated by using ping with various payload sizes. A ping with a payload of 3000 would fail but one with a size of 300 would always succeed. While at this point I was a long way from success, at least I had a test and a strong clue as to what was going on.
Another strong clue was that there were no issues at all over the underlay network (that basic wired network across the campus where all the UDP VXLAN packets fly across. This is what I expected since a combination of jumbo frames permitted everywhere and fragmentation meant the network just did its thing. Typically layer2 MTU (size of Ethernet datagram) defaults to jumbo on modern kit (10k on HP Comware) and can be set easily elsewhere (9128 on Aruba CX). That was the case with that being set on all inter router links. The second MTU setting in play is the layer3 MTU or “IP MTU” in Aruba CX parlance. This is the size of the IPv4 payload and comes into play at subnet boundaries such as when the initial echo request hits the SVI of the default gateway. On Comware this was fixed at 1500 but on Aruba CX this can be increased up to the 9128 (but defaults to 1500).
There are two questions at this point. First is why large pings are not simply fragmented at the first hop and therefore the VXLAN network doesn’t even see large packets. Second is why on my production network this was an issue with jumbo enabled everywhere.
The first of these feels like a bug. Silently dropping packets should never happen. Silently dropping specific types of packets is simply discrimination. Turns out this is not only normal but written in the RFC. See here from section 4.3
VTEPs MUST NOT fragment VXLAN packets. Intermediate routers may fragment encapsulated VXLAN packets due to the larger frame size. The destination VTEP MAY silently discard such VXLAN fragments.
OK, so at this point we have a test and a reason why the code would perform the drops. But there was still a problem. In the simply campus network (edge router with the SVI, core router with exit to internet, DC etc) every uplink/downlink was L2 & L3 jumbo frame. The ping of size 3000 could easily squeeze through without fragmentation.
Only closer inspection the first hop SVI (default gateway for the laptop) was set with the default IP MTU of 1500 meaning that my large ping would be chopped into three IP packets. less than 20% of the MTU of the underlay.
Its worth pointing out that the VXLAN tunnel eats 50 bytes from any MTU. This is on top of that which a ping uses. For ping on the underlay:
1500 bytes (Ethernet mtu) - 20 byte (IP header) - 8 byte (ICMP header) = 1472 byte
Therefore with VXLAN you would expect the ICMP echo request to work with a datagram of 1422 but fail with 1423. This was the case. The 3000 size was too ambitious under failure conditions and even 1423 would cause the silent drops. Recall that by default most implementations of ping uses something like 30.
This understanding helps in that we can now see that fragmentation is occurring in the initial SVI and that only VXLAN encapsulated packets are dropped and that this is normal if the MTU exceeds that of the VXLAN link. However, the uplinks from the gateway have an IP MTU of 9100 and MTU of 9128. So this must be a bug right?
At this point I switched focus to the underlay and traced the path of the VXLAN UDP packet. It is at this point I spot something strange. Instead of the path going between edge router to core router (which held a default gateway to the rest of the world) the path went edge–core–other router–other router–other core–internet. There are two cores and at this point the edge router had a single uplink. The calculated IPv4 path for the underlay was direct along its only uplink. The calculated VTEP for the ping packets was via the other core. So the VXLAN path went via a long route before reaching the chosen exit VTEP. Not a problem, this is what redundancy is for and this is what would happen in production when full uplinks were in place but an optic/fibre breaks.
The issue, and the root cause of the failure, is that one of the other VXLAN enabled edge routers was only partially configured and had default IP MTU settings. So the VXLAN packet itself would have to be fragmented in order to reach the final VTEP. As per the RFC, it was dropped.
Worth noting that Aruba CX has no counter to give you a clue this is happening. Aruba support suggest that VXLAN debugging would give you this but recall this is production and debugging all traffic might not be wise. If the problem could be recreated in the PoC lab I would have angry users.
So there are two lessons here:
- VXLAN will silently drop packets too large to traverse the whole path to the exit VTEP
- Ensure the whole underlay is compliant with the above as routes may not be as they are after failures occur
I’ve created some network automation for this campus network that tests for all uplinks and whether they have small MTU/IP MTU settings. Also an agreement that we don’t admin up any links that don’t have the full config.
For more VXLAN MTU maths and a better explanation check out this blog post from Packetpushers.