Where?
Bumping for visibility.
Not associated with Meta, but this piqued my interest. That being said, I found some parts confusing and hard to follow. For example what does URPF (Unicast Reverse Path Forwarding) in the title of this submission have to do with the contents?
And is the packet loss supposedly happening at specific times only? It's not mentioned anywhere, but one screenshot highlights the time. I couldn't reproduce the packet loss using any of the looking glasses and dest IP addresses in the screenshots. At this point, if this was a report I had received about one of my services, I would have probably bumped down the priority to low and asked for a reproducible test, because in my experience even issues that affect a single path in an ECMP group are not this hard to reproduce. I think it's way more important to give the engineer who will process the report an easy way to check that there is indeed a problem than to start to teach how traceroute works.
TBF, there does seem to be an issue somewhere, because sticking 129.134.80.234, one of the Meta IP addresses from a screenshot, on ping.pe does definitely show significant packet loss from more locations than you'd expect to see for an address with no connectivity issues.
Packet loss is happening all the time, though it might be more noticeable during peak hours since a faulty interface will show a higher error rate under heavy load. You can replicate it using looking glasses; maybe you didn't see it five days ago but you do now. Since it’s an ECMP issue, it depends heavily on which source and destination servers you’re testing. It’s just a matter of iterating.
I’m glad you were able to replicate it on ping.pe; Meta, however, still has no clue
Deleted Comment
Not associated with Meta, but this piqued my interest. That being said, I found some parts confusing and hard to follow. For example what does URPF (Unicast Reverse Path Forwarding) in the title of this submission have to do with the contents?
And is the packet loss supposedly happening at specific times only? It's not mentioned anywhere, but one screenshot highlights the time. I couldn't reproduce the packet loss using any of the looking glasses and dest IP addresses in the screenshots. At this point, if this was a report I had received about one of my services, I would have probably bumped down the priority to low and asked for a reproducible test, because in my experience even issues that affect a single path in an ECMP group are not this hard to reproduce. I think it's way more important to give the engineer who will process the report an easy way to check that there is indeed a problem than to start to teach how traceroute works.
TBF, there does seem to be an issue somewhere, because sticking 129.134.80.234, one of the Meta IP addresses from a screenshot, on ping.pe does definitely show significant packet loss from more locations than you'd expect to see for an address with no connectivity issues.