[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Service provider story about tracking down TCP RSTs



Can you share the Arista model and EOS version of the devices you installed
that TTL hashing was enabled by default?

On Sat, Sep 1, 2018 at 2:51 PM, <frnkblk at iname.com> wrote:

> I want to share a little bit of our journey in tracking down the TCP RSTs
> that impacted some of our customers for almost ten weeks.
>
>
>
> Almost immediately after we turned up two new Arista border routers in
> late July we started receiving a trickle of complaints from customers
> regarding their inability to access certain websites (mostly B2B). All the
> packet captures showed the standard TCP SYN/SYN-ACK pair, then a TCP RST
> from the website after the client sent a TLS/SSL Client Hello. As the
> reports continued to come in, we built a Google Doc to keep track and it
> became clear that most of the sites were hosted by Incapsula/Imperva, but
> there were also a few by Sucuri and Fastly. Knowing that Incapsula provides
> DoS protection, we attempted to work with them (providing websites,
> source/destination IPs, traceroutes, and packet captures) to find out why
> their hosts were issuing our customers a TCP RST, but we made little
> progress. We moved some of the affected customers to different IP addresses
> but that didnâ??t resolve the issue. We also asked our customer to work with
> the website to see if they would be willing to open a ticket with
> Incapsula. In the meantime, customers were getting frustrated! They
> couldnâ??t visit Incapsula-hosted healthcare websites, financial firms,
> product dealers, etc. Over the weeks, a few of those customers
> purchased/borrowed different routers and some of those didnâ??t have website
> issues anymore. And more than a few of them discovered that the websites
> worked fine from home or their mobile phone/hotspot, but not from their
> Internet connection with us. You can guess where they were applying
> pressure! That said, we didnâ??t know why a small handful of companies, known
> for DoS protection, were issuing TCP RSTs to just some of our customers.
>
>
>
> Earlier this week we received four or five more websites from yet another
> affected customer, but most of those were with Fastly. By this time, we had
> been able to replicate the issue in our lab. Feeling desperate to make some
> tangible progress on this issue, I reached out to the Fastly NOC. In less
> than 12 hours they provided some helpful feedback, pointing out that a
> single traceroute to a Fastly site was hitting two of their POPs (they use
> anycast) and because they donâ??t sync state between POPs the second POP
> would naturally issue a TCP RST (sidebar: fascinating blog article on
> Fastlyâ??s infrastructure here: https://www.fastly.com/blog/
> building-and-scaling-fastly-network-part-2-balancing-requests). In
> subsequent email exchanges, the Fastly NOC suggested that it appeared that
> we were â??spraying flowsâ?? (that is, packets related to single client session
> were egressing our network via different paths). Because Fastly is also
> present with us at an IX (though they werenâ??t advertising their anycast IPs
> at the time), they suggested that we look at how our traffic egresses our
> network (IX versus transit) and our routersâ?? outbound
> load-balancing/hashing schemes.
>
>
>
> The IX turned up to be a red herring, so I turned my attention to our
> transit. Each of our border routers has two BGP sessions over two circuits
> to transit provider POP A and two BGP sessions over two circuits to transit
> provider POP B, for a total of four BGP sessions per border router, a total
> of eight BGP sessions altogether. Starting with our core router, I
> confirmed that its ECMP hashing was consistent such that Fastly-bound
> traffic always went to border router 1 or border router 2. Then I looked at
> the ECMP hashing scheme on our border routers and noticed something unique
> â?? by default Arista also uses TTL:
>
>
>
> IPv4 hash fields:
>
>    Source IPv4 Address is ON
>
>    Protocol is ON
>
>    Time-To-Live is ON
>
>    Destination IPv4 Address is ON
>
>
>
> Since the source and destination IPs and protocol werenâ??t changing,
> perhaps the TTL was not consistent? I opened the first packet trace in
> Wireshark and jackpot â?? the TTL value was 128 on the SYN but 127 on the
> TLS/SSL Client Hello. I adjusted the Aristaâ??s load-balancing profile not to
> use TTL and immediately my MTR in the background changed and all the sites
> on the lab machine that couldnâ??t load before â?¦ were now loading.
>
>
>
> Fastly also pointed me to another article written by Joel Jaeggli (
> https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/) that
> discusses IPv6 flow labels â?? we removed that from the border routersâ?? IPv6
> hash fields, too.
>
>
>
> I reviewed the packet traces today and noticed that TTL values remained
> consistent at 128 **behind** the router CPE. In packet captures on the
> WAN interface of the router CPE I see that the SYN remains at 128, but the
> TLS/Client Hello is properly decremented to 127. So, it appears that some
> router CPE (and there were a variety of makes and models) are doing
> something special to certain packets and not decrementing the TTL.
>
> This explains why:
>
>    - our customers had issues with all their devices behind their router
>    CPE
>    - the issue remained regardless of what public IP address their router
>    CPE obtained via DHCP or was assigned
>    - some customers who changed their router CPE didnâ??t have the issue
>    anymore â?? they got lucky with a router that doesnâ??t adjust/reset the TTL
>    - why customers who used our managed Wi-Fi router did not see the
>    issue, because that model doesnâ??t apparently manipulate the TTL, at least
>    not in an inconsistent way.
>
>
>
> Lesson learned: review a deviceâ??s hashing mechanism before going into
> production.
>
>
>
> For those interested, I have links to the packet traces below my
> signature, showing the inconsistent TTL values.
>
>
>
> Thanks again to the fantastic group of folk at the Fastly NOC who so ably
> pointed us in the right direction!
>
>
>
> Frank
>
>
>
> https://www.premieronline.net/~fbulk/example1.pcapng
>
> https://www.premieronline.net/~fbulk/example2.pcapng
>
> https://www.premieronline.net/~fbulk/example3.pcapng
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20180910/287744fe/attachment.html>