[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

tools and techniques to pinpoint and respond to loss on a path



What I have done in the past, and this presumes you have a /29 or bigger on the peering session to your upstreams is to check with the direct upstream provider at each and get approval to put a linux box diagnostics server on the peering side of each BGP upstream connection you have - default-routed out to their BGP router(s).  Typically not a problem with the upstream as long as they know this is for diagnostics purposes and will be taken down later.  Also helps the upstreams know you are seriously looking at the reliability they are giving and their competitors are giving you.

On that diagnostics box, run some quick & dirty tools to try and start isolating if the problem is related to one upstream link or another, or a combination of them.  Have each one monitoring all the distant peer connections, and possibly even each-other local peers for connectivity if you are uber-detailed.  The problem could be anywhere in between, but if you notice it is one link that has the issues and the other one does not, and/or a combo of src/dst, then you are in better shape to help your upstreams diagnose as well.  A couple tools like smokeping and running traceroute and ping on a scripted basis are not perfect, but easy to setup.  Log it all out so when it impacts production systems you can go back and look at those logs and see if there are any clues.  nettop is also another handy tool to dump stuff out with and also in the nearly impossible case you happen to be on the console when the problem occurs is very handy.