[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[ih] inter-network communication history
This is getting pretty long and we are probably the only ones interested in it.
You were lucky. We had customers who had configuration changes between day and night, end of month, holidays vs non-holidays (both for less traffic and much more), in Hong Kong race day was a killer, etc. There are also networks that have to reconfigure quickly when new conditions arise (some predictable, some not).
Database Performance: All of that did make database performance an issue. But also a 20+ times slower can make even ordinary things far too slow. SNMP?s inherent dribble approach probably covered up the database performance issues. The idea that one couldn?t snapshot a router table before it was updated was a pain.
General: One of the problems we saw was getting it across to operators that they were *managing* the network, not controlling it. There was a situation in the UK where the number of switch crashes dropped precipitously for 6 weeks and then came back up. No changes had been made. It was very strange until they realized it was the 6 weeks the operators were on strike. They had been trying to control the network and were making it worse.
I contend that network management is basically, Monitor and Repair but not control.
Multi-protocol: Yes this is where leveraging the commonality of structure paid off. We had to support all of those too. Given our approach to it, it was relatively straightforward.
Fault Management: yes, this is another place where the commonality paid off. Also recognizing that doing diagnosis was basically creating a small management domain for the equipment under test. I was never able to get as far on this problem as I thought we could go. But one could do a lot with common tools to generate test patterns, test dictionaries, etc.
Most of this was on the market by 1986.
> On Nov 10, 2019, at 01:26, Jack Haverty <jack at 3kitty.org> wrote:
> On 11/9/19 4:23 AM, John Day wrote:
>> As they say, Jack, ignorance is bliss! ;-) Were you doing configuration with it? Or was it just monitoring?
> As I recall, configuration wasn't a big deal. Nodes were typically
> routers with Ethernets facing toward the users at the site and several
> interfaces the other way for long-haul circuits. Our approach was to
> collect all the appropriate equipment for the next site in our
> California lab, configure it and test it out on the live network, and
> then ship it all to wherever it was to go. So, for example, New Zealand
> might have actually been in California at first, but when it got to NZ
> it worked the same.
> IIRC, there was lots of stuff that could be configured and tweaked in
> the routers. There was even a little documentation on what some of
> those "virtual knobs" affected. There was essentially nothing on why
> you might want to set some knob to any particular position, what
> information you needed to make such decisions, or how to predict
> results. Anything could happen. So there was strong incentive never
> to change the default configuration parameters after the site equipment
> left our lab.
> I don't remember any concerns about database performance. But we only
> had a hundred or so boxes out in our net. Perhaps the Network
> Management vendors had visions of customers with thousands of their
> boxes so we didn't see the same problems. Also, we only collected the
> specific data from sources like SNMP that we expected wet could actually
> use. We thought our network was pretty big for the time, spanning 5
> continents and thousands of users and computers. The database we had
> worked fine for that. Compared to other situations, like processing
> credit card or bank transactions, it didn't seem like a big load. I
> think it all went into a Sparc. But there were bigger machines around
> if we needed one.
> The vendor-supplied tools did provide some monitoring. E.g., it was
> fairly easy to see problems like a dead router or line, and pick up the
> phone to call the right TelCo or local site tech to reboot the box.
> With alternate routing, often the users didn't even notice. Just like
> in the ARPANET...(Yay packet switching!)
> To make things extra interesting, that was the era of "multi-protocol
> routers", since TCP hadn't won the network wars quite yet. Our
> corporate product charter was to provide software that ran on any
> computer, over any kind of network. So our net carried not only TCP/IP,
> but also other stuff - e.g., DECNet, AppleTalk, SPX/IPX, and maybe one
> or two I don't remember. SNA/LU6.2 anyone...? Banyan Vines?
> Most of our more challenging "network management" work involved fault
> isolation and diagnosis, plus trend analysis and planning.
> A typical problem would start with an urgent call from some user who was
> having trouble doing something. It might be "The network is way too
> slow. It's broken." or "I can't get my quarterly report to go in".
> Often the vendor system would show that all routers were up and running
> fine, and all lines were up. But from the User's perspective, the
> network was broken.
> Figuring out what was happening was where the ad-hoc tools came in.
> Sometimes it was User Malfunction, but often there was a real issue in
> the network that just didn't appear in any obvious way to the
> operators. But the Users saw it.
> "You say the Network is running fine.....but it doesn't work!"
> To delve into Users' problems, we needed to go beyond just looking at
> the routers and circuits. Part of the problem might be in the Host
> computers where TCP lived, or in the Application, e.g., email.
> We ran the main data center in addition to the network. There wasn't
> anyone else for us to point the finger at.
> We used simple shell scripts and common Unix programs to gather
> SNMP-available data and stuff it into the database, parsed as much as we
> could into appropriate tables with useful columns like Time, Router#,
> ReportType, etc. That provided data about how the routers saw the
> network world, capturing status and behavior over whatever period of
> time we ran the collector.
> Following the "Standard Node" approach, wherever we placed a network
> node we also made sure to have some well-understood machine on the User
> side that we could use remotely from the NOC. Typically it would be
> some kind of Unix workstation, attached to the site's Ethernet close to
> the router. Today, I'd probably just velcro a Raspberry Pi to the router.
> I used to call this an Anchor Host, since it provided a stable,
> well-understood (by us at the NOC) machine out in the network. This
> was really just copying the ARPANET approach from the early 70s, where a
> "Fake Host" inside the IMP could be used to do network management things
> like generate test traffic or snoop on regular network traffic. We
> couldn't change the router code to add a Fake Host, but we could put a
> Real Host next to it.
> From that Fake (Real) Host, we could run Ping tests across the network
> to measure RTT, measure bandwidth between 2 points during a test FTP,
> generate traffic, and such stuff, simply using the tools that commonly
> come in Unix boxes. The results similarly made their way into tables in
> the database. Some tests were run continuously, e.g., ping tests every
> 5 minutes. Others were enabled on demand to help figure out some
> problem, avoiding burdening the network (and database I guess) with
> extra unneeded traffic.
> Also from that Fake Host, we could run TCPDUMP, which captured traffic
> flowing across that Ethernet and produced reams of output with a melange
> of multi-protocol packet headers. Again, all of that could make its way
> into the database on demand, organized into useful Tables, delayed if
> necessary to avoid impacting the network misbehavior we were trying to
> debug. Give a Unix guru awk, sed, cron and friends and amazing things
> can happen.
> We could even run a Flakeway on that Anchor Host, to simulate network
> glitches for experimentation, but I can't recall ever having to do
> that. But perhaps the ops did and I never knew.
> Once all that stuff got into the database, it became data. Not a
> problem. I was a network guy afloat in an ocean of database gurus, and
> I was astonished at the way they could manipulate that data and turn it
> into Information.
> I didn't get involved much in everyday network operations, but when
> weird things happened I'd stick my nose in.
> Once there was an anomaly in a trans-pacific path, where there was a
> flaky circuit that would go down and up annoyingly often. The carrier
> was "working on it..."
> What the ops had noticed was that after such a glitch finished, the
> network would settle down as expected. But sometimes, the RTT delay and
> bandwidth measurements would settle down to a new stable level
> noticeably different from before the line glitch. They even had
> brought up a rolling real-time graph of the data, kind of like a
> hospital heart-monitor, that clearly showed the glitch and the change in
> Using our adhoc tools, we traced the problem down to a bug in some
> vendor's Unix system. That machine's TCP retransmission timer algorithm
> was reacting to the glitch, and adapting as the rerouting occurred. But
> after the glitch, the TCP had settled into a new stable pattern where
> the retransmission timer fired just a little too soon, and every packet
> was getting sent twice. The network anomaly would show up if a line
> glitch occurred, but only if that Unix user was in the middle of doing
> something like a file transfer across the Pacific at the time. The
> Hosts and TCPs were both happy, the Routers were blissfully ignorant,
> and half that expensive trans-pacific circuit was being wasted carrying
> duplicate packets.
> With the data all sitting in the database, we had the tools to figure
> that out. We reported the TCP bug to the Unix vendor. I've always
> wondered if it ever got fixed, since most customers would probably never
> Another weird thing was that "my quarterly report won't go" scenario.
> That turned out to be a consequence of the popularity of the "Global
> Lan" idea in the network industry at the time. IIRC, someone in some
> office in Europe had just finished putting together something like a
> library of graphics and photos for brochures et al, and decided to send
> it over to the colleagues who were waiting for it. Everybody was on
> the department "LAN", so all you had to do was drag this folder over
> there to those guys' icons and it would magically appear on their
> desktops. Of course it didn't matter that those other servers were in
> the US, Australia, and Asia - it's a Global LAN, right!
> The network groaned, but all the routers and lines stayed up, happily
> conveying many packets per second. For hours. Unfortunately too few of
> the packets were carrying that email traffic.
> We turned off "Global LAN" protocols in the routers ... but of course
> today such LAN-type services all run over TCP, so it might not be quite
> as easy.
> The other important but less urgent Network Management activity involved
> things like Capacity Planning. With the data in the database, it was
> pretty easy to get reports or graphs of trends over a month/quarter, and
> see the need to order more circuits or equipment.
> We could also run various tests like traffic generators and such and
> gather data when there were no problems in the network. That collected
> data provided a "baseline" of how things looked when everything was
> working. During problem times, it was straightforward to run similar
> tests and compare the results with the baselines to figure out where the
> source of a problem might be by highlighting significant differences.
> The ability to compare "working" and "broken" data is a powerful Network
> Management tool.
> So that'w what we did. I'm not sure I'd characterize all that kind of
> activity as either Configuration or Monitoring. I've always thought it
> was just Network Management.
> There's a lot of History of the Internet protocols, equipment, software,
> etc., but I haven't seen much of a historical account of how the various
> pieces of the Internet have been operated and managed, and how the tools
> and techniques have evolved over time.
> If anybody's up for it, it would be interesting to see how other people
> did such "Network Management" activities with their own adhoc tools as
> the Internet evolved.
> It would also be fascinating to see how today's expensive Network
> Management Systems tools would be useful in my scenarios above. I.e.,
> how effective would today's tools be if used by network operators to
> deal with my example network management scenarios - along the lines of
> RFC1109's observations about how to evaluate Network Management technology.
> BTW, everything I wrote above occurred in 1990-1991.