[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ih] Detlef's TCP questions

> During the period of Van Jacobson's development of the algorithms that 
> bear his
> name, he wrote many lengthy, pithy, and informative messages to various 
> public
> mailing lists about the hazards of the Internet and how his algorithms cope.
> Maybe some of these lists are archived somewhere.
> Bob Braden

Hi Bob:

Turns out the late Rich Stevens had a small collection of notes he
considered Van's "greatest hits" from that time and was kind enough to
share it with me.  Here are the notes.


> To: markl at PTT.LCS.MIT.EDU
> Subject: Re: interpacket arrival variance and mean 
> In-reply-to: Your message of Mon, 08 Jun 87 12:31:33 EDT.
> Date: Mon, 15 Jun 87 06:08:01 PDT
> From: Van Jacobson <van at lbl-csam.arpa>
> Mark -
> I'm working on long paper about transport protocol timers (models,
> estimation, implementations, common implementation mistakes, etc.).
> This has me all primed on the subject so attached is a long-winded
> explanation and example C code for estimating the mean and variance
> of a series of measurements.
> Though I know your application isn't tcp, I'm going to use a new
> tcp retransmit timer algorithm as an example.  The algorithm
> illustrates the method and, based on 3 months of experiment, is
> much better than the algorithm in rfc793 (though it isn't as good
> as the algorithm I talked about at IETF and am currently debugging). 
> Theory
> ------
> You're probably familiar with the rfc793 algorithm for estimating
> the mean round trip time.  This is the simplest example of a
> class of estimators called "recursive prediction error" or
> "stochastic gradient" algorithms.  In the past 20 years these
> algorithms have revolutionized estimation and control theory so
> it's probably worth while to look at the rfc793 estimator in some
> detail. 
> Given a new measurement, Meas, of the rtt, tcp updates an
> estimate of the average rtt, Avg, by
> 	Avg <- (1 - g) Avg  +  g Meas
> where g is a "gain" (0 < g < 1) that should be related to the
> signal- to-noise ratio (or, equivalently, variance) of Meas. 
> This makes a bit more sense (and computes faster) if we rearrange
> and collect terms multiplied by g to get
> 	Avg <- Avg + g (Meas - Avg)
> We can think of "Avg" as a prediction of the next measurement. 
> So "Meas - Avg" is the error in that prediction and the
> expression above says we make a new prediction based on the old
> prediction plus some fraction of the prediction error.  The
> prediction error is the sum of two components: (1) error due to
> "noise" in the measurement (i.e., random, unpredictable effects
> like fluctuations in competing traffic) and (2) error due to a
> bad choice of "Avg".  Calling the random error RE and the
> predictor error PE,
> 	Avg <- Avg + g RE + g PE
> The "g PE" term gives Avg a kick in the right direction while the
> "g RE" term gives it a kick in a random direction.  Over a number
> of samples, the random kicks cancel each other out so this
> algorithm tends to converge to the correct average.  But g
> represents a compromise: We want a large g to get mileage out of
> the PE term but a small g to minimize the damage from the RE
> term.  Since the PE terms move Avg toward the real average no
> matter what value we use for g, it's almost always better to use
> a gain that's too small rather than one that's too large. 
> Typical gain choices are .1 - .2 (though it's always a good
> idea to take long look at your raw data before picking a gain). 
> It's probably obvious that Avg will oscillate randomly around
> the true average and the standard deviation of Avg will be
> something like g sdev(Meas).  Also that Avg converges to the
> true average exponentially with time constant 1/g.  So
> making g smaller gives a stabler Avg at the expense of taking
> a much longer time to get to the true average.
> [My paper makes the preceding hand-waving a bit more rigorous
> but I assume no one cares about rigor.  If you do care, check
> out almost any text on digital filtering or estimation theory.]
> If we want some measure of the variation in Meas, say to compute
> a good value for the tcp retransmit timer, there are lots of
> choices.  Statisticians love variance because it has some nice
> mathematical properties.  But variance requires squaring (Meas -
> Avg) so an estimator for it will contain two multiplies and a
> large chance of integer overflow.  Also, most of the applications
> I can think of want variation in the same units as Avg and Meas,
> so we'll be forced to take the square root of the variance to use
> it (i.e., probably at least a divide, multiply and two adds). 
> A variation measure that's easy to compute, and has a nice,
> intuitive justification, is the mean prediction error or mean
> deviation.  This is just the average of abs(Meas - Avg). 
> Intuitively, this is an estimate of how badly we've blown our
> recent predictions (which seems like just the thing we'd want to
> set up a retransmit timer).  Statistically, standard deviation
> (= sqrt variance) goes like sqrt(sum((Meas - Avg)^2)) while mean
> deviation goes like sum(sqrt((Meas - Avg)^2)).  Thus, by the
> triangle inequality, mean deviation should be a more "conservative"
> estimate of variation. 
> If you really want standard deviation for some obscure reason,
> there's usually a simple relation between mdev and sdev.  Eg.,
> if the prediction errors are normally distributed, mdev =
> sqrt(pi/2) sdev.  For most common distributions the factor
> to go from sdev to mdev is near one (sqrt(pi/2) ~= 1.25).  Ie.,
> mdev is a good approximation of sdev (and is much easier to
> compute).
> Practice
> --------
> So let's do estimators for Avg and mean deviation, Mdev.  Both
> estimators compute means so we get two instances of the rfc793
> algorithm:
> 	Err = abs (Meas - Avg)
> 	Avg <- Avg + g (Meas - Avg)
> 	Mdev <- Mdev + g (Err - Mdev)
> If we want to do the above fast, it should probably be done in
> integer arithmetic.  But the expressions contain fractions (g < 1)
> so we'll have to do some scaling to keep everything integer.  If
> we chose g so that 1/g is an integer, the scaling is easy.  A
> particularly good choice for g is a reciprocal power of 2 
> (ie., g = 1/2^n for some n).  Multiplying through by 1/g gives
> 	2^n Avg <- 2^n Avg + (Meas - Avg)
> 	2^n Mdev <- 2^n Mdev + (Err - Mdev)
> To minimize round-off error, we probably want to keep the scaled
> versions of Avg and Mdev rather than the unscaled versions.  If
> we pick g = .125 = 1/8 (close to the .1 that rfc793 suggests) and
> express all the above in C:
> 	Meas -= (Avg >> 3);
> 	Avg += Meas;
> 	if (Meas < 0)
> 		Meas = -Meas;
> 	Meas -= (Mdev >> 3);
> 	Mdev += Meas;
> I've been using a variant of this to compute the retransmit timer
> for my tcp.  It's clearly faster than the two floating point
> multiplies that 4.3bsd uses and gives you twice as much information.
> Since the variation is estimated dynamically rather than using
> the wired-in beta of rfc793, the retransmit performance is also
> much better: faster retransmissions on low variance links and
> fewer spurious retransmissions on high variance links. 
> It's not necessary to use the same gain for Avg and Mdev. 
> Empirically, it looks like a retransmit timer estimate works
> better if there's more gain (bigger g) in the Mdev estimator. 
> This forces the timer to go up quickly in response to changes in
> the rtt.  (Although it may not be obvious, the absolute value in
> the calculation of Mdev introduces an asymmetry in the timer:
> Because Mdev has the same sign as an increase and the opposite
> sign of a decrease, more gain in Mdev makes the timer go up fast
> and come down slow, "automatically" giving the behavior Dave
> Mills suggests in rfc889.)
> Using a gain of .25 on the deviation and computing the retransmit
> timer, rto, as Avg + 2 Mdev, my tcp timer code looks like:
> 	Meas -= (Avg >> 3);
> 	Avg += Meas;
> 	if (Meas < 0)
> 		Meas = -Meas;
> 	Meas -= (Mdev >> 2);
> 	Mdev += Meas;
> 	rto = (Avg >> 3) + (Mdev >> 1);
> Hope this helps.
>   - Van
> To: jain%erlang.DEC at decwrl.dec.com (Raj Jain, LKG1-2/A19, DTN: 226-7642)
> Cc: ietf at gateway.mitre.org
> Subject: Re: Your congestion scheme 
> In-Reply-To: Your message of 03 Nov 87 12:51:00 GMT.
> Date: Mon, 16 Nov 87 06:03:29 PST
> From: Van Jacobson <van at lbl-csam.arpa>
> Raj,
> Thanks for the note.  I hope you'll excuse my taking the liberty
> of cc'ing this reply to the ietf:  At the last meeting there was a
> great deal of interest in your congestion control scheme and our
> adaptation of it. 
> > I am curious to know what values of increase amount and decrease
> > factor did you use.  In our previous study on congestion using
> > timeout (CUTE scheme, IEEE JSAC, Oct 1986), we had found that the
> > decrease factor should be small since packet losses are
> > expensive.  In fact, we found that a decrease factor of zero
> > (decreasing to one) was the best. 
> We use .5 for the decrease factor and 1 for the increase factor. 
> We also use something very close to CUTE (Mike Karels and I were
> behind on our journal reading so we independently rediscovered
> the algorithm and called it slow-start).  Since we use a lost
> packet as the "congestion experienced" indicator, the CUTE
> algorithm and the congestion avoidance algorithm (BTW, have you
> picked a name for it yet?) get run together, even though they are
> logically distinct. 
> The sender keeps two state variables for congestion control: a
> congestion window, cwnd, and a threshhold size, ssthresh, to
> switch between the two algorithms.  The sender's output routine
> sends the minimum of cwnd and the window advertised by the
> receiver.  The rest of the congestion control sender code looks
> like:  On a timeout, record half the current window size as
> "ssthresh" (this is the multiplicative decrease part of the
> congestion avoidance algorithm), then set the congestion window
> to 1 packet and call the output routine (this initiates
> slowstart/CUTE).  When new data is acked, the sender does 
> something like
> 	if (cwnd < ssthresh)	  // if we're still doing slowstart
> 		cwnd += 1 packet  // open the window exponentially
> 	else
> 		cwnd += 1/cwnd    // otherwise do the Congestion
> 				  // Avoidance increment-by-1
> Notice that our variation of CUTE opens the window exponentially
> in time, as opposed to the linear scheme in your JSAC article. 
> We looked at a linear scheme but were concerned about the
> performance hit on links with a large bandwidth-delay product
> (ie., satellite links).  An exponential builds up fast enough to
> accomodate any bandwidth-delay and our testing showed no more
> packet drops with exponential opening that with linear opening. 
> (My model of what slowstart is doing -- starting an ack "clock"
> to meter out packets -- suggested that there should be no
> loss-rate difference between the two policies).
> The reason for the 1/2 decrease, as opposed to the 7/8 you use,
> was the following hand-waving:  When a packet is dropped, you're
> either starting (or restarting after a drop) or steady-state
> sending.  If you're starting, you know that half the current
> window size "worked", ie., that a window's worth of packets were
> exchanged with no drops (slowstart guarantees this).  Thus on
> congestion you set the window to the largest size that you know
> works then slowly increase the size.  If the connection is
> steady-state running and a packet is dropped, it's probably
> because a new connection started up and took some of your
> bandwidth.  We usually run our nets with rho <= .5 so it's
> probable that there are now exactly two conversations sharing the
> bandwidth.  Ie., that you should reduce your window by half
> because the bandwidth available to you has been reduced to half. 
> And, if there are more than two conversations sharing the
> bandwidth, halving your window is conservative -- and being
> conservative at high traffic intensities is probably wise. 
> Obviously, this large decrease term is accounting for the high
> cost of our "congestion experienced" indicator compared to yours --
> a dropped packet vs. a bit in the ack.  If the DEC bit were
> available, the preceding "logic" should be ignored.  But I wanted
> tcp congestion avoidance that we could deploy immediately and
> incrementally, without adding code to the hundreds of Internet
> gateways.  So using dropped packets seemed like the only choice. 
> And, in system terms, the cost isn't that high: Currently packets
> are dropped only when a large queue has formed.  If we had a bit
> to force senders to reduce their windows, we'd still be stuck
> with the queue since we'd still be running the bottleneck at 100%
> utilization so there's no excess bandwidth available to dissipate
> the queue.  If we toss a packet, a sender will shut up for 2 rtt,
> exactly the time we need to empty the queue (in the ususal case).
> If that sender restarts with the correct window size the queue
> won't reform.  Thus we've reduced the delay to minimum without
> the system losing any bottleneck bandwidth.
> The 1-packet increase has less justification that the .5
> decrease.  In fact, it's almost certainly too large.  If the
> algorithm converges to a window size of w, you get O(w^2) packets
> between drops with an additive increase policy.  We were shooting
> for an average drop rate of <1% and found that on the Arpanet
> (the worst case of the 4 networks we tested), windows converged
> to 8 - 12 packets.  This yields 1 packet increments for a 1%
> average drop rate. 
> But, since we've done nothing in the gateways, the window we
> converge to is the maximum the gateway can accept without dropping
> packets.  I.e., in the terms you used, we are just to the left of
> the cliff rather than just to the right of the knee.  If we
> now fix the gateways so they start dropping packets when the
> queue gets pushed past the knee, our increment will be much too
> agressive and we'll have to drop it by at least a factor of 4
> (since all my measurements on an unloaded Arpanet or Milnet
> place their "pipe size" at 4-5 packets).  Mike and I have talked
> about a second order control loop to adaptively determine the
> appropriate increment to use for a path (there doesn't seem to
> be much need to adjust the decrement).  It looks trivial to
> implement such a loop (since changing the increment affects
> the frequency of oscillation but not the mean window size,
> the loop would affect rate of convergence but not convergence
> and (first-order) stability).  But we think 2nd order stuff
> should wait until we've spent some time on the 1st order part
> of the algorithm for the gateways.
> I'm tired and probably no longer making sense.  I think I'll
> head home and get some sleep.  Hope to hear from you soon.
> Cheers.
>  - Van
> To: tcp-ip at sri-nic.arpa
> Subject: Dynamic Congestion Avoidance / Control (long message)
> Date: Thu, 11 Feb 88 22:17:04 PST
> From: Van Jacobson <van at lbl-csam.arpa>
> A dozen people forwarded Phil Karn's question about TCP
> congestion control to me, usually with pithy subject lines like
> "how much longer till you publish something?".  I do have three
> RFCs and two papers in various stages of preparation, but innate
> laziness combined with this semester's unexpectedly large
> teaching load means it will probably be late spring before
> anything gets finished.  In the interim, here's a sort-of
> overview of our congestion control work.
> I should point out these algorithms are joint work with Mike
> Karels of UC Berkeley (I guess my name got stuck on things
> because I make the presentations while Mike is off fighting the
> battles on EGP or routing or domains or ...).  I should also
> mention that there's not a whole lot that's original.  In
> particular, the two most important algorithms are quite close to
> (prior) work done by DEC's Raj Jain.  This was by accident in
> one case and deliberate in the other.
> This stuff has been discussed on the ietf and end2end lists
> (Phil participated in some of those discussions and was, in
> fact, one of the early beta testers for our new tcp --  I have
> this nagging suspicion I was set up).  I've appended a couple of
> those mail messages.
> Mike and I have put a number (six, actually) of new algorithms
> into the 4bsd tcp.  Our own measurements and the reports of our
> beta testers suggest that the result is quite good at dealing
> with current, congested conditions on the Internet.  The various
> algorithms were developed and presented at different times over
> the past year and it may not be clear to anyone but the
> developers how, or if, the pieces relate.
> Most of the changes spring from one observation:  The flow on a
> TCP connection (or ISO TP-4 or XNS SPP connection) should, by
> the design of the protocol, obey a `conservation of packets'
> principle.  And, if this principle were obeyed, congestion
> collapse would become the exception rather than the rule.
> Thus congestion control turns into finding places where
> conservation is violated and fixing them.
> By `conservation of packets' I mean the following:
> If you look at the connection "in equilibrium", i.e., running
> stably with a full window of data in transit, you notice that
> the flow is what a physicist would call "conservative":  A new
> packet isn't put into the network until an old packet leaves.
> Two scientific disciplines that deal with flow, hydrodynamics
> and thermodynamics, predict that systems with this conservation
> property should be extremely robust in the face of congestion.
> Observation of the Arpanet or NSFNet suggests that they were not
> particularly robust in the face of congestion.  From whence
> comes the discrepancy?
> [Someone asked me for a simple explanation of why a conservative
> flow system should be stable under load.  I've been trying to
> come up with one, thus far without success -- absolutely nothing
> in hydrodynamics admits to simple explanations.  The shortest
> explanation is that the inhomogenous forcing terms in the
> Fokker-Planck equation vanish and the remaining terms are highly
> damped.  But I don't suppose this means much to anyone (it
> doesn't mean much to me).  My intuition is that conservation
> means there's never an `energy difference' between the outside
> world and the system that the system could use to `pump up' an
> instability to a state of collapse.  Thus the only mechanism the
> system can use for self-destruction is to re-arrange its
> internal energy and create a difference large enough to break
> something.  But entropy (or diffusion) always trys to erase
> large internal energy differences so collapse via these
> mechanisms is improbable.  Possible, but improbable, at least
> until the load gets so outrageous that collective, non-ergodic
> effects start to dominate the overall behavior.]
> Packet conservation can fail three ways:
>   1) the connection doesn't get to equilibrium, or
>   2) a sender injects a packet before an old packet has exited, or
>   3) the equilibrium can't be reached because of resource
>      limits along the path.
> (1) has to be from a connection starting or restarting after a
> packet loss.  Another way to look at the conservation property
> is to say that the sender uses acks as a "clock" to strobe new
> packets into the network.  Since the receiver can generate acks
> no faster than data packets can get through the network, the
> protocol is "self clocking" (an ack can't go in until a packet
> comes out, a packet can't go in until an ack comes out).  Self
> clocking systems automatically adjust to bandwidth and delay
> variations and have a wide dynamic range (an important issue
> when you realize that TCP spans everything from 800 Mbit/sec
> Cray channels to 1200 bit/sec packet radio links).  But the same
> thing that makes a self-clocked system stable when it's running
> makes it hard to get started -- to get data flowing you want acks
> to clock packets out but to get acks you have to have data flowing.
> To get the `clock' started, we came up with an algorithm called
> slow-start that gradually increases the amount of data flowing.
> Although we flatter ourselves that the design of this algorithm
> is quite subtle, the implementation is trivial -- 3 lines of
> code and one new state variable in the sender:
>     1) whenever you're starting or restarting after a loss,
>        set your "congestion window" to 1 packet.
>     2) when you get an ack for new data, increase the
>        congestion window by one packet.
>     3) when you send, send the minimum of the receiver's
>        advertised window and the congestion window.
> (This is quite similar to Raj Jain's CUTE algorithm described in
> IEEE Transactions on Communications, Oct, '86, although we didn't
> know about CUTE at the time we were developing slowstart). 
> Actually, the slow-start window increase isn't all that gradual:
> The window opening takes time proportional to log2(W) where W is
> the window size in packets.  This opens the window fast enough
> to have a negligible effect on performance, even on links that
> require a very large window.  And the algorithm guarantees that
> a connection will source data at a rate at most twice the
> maximum possible on the path.  (Without slow-start, by contrast,
> when 10Mb ethernet hosts talk over the 56Kb Arpanet via IP
> gateways, the gateways see a window's worth of packets (8-16)
> delivered at 200 times the path bandwidth.)
> Once you can reliably start data flowing, problems (2) and (3)
> have to be addressed.  Assuming that the protocol implementation
> is correct, (2) must represent a failure of sender's retransmit
> timer.  A good round trip time estimator (the core of the
> retransmit timer) is the single most important feature of any
> protocol implementation that expects to survive heavy load.  And
> it almost always gets botched.
> One mistake seems to be not estimating the variance of the rtt.
> >From queuing theory we know that rtt and rtt variation increase
> very quickly with load.  Measuring the load by rho (the ratio of
> average arrival rate to average departure rate), the rtt goes up
> like 1/(1-rho) and the variation in rtt goes like 1/(1-rho)^2.
> To make this concrete, if the network is running at 75% of
> capacity (as the Arpanet was in last April's collapse), you
> should expect rtt to vary by a factor of 16 around its mean
> value.  The RFC793 parameter "beta" accounts for rtt variation.
> The suggested value of "2" can adapt to loads of at most 30%.
> Above this point, a connection will respond to load increases by
> retransmitting packets that have only been delayed in transit.
> This makes the network do useless work (wasting bandwidth on
> duplicates of packets that have been or will be delivered) at a
> time when it's known to be having trouble with useful work.
> I.e., this is the network equivalent of pouring gasoline on a
> fire.
> We came up a cheap method for estimating variation (see first of
> attached msgs) and the resulting retransmit timer essentially
> eliminates spurious retransmissions.  A pleasant side effect of
> estimating "beta" rather than using a fixed value is that low
> load as well as high load performance improves, particularly
> over high delay paths such as satellite links.
> Another timer mistake seems to be in the backoff after a
> retransmit:  If we have to retransmit a packet more than once,
> how should the retransmits be spaced?  It turns out there's only
> one scheme that's likely to work, exponential backoff, but
> proving this is a bit involved (one of the two papers alluded to
> in to opening goes through a proof).  We might finesse a proof
> by noting that a network is, to a very good approximation, a
> linear system.  (That is, it is composed of elements that behave
> like linear operators -- integrators, delays, gain stages,
> etc.).  Linear system theory says that if a system is stable,
> the stability is exponential.  This suggests that if we have a
> system that is unstable (a network subject to random load shocks
> and prone to congestive collapse), the way to stabilize it is to
> add some exponential damping (read: exponential timer backoff)
> to its primary excitation (read: senders, traffic sources).
> Once the timers are in good shape, you can state with some
> confidence that a timeout really means a lost packet and not a
> busted timer.  At this point you can do something about (3).
> Packets can get lost for two reasons: they are damaged in
> transit or the network is congested and somewhere on the path
> there was insufficient buffer capacity.  On most of the network
> paths we use, loss due to damage is rare (<<1%) so it is very
> probable that a packet loss is due to congestion in the network.
> Say we try to develop a `congestion avoidance' strategy like the
> one Raj Jain, et.al., propose in DEC TR506 and ISO 8473.  It
> must have two components:  The network must be able to signal
> the transport endpoints that congestion is occurring (or about
> to occur).  And the endpoints must have a policy that decreases
> utilization if this signal is received and increases utilization
> if the signal isn't received.
> If packet loss is (almost) always due to congestion and if a
> timeout is (almost) always due to a lost packet, we have a good
> candidate for the `network is congested' signal.  Particularly
> since this signal is delivered automatically by all existing
> networks, without special modification (as opposed to, say, ISO
> 8473 which requires a new bit in the packet headers and a mod to
> *all* existing gateways to set this bit).
> [As a (lengthy) aside:
> Using `lost packets' to communicate information seems to bother
> people.  The second of the two papers mentioned above is devoted
> to analytic and simulation investigations of our algorithm under
> various network conditions.  I'll briefly mention some of the
> objections we've heard and, I think, addressed.
> There have been claims that signaling congestion via dropping
> packets will adversely affect total throughput (it's easily
> proved that the opposite is true) or that this signal will be
> `too slow' compared to alternatives (The fundamental frequency
> of the system is set by its average round trip time.  From
> control theory and Nyquist's theorem, any control strategy that
> attempts to respond in less than two round trip times will be
> unstable.  A packet loss is detected in at most two rtt, the
> minimum stable response time).
> There have also been claims that this scheme will result in
> unfair or sub-optimal resource allocation (this is untrue if
> equivalent implementations of ISO and our schemes are compared)
> or that mistaking damage for congestion will unnecessarily
> throttle the endpoints on some paths with a high intrinsic loss
> rate (this is mostly untrue -- the scheme we propose is
> analytically tractable so we've worked out its behavior under
> random loss.  It turns out that window flow control schemes just
> don't handle high loss rates well and throughput of a vanilla
> TCP under high, random loss is abysmal.  Adding our congestion
> control scheme makes things slightly worse but the practical
> difference is negligible.  As an example (that we have calculated
> and Jon Crowcroft at UCL has measured), it takes 40 256-byte
> packets to fill the Satnet pipe.  Satnet currently shows a
> random, 1% average packet loss.  The maximum throughput a
> vanilla tcp could achieve at this loss rate is 56% of the 40kbs
> channel bandwidth.  The maximum throughput our TCP can achieve at
> this loss rate is 49% of the channel bandwidth.
> In theory, the use of dynamic congestion control should allow
> one to achieve much better throughput under high loss than is
> possible with normal TCP -- performance comparable to, say,
> NETBLT over the same lossy link.  The reason is that regular TCP
> tries to communicate two different things with one number (the
> window field in the packet): the amount of buffer the receiver
> has available and the delay-bandwidth product of the pipe.  Our
> congestion control scheme separates these two numbers.  The
> sender estimates the pipe size so the receiver window need only
> describe the receiver's buffer size.  As long as the receiver
> advertises a sufficiently large buffer, (twice the delay-
> bandwidth product) a 1% loss rate would translate into a 1%
> throughput effect, not a factor-of-two effect.  Of course, we
> have yet to put this theory to test.]
> The other part of a congestion avoidance strategy, the endnode
> action, is almost identical in Jain's DEC/ISO scheme and our
> TCP.  This is not an accident (we copied Jain's scheme after
> hearing his presentation at the Boston IETF & realizing that the
> scheme was, in a sense, universal):  The endnode strategy
> follows directly from a first-order time-series model of the
> network.  Say we measure network `load' by average queue length
> over fixed intervals of some appropriate length (something near
> the rtt).  If L(i) is the load at interval i, a network not
> subject to congestion can be modeled by saying L(i) changes
> slowly compared to the sampling time.  I.e.,
>     L(i) = N 
> (the average queue length doesn't change over time).  If the
> network is subject to congestion, this zero'th order model
> breaks down.  The average queue length becomes the sum of two
> terms, the N above that accounts for the average arrival rate
> of new traffic and a new term that accounts for the left-over
> traffic from the last time interval:
>     L(i) = N + a L(i-1)
> (As pragmatists, we extend the original model just enough to
> explain the new behavior.  What we're doing here is taking the
> first two terms in a taylor series expansion of L(t) when we
> find that just the first term is inadequate.  There is reason to
> believe one would eventually need a three term, second order
> model, but not until the Internet has grown to several times its
> current size.)
> When the network is congested, the `a' term must be large and
> the load (queues) will start increasing exponentially.  The only
> way to stabilize the system is if the traffic sources throttle
> back at least as fast as the queues are growing.  Since the way
> a source controls load in a window-based protocol is to adjust
> the size of the window, W, we end up with the sender policy
>     On congestion:   W(i) = d W(i-1)    (d < 1)
> I.e., a multiplicative decrease of the window size.  (This will
> turn into an exponential decrease over time if the congestion
> persists.)
> If there's no congestion, the `a' term must be near zero and the
> network load is about constant.  You have to try to increase the
> bandwidth you're using to find out the current limit (e.g., you
> could have been sharing the path with someone else and converged
> to a window that gives you each half the available bandwidth.  If
> he shuts down, 50% of the bandwidth will get wasted unless you
> make some effort to increase your window size.)  What should the
> increase policy be?
> The first thought is to use a symmetric, multiplicative increase,
> possibly with a longer time constant.  E.g.,  W(i) = b W(i-1),
> 1 < b <= 1/d.  This is a mistake.  The result will oscillate
> wildly and, on the average, deliver poor throughput.  The reason
> why is tedious to explain.  It has to do with that fact that it
> is easy to drive the net into saturation but hard for the net
> to recover (what Kleinrock, vol.2, calls the "rush-hour effect").
> Thus overestimating the available bandwidth is costly.  But an
> exponential, almost regardless of its time constant, increases
> so quickly that large overestimates are inevitable.
> Without justification, I'll simply state that the best increase
> policy is to make small, constant changes to the window size (if
> you've had a control theory course, you've seen the justification):
>     On no congestion:   W(i) = W(i-1) + u	(u << the path delay-
> 						bandwidth product)
> I.e., an additive increase policy.  This is exactly the policy that
> Jain, et.al., suggest in DEC TR-506 and exactly the policy we've
> implemented in TCP.  The only difference in our implementations is
> the choice of constants for `d' and `u'.  We picked .5 and 1 for
> reasons that are partially explained in the second of the attached
> messages.  A more complete analysis is in the second in-progress
> paper (and may be discussed at the upcoming IETF meeting).
> All of the preceding has probably made it sound as if the
> dynamic congestion algorithm is hairy but it's not.  Like
> slow-start, it turns out to be three lines of code and one new
> connection state variable (the combined slow-start/congestion
> control algorithm is described in the second of the attached
> msgs).
> That covers about everything that's been done to date.  It's
> about 1/3 of what we think clearly needs to be done.  The next
> big step is to do the gateway `congestion detection' algorithms
> so that a signal is sent to the endnodes as early as possible
> (but not so early that the gateway ends up starved for traffic).
> The way we anticipate doing these algorithms, gateway `self
> protection' from a mis-behaving host will fall-out for free (that
> host will simply have most of its packets dropped as the gateway
> trys to tell it that it's using more than its fair share).  Thus,
> like the endnode algorithm, the gateway algorithm will improve
> things even if no endnode is modified to do dynamic congestion
> avoidance.  And nodes that do implement congestion avoidance
> will get their fair share of bandwidth and a minimum number of
> packet drops.
> Since congestion grows exponentially, detecting it early is
> important.  (If it's detected early, small adjustments to the
> senders' windows will cure it.  Otherwise massive adjustments
> will be necessary to give the net enough spare capacity to pump
> out the backlog.)  But, given the bursty nature of traffic,
> reliable detection is a non-trivial problem.  There is a scheme
> proposed in DEC TR-506 but we think it might have convergence
> problems under high load and/or significant second-order dynamics
> in the traffic.  We are thinking of using some of our earlier
> work on ARMAX models for rtt/queue length prediction as the
> basis of the detection.  Preliminary results suggest that this
> approach works well at high load, is immune to second-order
> effects in the traffic and is cheap enough to compute that it
> wouldn't slow down thousand-packet-per-second gateways.
> In addition to the gateway algorithms, we want to apply the
> endnode algorithms to connectionless traffic (e.g., domain
> server queries, RPC requests).  We have the rate-based
> equivalent of the TCP window algorithm worked out (there should
> be a message describing it on the tcp list sometime in the near
> future).  Sun Microsystems has been very interested, and
> supportive, during the development of the TCP congestion control
> (I believe Sun OS 4.0 will ship with our new TCP) and Sun has
> offered to cooperate in developing the RPC dynamic congestion
> control algorithms, using NFS as a test-bed (since NFS is known
> to have congestion problems).
> The far future holds some work on controlling second-order, non-
> ergodic effects, adaptively determining the rate constants in
> the control algorithm, and some other things that are too vague
> to mention.
>  - Van
> From: van at HELIOS.EE.LBL.GOV (Van Jacobson)
> Newsgroups: comp.protocols.tcp-ip
> Subject: some interim notes on the bsd network speedups
> Message-ID: <8807200426.AA01221 at helios.ee.lbl.gov>
> Date: 20 Jul 88 04:26:17 GMT
> I told the potential beta-tests for our new 4bsd network code
> that I hoped to have a version of the code out by the end of
> July.  (BTW, we've got all the beta testers we can handle --
> please don't apply.)  It looks like that's going to turn into
> the end of August, in part because of SIGCOMM and in part
> because Sun puts sending source to academic customers on the
> bottom of its priority list.  I thought I'd flame about the
> latter and give a bit of a status report on the new code.
> I've been trying to put together a beta distribution for the
> "header prediction" bsd network code.  This effort has been
> stymied because it's impossible to get current source from Sun.
> The code involves major changes to the 4.3bsd kernel.  The only
> local machines I can use to develop new kernels are Suns --
> everything else is either multi-user or has pathetic ethernet
> hardware.  But the only Sun kernel source I've got is the doubly
> obsolete Sun OS 3.5/4.2 BSD.  It would be a massive waste of
> time to upgrade this system to 4.3 BSD just so I can develop
> the next BSD -- Bill Nowicki did the upgrade a year ago and
> binaries of the new system (totally worthless to me) are
> shipping as Sun OS 4.0.  [I'm not the only one suffering this
> problem -- I understand Craig Partridge's multicast work is
> suffering because he can't get 4.3-compatible Sun source.  I
> think he gave up & decided to do all his development on 4.3bsd
> Vaxen.  And I think I heard Chuck Hedrick say 4.0 has all the
> rlogin, URG and nameserver bugs that we fondly remember fixing
> in 3.x.  And he has to get source before the academic year
> starts or they won't be able to switch until a semester break.
> And Mike Karels is saying "I told you so" and suggesting I buy
> some CCIs.  Pity that Sun can't figure out that it's in their
> best interest to do TIMELY source distribution to the academic
> and research community -- their software development base gets
> expanded a few hundred-fold for the cost of making tapes.]
> Anyway, now that I've vented my spleen, there are some interim
> results to talk about.  While waiting for either useful source
> or a better hardware platform to show up, I've been cleaning up
> my original mods and backing out changes one and two at a time
> to gauge their individual effect.  Because network performance
> seems to rest on getting a lot of things happening in parallel,
> this leave-one-out testing doesn't give simple good/bad answers
> (I had one test case that went like
>     Basic system:    600 KB/s
>     add feature A:    520 KB/s
>     drop A, add B:    530 KB/s
>     add both A & B:    700 KB/s
> Obviously, any statement of the form "feature A/B is good/bad"
> is bogus.)  But, in spite of the ambiguity, some of the network
> design folklore I've heard seems to be clearly wrong.
> In the process of cleaning things up, they slowed down.  Task-
> to-task data throughput using TCP between two Sun 3/60s dropped
> from 1 MB/s (about 8.6 Mb/s on the wire) to 890 KB/s (about 7.6
> Mb/s on the wire).  I know where the 11% was lost (an
> interaction between "sosend" and the fact that an AMD LANCE chip
> requires at least 100 bytes in the first chunk of data if you
> ask it to chain -- massive braindamage on AMD's part) and how to
> get it back (re-do the way user data gets handed to the
> transport protocol) but need to talk with Mike Karels about the
> "right" way to do this.
> Still, 890 KB/s represents a non-trivial improvement over the
> stock Sun/4bsd system:  Under identical test conditions (same
> socket & user buffer sizes, same MSS and MTU, same machines),
> the best tcp throughput I could get with an out-the-box Sun OS
> 3.5 was 380 KB/s.  I wish I could say "make these two simple
> changes and your throughput will double" but I can't.  There
> were lots and lots of fiddley little changes, none made a huge
> difference and they all seemed to be necessary.
> The biggest single effect was a change to sosend (the routine
> between the user "write" syscall and tcp_output).  Its loop
> looked something like:
>     while there is user data & space in the socket buffer
>     copy from user space to socket
>     call the protocol "send" routine
> After hooking a scope to our ethernet cable & looking at the
> packet spacings, I changed this to
>     while there is user data & space in the socket buffer
>     copy up to 1K (one cluster's worth) from user space to socket
>     call the protocol "send" routine
> and the throughput jumped from 380 to 456 KB/s (+20%).  There's
> one school of thought that says the first loop was better
> because it minimized the "boundary crossings", the fixed costs
> of routine calls and context changes.  This same school is
> always lobbying for "bigger":  bigger packets, bigger windows,
> bigger buffers, for essentially the same reason: the bigger
> chunks are, the fewer boundary crossings you pay for.  The
> correct school, mine :-), says there's always a fixed cost and a
> variable cost (e.g., the cost of maintaining tcp state and
> tacking a tcp packet header on the front of some data is
> independent of the amount of data; the cost of filling in the
> checksum field in that header scales linearly with the amount of
> data).  If the size is large enough to make the fixed cost small
> compared to the variable cost, making things bigger LOWERS
> throughput because you throw away opportunities for parallelism.
> Looking at the ethernet, I saw a burst of packets, a long dead
> time, another burst of packets, ... .  It was clear that while
> we were copying 4 KB from the user, the processor in the LANCE
> chip and tcp_input on the destination machine were mostly
> sitting idle.
> To get good network performance, where there are guaranteed to
> be many processors that could be doing things in parallel, you
> want the "units of work" (loop sizes, packet sizes, etc.) to be
> the SMALLEST values that amortise the fixed cost.  In Berkeley
> Unix, the fixed costs of protocol processing are pretty low and
> sizes of 1 - 2 KB on a 68020 are as large as you'd want to get.
> (This is easy to determine.  Just do a throughput vs. size test
> and look for the knee in the graph.  Best performance is just to
> the right of the knee.)  And, obviously, on a faster machine
> you'd probably want to do things in even smaller units (if the
> overhead stays the same -- Crays are fast but hardware
> strangeness drives the fixed costs way up.  Just remember that
> if it takes large packets and large buffers to get good
> performance on a fast machine, that's because it's broken, not
> because it's fast.)
> Another big effect (difficult to quantify because several
> changes had to be made at once) was to remove lots of
> more-or-less hidden data copies from the protocol processing.
> It's a truism that you never copy data in network code.  Just
> lay the data down in a buffer & pass around pointers to
> appropriate places in that buffer.  But there are lots of places
> where you need to get from a pointer into the buffer back to a
> pointer to the buffer itself (e.g., you've got a pointer to the
> packet's IP header, there's a header checksum error, and you
> want to free the buffer holding the packet).  The routine "dtom"
> converts a data pointer back to a buffer pointer but it only
> works for small things that fit in a single mbuf (the basic
> storage allocation unit in the bsd network code).  Incoming
> packets are never in an mbuf; they're in a "cluster" which the
> mbuf points to.  There's no way to go from a pointer into a
> cluster to a pointer to the mbuf.  So, anywhere you might need
> to do a dtom (anywhere there's a detectable error), there had to
> be a call to "m_pullup" to make sure the dtom would work.
> (M_pullup works by allocating a fresh mbuf, copying a bunch of
> data from the cluster to this mbuf, then chaining the original
> cluster behind the new mbuf.)
> So, we were doing a bunch of copying to expedite error handling.
> But errors usually don't happen (if you're worried about
> performance, first you make sure there are very, very few
> errors), so we were doing a bunch of copying for nothing.  But,
> if you're sufficiently insomniac, in about a week you can track
> down all the pullup's associated with all the dtom's and change
> things to avoid both.  This requires massive recoding of both
> the TCP and IP re-assembly code.  But it was worth it:  TCP
> throughput climbed from around 600 KB/s to 750 KB/s and IP
> forwarding just screamed:  A 3/60 forwarding packets at the 9
> Mb/s effective ethernet bandwidth used less than 50% of the CPU.
> [BTW, in general I didn't find anything wrong with the BSD
> mbuf/cluster model.  In fact, I tried some other models (e.g.,
> do everything in packet sized chunks) and they were slower.
> There are many cases where knowing that you can grab an mbuf and
> chain it onto a chunk of data really simplifies the protocol
> code (simplicity == speed).  And the level of indirection and
> fast, reference counted allocation of clusters can really be a
> win on data transfers (a la kudp_fastsend in Sun OS).  The
> biggest problem I saw, other than the m_pullup's, was that
> clusters are too small:  They need to be at least big enough for
> an ethernet packet (2K) and making them page sized (8K on a Sun)
> doesn't hurt and would let you do some quick page swap tricks in
> the user-system data copies (I didn't do any of these tricks in
> the fast TCP.  I did use 2KB clusters to optimize things for the
> ethernet drivers).]
> An interesting result of the m_pullup removals was the death of
> another piece of folklore.  I'd always heard that the limiting
> CPU was the receiver.  Wrong.  After the pullup changes, the
> sender would be maxed out at 100% CPU utilization while the
> receiver loafed along at 65-70% utilization (utilizations
> measured with a microprocessor analyzer; I don't trust the
> system's stats).  In hindsight, this seems reasonable.  At the
> receiver, a packet comes in, wanders up to the tcp layer, gets
> stuck in the socket buffer and an ack is generated (i.e., the
> processing terminates with tcp_input at the socket buffer).  At
> the sender, the ack comes in, wanders up to the tcp layer, frees
> some space, then the higher level socket process has to be woken
> up to fill that space (i.e., the processing terminates with
> sosend, at the user socket layer).  The way Unix works, this
> forces a boundary crossing between the bottom half (interrupt
> service) and top half (process context) of the kernel.  On a
> Sun, and most of the other Unix boxes I know of, this is an
> expensive crossing.  [Of course, the user process on the
> receiver side has to eventually wake up and empty the socket
> buffer but it gets to do that asynchronously and the dynamics
> tend to arrange themselves so it processes several packets on
> each wakeup, minimizing the boundary crossings.]
> Talking about the bottom half of the kernel reminds me of
> another major effect.  There seemed to be a "sound barrier" at
> 550 KB/s.  No matter what I did, the performance stuck at 550
> KB/s.  Finally, I noticed that Sun's LANCE ethernet driver,
> if_le.c, would only queue one packet to the LANCE at a time.
> Picture what this means:  (1) driver hands packet to LANCE, (2)
> LANCE puts packet on wire, (3) end of packet, LANCE interrupts
> processor, (4) interrupt dispatched to driver, (5) go back to
> (1).  The time involved in (4) is non-trivial, more than 150us,
> and represents a lot of idle time for the LANCE and the wire.
> So, I rewrote the driver to queue an arbitrary number of packets
> to the LANCE, the sound barrier disappeared, and other changes
> started making the throughput climb (it's not clear whether this
> change had any effect on throughput or just allowed other
> changes to have an effect).
> [Shortly after making the if_le change, I learned why Sun might
> have written the driver the silly way they did:  Before the
> change, the 6 back-to-back IP fragments of an NFS write would
> each be separated by the 150us interrupt service time.  After
> the change, they were really back-to-back, separated by only the
> 9.6us minimum ethernet spacing (and, no, Sun does NOT violate
> the ethernet spec in any way, shape or form.  After my last
> message on this stuff, Apollo & DEC people kept telling me Sun
> was out of spec.  I've been watching packets on our ethernet for
> so long, I'm starting to learn the middle name of every bit.
> Sun bits look like DEC bits and Sun, or rather, the LANCE in the
> 3/50 & 3/60, completely complys with the timings in the blue
> book.)  Anyway, the brain-dead Intel 82586 ethernet chip Sun
> puts in all its 3/180, 3/280 and Sun-4 file servers can't hack
> back-to-back, minimum spacing packets.  Every now and again it
> drops some of the frags and wanders off to never-never land
> ("iebark reset").  Diskless workstations don't work well when
> they can't write to their file server and, until I hit on the
> idea of inserting "DELAY" macros in kudp_fastsend, it looked
> like I could have a fast TCP or a functional workstation but not
> both.]
> Probably 30% of the performance improvements came from fixing
> things in the Sun kernel.  I mean like, really, guys:  If the
> current task doesn't change, and it doesn't 80% of the time
> swtch is called, there's no need to do a full context save and
> restore.  Adding the two lines
>     cmpl    _masterprocp,a0
>     jeq     6f              ? restore of current proc is easy
> just before the call to "resume" in sun3/vax.s:swtch got me a
> quick 70 KB/s performance increase but felt more like a bug fix
> than progress.  And a kernel hacker that does 16-bit "movw"s and
> "addw"s on a 68020, or writes 2 instruction dbra loops designed
> to put a 68010 in loop mode, should be shot.  The alu takes the
> same 3 clocks for a 2 byte add or a 4 byte add so things will
> finish a lot quicker if you give it 4 bytes at a time.  And
> every branch stalls the pipe, so unrolling that loop to cut down
> on branches is a BIG win.
> Anyway, I recoded the checksum routine, ocsum.s (now oc_cksum.s
> because I found the old calling sequence wasn't the best way to
> do things) and its caller, in_cksum.c, and got the checksumming
> time down from 490us/KB to 130us/KB.  Unrolling the move loop in
> copyin/copyout (the routines that move data user to kernel and
> kernel to user), got them down from 200us/KB to 140us/KB.  (BTW,
> if you combine the move with the checksum, which I did in the
> kludged up, fast code that ran 1 MB/s on a 15MHz 3/50, it costs
> 200us/KB, not the 300us/KB you'd expect from adding the move
> and sum times.  Pipelined processors are strange and wonderful
> beasts.)
> From these times, you can work out most of the budgets and
> processing details:  I was using 1408 data byte packets (don't
> ask why 1408).  It takes 193us to copy a packet's worth of data
> and 184us to checksum the packet and its 40 byte header.  From
> the logic analyzer, the LANCE uses 300us of bus and memory
> bandwidth reading or writing the packet (I don't understand why,
> it should only take half this).   So, the variable costs are
> around 700us per packet.  When you add the 18 byte ethernet
> header and 12 byte interpacket gap, to run at 10 Mb/s I'd have
> to supply a new packet every 1200us.  I.e., there's a budget of
> 500us for all the fixed costs (protocol processing, interrupt
> dispatch, device setup, etc.).  This is do-able (I've done it,
> but not very cleanly) but what I'm getting today is a packet
> every 1500us.  I.e., 800us per packet fixed costs.  When I look
> with our analyzer, 30% of this is TCP, IP, ARP and ethernet
> protocol processing (this was 45% before the "header prediction"
> tcp mods), 15% is stalls (idle time that I don't currently
> understand but should be able to eventually get rid of) and 55%
> is device setup, interrupt service and task handling.  I.e.,
> protocol processing is negligible (240us per packet on this
> processor and this isn't a fast processor in today's terms).  To
> make the network go faster, it seems we just need to fix the
> operating system parts we've always needed to fix:  I/O service,
> interrupts, task switching and scheduling.  Gee, what a surprise.
>  - Van
> BBoard-ID: 7621
> BB-Posted: Tue, 25 Oct 88 2:06:08 EDT
> To: tcp-ip at sri-nic.ARPA
> Subject: 4BSD TCP Ethernet Throughput
> Date: Mon, 24 Oct 88 13:33:13 PDT
> From: Van Jacobson <van at helios.ee.lbl.gov>
> Many people have asked for the Ethernet throughput data I
> showed at Interop so it's probably easier to post it:
> These are some throughput results for an experimental version of
> the 4BSD (Berkeley Unix) network code running on a couple of
> different MC68020-based systems: Sun 3/60s (20MHz 68020 with AMD
> LANCE Ethernet chip) and Sun 3/280s (25MHz 68020 with Intel
> 82586 Ethernet chip) [note again the tests were done with Sun
> hardware but not Sun software -- I'm running 4.?BSD, not Sun
> OS].  There are lots and lots of interesting things in the data
> but the one thing that seems to have attracted people's
> attention is the big difference in performance between the two
> Ethernet chips.
> The test measured task-to-task data throughput over a TCP
> connection from a source (e.g., chargen) to a sink (e.g.,
> discard).  The tests were done between 2am and 6am on a fairly
> quiet Ethernet (~100Kb/s average background traffic).  The
> packets were all maximum size (1538 bytes on the wire or 1460
> bytes of user data per packet).  The free parameters for the
> tests were the sender and receiver socket buffer sizes (which
> control the amount of 'pipelining' possible between the sender,
> wire and receiver).  Each buffer size was independently varied
> from 1 to 17 packets in 1 packet steps.  Four tests were done at
> each of the 289 combinations.  Each test transferred 8MB of data
> then recorded the total time for the transfer and the send and
> receive socket buffer sizes (8MB was chosen so that the worst
> case error due to the system clock resolution was ~.1% -- 10ms
> in 10sec).  The 1,156 tests per machine pair were done in random
> order to prevent any effects from fixed patterns of resource
> allocation.
> In general, the maximum throughput was observed when the sender
> buffer equaled the receiver buffer (the reason why is complicated
> but has to do with collisions).  The following table gives the
> task-to-task data throughput (in KBytes/sec) and throughput on
> the wire (in MBits/sec) for (a) a 3/60 sending to a 3/60 and
> (b) a 3/280 sending to a 3/60.
> 	_________________________________________________
> 	|              3/60 to 3/60   |  3/280 to 3/60   |
> 	|            (LANCE to LANCE) | (Intel to LANCE) |
> 	| socket                      |                  |
> 	| buffer     task to          | task to          |
> 	|  size       task      wire  |  task      wire  |
> 	|(packets)   (KB/s)    (Mb/s) | (KB/s)    (Mb/s) |
> 	|    1         384      3.4   |   337      3.0   |
> 	|    2         606      5.4   |   575      5.1   |
> 	|    3         690      6.1   |   595      5.3   |
> 	|    4         784      6.9   |   709      6.3   |
> 	|    5         866      7.7   |   712      6.3   |
> 	|    6         904      8.0   |   708      6.3   |
> 	|    7         946      8.4   |   710      6.3   |
> 	|    8         954      8.4   |   718      6.4   |
> 	|    9         974      8.6   |   715      6.3   |
> 	|   10         983      8.7   |   712      6.3   |
> 	|   11         995      8.8   |   714      6.3   |
> 	|   12        1001      8.9   |   715      6.3   |
> 	|_____________________________|__________________|
> The theoretical maximum data throughput, after you take into
> account all the protocol overheads, is 1,104 KB/s (this
> task-to-task data rate would put 10Mb/s on the wire).  You can
> see that the 3/60s get 91% of the the theoretical max.  The
> 3/280, although a much faster processor (the CPU performance is
> really dominated by the speed of the memory system, not the
> processor clock rate, and the memory system in the 3/280 is
> almost twice the speed of the 3/60), gets only 65% of
> theoretical max.
> The low throughput of the 3/280 seems to be entirely due to the
> Intel Ethernet chip: at around 6Mb/s, it saturates.  (I put the
> board on an extender and watched the bus handshake lines on the
> 82586 to see if the chip or the Sun interface logic was pooping
> out.  It was the chip -- it just stopped asking for data.  (The
> CPU was loafing along with at least 35% idle time during all
> these tests so it wasn't the limit).
> [Just so you don't get confused:  Stuff above was measurements.
>  Stuff below includes opinions and interpretation and should
>  be viewed with appropriate suspicion.]
> If you graph the above, you'll see a large notch in the Intel
> data at 3 packets.  This is probably a clue to why it's dying:
> TCP delivers one ack for every two data packets.  At a buffer
> size of three packets, the collision rate increases dramatically
> since the sender's third packet will collide with the receiver's
> ack for the previous two packets (for buffer sizes of 1 and 2,
> there are effectively no collisions).  My suspicion is that the
> Intel is taking a long time to recover from collisions (remember
> that you're 64 bytes into the packet when you find out you've
> collided so the chip bus logic has to back up 64 bytes -- Intel
> spent their silicon making the chip "programmable", I doubt they
> invested as much as AMD in the bus interface).  This may or may
> not be what's going on:  life is too short to spend debugging
> Intel parts so I really don't care to investigate further.
> The one annoyance in all this is that Sun puts the fast Ethernet
> chip (the AMD LANCE) in their slow machines (3/50s and 3/60s)
> and the slow Ethernet chip (Intel 82586) in their fast machines
> (3/180s, 3/280s and Sun-4s, i.e., all their file servers).
> [I've had to put delay loops in the Ethernet driver on the 3/50s
> and 3/60s to slow them down enough for the 3/280 server to keep
> up.]  Sun's not to blame for anything here:  It costs a lot
> to design a new Ethernet interface; they had a design for the
> 3/180 board set (which was the basis of all the other VME
> machines--the [34]/280 and [34]/110); and no market pressure to
> change it.  If they hadn't ventured out in a new direction with
> the 3/[56]0 -- the LANCE -- I probably would have thought
> 700KB/s was great Ethernet throughput (at least until I saw
> Dave Boggs' DEC-Titan/Seeq-chip throughput data).
> But I think Sun is overdue in offering a high-performance VME
> Ethernet interface.  That may change though -- VME controllers
> like the Interphase 4207 Eagle are starting to appear which
> should either put pressure on Sun and/or offer a high
> performance 3rd party alternative (I haven't actually tried an
> Eagle yet but from the documentation it looks like they did a
> lot of things right).  I'd sure like to take the delay loops out
> of my LANCE driver...
>  - Van
> ps: I have data for Intel-to-Intel and LANCE-to-Intel as well as
>     the Intel-to-LANCE I listed above.  Using an Intel chip on the
>     receiver, the results are MUCH worse -- 420KB/s max.  I chose
>     the data that put the 82586 in its very best light.
>     I also have scope pictures taken at the transceivers during all
>     these tests.  I'm sure there'll be a chorus of "so-and-so violates
>     the Ethernet spec" but that's a lie -- NONE OF THESE CHIPS OR
>     I looked very carefully for violations and have the pictures to
>     prove there were none.
>     Finally, all of the above is Copyright (c) 1988 by Van Jacobson.
>     If you want to reproduce any part of it in print, you damn well
>     better ask me first -- I'm getting tired of being misquoted in
>     trade rags.
> From van at helios.ee.lbl.gov Mon Apr 30 01:44:05 1990
> To: end2end-interest at ISI.EDU
> Subject: modified TCP congestion avoidance algorithm
> Date: Mon, 30 Apr 90 01:40:59 PDT
> From: Van Jacobson <van at helios.ee.lbl.gov>
> Status: RO
> This is a description of the modified TCP congestion avoidance
> algorithm that I promised at the teleconference.
> BTW, on re-reading, I noticed there were several errors in
> Lixia's note besides the problem I noted at the teleconference.
> I don't know whether that's because I mis-communicated the
> algorithm at dinner (as I recall, I'd had some wine) or because
> she's convinced that TCP is ultimately irrelevant :).  Either
> way, you will probably be disappointed if you experiment with
> what's in that note.
> First, I should point out once again that there are two
> completely independent window adjustment algorithms running in
> the sender:  Slow-start is run when the pipe is empty (i.e.,
> when first starting or re-starting after a timeout).  Its goal
> is to get the "ack clock" started so packets will be metered
> into the network at a reasonable rate.  The other algorithm,
> congestion avoidance, is run any time *but* when (re-)starting
> and is responsible for estimating the (dynamically varying)
> pipesize.  You will cause yourself, or me, no end of confusion
> if you lump these separate algorithms (as Lixia's message did).
> The modifications described here are only to the congestion
> avoidance algorithm, not to slow-start, and they are intended to
> apply to large bandwidth-delay product paths (though they don't
> do any harm on other paths).  Remember that with regular TCP (or
> with slow-start/c-a TCP), throughput really starts to go to hell
> when the probability of packet loss is on the order of the
> bandwidth-delay product.  E.g., you might expect a 1% packet
> loss rate to translate into a 1% lower throughput but for, say,
> a TCP connection with a 100 packet b-d p. (= window), it results
> in a 50-75% throughput loss.  To make TCP effective on fat
> pipes, it would be nice if throughput degraded only as function
> of loss probability rather than as the product of the loss
> probabilty and the b-d p.  (Assuming, of course, that we can do
> this without sacrificing congestion avoidance.)
> These mods do two things: (1) prevent the pipe from going empty
> after a loss (if the pipe doesn't go empty, you won't have to
> waste round-trip times re-filling it) and (2) correctly account
> for the amount of data actually in the pipe (since that's what
> congestion avoidance is supposed to be estimating and adapting to).
> For (1), remember that we use a packet loss as a signal that the
> pipe is overfull (congested) and that packet loss can be
> detected one of two different ways:  (a) via a retransmit
> timeout or (b) when some small number (3-4) of consecutive
> duplicate acks has been received (the "fast retransmit"
> algorithm).  In case (a), the pipe is guaranteed to be empty so
> we must slow-start.  In case (b), if the duplicate ack
> threshhold is small compared to the bandwidth-delay product, we
> will detect the loss with the pipe almost full.  I.e., given a
> threshhold of 3 packets and an LBL-MIT bandwidth-delay of around
> 24KB or 16 packets (assuming 1500 byte MTUs), the pipe is 75%
> full when fast-retransmit detects a loss (actually, until
> gateways start doing some sort of congestion control, the pipe
> is overfull when the loss is detected so *at least* 75% of the
> packets needed for ack clocking are in transit when
> fast-retransmit happens).  Since the pipe is full, there's no
> need to slow-start after a fast-retransmit.
> For (2), consider what a duplicate ack means:  either the
> network duplicated a packet (i.e., the NSFNet braindead IBM
> token ring adapters) or the receiver got an out-of-order packet.
> The usual cause of out-of-order packets at the receiver is a
> missing packet.  I.e., if there are W packets in transit and one
> is dropped, the receiver will get W-1 out-of-order and
> (4.3-tahoe TCP will) generate W-1 duplicate acks.  If the
> `consecutive duplicates' threshhold is set high enough, we can
> reasonably assume that duplicate acks mean dropped packets.
> But there's more information in the ack:  The receiver can only
> generate one in response to a packet arrival.  I.e., a duplicate
> ack means that a packet has left the network (it is now cached
> at the receiver).  If the sender is limitted by the congestion
> window, a packet can now be sent.  (The congestion window is a
> count of how many packets will fit in the pipe.  The ack says a
> packet has left the pipe so a new one can be added to take its
> place.)  To put this another way, say the current congestion
> window is C (i.e, C packets will fit in the pipe) and D
> duplicate acks have been received.  Then only C-D packets are
> actually in the pipe and the sender wants to use a window of C+D
> packets to fill the pipe to its estimated capacity (C+D sent -
> D received = C in pipe).
> So, conceptually, the slow-start/cong.avoid/fast-rexmit changes
> are:
>   - The sender's input routine is changed to set `cwnd' to `ssthresh'
>     when the dup ack threshhold is reached.  [It used to set cwnd to
>     mss to force a slow-start.]  Everything else stays the same.
>   - The sender's output routine is changed to use an effective window
>     of min(snd_wnd, cwnd + dupacks*mss)  [the change is the addition
>     of the `dupacks*mss' term.]  `Dupacks' is zero until the rexmit
>     threshhold is reached and zero except when receiving a sequence
>     of duplicate acks.
> The actual implementation is slightly different than the above
> because I wanted to avoid the multiply in the output routine
> (multiplies are expensive on some risc machines).  A diff of the
> old and new fastrexmit code is attached (your line numbers will
> vary).
> Note that we still do congestion avoidance (i.e., the window is
> reduced by 50% when we detect the packet loss).  But, as long as
> the receiver's offered window is large enough (it needs to be at
> most twice the bandwidth-delay product), we continue sending
> packets (at exactly half the rate we were sending before the
> loss) even after the loss is detected so the pipe stays full at
> exactly the level we want and a slow-start isn't necessary.
> Some algebra might make this last clear:  Say U is the sequence
> number of the first un-acked packet and we are using a window
> size of W when packet U is dropped.  Packets [U..U+W) are in
> transit.  When the loss is detected, we send packet U and pull
> the window back to W/2.  But in the round-trip time it takes
> the U retransmit to fill the receiver's hole and an ack to get
> back, W-1 dup acks will arrive (one for each packet in transit).
> The window is effectively inflated by one packet for each of
> these acks so packets [U..U+W/2+W-1) are sent.  But we don't
> re-send packets unless we know they've been lost so the amount
> actually sent between the loss detection and the recovery ack is
> U+W/2+W-1 - U+W = W/2-1 which is exactly the amount congestion
> avoidance allows us to send (if we add in the rexmit of U).  The
> recovery ack is for packet U+W so when the effective window is
> pulled back from W/2+W-1 to W/2 (which happens because the
> recovery ack is `new' and sets dupack to zero), we are allowed
> to send up to packet U+W+W/2 which is exactly the first packet
> we haven't yet sent.  (I.e., there is no sudden burst of packets
> as the `hole' is filled.)  Also, when sending packets between
> the loss detection and the recovery ack, we do nothing for the
> first W/2 dup acks (because they only allow us to send packets
> we've already sent) and the bottleneck gateway is given W/2
> packet times to clean out its backlog.  Thus when we start
> sending our W/2-1 new packets, the bottleneck queue is as empty
> as it can be.
> [I don't know if you can get the flavor of what happens from
> this description -- it's hard to see without a picture.  But I
> was delighted by how beautifully it worked -- it was like
> watching the innards of an engine when all the separate motions
> of crank, pistons and valves suddenly fit together and
> everything appears in exactly the right place at just the right
> time.]
> Also note that this algorithm interoperates with old tcp's:  Most
> pre-tahoe tcp's don't generate the dup acks on out-of-order packets.
> If we don't get the dup acks, fast retransmit never fires and the
> window is never inflated so everything happens in the old way (via
> timeouts).  Everything works just as it did without the new algorithm
> (and just as slow).
> If you want to simulate this, the intended environment is:
>     - large bandwidth-delay product (say 20 or more packets)
>     - receiver advertising window of two b-d p (or, equivalently,
>       advertised window of the unloaded b-d p but two or more
>       connections simultaneously sharing the path).
>     - average loss rate (from congestion or other source) less than
>       one lost packet per round-trip-time per active connection.
>       (The algorithm works at higher loss rate but the TCP selective
>       ack option has to be implemented otherwise the pipe will go empty
>       waiting to fill the second hole and throughput will once again
>       degrade at the product of the loss rate and b-d p.  With selective
>       ack, throughput is insensitive to b-d p at any loss rate.)
> And, of course, we should always remember that good engineering
> practise suggests a b-d p worth of buffer at each bottleneck --
> less buffer and your simulation will exhibit the interesting
> pathologies of a poorly engineered network but will probably
> tell you little about the workings of the algorithm (unless the
> algorithm misbehaves badly under these conditions but my
> simulations and measurements say that it doesn't).  In these
> days of $100/megabyte memory, I dearly hope that this particular
> example of bad engineering is of historical interest only.
>  - Van
> [...code diffs deleted...]
> Received: from rx7.ee.lbl.gov by uu2.psi.com (5.65b/4.0.071791-PSI/PSINet) via SMTP;
> 	id AA12583 for popbbn; Wed, 8 Sep 93 01:29:46 -0400
> Received: by rx7.ee.lbl.gov for craig at aland.bbn.com (5.65/1.44r)
> 	id AA05271; Tue, 7 Sep 93 22:30:15 -0700
> Message-Id: <9309080530.AA05271 at rx7.ee.lbl.gov>
> To: Craig Partridge <craig at aland.bbn.com>
> Cc: David Clark <ddc at lcs.mit.edu>
> Subject: Re: query about TCP header on tcp-ip 
> In-Reply-To: Your message of Tue, 07 Sep 93 09:48:00 PDT.
> Date: Tue, 07 Sep 93 22:30:14 PDT
> From: Van Jacobson <van at ee.lbl.gov>
> Craig,
> As you probably remember from the "High Speed TCP" CNRI meeting,
> my kernel looks nothing at all like any version of BSD.  Mbufs
> no longer exist, for example, and `netipl' and all the protocol
> processing that used to be done at netipl interrupt level are
> gone.  TCP receive packet processing in the new kernel really is
> about 30 instructions on a RISC (33 on a sparc but three of
> those are compiler braindamage).  Attached is the C code & the
> associated sparc assembler.
> A brief recap of the architecture:  Packets go in 'pbufs' which
> are, in general, the property of a particular device.  There is
> exactly one, contiguous, packet per pbuf (none of that mbuf
> chain stupidity).  On the packet input interrupt, the device
> driver upcalls through the protocol stack (no more of that queue
> packet on the netipl software interrupt bs).  The upcalls
> percolate up the stack until the packet is completely serviced
> (e.g., NFS requests that can be satisfied from in-memory data &
> data structures) or they reach a routine that knows the rest of
> the processing must be done in some process's context in which
> case the packet is laid on some appropriate queue and the
> process is unblocked.  In the case of TCP, the upcalls almost
> always go two levels: IP finds the datagram is for this host &
> it upcalls a TCP demuxer which hashes the ports + SYN to find a
> PCB, lays the packet on the tail of the PCB's queue and wakes up
> any process sleeping on the PCB.  The IP processing is about 25
> instructions & the demuxer is about 10.
> As Dave noted, the two processing paths that need the most
> tuning are the data packet send & receive (since at most every
> other packet is acked, there will be at least twice as many data
> packets as ack packets).  In the new system, the receiving
> process calls 'tcp_usrrecv' (the protocol specific part of the
> 'recv' syscall) or is already blocked there waiting for new
> data.  So the following code is embedded in a loop at the start of
> tcp_usrrecv that spins taking packets off the pcb queue until
> there's no room for the next packet in the user buffer or the
> queue is empty.  The TCP protocol processing is done as we
> remove packets from the queue & copy their data to user space
> (and since we're in process context, it's possible to do a
> checksum-and-copy).
> Throughout this code, 'tp' points to the pcb and 'ti' points to
> the tcp header of the first packet on the queue (the ip header was
> stripped as part of interrupt level ip processing).  The header info
> (excluding the ports which are implicit in the pcb) are sucked out
> of the packet into registers [this is to minimize cache thrashing and
> possibly to take advantage of 64 bit or longer loads].  Then the
> header checksum is computed (tp->ph_sum is the precomputed pseudo-header
> checksum + src & dst ports).
> int tcp_usrrecv(struct uio* uio, struct socket* so)
> {
> 	struct tcpcb *tp = (struct tcpcb *)so->so_pcb;
> 	register struct pbuf* pb;
> 	while ((pb = tp->tp_inq) != 0) {
> 		register int len = pb->len;
> 		struct tcphdr *ti = (struct tcphdr *)pb->dat;
> 		u_long seq = ((u_long*)ti)[1];
> 		u_long ack = ((u_long*)ti)[2];
> 		u_long flg = ((u_long*)ti)[3];
> 		u_long sum = ((u_long*)ti)[4];
> 		u_long cksum = tp->ph_sum;
> 		/* NB - ocadd is an inline gcc assembler function */
> 		cksum = ocadd(ocadd(ocadd(ocadd(cksum, seq), ack), flg), sum);
> Next is the header prediction check which is probably the most
> opaque part of the code.  tp->pred_flags contains snd_wnd (the
> window we expect in incoming packets) in the bottom 16 bits and
> 0x4x10 in the top 16 bits.  The 'x' is normally 0 but will be
> set non-zero if header prediction shouldn't be done (e.g., if
> not in established state, if retransmitting, if hole in seq
> space, etc.).  So, the first term of the 'if' checks four
> different things simultaneously:
>  - that the window is what we expect
>  - that there are no tcp options
>  - that the packet has ACK set & doesn't have SYN, FIN, RST or URG set
>  - that the connection is in the right state
> and the 2nd term of the if checks that the packet is in sequence:
> #define FMASK (((0xf000 | TH_SYN|TH_FIN|TH_RST|TH_URG|TH_ACK) << 16) | 0xffff)
>         if ((flg & FMASK) == tp->pred_flags && seq == tp->rcv_nxt) {
> The next few lines are pretty obvious -- we subtract the header
> length from the total length and if it's less than zero the packet
> was malformed, if it's zero we must have a pure ack packet & we
> do the ack stuff otherwise if the ack field didn't move we have
> a pure data packet which we copy to the user's buffer, checksumming
> as we go, then update the pcb state if everything checks:
>                 len -= 20;
>                 if (len <= 0) {
>                         if (len < 0) {
>                                 /* packet malformed */
>                         } else {
>                                 /* do pure ack things */
>                         }
>                 } else if (ack == tp->snd_una) {
>                         cksum = in_uiomove((u_char*)ti + 20, len, uio, cksum);
>                         if (cksum != 0) {
>                                 /* packet or user buffer errors */
>                         }
>                         seq += len;
>                         tp->rcv_nxt = seq;
>                         if ((int)(seq - tp->rcv_acked) >= 0) {
>                                 /* send ack */
>                         } else {
>                                 /* free pbuf */
>                         }
> 			continue;
> 		}
> 	}
> 	/* header prediction failed -- take long path */
> 	...
> That's it.  On the normal receive data path we execute 16 lines of
> C which turn into 33 instructions on a sparc (it would be 30 if I
> could trick gcc into generating double word loads for the header
> & my carefully aligned pcb fields).  I think you could get it down
> around 25 on a cray or big-endian alpha since the loads, checksum calc
> and most of the compares can be done on 64 bit quantities (e.g.,
> you can combine the seq & ack tests into one).
> Attached is the sparc assembler for the above receive path.  Hope
> this explains Dave's '30 instruction' assertion.  Feel free to
> forward this to tcp-ip or anyone that might be interested.
>  - Van
>  ----------------
> 	ld [%i0+4],%l3			! load packet tcp header fields
> 	ld [%i0+8],%l4
> 	ld [%i0+12],%l2
> 	ld [%i0+16],%l0
> 	ld [%i1+72],%o0			! compute header checksum
> 	addcc %l3,%o0,%o3
> 	addxcc %l4,%o3,%o3
> 	addxcc %l2,%o3,%o3
> 	addxcc %l0,%o3,%o3
> 	sethi %hi(268369920),%o1	! check if hdr. pred possible
> 	andn %l2,%o1,%o1
> 	ld [%i1+60],%o2
> 	cmp %o1,%o2
> 	bne L1
> 	ld [%i1+68],%o0
> 	cmp %l3,%o0
> 	bne L1
> 	addcc %i2,-20,%i2
> 	bne,a L3
> 	ld [%i1+36],%o0
> 	  ! packet error or ack processing
> 	  ...
> L3:
> 	cmp %l4,%o0
> 	bne L1
> 	add %i0,20,%o0
> 	mov %i2,%o1
> 	call _in_uiomove,0
> 	mov %i3,%o2
> 	cmp %o0,0
> 	be L6
> 	add %l3,%i2,%l3
> 	  ! checksum error or user buffer error
> 	  ...
> L6:
> 	ld [%i1+96],%o0
> 	subcc %l3,%o0,%g0
> 	bneg L7
> 	st %l3,[%i1+68]
> 	  ! send ack
> 	  ...
> 	  br L8
> L7:
> 	  ! free pbuf
> 	  ...
> L8:	! done with this packet - continue
> 	...
> L1:	! hdr pred. failed - do it the hard way