Enhancements to Comer's TCP code

My home page

List of enhancements to Comer’s TCP code

Vsevolod (Simon) Ilyushchenko
simonf@simonf.com

October 8, 1998

This document describes the changes that were made to the C implementation of TCP in Comer in the course of translating it into Java. Here and below "Comer" refers the book by D. Comer and D. Stevens "Internetworking with TCP/IP", v.2, second edition. Occasionally I mention "Stevens". Of course, it is the other Bible of TCP implementations: W. Richard Stevens, "TCP/IP Illustrated", v. 2.

Recently, I have implemented the TCP protocol in Java. This was a unique chance to create from scratch an object-oriented version of TCP. I am not aware of any publicly available Java or other OO TCP implementation. Since my work was done under a contract, I am not allowed to disclose the code. However, I was allowed to release my ideas on how to make the code in Comer's book better: more efficient and RFC-compliant.

These ideas are NOT purely theoretical; quite the reverse, they were prompted by observing and playing with a working TCP implementation. If you are going to write your own code starting from the Comer's version, ignore my notes at your own peril. :)

This discussion is highly technical and is relevant only to those who know well both the TCP specification and the Comer’s code. I have tried to avoid use of code segments below. When they are used, it is neither C nor Java, but a pseudocode resembling both.

I have contacted Mr Comer's secretary, but I have received no response yet. I would be happy to include any comments by the authors of the book.

There is a Xinu bug list page that deals mostly with TCP bugs. I have inserted links to that page in those places where we discuss the same issues.

The proper behaviour of TCP is NOT to respond when it receives a segment with no data and with the next expected sequence number. However, Comer’s code WILL respond to such segment (function tcpackit(), p. 303). Zero window probes usually have the sequence number one less than expected (Linux) or the expected sequence number but only one byte of data (Windows, BSD).

Function tcpok(), p. 212, strips the data in the incoming segment if there is no place for the data in the local buffer. Then tcpackit() will be called, which does not have any means to decide if it received a one-byte zero window probe or an empty in-sequence segment. So tcpackit() acknowledges anything.

The approach I used was to put the code for clearing the data when the local buffer is full inside the else clause in tcpinp(), p. 206. This is the if-else statement that will call tcpackit() if the call to tcpok() returns false.

Then, inside tcpackit(), I added a check for the incoming segment being empty and in-sequence, that is, if its sequence number is the next expected, plus it has no data, plus it does not carry SYN or FIN flags. If this is true, no acknowledgment should be sent.

In the course of processing an incoming segment in the Established state, there may be produced more then one immediate-SEND event. Such events are generated by tcpkick(), which may be called in tcpswindow(), p. 290, tcpdata(), p. 229 and tcpostate(), p. 304. The first two are called directly from tcpestablished(), p. 228, and the third one is called in tcpacked(), p. 301, which is also called in tcpestablished().

The proposed solution is to extend the use of TCBF_NEEDOUT in tcpdata() for the whole period of processing an incoming segment. Instead of calling tcpkick() in each of the cases below, we will just set this flag. When a segment is processed, we will check this flag and post an immediate SEND event if the flag is set.

Even then, the combination of incoming segment processing and user data sending may lead to posting of too many immediate-SEND events where actually only one will suffice. This can be implemented via another flag called WILLOUT. This flag is set just before tcpkick() is called in every one of the cases above, and while it is set, no other immediate-SENDs can be posted. It is cleared after the processing of any SEND event in tcpout(), p. 250.

Finally, the last enhancement. It may happen that at the time when we process an incoming segment there are some unsent data, so there is a reason to post a SEND event. But by the time this event gets executed, other events may have already sent all the data. We need to distinguish between two reasons of sending a segment – the first one is that we have some data to send, and the second one covers the other causes (sending an ack, for example). Thus, let tcpswindow() and tcpostate() set one flag, called MAYOUT, and all the other functions that see a cause for output set the NEEDOUT flag. Then, when in tcpxmit() we see the case that there is no data to send, we will check the NEEDOUT flag. Only when it is set, we will send a no-data segment. The MAYOUT flag will not cause a segment to be sent. Of course, both of those flags will be cleared in tcpsend().

In the previous note, we talked about calling tcpkick() after processing a segment. Delayed acks are dealt with inside this function, and their treatment could be improved. First, not every call to tcpkick() is caused by segment processing, so the delayed acks processing should be moved out of tcpkick() to the function that will set the NEEDOUT flag. Also, we should delay an ACK only for a full-sized segment, plus we should ack every other segment anyway. This is implemented via a DELACK flag (called TF_ACKNOW in Stevens). Finally, if we have some out-of-order segments stored in the input buffer, something is messing with the order of segments on the wire, and we would not want to delay the ACK. Then the other side can learn about our situation.

All this leads to the following expression:

if ((new_data_length == MSS) && (!getFlag (DELACK)) && (no out-of-order segments))
{
    setFlag (DELACK);
    tmset (DELACK_TIME, "SEND")
}
else
    setFlag (NEEDOUT);

Consider the case when the sending process is in the IDLE state (in regard to output), and a window update arrives because the receiver on the opposite side has read more data. There is no need to reply to this packet. To provide for this, it is reasonable to put an extra check into tcpswindow(), p. 290, between the check for window shrinking and the check for the PERSIST state. This extra code will just return from the function if the output state is IDLE, preventing scheduling an output event.

Also, in the same place, it makes sense to add a check for the REXMT state to the check for the PERSIST state that is already there. Since we are going to perform output via the tcpxmit() function that will pick up the transmission from the first unacknowledged data byte, that is, exactly where a pending RETRANSMIT would do it, there is no need to process this old RETRANSMIT. The code in this if block should also cancel the RETRANSMIT events in addition to the PERSIST events.

As mentioned in note 1, the Xinu implementation of TCP will respond to empty in-sequence segments. The code for zero window probing relies on that. Indeed, in tcpsend(), p. 256, there is no special provision for a zero window probe, since its data length will automatically be zero, which is returned by tcpsndlen().

In order for zero window probes to work with the changes described in note 1, we need to specifically check in tcpsend() for swindow's equality to zero and either set datalength to (say) one byte (BSD way), or decrease the sequence number by one (Linux way). The complete if clause for this case should look like this:

if (rexmt && swindow==0 && !getFlag (RDONE | SNDFIN))
datalen = 1;

Here rexmt should be set to true because in the course of normal transmission from tcpxmit() we should not hit the case of swindow being equal to zero, but this may be an extraneous check. The flags RDONE and SNDFIN are checked to catch the cases when we are about to send or re-send a FIN, in which case it does not make sense to send more data (there aren’t any) or decrease sequence numbers.

When we will implement sender-side silly window avoidance (see next note), we will need an estimate of the maximum send window. This will require another variable associated with a connection, called maxSwindow. Therefore, when we receive a window value from the peer and set the variable swindow (function tcpswindow(), p. 290), we will also set this variable:

if (segment.swindow > connection.window)

connection.maxSwindow = segment.window

Function tcpwr(), p. 342, after new data was written will call tcpkick() only in the cases when the connection is idle or the urgent bit is set. Actually, this is an attempt for sender-side silly window avoidance that should be implemented fully elsewhere (see the next note).

The function tcpsndlen(), p. 259, can be improved in several ways.

Instead of

datalen = total_data_length - offset

datalen = min (datalen, swindow);

it is more proper to have

datalen = min (total_data_length, swindow) - offset;

(offset is what is denoted by *poff in the C code).

This can be found in Stevens, p. 855.

The reason is that when offset is non-zero, we have already sent "offset" bytes of data into the window (stored in swindow), so the available window at that point is smaller.

This is noted in the Xinu bug list page as bug T.10. If the other end is a slow reader, this bug will cause extremely low throughput.

Also, tcpsndlen() does not safeguard against window shrinking or silly window syndrome. Therefore, we should add after all the original calculations the following:

if (!getFlag (SNDSYN |SNDFIN))
datalen = max (0, datalen);

This takes care of the case when in the course of sending several segments swindow suddenly becomes zero. Therefore, the data length calculated by the original function would be negative. We should limit it to zero. There is only one legitimate case when the data length can be negative, namely, it can be equal to -1. This happens when we have a SYN or FIN pending transmission.

The last addition deals with silly window avoidance, sender-side, and is taken from Stevens, p. 859. The following lines should be added at the very end of the function.

//Always send a whole segment.
if (datalen ==smss)
return datalen;

//If the whole buffer can be sent, while we are idle
//or using no Nagle, do it.
if ((suna == snext || (Nagle option set)
&& (datalen + offset == total_data_length))
return datalen;

//Always send if a retransmission is going on.
if (rexmt)
return datalen;

//If the receiver's window (ie, our estimate)
//is at least half open, send.
if (datalen >= maxSwindow / 2)
return datalen;

//No reason to send found.
return 0;

These are the last lines of tcpsndlen().

Note that here we use variable maxSwindow that is absent in Comer’s text. See the previous note.

The Xinu bug list page mentions the poor SWS avoidance in the Comer's book as bugs T.12 and T.13.

Function tcphowmuch(), p. 260, could also be improved. Its pseudocode rewrite follows. Note that now it takes a new parameter, rexmt, the same one as in tcpsndlen().

int tcphowmuch(int ptcb, boolean rexmt)

{

int tosend = suna + total_data_length - snext;

    //SYN and FIN each take one seq number. Count them in.
    int specialFlags = 0;
    if (getFlag (SNDFIN))
        specialFlags++;

if (getFlag (SNDSYN))
specialFlags++;

    //If we have got any real data (not SYN/FIN),
    //but SWS prevents us from sending and no retry is going on,
    //pretend we haven't.
    if ((!rexmt) && (specialFlags == 0) && (tcpsndlen(false) == 0))
        tosend = 0;

tosend += specialFlags;

return tosend;
}

Since we have added SWS avoidance to tcpsndlen() (see previous note), we have to account for it in this function too. Basically, if tcpsndlen() returns zero, this function should also return zero, otherwise tcpxmit() will have a non-zero data length to send, but on each call to tcpsend() no data will be actually transmitted because of tcpsndlen().

This is reflected in the complex if clause in the middle of the new tcphowmuch(). Actually, there is only one case where we need to pretend that there is no data to send. First, we should not be doing retransmission, which is indicated by rexmt set to true. Second, no flags should be pending transmission – SWS avoidance does not work with them. And third, tcpsndlen() should return zero when called with its parameter rexmtset to false, which indicates no retransmission.

In tcpxmit(), p. 254, the while loop will work better if rewritten like this:

while (tcphowmuch(ptc, false) && pending < window)

The false parameter to tcphowmuch is discussed in the previous note. It tells the function if we want to use silly window avoidance. Another change is in the inequality sign. Comer has pending <= window, and if we already have sent a window-ful of data, we do not want to send another segment.

Comer’s code simply sets the PSH flag on every segment containing data (see tcpsend(), p. 256). This is not an issue of grave importance, but it is a more proper to set PSH only on the last segment in a transmission of several segments. This can be achieved by substituting the simple check

if (datalen > 0)

if ((rexmt && datalen > 0) ||
(!rexmt && datalen > 0 && tcpsndlen (false)==0))

This says that we will set the PSH flag only in two cases – if we are doing retransmission with non-zero amount of data, or if we are sending data in a normal way, and the next call to tcpsndlen() will yield zero, that is, the current segment is the last one to be sent.

It may happen that we are sending too many segments, and ACKs have already arrived while we are doing it. So it might seem that putting a signal();wait(); pair inside this loop will make matters better. However, the tests have shown that this does not increase throughput in all cases: for three typical scenarios (reading only, writing only, and both) it made the performance worse in one of them, made it better in another and did not influence it in the third. Besides, the difference is only about 10%.

In tcppersist(), p. 253, Comer calls tcpsend() with the second parameter of TSF_REXMT, that is, non-zero. This parameter, when zero, causes tcpsend() to form a new segment with the seq number of snext (the first unsent byte), and when it is non-zero, the seq number will be suna (the first unacked byte). In my opinion, it makes more sense to start from the unacked data when sending the persist probe. Furthermore, this is what BSD does (Stevens, p. 855). When sending a persist probe, snext is set equal to suna.

When an active close happens (the user closes a connection), we should flush the incoming data that have not been read yet. This will, among other things, expand the window so that we can accept a FIN from the other side. Cf Stevens, p. 1020: "Any pending data in the receive buffer is discarded by sbflush, since the process has closed the socket." To achieve the same effect, the Comer’s implementation has to flush the input buffer somewhere inside the tcpclose() function, for example, before changing the connection state. The flushing procedure consists in just setting the rbcount variable of a TCB to zero.

In tcpsynsen() function, p. 239, there should be an additional check in the if clause that checks for the RST bit. The proper behavior is specified in RFC 793:

If the RST bit is set:

If the ACK was acceptable then signal the user "error: connection reset", drop the segment, enter CLOSED state, delete TCB, and return. Otherwise (no ACK) drop the segment and return.

In the function above, it is not clear why Comer clears the FIN bit in the segment. The code is capable of handling the SYN and FIN flags at the same time. To do this, we must put the following after changing the state to Established:

if (getFlag (RDONE))
state = Closewait;

The function tcpacked(), p. 301, has a check for the ack number of an incoming segment greater then snext in the Synrcvd state. Since this is a very rare occasion, it makes sense to move this check into tcpsynrcvd() to save clock cycles in all other states.

According to Stevens, p. 930, we should not send a RST in response to an incoming segment when the connections state is CLOSED. The call to tcpreset() should not be removed from tcpclosed(), p. 218.

Function tcptimewait(), p. 200, does not agree with RFC 1337 that tells about "Time-Wait Assasination" that may occur if we respond to RST segments in Time-Wait. Therefore, if an incoming segment has the RST bit set, the function should just return.

Also, after receiving a FIN, there should be no incoming data processing (RFC 793, p. 75). Function tcpdata() should not be called from tcptimewait().

As a minor performance optimisation, when calling tmclear() to remove an event in the timer's queue, we can refrain from calling the MKEVENT macro that will create another event. We can just provide the connection and the event id and look for events that have the same parameters.

Another related note: in tcpkick() we can also skip creating a new event that will be immediately posted in the output event queue. Instead, we can have a special "zero time event" object (or structure) with the name "SEND" that will be kept in a TCB. When the need comes, we can just post this event to the output queue. Keep in mind that we proposed moving the delayed acks processing away from tcpkick() in note 3.

The RFC 1122 says that if the TCP is processing a series of queued segments, it must process them all before sending any ACK segments. Function tcpinp(), p. 206, can be modified to do it. I am not giving here the code, which is straightforward but ugly. The idea is to save the pointer to the connection that receives a segment and do not signal its semaphore until the next incoming segment is obtained. If this next segment should be passed to the same connection, the lock on its semaphore is still held. If the segment queue is empty or if the next segment is aimed at a different connection, only then we can release the lock. In the latter case, we need to wait on the semaphore for another connection and pass the segment to it, after which the whole thing repeats.

Same thing with events – in certain cases it may increase performance to deliver all the events to one connection without releasing the lock.

Function tcptimer(), p. 272, can be enhanced in the following way. Note that on the very top of p. 273 there is a check for

delta > TIMERGRAN * 100.

It can happen if delta, equal to now - lastrun, becomes too big, that is, when lastrun was set a long time ago. In its turn, this can happen if the timer process was suspended for too long. To avoid that, we can set lastrun just after the suspend call:

if (tqhead == 0)
{
suspend (tqpid);
lastrun = ctr100;
}