-
The proper behaviour of TCP is NOT to respond when it receives
a segment with no data and with the next expected sequence number. However,
Comer’s code WILL respond to such segment (function tcpackit(), p. 303).
Zero window probes usually have the sequence number one less than expected
(Linux) or the expected sequence number but only one byte of data (Windows,
BSD).
Function tcpok(), p. 212, strips the data in the incoming
segment if there is no place for the data in the local buffer. Then tcpackit()
will be called, which does not have any means to decide if it received
a one-byte zero window probe or an empty in-sequence segment. So tcpackit()
acknowledges anything.
The approach I used was to put the code for clearing the
data when the local buffer is full inside the else clause in tcpinp(),
p. 206. This is the if-else statement that will call tcpackit()
if the call to tcpok() returns false.
Then, inside tcpackit(), I added a check for the incoming
segment being empty and in-sequence, that is, if its sequence number is
the next expected, plus it has no data, plus it does not carry SYN or FIN
flags. If this is true, no acknowledgment should be sent.
-
In the course of processing an incoming segment in the Established
state, there may be produced more then one immediate-SEND event. Such events
are generated by tcpkick(), which may be called in tcpswindow(), p. 290,
tcpdata(), p. 229 and tcpostate(), p. 304. The first two are called directly
from tcpestablished(), p. 228, and the third one is called in tcpacked(),
p. 301, which is also called in tcpestablished().
The proposed solution is to extend the use of TCBF_NEEDOUT
in tcpdata() for the whole period of processing an incoming segment. Instead
of calling tcpkick() in each of the cases below, we will just set this
flag. When a segment is processed, we will check this flag and post an
immediate SEND event if the flag is set.
Even then, the combination of incoming segment processing
and user data sending may lead to posting of too many immediate-SEND events
where actually only one will suffice. This can be implemented via another
flag called WILLOUT. This flag is set just before tcpkick() is called in
every one of the cases above, and while it is set, no other immediate-SENDs
can be posted. It is cleared after the processing of any SEND event in
tcpout(), p. 250.
Finally, the last enhancement. It may happen that at the
time when we process an incoming segment there are some unsent data, so
there is a reason to post a SEND event. But by the time this event gets
executed, other events may have already sent all the data. We need to distinguish
between two reasons of sending a segment – the first one is that we have
some data to send, and the second one covers the other causes (sending
an ack, for example). Thus, let tcpswindow() and tcpostate() set one flag,
called MAYOUT, and all the other functions that see a cause for output
set the NEEDOUT flag. Then, when in tcpxmit() we see the case that there
is no data to send, we will check the NEEDOUT flag. Only when it is set,
we will send a no-data segment. The MAYOUT flag will not cause a segment
to be sent. Of course, both of those flags will be cleared in tcpsend().
-
In the previous note, we talked about calling tcpkick() after
processing a segment. Delayed acks are dealt with inside this function,
and their treatment could be improved. First, not every call to tcpkick()
is caused by segment processing, so the delayed acks processing should
be moved out of tcpkick() to the function that will set the NEEDOUT flag.
Also, we should delay an ACK only for a full-sized segment, plus we should
ack every other segment anyway. This is implemented via a DELACK flag (called
TF_ACKNOW in Stevens). Finally, if we have some out-of-order segments stored
in the input buffer, something is messing with the order of segments on
the wire, and we would not want to delay the ACK. Then the other side can
learn about our situation.
All this leads to the following expression:
if ((new_data_length == MSS) && (!getFlag
(DELACK)) && (no out-of-order segments))
{
setFlag (DELACK);
tmset (DELACK_TIME, "SEND")
}
else
setFlag (NEEDOUT);
-
Consider the case when the sending process is in the IDLE
state (in regard to output), and a window update arrives because the receiver
on the opposite side has read more data. There is no need to reply to this
packet. To provide for this, it is reasonable to put an extra check into
tcpswindow(), p. 290, between the check for window shrinking and the check
for the PERSIST state. This extra code will just return from the function
if the output state is IDLE, preventing scheduling an output event.
Also, in the same place, it makes sense to add a check
for the REXMT state to the check for the PERSIST state that is already
there. Since we are going to perform output via the tcpxmit() function
that will pick up the transmission from the first unacknowledged data byte,
that is, exactly where a pending RETRANSMIT would do it, there is no need
to process this old RETRANSMIT. The code in this if block should
also cancel the RETRANSMIT events in addition to the PERSIST events.
-
As mentioned in note 1, the Xinu implementation of TCP will
respond to empty in-sequence segments. The code for zero window probing
relies on that. Indeed, in tcpsend(), p. 256, there is no special provision
for a zero window probe, since its data length will automatically be zero,
which is returned by tcpsndlen().
In order for zero window probes to work with the changes
described in note 1, we need to specifically check in tcpsend() for swindow's
equality to zero and either set datalength to (say) one byte (BSD way),
or decrease the sequence number by one (Linux way). The complete if
clause for this case should look like this:
if (rexmt && swindow==0 && !getFlag
(RDONE | SNDFIN))
datalen = 1;
Here rexmt should be set to true because
in the course of normal transmission from tcpxmit() we should not hit the
case of swindow being equal to zero, but this may be an extraneous
check. The flags RDONE and SNDFIN are checked to catch the cases when we
are about to send or re-send a FIN, in which case it does not make sense
to send more data (there aren’t any) or decrease sequence numbers.
-
When we will implement sender-side silly window avoidance
(see next note), we will need an estimate of the maximum send window. This
will require another variable associated with a connection, called maxSwindow.
Therefore, when we receive a window value from the peer and set the variable
swindow
(function tcpswindow(), p. 290), we will also set this variable:
if (segment.swindow > connection.window)
connection.maxSwindow = segment.window
-
Function tcpwr(), p. 342, after new data was written will
call tcpkick() only in the cases when the connection is idle or the urgent
bit is set. Actually, this is an attempt for sender-side silly window avoidance
that should be implemented fully elsewhere (see the next note).
-
The function tcpsndlen(), p. 259, can be improved in several
ways.
Instead of
datalen = total_data_length -
offset
datalen = min (datalen, swindow);
it is more proper to have
datalen = min (total_data_length,
swindow) - offset;
(offset is what is denoted by *poff
in the C code).
This can be found in Stevens, p. 855.
The reason is that when offset is non-zero, we
have already sent "offset" bytes of data into the window (stored in swindow),
so the available window at that point is smaller.
This is noted in the Xinu
bug list page as bug T.10. If the other end is a slow reader, this
bug will cause extremely low throughput.
Also, tcpsndlen() does not safeguard against window shrinking
or silly window syndrome. Therefore, we should add after all the original
calculations the following:
if (!getFlag (SNDSYN |SNDFIN))
datalen = max (0, datalen);
This takes care of the case when in the course of sending
several segments swindow suddenly becomes zero. Therefore, the
data length calculated by the original function would be negative. We should
limit it to zero. There is only one legitimate case when the data length
can be negative, namely, it can be equal to -1. This happens when we have
a SYN or FIN pending transmission.
The last addition deals with silly window avoidance, sender-side,
and is taken from Stevens, p. 859. The following lines should be added
at the very end of the function.
//Always send a whole segment.
if (datalen ==smss)
return datalen;
//If the whole buffer can be sent, while we are idle
//or using no Nagle, do it.
if ((suna == snext || (Nagle option set)
&& (datalen + offset == total_data_length))
return datalen;
//Always send if a retransmission is going on.
if (rexmt)
return datalen;
//If the receiver's window (ie, our estimate)
//is at least half open, send.
if (datalen >= maxSwindow / 2)
return datalen;
//No reason to send found.
return 0;
These are the last lines of tcpsndlen().
Note that here we use variable maxSwindow that is absent
in Comer’s text. See the previous note.
The Xinu bug
list page mentions the poor SWS avoidance in the Comer's book as bugs
T.12 and T.13.
-
Function tcphowmuch(), p. 260, could also be improved. Its
pseudocode rewrite follows. Note that now it takes a new parameter, rexmt,
the same one as in tcpsndlen().
int tcphowmuch(int ptcb, boolean rexmt)
{
int tosend = suna + total_data_length
- snext;
//SYN and FIN each take one seq
number. Count them in.
int specialFlags = 0;
if (getFlag (SNDFIN))
specialFlags++;
if (getFlag (SNDSYN))
specialFlags++;
//If we have got any real data
(not SYN/FIN),
//but SWS prevents us from sending
and no retry is going on,
//pretend we haven't.
if ((!rexmt) && (specialFlags
== 0) && (tcpsndlen(false) == 0))
tosend
= 0;
tosend += specialFlags;
return tosend;
}
Since we have added SWS avoidance to tcpsndlen() (see
previous note), we have to account for it in this function too. Basically,
if tcpsndlen() returns zero, this function should also return zero, otherwise
tcpxmit() will have a non-zero data length to send, but on each call to
tcpsend() no data will be actually transmitted because of tcpsndlen().
This is reflected in the complex if clause in
the middle of the new tcphowmuch(). Actually, there is only one case where
we need to pretend that there is no data to send. First, we should not
be doing retransmission, which is indicated by rexmt set to true.
Second, no flags should be pending transmission – SWS avoidance does not
work with them. And third, tcpsndlen() should return zero when called with
its
parameter
rexmtset
to false, which indicates no retransmission.
-
In tcpxmit(), p. 254, the while loop will work better if
rewritten like this:
while (tcphowmuch(ptc, false) && pending
< window)
The false parameter to tcphowmuch is discussed
in the previous note. It tells the function if we want to use silly window
avoidance. Another change is in the inequality sign. Comer has pending
<= window, and if we already have sent a window-ful of data, we
do not want to send another segment.
-
Comer’s code simply sets the PSH flag on every segment containing
data (see tcpsend(), p. 256). This is not an issue of grave importance,
but it is a more proper to set PSH only on the last segment in a transmission
of several segments. This can be achieved by substituting the simple check
if (datalen > 0)
to
if ((rexmt && datalen > 0) ||
(!rexmt && datalen > 0 && tcpsndlen
(false)==0))
This says that we will set the PSH flag only in two cases
– if we are doing retransmission with non-zero amount of data, or if we
are sending data in a normal way, and the next call to tcpsndlen() will
yield zero, that is, the current segment is the last one to be sent.
-
It may happen that we are sending too many segments, and
ACKs have already arrived while we are doing it. So it might seem that
putting a signal();wait(); pair inside this loop will make matters
better. However, the tests have shown that this does not increase throughput
in all cases: for three typical scenarios (reading only, writing only,
and both) it made the performance worse in one of them, made it better
in another and did not influence it in the third. Besides, the difference
is only about 10%.
-
In tcppersist(), p. 253, Comer calls tcpsend() with the second
parameter of TSF_REXMT, that is, non-zero. This parameter, when zero, causes
tcpsend() to form a new segment with the seq number of snext (the
first unsent byte), and when it is non-zero, the seq number will be suna
(the first unacked byte). In my opinion, it makes more sense to start from
the unacked data when sending the persist probe. Furthermore, this is what
BSD does (Stevens, p. 855). When sending a persist probe, snext
is set equal to suna.
-
When an active close happens (the user closes a connection),
we should flush the incoming data that have not been read yet. This will,
among other things, expand the window so that we can accept a FIN from
the other side. Cf Stevens, p. 1020: "Any pending data in the receive buffer
is discarded by sbflush, since the process has closed the socket."
To achieve the same effect, the Comer’s implementation has to flush the
input buffer somewhere inside the tcpclose() function, for example, before
changing the connection state. The flushing procedure consists in just
setting the rbcount variable of a TCB to zero.
-
In tcpsynsen() function, p. 239, there should be an additional
check in the if clause that checks for the RST bit. The proper behavior
is specified in RFC 793:
If the RST bit is set:
If the ACK was acceptable then signal the user "error:
connection reset", drop the segment, enter CLOSED state, delete TCB, and
return. Otherwise (no ACK) drop the segment and return.
-
In the function above, it is not clear why Comer clears the
FIN bit in the segment. The code is capable of handling the SYN and FIN
flags at the same time. To do this, we must put the following after changing
the state to Established:
if (getFlag (RDONE))
state = Closewait;
-
The function tcpacked(), p. 301, has a check for the ack
number of an incoming segment greater then snext in the Synrcvd
state. Since this is a very rare occasion, it makes sense to move this
check into tcpsynrcvd() to save clock cycles in all other states.
-
According to Stevens, p. 930, we should not send a RST in
response to an incoming segment when the connections state is CLOSED. The
call to tcpreset() should not be removed from tcpclosed(), p. 218.
-
Function tcptimewait(), p. 200, does not agree with RFC 1337
that tells about "Time-Wait Assasination" that may occur if we respond
to RST segments in Time-Wait. Therefore, if an incoming segment has the
RST bit set, the function should just return.
Also, after receiving a FIN, there should be no incoming
data processing (RFC 793, p. 75). Function tcpdata() should not be called
from tcptimewait().
-
As a minor performance optimisation, when calling tmclear()
to remove an event in the timer's queue, we can refrain from calling the
MKEVENT macro that will create another event. We can just provide the connection
and the event id and look for events that have the same parameters.
-
Another related note: in tcpkick() we can also skip creating
a new event that will be immediately posted in the output event queue.
Instead, we can have a special "zero time event" object (or structure)
with the name "SEND" that will be kept in a TCB. When the need comes, we
can just post this event to the output queue. Keep in mind that we proposed
moving the delayed acks processing away from tcpkick() in note 3.
-
The RFC 1122 says that if the TCP is processing a series
of queued segments, it must process them all before sending any ACK segments.
Function tcpinp(), p. 206, can be modified to do it. I am not giving here
the code, which is straightforward but ugly. The idea is to save the pointer
to the connection that receives a segment and do not signal its semaphore
until the next incoming segment is obtained. If this next segment should
be passed to the same connection, the lock on its semaphore is still held.
If the segment queue is empty or if the next segment is aimed at a different
connection, only then we can release the lock. In the latter case, we need
to wait on the semaphore for another connection and pass the segment to
it, after which the whole thing repeats.
Same thing with events – in certain cases it may increase
performance to deliver all the events to one connection without releasing
the lock.
-
Function tcptimer(), p. 272, can be enhanced in the following
way. Note that on the very top of p. 273 there is a check for
delta > TIMERGRAN * 100.
It can happen if delta, equal to now - lastrun,
becomes too big, that is, when lastrun was set a long time ago.
In its turn, this can happen if the timer process was suspended for too
long. To avoid that, we can set lastrun just after the suspend
call:
if (tqhead == 0)
{
suspend (tqpid);
lastrun = ctr100;
}