The network and its underlying performance
is a fundamental component to performance of all applications that require its
services. As such this subject has
become a cornerstone to the development of many strategies to optimize its behaviour.
In this small series of blogs I look at
those strategies and their implementation.
Fundamental to almost all of these strategies is an understanding of the
workings of the underlying TCP (Transmission Control Protocol) Algorithm.
Post 1 Understanding the TCP transmission
Algorithms
TCP is a reliable transport protocol. That
is the TCP algorithm will guarantee that data sent between two points in the
network will be successfully delivered.
If any data is lost along the way the algorithm will ensure it is
re-transmitted to ensure complete delivery of the required data.
Fundamental to this reliable flow of data
is the use of positive acknowledgement from the receiver that data has been
delivered, failing that retransmission is required in the event of failure. The possibility that data must be
retransmitted imposes requirements on both sender and receiver to buffer data: the sender must buffer data while it waits for
positive acknowledged from the receiver, and the receiver at the very least
must buffer data until it acknowledges receipt to the sender, plus passes it up
the network stack for further processing.
Since both sender and receiver have these
responsibilities imposed flow control mechanisms must be introduced for both
parties. Buffers are limited by
physical hardware resources which will have boundaries!
For the sender there is the concept of the ‘congestion window’ and for the receiver there is the ‘TCP window’.
Senders Congestion
Window
The sender starts by transmitting one
segment and waiting for its ACK. When that ACK is received, the congestion
window is incremented from one to two, and two segments can then be sent. When
each of those two segments is acknowledged, the congestion window is increased
to four and so on. This provides a growth mechanism to the speed at which data
can be transmitted. However it is inevitable that this growth
at some point must be throttled.
A number of factors can throttle this
growth. First to be considered is network
unreliability: if a packet is not acknowledged within a timely manner the TCP
algorithm will assume there is network congestion, subsequently ‘back off’ and
start transmitting again from the initial slower speed. Second to be considered is the Receiver’s
window size.
Receivers TCP
Window
The window size represents the maximum
number of unacknowledged bytes that can be outstanding at any time. This is negotiated at TCP session
establishment, and prevents the receiver being overrun with data that it cannot
process in time. Upon sending an ACK
the Receiver will inform the Sender, in the TCP header window size field, the
amount of available buffer left.
So TCP transmission rate is controlled by
both TCP sender ‘congestion window’ and receiver
‘TCP window’. The actual rate of transmission
will grow until either/or the TCP window or congestion window is reached i.e. whichever is
smallest.
Now with an understanding of the
protocol mechanisms at play within TCP i look at factors that come into play
when considering the performance of the network, and what can be done to speed
things up…..
Post 2 Propagation,
Processing and Serialisation
These three factors will almost always have
a bearing on the performance of the network
Propagation
The time taken for bits to travel across
the wire. In a copper of fibre medium
this is limited by the speed of light.
Without delving into the speed of light and speed = distance/time formaula. As a rule of thumb 1 millisecond per 100 miles is the best case
scenario.
Processing
Any network equipment in the transmission
path will add a small incremental amount of processing delay to the overall
transmission time.
Serialisation
The time it takes to get the bits onto the
wire. This is influenced by the transmission speed of the line.
Post 3 Bandwidth Delay
Product
At any time during a data exchange between
two points in the network there is likely to be data ‘in-flight’ on the
wire. In-flight data refers to bytes
that have been sent but not yet acknowledged to the sender by the receiver. If the source and destination are far apart
then this figure will likely be higher – as referred to in Post 2 propagation
delay will come into play on such a link. At the
same time if the bandwidth of the link is high it allows greater amounts of
data to be pushed down the pipe without acknowledgement.
By definition BDP refers to the product of a data link's capacity
(bits per second) and its end to end delay (seconds). End to end delay is also expressed as RTT or
Round Trip Time. By way of an example consider a 10mbps link with an RTT of
50milliseconds.
10,000,000 * 0.050 = 500,000 bits BDP or
62500 bytes.
So with the above example it is possible
that 62500 bytes of unacknowledged data could be in transit at any one
time. It’s probably appropriate to
mention the term Long Fat Networks (LFNs) at this point. Strictly speaking and
according to the official definition any network with a BDP of greater than
12500 bytes is termed an LFN.
The standard TCP header contains the window
size field and this is defined as 16 bits in length. Therefore the max amount
of data that can be in transit using standard TCP attributes is ‘2 to the power
16’ = 65536 bytes. With the example
given if the link attributes become any faster or had a slighter greater RTT
this would introduce an intrinsic problem with TCP – the BDP will become
greater than the TCP window size allows!!
At this point throughput on the link would become throttled, not because
of any physical limitations of the network but by the limitations of the TCP
protocol. Today network capabilities
have outgrown the envisaged limits defined by TCP/IP version 4 defined way back
by RFC 791 in 1981.
Thankfully this inherent limitation was
foreseen and has now been circumvented. This leads nicely onto Post 4 – Windows Scaling…..
Post 4 Windows Scaling
Following on from previous post Windows
Scaling becomes necessary once the BDP is greated than 65kb. Use of this feature enables the TCP window
to be scaled from TCP header size limitation of 2 to the power 16 up to 2 to
the power 30 or equivalent to just over 1 gigabyte.
This has been implemented by creating a TCP header option field that
specifies the windows scaling factor (up to a max value of 14). Details in RFC 1323. This approach was much preferable to the
option increasing the size of the TCP windows header field: this which would
have made the upgrade incompatible with previous versions of TCP.
One downside of windows scaling is that it
can only be negotiated as part of TCP session establishment. This puts a requirement on the TCP stack to
successfully establish the correct windows scaling for the BDP of the
link……. The raises the question how can
it do this?? This rather depends on the
OS in use. Implementations may have a global setting that enables windows
scaling and sets a global value. This may be tuned or programmed on a per
application basis. Some OS's may even
deploy auto-tuning whereby the optimum window size is calculated following network
diagnostic tests.
In the next post I look at SACK another TCP
mechanism to increase throughput…
Post 5 SACK Selective
Acknowledgement
Under normal TCP processing rules
the sender is not obliged to acknowledge all received packets. This lowers the
overhead of TCP on the transmission of data.
One downside is that when the sender determines a packet has been lost
(e.g. based on a timeout), it must retransmit all packets from the last
positively acknowledged.
SACK (RFC
2018) is a negotiated option on TCP
session establishment. Once agreed the
sender can acknowledged non-contiguous packets that are received. In the event
that a packet is lost only selected transmission of lost or non acknowledged
packets is required.
This feature can increase
throughput rates especially on ‘lossy’ links, where retransmissions may be a
regular occurrence.
Post 6 Packet Size
When a TCP client initiates a connection it
includes its Maximum Segment Size (MSS) as an option in its SYN packet. In a
typical LAN this will be 1460 bytes. The
default Maximum Transmission Unit (MTU) on a LAN is 1500 bytes i.e. the maximum
amount of data that can be transported by the protocol in one go. The TCP/IP
header is 40 bytes, hence 1500 – 40 = 1460 bytes.
To summarize MSS is the largest amount of
payload data. The MTU is the largest packet size including all TCP/IP headers (excluding
Ethernet Header). If an attempt is made
to transmit a data over and above this limit it will likely be discarded by the
network. A Fragmentation Needed ICMP response may be
sent back to the source depending on OS configurations.
The limitation on
Ethernet packet size increases inefficiencies when transporting large volumes
of data. Each received Ethernet frame
requires any routers and switches in the network path to perform processing on
it. If the frame size can be increased in enables data transfer with less
effort, reducing network hardware resource utilization. The overall overhead of
the TCP/IP protocol is also reduced, with less headers and acknowledgements to
be transmitted.
To this end the concept of Jumbo Frames was
introduced within Ethernet. This allows an
MTU of 9000 bytes. This byte size was
chosen as it is below the Ethernet CRC limit of effectiveness which apparently lies
at around 12000 bytes, but is above the data block boundary of 8K (2 to the
power 13). Many vendors of network
equipment now support implementations of Jumbo Frames. The downside is that all
network equipment in the path must be enabled for Jumbo Frames or it won’t
work.
Post 7 Nagle Algorithm
At the other end of the packet size
spectrum TCP/IP can perform badly when there are many small amounts of data to
be transferred. For example with a packet of 2 bytes, there will be 40 bytes of
TCP header added to enable the transmission.
In this scenario the overhead of TCP/IP in percentage terms would be
huge.
The NAGLE algorithm was created to optimize
performance of TCP/IP in situations like the above. The concept is named after
John Nagle. Effectively data is buffered
by the sender in there is an outstanding ACK for data that has already been
sent. The effect is to bundle bytes
together to enable transmission of data in a more efficient manner.
Post 8 WAN accelerators
There are now many products on the market
that offer WAN optimization. They pull
together a number of techniques to improve performance over the WAN. In summary these are:
Caching: storing data locally knowing that
it will likely be referenced again, thereby reducing the need to reach out
across the WAN at all.
De-duplication: Rather than send complete blocks of data it
will simply be referred to by a small amount of reference data i.e. providing
simply an index to the data stored locally.
Compression: Use of compression technologies to reduce
the amount of data that actual needs to be sent. The speed of compression/decompression
algorithms is effectively faster than the speed of transmission of the actual
data itself.
In addition to the above they will commonly
make use of all the features already mentioned above, tweaking of packets
sizes, windows scaling and SACK etc.
This is by no means a definitive list for WAN accelerators but just an
overview/insight to some of the techniques that they deploy.
Downsides are obviously increased costs (at
least 2 are required!,and possibly 4 for resilience!) and network complexity.
……....