The network and its underlying performance is a fundamental component to performance of all applications that require its services. As such this subject has become a cornerstone to the development of many strategies to optimize its behaviour.
In this small series of blogs I look at those strategies and their implementation. Fundamental to almost all of these strategies is an understanding of the workings of the underlying TCP (Transmission Control Protocol) Algorithm.
Post 1 Understanding the TCP transmission Algorithms
TCP is a reliable transport protocol. That is the TCP algorithm will guarantee that data sent between two points in the network will be successfully delivered. If any data is lost along the way the algorithm will ensure it is re-transmitted to ensure complete delivery of the required data.
Fundamental to this reliable flow of data is the use of positive acknowledgement from the receiver that data has been delivered, failing that retransmission is required in the event of failure. The possibility that data must be retransmitted imposes requirements on both sender and receiver to buffer data: the sender must buffer data while it waits for positive acknowledged from the receiver, and the receiver at the very least must buffer data until it acknowledges receipt to the sender, plus passes it up the network stack for further processing.
Since both sender and receiver have these responsibilities imposed flow control mechanisms must be introduced for both parties. Buffers are limited by physical hardware resources which will have boundaries!
Senders Congestion Window
The sender starts by transmitting one segment and waiting for its ACK. When that ACK is received, the congestion window is incremented from one to two, and two segments can then be sent. When each of those two segments is acknowledged, the congestion window is increased to four and so on. This provides a growth mechanism to the speed at which data can be transmitted. However it is inevitable that this growth at some point must be throttled.
A number of factors can throttle this growth. First to be considered is network unreliability: if a packet is not acknowledged within a timely manner the TCP algorithm will assume there is network congestion, subsequently ‘back off’ and start transmitting again from the initial slower speed. Second to be considered is the Receiver’s window size.
Receivers TCP Window
The window size represents the maximum number of unacknowledged bytes that can be outstanding at any time. This is negotiated at TCP session establishment, and prevents the receiver being overrun with data that it cannot process in time. Upon sending an ACK the Receiver will inform the Sender, in the TCP header window size field, the amount of available buffer left.
So TCP transmission rate is controlled by both TCP sender ‘congestion window’ and receiver ‘TCP window’. The actual rate of transmission will grow until either/or the TCP window or congestion window is reached i.e. whichever is smallest.
Now with an understanding of the protocol mechanisms at play within TCP i look at factors that come into play when considering the performance of the network, and what can be done to speed things up…..
Post 2 Propagation, Processing and Serialisation
These three factors will almost always have a bearing on the performance of the network
The time taken for bits to travel across the wire. In a copper of fibre medium this is limited by the speed of light. Without delving into the speed of light and speed = distance/time formaula. As a rule of thumb 1 millisecond per 100 miles is the best case scenario.
Any network equipment in the transmission path will add a small incremental amount of processing delay to the overall transmission time.
The time it takes to get the bits onto the wire. This is influenced by the transmission speed of the line.
Post 3 Bandwidth Delay Product
At any time during a data exchange between two points in the network there is likely to be data ‘in-flight’ on the wire. In-flight data refers to bytes that have been sent but not yet acknowledged to the sender by the receiver. If the source and destination are far apart then this figure will likely be higher – as referred to in Post 2 propagation delay will come into play on such a link. At the same time if the bandwidth of the link is high it allows greater amounts of data to be pushed down the pipe without acknowledgement.
By definition BDP refers to the product of a data link's capacity (bits per second) and its end to end delay (seconds). End to end delay is also expressed as RTT or Round Trip Time. By way of an example consider a 10mbps link with an RTT of 50milliseconds.
10,000,000 * 0.050 = 500,000 bits BDP or 62500 bytes.
So with the above example it is possible that 62500 bytes of unacknowledged data could be in transit at any one time. It’s probably appropriate to mention the term Long Fat Networks (LFNs) at this point. Strictly speaking and according to the official definition any network with a BDP of greater than 12500 bytes is termed an LFN.
The standard TCP header contains the window size field and this is defined as 16 bits in length. Therefore the max amount of data that can be in transit using standard TCP attributes is ‘2 to the power 16’ = 65536 bytes. With the example given if the link attributes become any faster or had a slighter greater RTT this would introduce an intrinsic problem with TCP – the BDP will become greater than the TCP window size allows!! At this point throughput on the link would become throttled, not because of any physical limitations of the network but by the limitations of the TCP protocol. Today network capabilities have outgrown the envisaged limits defined by TCP/IP version 4 defined way back by RFC 791 in 1981.
Thankfully this inherent limitation was foreseen and has now been circumvented. This leads nicely onto Post 4 – Windows Scaling…..
Post 4 Windows Scaling
Following on from previous post Windows Scaling becomes necessary once the BDP is greated than 65kb. Use of this feature enables the TCP window to be scaled from TCP header size limitation of 2 to the power 16 up to 2 to the power 30 or equivalent to just over 1 gigabyte. This has been implemented by creating a TCP header option field that specifies the windows scaling factor (up to a max value of 14). Details in RFC 1323. This approach was much preferable to the option increasing the size of the TCP windows header field: this which would have made the upgrade incompatible with previous versions of TCP.
One downside of windows scaling is that it can only be negotiated as part of TCP session establishment. This puts a requirement on the TCP stack to successfully establish the correct windows scaling for the BDP of the link……. The raises the question how can it do this?? This rather depends on the OS in use. Implementations may have a global setting that enables windows scaling and sets a global value. This may be tuned or programmed on a per application basis. Some OS's may even deploy auto-tuning whereby the optimum window size is calculated following network diagnostic tests.
In the next post I look at SACK another TCP mechanism to increase throughput…
Post 5 SACK Selective Acknowledgement
Under normal TCP processing rules the sender is not obliged to acknowledge all received packets. This lowers the overhead of TCP on the transmission of data. One downside is that when the sender determines a packet has been lost (e.g. based on a timeout), it must retransmit all packets from the last positively acknowledged.
SACK (RFC 2018) is a negotiated option on TCP session establishment. Once agreed the sender can acknowledged non-contiguous packets that are received. In the event that a packet is lost only selected transmission of lost or non acknowledged packets is required.
This feature can increase throughput rates especially on ‘lossy’ links, where retransmissions may be a regular occurrence.
Post 6 Packet Size
When a TCP client initiates a connection it includes its Maximum Segment Size (MSS) as an option in its SYN packet. In a typical LAN this will be 1460 bytes. The default Maximum Transmission Unit (MTU) on a LAN is 1500 bytes i.e. the maximum amount of data that can be transported by the protocol in one go. The TCP/IP header is 40 bytes, hence 1500 – 40 = 1460 bytes.
To summarize MSS is the largest amount of payload data. The MTU is the largest packet size including all TCP/IP headers (excluding Ethernet Header). If an attempt is made to transmit a data over and above this limit it will likely be discarded by the network. A Fragmentation Needed ICMP response may be sent back to the source depending on OS configurations.
The limitation on Ethernet packet size increases inefficiencies when transporting large volumes of data. Each received Ethernet frame requires any routers and switches in the network path to perform processing on it. If the frame size can be increased in enables data transfer with less effort, reducing network hardware resource utilization. The overall overhead of the TCP/IP protocol is also reduced, with less headers and acknowledgements to be transmitted.
To this end the concept of Jumbo Frames was introduced within Ethernet. This allows an MTU of 9000 bytes. This byte size was chosen as it is below the Ethernet CRC limit of effectiveness which apparently lies at around 12000 bytes, but is above the data block boundary of 8K (2 to the power 13). Many vendors of network equipment now support implementations of Jumbo Frames. The downside is that all network equipment in the path must be enabled for Jumbo Frames or it won’t work.
Post 7 Nagle Algorithm
At the other end of the packet size spectrum TCP/IP can perform badly when there are many small amounts of data to be transferred. For example with a packet of 2 bytes, there will be 40 bytes of TCP header added to enable the transmission. In this scenario the overhead of TCP/IP in percentage terms would be huge.
The NAGLE algorithm was created to optimize performance of TCP/IP in situations like the above. The concept is named after John Nagle. Effectively data is buffered by the sender in there is an outstanding ACK for data that has already been sent. The effect is to bundle bytes together to enable transmission of data in a more efficient manner.
Post 8 WAN accelerators
There are now many products on the market that offer WAN optimization. They pull together a number of techniques to improve performance over the WAN. In summary these are:
Caching: storing data locally knowing that it will likely be referenced again, thereby reducing the need to reach out across the WAN at all.
De-duplication: Rather than send complete blocks of data it will simply be referred to by a small amount of reference data i.e. providing simply an index to the data stored locally.
Compression: Use of compression technologies to reduce the amount of data that actual needs to be sent. The speed of compression/decompression algorithms is effectively faster than the speed of transmission of the actual data itself.
In addition to the above they will commonly make use of all the features already mentioned above, tweaking of packets sizes, windows scaling and SACK etc. This is by no means a definitive list for WAN accelerators but just an overview/insight to some of the techniques that they deploy.
Downsides are obviously increased costs (at least 2 are required!,and possibly 4 for resilience!) and network complexity.