LPC: Making the net go faster

By Jonathan Corbet
September 13, 2011

Almost every service offered by Google is delivered over the Internet, so it makes sense that the company would have an interest in improving how the net performs. The networking session at the 2011 Linux Plumbers Conference featured presentations from three Google developers, each of whom had a proposal for a significant implementation change. Between the three, it seems, there is still a lot of room for improvement in how we do networking.

Proportional rate reduction

The "congestion window" is a TCP sender's idea of how much data it can have in flight to the other end before it starts to overload a link in the middle. Dropped packets are often a sign that the congestion window is too large, so TCP implementations normally reduce the window significantly when loss happens. Cutting the congestion window will reduce performance, though; if the packet loss was a one-time event, that slowdown will be entirely unnecessary. RFC 3517 describes an algorithm for bringing the connection up to speed quickly after a lost packet, but, Nandita Dukkipati says, we can do better.

According to Nandita, a large portion of the network sessions involving Google's servers experience losses at some point; the ones that do can take 7-10 times longer to complete. RFC 3517 is part of the problem. This algorithm responds to a packet loss by immediately cutting the congestion window in half; that means that the sending system must, if the congestion window had been full at the time of the loss, wait for ACKs for half of the in-transit packets before transmitting again. That causes the sender to go silent for an extended period of time. It works well enough in simple cases (a single packet lost in a long-lasting flow), but it tends to clog up the works when dealing with short flows or extended packet losses.

Linux does not use strict RFC 3517 now; it uses, instead, an enhancement called "rate halving." With this algorithm, the congestion window is not halved immediately. Once the connection goes into loss recovery, each incoming ACK (which will typically acknowledge the receipt of two packets at the other end) will cause the congestion window to be reduced by a single packet. Over the course of one full set of in-flight packets, the window will be cut in half, but the sending system will continue to transmit (at a lower rate) while that reduction is happening. The result is a smoother flow and reduced latency.

But rate halving can be improved upon. The ACKs it depends on are themselves subject to loss; an extended loss can cause significant reduction of the congestion window and slow recovery. This algorithm also does not even begin the process of raising the congestion window back to the highest workable value until the recovery process is complete. So it can take quite a while to get back up to full speed.

The proportional rate reduction algorithm takes a different approach. The first step is to calculate an estimate for the amount of data still in flight, followed by a calculation of what, according to the congestion control algorithm in use, the congestion window should now be. If the amount of data in the pipeline is less than the target congestion window, the system just goes directly into the TCP slow start algorithm to bring the congestion window back up. Thus, when the connection experiences a burst of losses, it will start trying to rebuild the congestion window right away instead of creeping along with a small window for an extended period.

If, instead, the amount of data in flight is at least as large as the new congestion window, an algorithm similar to rate halving is used. The actual reduction is calculated relative to the new congestion window, though, rather than being a strict one-half cut. For both large and small losses, the emphasis on using estimates of the amount of in-flight data instead of counting ACKs is said to make recovery go more smoothly and to avoid needless reductions in the congestion window.

How much more better is it? Nandita said that Google has been running experiments on some of its systems; the result has been a 3-10% reduction in average latency. Recovery timeouts have been reduced by 5%. This code is being deployed more widely on Google's servers; it also has been accepted for merging during the 3.2 development cycle. More information can be found in this draft RFC.

TCP fast open

Opening a TCP connection requires a three-packet handshake: a SYN packet sent by the client, a SYN-ACK response from the server, and a final ACK from the client. Until the handshake is complete, the link can carry no data, so the handshake imposes an unavoidable startup latency on every connection. But what would happen, asked Yuchung Cheng, if one were to send data with the handshake packets? For simple transactions - an HTTP GET request followed by the contents of a web page, for example - sending the relevant data with the handshake packets would eliminate that latency. The result of this thought is the "TCP fast open" proposal.

RFC 793 (describing TCP) does allow data to be passed with the handshake packets, with the proviso that the data not be passed to applications until the handshake completes. One can consider fudging that last requirement to speed the process of transmitting data through a TCP connection, but there are some hazards to be dealt with. An obvious problem is the amplification of SYN flood attacks, which are bad enough when they only involve the kernel; if each received SYN packet were to take up application resources as well, the denial of service possibilities would be significantly worse.

Yuchung described an approach to fast open which is intended to get around most of the problems. The first step is the creation of a per-server secret which is hashed with information from each client to create a per-client cookie. That cookie is sent to the client as a special option on an ordinary SYN-ACK packet; the client can keep it and use it for fast opens in the future. The requirement to get a cookie first is a low bar for the prevention of SYN flood attacks, but it does make things a little harder. In addition, the server's secret is changed relatively often, and, if the server starts to see too many connections, fast open will simply be disabled until things calm down.

One remaining problem is that about 5% of the systems on the net will drop SYN packets containing unknown options or data. There is little to be done in this situation; TCP fast open simply will not work. The client must thus remember cases where the fast-open SYN packet did not get through and just use ordinary opens in the future.

Fast open will not happen by default; applications on both ends of the connection must specifically request it. On the client side, the sendto() system call is used to request a fast-open connection; with the new MSG_FAST_OPEN flag, it functions like the combination of connect() and sendmsg(). On the server side, a setsockopt() call with the TCP_FAST_OPEN option will enable fast opens. Either way, applications need not worry about dealing with the fast-open cookies and such.

In Google's testing, TCP fast open has been seen to improve page load times by anything between 4% and 40%. This technique works best in situations where the round trip time is high, naturally; the bigger the latency, the more value there is in removing it. A patch implementing this feature will be submitted for inclusion sometime soon.

Briefly: user-space network queues

While the previous two talks were concerned with improving the efficiency of data transfer over the net, Willem de Bruijn is concerned with network processing on the local host. In particular, he is working with high-end hardware: high-speed links, numerous processors, and, importantly, smart network adapters that can recognize specific flows and direct packets to connection-specific queues. By the time the kernel gets around to thinking about a given packet at all, it will already be sorted into the proper place, waiting for the application to ask for the data.

Actual processing of the packets will happen in the context of the receiving process as needed. So it all happens in the right context and on the right CPU; intermediate processing at the software IRQ level will be avoided. Willem even described a new interface whereby the application would receive packets directly from the kernel via a shared memory segment.

In other words, this talk described a variant of the network channels concept, where packet processing is pushed as close to the application as possible. There are numerous details to be dealt with, including the usual hangups for the channels idea: firewall processing and such. The proposed use of a file in sysfs to pass packets to user space also seems unlikely to pass review. But this work may eventually reach a point where it is generally useful; those who are interested can find the patches on the unetq page.

Index entries for this article
Kernel	Networking/Congestion control
Conference	Linux Plumbers Conference/2011

(Log in to post comments)

LPC: Making the net go faster (Briefly: user-space network queues)

Posted Sep 15, 2011 16:15 UTC (Thu) by mjthayer (guest, #39183) [Link]

From the article:
> There are numerous details to be dealt with, including the usual hangups for the channels idea: firewall processing and such.

I am a networking ignoramus, so please excuse the following if necessary. I thought though that firewalling normally involves taking a decision about whether to allow or forbid a particular connection as a whole. Can't the packet sequence for a particular connection simply be assigned to a particular user-space buffer after (if) that connection has been allowed?

LPC: Making the net go faster (Briefly: user-space network queues)

Posted Sep 15, 2011 17:14 UTC (Thu) by appie (guest, #34002) [Link]

The idea behind network channels is to push (packet) processing out of the kernel towards the application. Over simplified: a direct pipe between incoming packets at the hardware level and the application. Packet processing won't be done by the kernel (again, oversimplified), hence no firewall checks.
Firewalls would have to be implemented in user space, in e.g. a library, every application connecting tot a network channel would need to link to that library and explicitly do it's own firewalling.

Also see: Van Jacobson's network channels

LPC: Making the net go faster (Briefly: user-space network queues)

Posted Sep 23, 2011 13:26 UTC (Fri) by slashdot (guest, #22014) [Link]

Then put the firewall in the hardware too.

LPC: Making the net go faster (Briefly: user-space network queues)

Posted Sep 15, 2011 23:00 UTC (Thu) by skitching (guest, #36856) [Link]

There are two types of firewalling: plain address/port checks that are applied when a connection is opened, and "stateful firewalling" where the actual data stream is inspected.

Stateful firewalls can do things like allow HTTP GET/PUT operations, but block other HTTP methods. Sadly, the corporate email at my current employer does this, which prevents all subversion access to external subversion repositories :-(. However it can be useful.

LPC: Making the net go faster (Briefly: user-space network queues)

Posted Sep 16, 2011 9:09 UTC (Fri) by mjthayer (guest, #39183) [Link]

> There are two types of firewalling: plain address/port checks that are applied when a connection is opened, and "stateful firewalling" where the actual data stream is inspected.

I agree, that sort of firewall cannot be implemented by just allowing or disallowing connections when they are first made. If you don't want an additional data copy it might work if the card is writing to a shared memory buffer which is only made available to the application once the firewall has given the green light. That would slow things down quite a bit, but I think if you want to inspect data that closely you have to live with that anyway (or use an external firewall). Having the receiver process itself doing the inspection as suggested by appie above [1] is probably not really an option here, as this sort of firewall is most likely there to stop the user doing things they may want to do but you (for some value of you) don't.

[1] http://lwn.net/Articles/459118/

LPC: Making the net go faster (Briefly: user-space network queues)

Posted Jan 6, 2012 20:41 UTC (Fri) by whacker (guest, #55546) [Link]

> Sadly, the corporate email at my current employer does this, which prevents all subversion access to external subversion repositories :-(. However it can be useful.

This sort of thing is likely being done by a http proxy in userspace than a stateful firewall in the kernel.

Already available in the kernel: See IBverbs API

Posted Sep 16, 2011 15:21 UTC (Fri) by clameter (subscriber, #17005) [Link]

The IBverbs API implements something along the lines discussed here and it works with several Ethernet NICs as well.

LPC: Making the net go faster

Posted Sep 17, 2011 17:01 UTC (Sat) by dbarv (subscriber, #55094) [Link]

TCP fast open seems to reinvent T/TCP (http://en.wikipedia.org/wiki/T/TCP). Maybe the related security problems will be avoided...

LPC: Making the net go faster

Posted Oct 3, 2011 11:00 UTC (Mon) by andika (subscriber, #47219) [Link]

http://code.google.com/p/kernel/wiki/unetq

Page "unetq" Not Found

Select an existing page from the project's list.

LPC: Making the net go faster

Posted Nov 18, 2011 21:47 UTC (Fri) by Lennie (guest, #49641) [Link]

I think it might be this page ?:

https://code.google.com/p/kernel/wiki/ProjectUnetq

LPC: Making the net go faster

Posted Dec 19, 2011 3:05 UTC (Mon) by realbright (guest, #81887) [Link]

Regarding unetq,
I think, we have already got a well-made infra. For packet hooking called netfilter.
Which will be a winner? Kernel module netfilter or unetq?