A ring buffer for epoll

By Jonathan Corbet
May 30, 2019

The set of system calls known collectively as epoll was designed to make polling for I/O events more scalable. To that end, it minimizes the amount of setup that must be done for each system call and returns multiple events so that the number of calls can also be minimized. But that turns out to still not be scalable enough for some users. The response to this problem, in the form of this patch series from Roman Penyaev, takes a familiar form: add yet another ring-buffer interface to the kernel.

The poll() and select() system calls can be used to wait until at least one of a set of file descriptors is ready for I/O. Each call, though, requires the kernel to set up an internal data structure so that it can be notified when any given descriptor changes state. Epoll gets around this by separating the setup and waiting phases, and keeping the internal data structure around for as long as it is needed.

An application starts by calling epoll_create1() to create a file descriptor to use with the subsequent steps. That call, incidentally, supersedes epoll_create(); it replaces an unused argument with a flags parameter. Then epoll_ctl() is used to add individual file descriptors to the set monitored by epoll. Finally, a call to epoll_wait() will block until at least one of the file descriptors of interest has something to report. This interface is a bit more work to use than poll(), but it makes a big difference for applications that are monitoring huge numbers of file descriptors.

That said, it would seem that there is still room for doing things better. Even though epoll is more efficient than its predecessors, an application still has to make a system call to get the next set of file descriptors that are ready for I/O. On a busy system, where there is almost always something that is needing attention, it would be more efficient if there were a way to get new events without calling into the kernel. That is where Penyaev's patch set comes in; it creates a ring buffer shared between the application and the kernel that can be used to transmit events as they happen.

epoll_create() — the third time is the charm

The first step for an application that wishes to use this mechanism is to tell the kernel that polling will be used and how big the ring buffer should be. Of course, epoll_create1() does not have a parameter that can be used for the size information, so it is necessary to add epoll_create2():

    int epoll_create2(int flags, size_t size);

There is a new flag, EPOLL_USERPOLL, that tells the kernel to use a ring buffer to communicate events; the size parameter says how many entries the ring buffer should hold. This size will be rounded up to the next power of two; the result sets an upper bound on the number of file descriptors that this epoll instance will be able to monitor. A maximum of 65,536 entries is enforced by the current patch set.

File descriptors are then added to the polling set in the usual way with epoll_ctl(). There are some restrictions that apply here, though, since some modes of operation are not compatible with user-space polling. In particular, every file descriptor must request edge-triggered behavior with the EPOLLET flag. Only one event will be added to the ring buffer when a file descriptor signals readiness; continually adding events for level-triggered behavior clearly would not work well. The EPOLLWAKEUP flag (which can be used to prevent system suspend while specific events are being processed) does not work in this mode; EPOLLEXCLUSIVE is also not supported.

Two or three separate mmap() calls are required to map the ring buffer into user space. The first one should have an offset of zero and a length of one page; it will yield a page containing this structure:

    struct epoll_uheader {
	u32 magic;          /* epoll user header magic */
	u32 header_length;  /* length of the header + items */
	u32 index_length;   /* length of the index ring, always pow2 */
	u32 max_items_nr;   /* max number of items */
	u32 head;           /* updated by userland */
	u32 tail;           /* updated by kernel */

	struct epoll_uitem items[];
    };

The header_length field, somewhat confusingly, contains the length of both the epoll_uheader structure and the items array. As seen in this example program, the intended use pattern appears to be that the application will map the header structure, get the real length, unmap the just-mapped page, then remap it using header_length to get the full items array.

One might expect that items is the ring buffer, but there is a layer of indirection used here. Getting at the actual ring buffer requires calling mmap() another time with header_length as the offset and the index_length header field as the length. The result will be an array of integer indexes into the items array that functions as the real ring buffer.

The actual items used to indicate events are represented by this structure:

    struct epoll_uitem {
	__poll_t ready_events;
	__poll_t events;
	__u64 data;
    };

Here, events appears to be the set of events that was requested when epoll_ctl() was called, and ready_events is the set of events that has actually happened. The data field comes through directly from the epoll_ctl() call that added this file descriptor.

Whenever the head and tail fields differ, there is at least one event to be consumed from the ring buffer. To consume an event, the application should read the entry from the index array at head; this read should be performed in a loop until a non-zero value is found there. The loop, evidently, is required to wait, if necessary, until the kernel's write to that entry is visible. The value read is an index into the items array — almost. It is actually the index plus one. The data should be copied from the entry and ready_events set to zero; then the head index should be incremented.

So, in a cleaned up form, code that reads from the ring buffer will look something like this:

    while (header->tail == header->head)
        ;  /* Wait for an event to appear */
    while (index[header->head] == 0)
        ;  /* Wait for event to really appear */
    item = header->items + index[header->head] - 1;
    data = item->data;
    item->ready_events = 0;  /* Mark event consumed */
    header->head++;

In practice, this code is likely to be using C atomic operations rather than direct reads and writes, and head must be incremented in a circular fashion. But hopefully the idea is clear.

Busy-waiting on an empty ring buffer is obviously not ideal. Should the application find itself with nothing to do, it can still call epoll_wait() to block until something happens. This call will only succeed, though, if the events array is passed as NULL, and maxevents is set to zero; in other words, epoll_wait() will block, but it will not, itself, return any events to the caller. It will, though, helpfully return ESTALE to indicate that there are events available in the ring buffer.

This patch set is in its third revision, and there appears to be little opposition to its inclusion at this point. The work has not yet found its way into linux-next, but it still seems plausible that it could be deemed ready for the 5.3 merge window.

Some closing grumbles

Figuring out the above interface required a substantial amount of reverse engineering of the code. This is a rather complex new API, but it is almost entirely undocumented; that will make it hard to use, but the lack of documentation also makes it hard to review the API in the first place. It is doubtful that anybody beyond the author has written any code to use this API at this point. Whether the development community will fully understand this API before committing to it is far from clear.

Perhaps the saddest thing, though, is that this will be yet another of many ring-buffer interfaces in the kernel. Others include perf events, ftrace, io_uring, AF_XDP and, doubtless, others that don't come immediately to mind. Each of these interfaces has been created from scratch and must be understood (and consumers implemented) separately by user-space developers. Wouldn't it have been nice if the kernel had defined a set of standards for ring buffers shared with user space rather than creating something new every time? One cannot blame the current patch set for this failing; that ship sailed some time ago. But it does illustrate a shortcoming in how Linux kernel APIs are designed; they seem doomed to never fit into a coherent and consistent whole.

Index entries for this article
Kernel	Epoll

(Log in to post comments)

A ring buffer for epoll

Posted May 30, 2019 17:03 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

I'm really confused. Why wouldn't we just use the AIO poll interface?

A ring buffer for epoll

Posted May 31, 2019 5:42 UTC (Fri) by smurf (subscriber, #17840) [Link]

Too much overhead. If there's a continuous stream of ready file descriptors you want no system calls. This does it.

Would be even less overhead if the kernel had a single sensible ring buffer implementation. This is not.

A ring buffer for epoll

Posted Jun 1, 2019 7:15 UTC (Sat) by pbonzini (subscriber, #60935) [Link]

The event ring buffer can be accessed from user space without invoking io_getevents.

A ring buffer for epoll

Posted May 30, 2019 17:14 UTC (Thu) by bjorntopel (subscriber, #80345) [Link]

Regarding the "Some closing grumbles" section: We (as in the AF_XDP authors) are looking into supporting the io_uring in addition to the AF_XDP rings. At least for sockets, the io_uring looks like an excellent fit. Jens Axboe has done some really good design decisions there!

Now, let's pull the virtio ring into io_uring as well... ;-)

A ring buffer for epoll

Posted May 30, 2019 23:34 UTC (Thu) by mst@redhat.com (subscriber, #60682) [Link]

> Now, let's pull the virtio ring into io_uring as well... ;-)
The old split ring layout is somewhat complex.
The new packed ring format might be a good fit for that.

Not need for new syscall

Posted May 30, 2019 21:30 UTC (Thu) by scientes (guest, #83068) [Link]

You just add a flag, and with that flag there is a second syscall argument. Look at futex() and the crazy variable number of arguments. glibc then magically calls it epoll_create2, or whatever. But no need for a new syscall, just a new flag.

Not need for new syscall

Posted May 30, 2019 21:36 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

I don't quite get it why people are so opposed to new syscalls.

Not need for new syscall

Posted May 30, 2019 21:50 UTC (Thu) by scientes (guest, #83068) [Link]

I just checked, epoll_create1() checks for unknown flags, so there totally is no need for a new syscall.

Not need for new syscall

Posted May 30, 2019 22:53 UTC (Thu) by cyphar (subscriber, #110703) [Link]

We are running out of syscall space. 5.3 will probably have 434 common syscalls on all architectures, and there are apparently cache-related performance impacts once you pass 512 (on x86 at least). This doesn't mean we should always avoid new syscalls, but rather we should be careful when we add them. If the only user-facing purpose of a new syscall is to add a struct argument then we should look at doing it that way.

Not need for new syscall

Posted May 31, 2019 11:28 UTC (Fri) by epa (subscriber, #39769) [Link]

It seems there are two different issues here. One is the ABI used to call into the kernel on different architectures. That may support a fixed number of 'system call numbers' or have performance reasons to keep it down. The other is the API provided to the C library and by the C library to applications so they can call the familiar named functions like open(2) or kill(2). You could have an operating system running on i386 that used only a syscall number when calling into the kernel, but still provided the usual POSIX system call names. Is there a reason Linux can't add new "system calls" indefinitely in this way?

Not need for new syscall

Posted May 31, 2019 12:47 UTC (Fri) by smurf (subscriber, #17840) [Link]

Multiplex syscalls are generally frowned upon these days. Indirection eats another register for the "real" syscall number, tracing and syscall filtering get more complicated, … Besides, yes the syscall table would be full after adding the 512th entry, but extending it to 1024 is not exactly rocket science.

Adding a generator for these tables, in order to use a central point of syscall registry instead of the current arch hodgepodge, is certainly possible. Just do it …

Not need for new syscall

Posted Jun 1, 2019 15:24 UTC (Sat) by luto (subscriber, #39314) [Link]

What’s the issue on x86? As far as I know, the only real issue is running into the silly x32 aliases, but we can easily fix that.

Not need for new syscall

Posted May 31, 2019 6:56 UTC (Fri) by koenkooi (subscriber, #71861) [Link]

My issue with new syscalls is that they usually get added and enabled for a single platform, x86_64, and only added to more platforms months or years after that. This happened with the original epoll and accept4. The issue manifested itself as a 180 second delay during boot due to accept4:

* sys_accept4() was added in 2.6.28
* sys_accept4() was added for ARM in 2.6.36
* (e)glibc built against 2.6.32 headers on and ARM board running 2.6.32

With help from the systemd folks I tracked it down to accept4 missing, so I applied http://lists.infradead.org/pipermail/linux-arm-kernel/201... to the 2.6.32 kernel. Still a 3 minute delay. That's when I realized I needed to build eglibc against the patched 2.6.32 headers as well as patching the kernel. Running a kernel with the new syscall hooked up is not enough!

So everytime a new syscall gets proposed that is desired by the base layers in the OS I keep an eye on the ARM syscall list to avoid surprises. Marcin keeps this table up to date: https://fedora.juszkiewicz.com.pl/syscalls.html

System calls and architectures

Posted May 31, 2019 13:50 UTC (Fri) by corbet (editor, #1) [Link]

That's not really a problem with new system calls; it's about how they are implemented in the kernel. The good news is that this situation has gotten a lot better and continues to improve. A lot of the system-call boilerplate is being unified across architectures, and it's increasingly expected that new system calls will be enabled for most or all architectures from the outset.

System calls and architectures

Posted May 31, 2019 14:27 UTC (Fri) by cyphar (subscriber, #110703) [Link]

And new (>403) syscalls now use the same number on all architectures, so in principle there should be no need to rebuild libraries to get a __NR_foobar definition on a given architecutre -- libraries should be able to simply do a -ENOSYS check at runtime with an non-arch-specific __NR_foobar value.

A ring buffer for epoll

Posted May 31, 2019 0:55 UTC (Fri) by cesarb (subscriber, #6266) [Link]

> takes a familiar form: add yet another ring-buffer interface to the kernel.

I wonder if future operating system designs will use ring buffers for everything instead of system calls. Want to open a file? Add an "open file" request to one ring buffer, and wait for the corresponding response in another ring buffer.

A ring buffer for epoll

Posted May 31, 2019 2:50 UTC (Fri) by sbaugh (subscriber, #103291) [Link]

>I wonder if future operating system designs will use ring buffers for everything instead of system calls

Sounds like FlexSC: https://www.usenix.org/legacy/events/osdi10/tech/full_pap...

FlexSC

Posted May 31, 2019 7:33 UTC (Fri) by smurf (subscriber, #17840) [Link]

That paper seems very interesting. Too bad it's 9 years old and no follow-up has happened.

A ring buffer for epoll

Posted May 31, 2019 22:58 UTC (Fri) by cesarb (subscriber, #6266) [Link]

Very interesting. One of the things mentioned in that paper is that using a ring buffer for system calls allows running the kernel and user space in separate cores; this might be a way to reduce the impact of Spectre/Meltdown/etc mitigations, and even strengthen them by keeping both siblings of each SMT pair either in the kernel or in user space all the time (so there would no longer be a need to either disable SMT, or do a very expensive IPI on every kernel entry/exit to protect against MDS).

A ring buffer for epoll

Posted May 31, 2019 5:48 UTC (Fri) by smurf (subscriber, #17840) [Link]

That, and a single syscall – to signal the kernel, when you write the first event to an empty buffer. (That syscall already exists, by the way: futex_wait(). You simply need to also support kernel threads.)

A ring buffer for epoll

Posted May 31, 2019 21:50 UTC (Fri) by mm7323 (subscriber, #87386) [Link]

You could call that syscall batching. Apart from the downsides of error handling, I believe there are patent issues:

https://patents.google.com/patent/US9038075B2/en

A ring buffer for epoll

Posted Jun 1, 2019 8:05 UTC (Sat) by smurf (subscriber, #17840) [Link]

That patent is owned by Red Hat, so no problem here.

A ring buffer for epoll

Posted Jun 2, 2019 5:07 UTC (Sun) by alison (subscriber, #63752) [Link]

I think you mean "owned by IBM," ahem.

A ring buffer for epoll

Posted Jun 2, 2019 9:48 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

I think you mean "donated to OIN".

A ring buffer for epoll

Posted Jun 2, 2019 14:52 UTC (Sun) by alison (subscriber, #63752) [Link]

>I think you mean "donated to OIN".

There's some good news, at least. The contrast between IBM's handling of RedHat and Oracle's of Sun is striking, at least so far.

A ring buffer for epoll

Posted Jun 2, 2019 19:30 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

So far Red Hat has not been acquired; it's still an independent public company, although in the process of being acquired.

A ring buffer for epoll

Posted Jun 3, 2019 9:09 UTC (Mon) by smurf (subscriber, #17840) [Link]

Assignee of record is still RedHat.

A ring buffer for epoll

Posted May 31, 2019 14:14 UTC (Fri) by jhoblitt (subscriber, #77733) [Link]

What happens if userspace doesn't keep up, the ring buffer is full, and new fd events are generated?

Filling the ring buffer

Posted May 31, 2019 14:21 UTC (Fri) by corbet (editor, #1) [Link]

That cannot happen, as it turns out. I didn't get deeply into this in the article, maybe I should have. The new epoll code gives each file descriptor a dedicated entry in the items array; when one becomes ready, an index to it is added to the index array, which is the real ring buffer. Until user space consumes the item, there is nothing more to add to the index array - the file descriptor is already there (though more POLL* bits could be set). So the ring buffer can fill but never overflow.

Filling the ring buffer

Posted May 3, 2021 14:08 UTC (Mon) by brho (subscriber, #50662) [Link]

this sounds a lot like what i did in a similar "ring buffer for something like epoll" in another OS:

https://github.com/brho/akaros/blob/master/kern/include/r....

the 'CEQ' was designed so that i could do epoll in userspace on a non-linux research OS.

A ring buffer for epoll

Posted Jun 1, 2019 20:43 UTC (Sat) by andresfreund (subscriber, #69562) [Link]

> Figuring out the above interface required a substantial amount of reverse engineering of the code. This is a rather complex new API, but it is almost entirely undocumented; that will make it hard to use, but the lack of documentation also makes it hard to review the API in the first place. It is doubtful that anybody beyond the author has written any code to use this API at this point. Whether the development community will fully understand this API before committing to it is far from clear.

I don't understand how this is the case for just about every new linux API. Coming from another community (postgres), I really don't understand why it's not a hard requirement to provide some minimal set of API docs? It doesn't have to be in fully man-page formatted, nicely phrased, native-speaker level English. But it should provide at least enough information to be able to write that manpage (and a good bit of this article) without having to do a lot of original research. This isn't even hard to enforce?

A ring buffer for epoll

Posted Jun 2, 2019 16:05 UTC (Sun) by daney (guest, #24551) [Link]

Your code for reading items out of the ring buffer is undoubtedly missing memory barrier operations.

Within the kernel you have code review of memory access ordering issues and might have a chance at getting them implemented correctly. How do you ensure the same on the consuming side in userspace?

Somebody should probably provide a wrapper library for all this that tries to do the right thing.

System calls make nice barriers, if you eliminate the system call, you move the responsibility for correctly implementing the barriers to the authors of any userspace code.

A ring buffer for epoll

Posted Jun 3, 2019 0:36 UTC (Mon) by edeloget (subscriber, #88392) [Link]

I agree that handling memory barriers in user code is not very simple, but avoiding a system call is still a big win for the rare number of people who are in need for this interface (and, frankly, if you are in need for this API, you'd better write the code correctly :))

A ring buffer for epoll

Posted Jun 3, 2019 16:14 UTC (Mon) by daney (guest, #24551) [Link]

Because you cannot change (break) userspace after a new facility is added to the kernel, you have to make sure the userspace interfaces are fully specified, and correct *before* they are added.

Because use of this new epool interface requires correct, race-free, access to multiple memory locations (head, tail, items[i].ready_events, etc.) by both the kernel and userspace, instead of a simple system call, specification of ordering is important. When running on x86, naive implementations may work by accident, where identical code may fail on more weakly ordered architectures. It would be nice to see something that also works on arm64, ppc, et al. from the start.

Also, since the kernel is consuming data from multiple memory locations that are under control of userspace, it would seem great care must be take to ensure that no security vulnerabilities are introduced. The more common paradigm of: Copy in user data at system call time, validate, compute result, return to user, no longer holds.

A ring buffer for epoll

Posted Jun 9, 2019 13:04 UTC (Sun) by ianmcc (subscriber, #88379) [Link]

item = header->items + index[header->tail] - 1;

This looks suspicious. Should it be

item = header->items + index[header->head];

A ring buffer for epoll

Posted Jun 9, 2019 15:00 UTC (Sun) by corbet (editor, #1) [Link]

Sigh, it should have been using the head index, yes; that has been fixed. The "-1" is correct, though: remember that the index value is one higher than the actual distance into the array.

A ring buffer for epoll

Posted Jun 9, 2019 15:21 UTC (Sun) by excors (subscriber, #95769) [Link]

It looks like it stores index plus one so that 0 can have a special meaning, but I don't understand why 0 needs a special meaning. Why wouldn't the kernel just write the correct index before incrementing tail (with a write barrier in between), so that userspace only needs to wait for tail and not wait again for index? I think the kernel has to be doing a write barrier anyway (after writing to the epoll_uitem, before writing to tail/index) so I don't see how it would be needed for performance.