A ring buffer for epoll
The poll() and select() system calls can be used to wait until at least one of a set of file descriptors is ready for I/O. Each call, though, requires the kernel to set up an internal data structure so that it can be notified when any given descriptor changes state. Epoll gets around this by separating the setup and waiting phases, and keeping the internal data structure around for as long as it is needed.
An application starts by calling epoll_create1() to create a file descriptor to use with the subsequent steps. That call, incidentally, supersedes epoll_create(); it replaces an unused argument with a flags parameter. Then epoll_ctl() is used to add individual file descriptors to the set monitored by epoll. Finally, a call to epoll_wait() will block until at least one of the file descriptors of interest has something to report. This interface is a bit more work to use than poll(), but it makes a big difference for applications that are monitoring huge numbers of file descriptors.
That said, it would seem that there is still room for doing things better. Even though epoll is more efficient than its predecessors, an application still has to make a system call to get the next set of file descriptors that are ready for I/O. On a busy system, where there is almost always something that is needing attention, it would be more efficient if there were a way to get new events without calling into the kernel. That is where Penyaev's patch set comes in; it creates a ring buffer shared between the application and the kernel that can be used to transmit events as they happen.
epoll_create() — the third time is the charm
The first step for an application that wishes to use this mechanism is to tell the kernel that polling will be used and how big the ring buffer should be. Of course, epoll_create1() does not have a parameter that can be used for the size information, so it is necessary to add epoll_create2():
int epoll_create2(int flags, size_t size);
There is a new flag, EPOLL_USERPOLL, that tells the kernel to use a ring buffer to communicate events; the size parameter says how many entries the ring buffer should hold. This size will be rounded up to the next power of two; the result sets an upper bound on the number of file descriptors that this epoll instance will be able to monitor. A maximum of 65,536 entries is enforced by the current patch set.
File descriptors are then added to the polling set in the usual way with epoll_ctl(). There are some restrictions that apply here, though, since some modes of operation are not compatible with user-space polling. In particular, every file descriptor must request edge-triggered behavior with the EPOLLET flag. Only one event will be added to the ring buffer when a file descriptor signals readiness; continually adding events for level-triggered behavior clearly would not work well. The EPOLLWAKEUP flag (which can be used to prevent system suspend while specific events are being processed) does not work in this mode; EPOLLEXCLUSIVE is also not supported.
Two or three separate mmap() calls are required to map the ring buffer into user space. The first one should have an offset of zero and a length of one page; it will yield a page containing this structure:
struct epoll_uheader { u32 magic; /* epoll user header magic */ u32 header_length; /* length of the header + items */ u32 index_length; /* length of the index ring, always pow2 */ u32 max_items_nr; /* max number of items */ u32 head; /* updated by userland */ u32 tail; /* updated by kernel */ struct epoll_uitem items[]; };
The header_length field, somewhat confusingly, contains the length of both the epoll_uheader structure and the items array. As seen in this example program, the intended use pattern appears to be that the application will map the header structure, get the real length, unmap the just-mapped page, then remap it using header_length to get the full items array.
One might expect that items is the ring buffer, but there is a layer of indirection used here. Getting at the actual ring buffer requires calling mmap() another time with header_length as the offset and the index_length header field as the length. The result will be an array of integer indexes into the items array that functions as the real ring buffer.
The actual items used to indicate events are represented by this structure:
struct epoll_uitem { __poll_t ready_events; __poll_t events; __u64 data; };
Here, events appears to be the set of events that was requested when epoll_ctl() was called, and ready_events is the set of events that has actually happened. The data field comes through directly from the epoll_ctl() call that added this file descriptor.
Whenever the head and tail fields differ, there is at least one event to be consumed from the ring buffer. To consume an event, the application should read the entry from the index array at head; this read should be performed in a loop until a non-zero value is found there. The loop, evidently, is required to wait, if necessary, until the kernel's write to that entry is visible. The value read is an index into the items array — almost. It is actually the index plus one. The data should be copied from the entry and ready_events set to zero; then the head index should be incremented.
So, in a cleaned up form, code that reads from the ring buffer will look something like this:
while (header->tail == header->head) ; /* Wait for an event to appear */ while (index[header->head] == 0) ; /* Wait for event to really appear */ item = header->items + index[header->head] - 1; data = item->data; item->ready_events = 0; /* Mark event consumed */ header->head++;
In practice, this code is likely to be using C atomic operations rather than direct reads and writes, and head must be incremented in a circular fashion. But hopefully the idea is clear.
Busy-waiting on an empty ring buffer is obviously not ideal. Should the application find itself with nothing to do, it can still call epoll_wait() to block until something happens. This call will only succeed, though, if the events array is passed as NULL, and maxevents is set to zero; in other words, epoll_wait() will block, but it will not, itself, return any events to the caller. It will, though, helpfully return ESTALE to indicate that there are events available in the ring buffer.
This patch set is in its third revision, and there appears to be little opposition to its inclusion at this point. The work has not yet found its way into linux-next, but it still seems plausible that it could be deemed ready for the 5.3 merge window.
Some closing grumbles
Figuring out the above interface required a substantial amount of reverse engineering of the code. This is a rather complex new API, but it is almost entirely undocumented; that will make it hard to use, but the lack of documentation also makes it hard to review the API in the first place. It is doubtful that anybody beyond the author has written any code to use this API at this point. Whether the development community will fully understand this API before committing to it is far from clear.
Perhaps the saddest thing, though, is that this will be yet another of many
ring-buffer interfaces in the kernel. Others include perf events, ftrace,
io_uring, AF_XDP and, doubtless, others that don't come
immediately to mind. Each of these interfaces has been created from
scratch and must be understood (and consumers implemented) separately by user-space
developers. Wouldn't it have been nice if the kernel had defined a set of
standards for ring buffers shared with user space rather than creating
something new every time? One cannot blame the current patch set for this
failing; that ship sailed some time ago. But it does illustrate a
shortcoming in how Linux kernel APIs are designed; they seem doomed to
never fit into a coherent and consistent whole.
Index entries for this article | |
---|---|
Kernel | Epoll |
(Log in to post comments)
A ring buffer for epoll
Posted May 30, 2019 17:03 UTC (Thu) by quotemstr (subscriber, #45331) [Link]
A ring buffer for epoll
Posted May 31, 2019 5:42 UTC (Fri) by smurf (subscriber, #17840) [Link]
Would be even less overhead if the kernel had a single sensible ring buffer implementation. This is not.
A ring buffer for epoll
Posted Jun 1, 2019 7:15 UTC (Sat) by pbonzini (subscriber, #60935) [Link]
A ring buffer for epoll
Posted May 30, 2019 17:14 UTC (Thu) by bjorntopel (subscriber, #80345) [Link]
Now, let's pull the virtio ring into io_uring as well... ;-)
A ring buffer for epoll
Posted May 30, 2019 23:34 UTC (Thu) by mst@redhat.com (subscriber, #60682) [Link]
The old split ring layout is somewhat complex.
The new packed ring format might be a good fit for that.
Not need for new syscall
Posted May 30, 2019 21:30 UTC (Thu) by scientes (guest, #83068) [Link]
Not need for new syscall
Posted May 30, 2019 21:36 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
Not need for new syscall
Posted May 30, 2019 21:50 UTC (Thu) by scientes (guest, #83068) [Link]
Not need for new syscall
Posted May 30, 2019 22:53 UTC (Thu) by cyphar (subscriber, #110703) [Link]
Not need for new syscall
Posted May 31, 2019 11:28 UTC (Fri) by epa (subscriber, #39769) [Link]
Not need for new syscall
Posted May 31, 2019 12:47 UTC (Fri) by smurf (subscriber, #17840) [Link]
Adding a generator for these tables, in order to use a central point of syscall registry instead of the current arch hodgepodge, is certainly possible. Just do it …
Not need for new syscall
Posted Jun 1, 2019 15:24 UTC (Sat) by luto (subscriber, #39314) [Link]
Not need for new syscall
Posted May 31, 2019 6:56 UTC (Fri) by koenkooi (subscriber, #71861) [Link]
* sys_accept4() was added in 2.6.28
* sys_accept4() was added for ARM in 2.6.36
* (e)glibc built against 2.6.32 headers on and ARM board running 2.6.32
With help from the systemd folks I tracked it down to accept4 missing, so I applied http://lists.infradead.org/pipermail/linux-arm-kernel/201... to the 2.6.32 kernel. Still a 3 minute delay. That's when I realized I needed to build eglibc against the patched 2.6.32 headers as well as patching the kernel. Running a kernel with the new syscall hooked up is not enough!
So everytime a new syscall gets proposed that is desired by the base layers in the OS I keep an eye on the ARM syscall list to avoid surprises. Marcin keeps this table up to date: https://fedora.juszkiewicz.com.pl/syscalls.html
System calls and architectures
Posted May 31, 2019 13:50 UTC (Fri) by corbet (editor, #1) [Link]
That's not really a problem with new system calls; it's about how they are implemented in the kernel. The good news is that this situation has gotten a lot better and continues to improve. A lot of the system-call boilerplate is being unified across architectures, and it's increasingly expected that new system calls will be enabled for most or all architectures from the outset.
System calls and architectures
Posted May 31, 2019 14:27 UTC (Fri) by cyphar (subscriber, #110703) [Link]
A ring buffer for epoll
Posted May 31, 2019 0:55 UTC (Fri) by cesarb (subscriber, #6266) [Link]
I wonder if future operating system designs will use ring buffers for everything instead of system calls. Want to open a file? Add an "open file" request to one ring buffer, and wait for the corresponding response in another ring buffer.
A ring buffer for epoll
Posted May 31, 2019 2:50 UTC (Fri) by sbaugh (subscriber, #103291) [Link]
Sounds like FlexSC: https://www.usenix.org/legacy/events/osdi10/tech/full_pap...
FlexSC
Posted May 31, 2019 7:33 UTC (Fri) by smurf (subscriber, #17840) [Link]
A ring buffer for epoll
Posted May 31, 2019 22:58 UTC (Fri) by cesarb (subscriber, #6266) [Link]
A ring buffer for epoll
Posted May 31, 2019 5:48 UTC (Fri) by smurf (subscriber, #17840) [Link]
A ring buffer for epoll
Posted May 31, 2019 21:50 UTC (Fri) by mm7323 (subscriber, #87386) [Link]
A ring buffer for epoll
Posted Jun 1, 2019 8:05 UTC (Sat) by smurf (subscriber, #17840) [Link]
A ring buffer for epoll
Posted Jun 2, 2019 5:07 UTC (Sun) by alison (subscriber, #63752) [Link]
A ring buffer for epoll
Posted Jun 2, 2019 9:48 UTC (Sun) by pbonzini (subscriber, #60935) [Link]
I think you mean "donated to OIN".
A ring buffer for epoll
Posted Jun 2, 2019 14:52 UTC (Sun) by alison (subscriber, #63752) [Link]
There's some good news, at least. The contrast between IBM's handling of RedHat and Oracle's of Sun is striking, at least so far.
A ring buffer for epoll
Posted Jun 2, 2019 19:30 UTC (Sun) by pbonzini (subscriber, #60935) [Link]
A ring buffer for epoll
Posted Jun 3, 2019 9:09 UTC (Mon) by smurf (subscriber, #17840) [Link]
A ring buffer for epoll
Posted May 31, 2019 14:14 UTC (Fri) by jhoblitt (subscriber, #77733) [Link]
Filling the ring buffer
Posted May 31, 2019 14:21 UTC (Fri) by corbet (editor, #1) [Link]
That cannot happen, as it turns out. I didn't get deeply into this in the article, maybe I should have. The new epoll code gives each file descriptor a dedicated entry in the items array; when one becomes ready, an index to it is added to the index array, which is the real ring buffer. Until user space consumes the item, there is nothing more to add to the index array - the file descriptor is already there (though more POLL* bits could be set). So the ring buffer can fill but never overflow.
Filling the ring buffer
Posted May 3, 2021 14:08 UTC (Mon) by brho (subscriber, #50662) [Link]
https://github.com/brho/akaros/blob/master/kern/include/r....
the 'CEQ' was designed so that i could do epoll in userspace on a non-linux research OS.
A ring buffer for epoll
Posted Jun 1, 2019 20:43 UTC (Sat) by andresfreund (subscriber, #69562) [Link]
I don't understand how this is the case for just about every new linux API. Coming from another community (postgres), I really don't understand why it's not a hard requirement to provide some minimal set of API docs? It doesn't have to be in fully man-page formatted, nicely phrased, native-speaker level English. But it should provide at least enough information to be able to write that manpage (and a good bit of this article) without having to do a lot of original research. This isn't even hard to enforce?
A ring buffer for epoll
Posted Jun 2, 2019 16:05 UTC (Sun) by daney (guest, #24551) [Link]
Within the kernel you have code review of memory access ordering issues and might have a chance at getting them implemented correctly. How do you ensure the same on the consuming side in userspace?
Somebody should probably provide a wrapper library for all this that tries to do the right thing.
System calls make nice barriers, if you eliminate the system call, you move the responsibility for correctly implementing the barriers to the authors of any userspace code.
A ring buffer for epoll
Posted Jun 3, 2019 0:36 UTC (Mon) by edeloget (subscriber, #88392) [Link]
A ring buffer for epoll
Posted Jun 3, 2019 16:14 UTC (Mon) by daney (guest, #24551) [Link]
Because use of this new epool interface requires correct, race-free, access to multiple memory locations (head, tail, items[i].ready_events, etc.) by both the kernel and userspace, instead of a simple system call, specification of ordering is important. When running on x86, naive implementations may work by accident, where identical code may fail on more weakly ordered architectures. It would be nice to see something that also works on arm64, ppc, et al. from the start.
Also, since the kernel is consuming data from multiple memory locations that are under control of userspace, it would seem great care must be take to ensure that no security vulnerabilities are introduced. The more common paradigm of: Copy in user data at system call time, validate, compute result, return to user, no longer holds.
A ring buffer for epoll
Posted Jun 9, 2019 13:04 UTC (Sun) by ianmcc (subscriber, #88379) [Link]
This looks suspicious. Should it be
item = header->items + index[header->head];
?
A ring buffer for epoll
Posted Jun 9, 2019 15:00 UTC (Sun) by corbet (editor, #1) [Link]
Sigh, it should have been using the head index, yes; that has been fixed. The "-1" is correct, though: remember that the index value is one higher than the actual distance into the array.
A ring buffer for epoll
Posted Jun 9, 2019 15:21 UTC (Sun) by excors (subscriber, #95769) [Link]