LCA: A new approach to asynchronous I/O
Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today! |
Asynchronous I/O has been a problematic issue for the Linux kernel for many years. The current implementation is difficult to use, incomplete in its coverage, and hard to support within the kernel. More recently, there has been an attempt to resolve the problem with the syslet concept, wherein kernel threads would be used to make almost any system call potentially asynchronous. Syslets have their own problems, though, not the least of which being that their use can cause a user-space process to change its process ID over time. Work on this area has slowed, with few updates being seen since mid-2007.
Zach Brown is still working on the asynchronous I/O problem, though; he
used his linux.conf.au talk to discuss his current approach. The new
"acall" interface has the potential to resolve many of the problems which
have been seen in this area, but it is early-stage work which is likely to
evolve somewhat before it is seriously considered for mainline inclusion.
One of the big challenges with asynchronous kernel operations is that the kernel's idea of how to access task state is limited. For the most part, system calls expect the "current" variable to point to the relevant task structure. That proves to be a problem when things are running asynchronously, and, potentially, no longer have direct access to the originating process's state. The current AIO interface resolves this problem by splitting things into two phases: submission and execution. The submission phase has access to current and is able to block, but the execution phase is detached from all that. The end result is that AIO support requires a duplicate set of system call handlers and a separate I/O path. That, says Zach, is "why our AIO support still sucks after ten years of work."
The fibril or syslet idea replaces that approach with one which is conceptually different: system call handlers remain synchronous, and kernel threads are used to add asynchronous operation on top. This work has taken the form of some tricky scheduler hacks; if an operation which is meant to be asynchronous blocks, the scheduler quickly shifts over to another thread and returns to user space in that thread. That allows the preservation of the state built up to the blocking point and it avoids the cost of bringing in a new thread if the operation never has to block. But these benefits at the cost of changing the calling process's ID - a change which is sure to cause confusion.
When Zach inherited this work, he decided to take a fresh look at it with the explicit short-term goal of making it easy to implement the POSIX AIO specification. Other features, such as syslets (which allow a process to load a simple program into the kernel for asynchronous execution) can come later if it seems like a good idea. The end result is the "acall" API; this code has not yet been posted to the lists for review, but it is available from Zach's web site.
With this interface, a user-space process specifies an asynchronous operation with a structure like this:
struct acall_submission { u32 nr; u32 flags; u64 cookie; u64 completion_ring_pointer; u64 completion_pointer; u64 id_pointer; u64 args[6]; };
In this structure, nr identifies which system call is to be invoked asynchronously, while args is the list of arguments to pass to that system call. The cookie field is a value used by the calling program to identify the operation; it should be non-zero if it is to be used. The flags and various _pointer fields will be described shortly.
To submit one or more asynchronous requests, the application will call:
long acall_submit(struct acall_submission **submissions, unsigned long nr);
submissions is a list of pointers to requests, and nr is the length of that list. The return value will be the number of operations actually submitted. If something goes wrong in the submission process, the current implementation will return a value less than nr, but the error code saying exactly what went wrong will be lost if any operations were submitted successfully.
By default, acall_submit() will create a new kernel thread for each submitted operation. If the flags field for any request contains ACALL_SUBMIT_THREAD_POOL, that request will, instead, be submitted to a pool of waiting threads. Those threads are specific to the calling process, and they will only sit idle for 200ms before exiting. So submission to the thread pool may make sense if the application is submitting a steady stream of asynchronous operations; otherwise the kernel will still end up creating individual threads for each operation. Threads in the pool do not update their task state before each request, so they might be behind the current state of the calling process.
If the id_pointer field is non-NULL, acall_submit() will treat it as a pointer to an acall_id structure:
struct acall_id { unsigned char opaque[16]; };
This is a special value used by the application to identify this operation to the kernel. Internally it looks like this:
struct acall_kernel_id { u64 cpu; u64 counter; };
It is, essentially, a key used to look up the operation in a red/black tree.
The completion_pointer field, instead (if non-NULL), points to a structure like:
struct acall_completion { u64 return_code; u64 cookie; };
The final status of the operation can be found in return_code, while cookie is the caller-supplied cookie value. Once that cookie has a non-zero value, the return code will be valid.
The application can wait for the completion of specific operations with a call to:
long acall_comp_pwait(struct acall_id **uids, unsigned long nr, struct timespec *utime, const sigset_t *sigmask, size_t sigsetsize);
The uids array contains pointers to acall_id structures identifying the operations of interest; nr is the length of that array. If utime is not NULL, it points to a timespec structure specifying how long acall_comp_pwait() should wait before giving up. A set of signals to be masked during the operation can be given with sigmask and sigsetsize. A return value of one indicates that at least one operation actually completed.
An application submitting vast numbers of asynchronous operations may want to avoid making another system call to get the status of completed operations. Such applications can set up one or more completion rings, into which the status of completed operations will be written. A completion ring looks like:
struct acall_completion_ring { uint32_t head; uint32_t nr; struct acall_completion comps[0]; };
Initially, head should be zero, and nr should be the real length of the comps array. When the kernel is ready to store the results of an operation, it will first increment head, then put the results into comps[head % nr]. So a specific entry in the ring is only valid once the cookie field becomes non-zero. The kernel makes no attempt to avoid overwriting completion entries which have not yet been consumed by the application; it is assumed that the application will not submit more operations than will fit into a ring.
The actual ring to use is indicated by the completion_ring_pointer value in the initial submission. Among other things, that means that different operations can go into different rings, or that the application can switch to a differently-sized ring at any time. In theory, it also means that multiple processes could use the same ring, though waiting for completion will not work properly in that case.
If the application needs to wait until the ring contains at least one valid entry, it can call:
long acall_ring_pwait(struct acall_completion_ring *ring, u32 tail, u32 min, struct timespec *utime, const sigset_t *sigmask, size_t sigsetsize);
This call will wait until the given ring contains at least min events since the one written at index tail. The utime, sigmask, and sigsetsize arguments have the same meaning as with acall_comp_pwait().
Finally, an outstanding operation can be canceled with:
long acall_cancel(struct acall_id *uid);
Cancellation works by sending a KILL signal to the thread executing the operation. Depending on what was being done, that could result in partial execution of the request.
This API is probably subject to change in a number of ways. There is, for example, no limit to the size of the thread pool other than the general limit on the number of processes. Every request is assigned to a thread immediately, with threads created as needed; there is no way to queue a request until a thread becomes available in the future. The ability to load programs into the kernel for asynchronous execution ("syslets") could be added as well, though Zach gave the impression that he sees syslets as a relatively low-priority feature.
Beyond the new API, this asynchronous operation implementation differs from its predecessors in a couple of ways. Requests will always be handed off to threads for execution; there is no concept of executing synchronously until something blocks. That may increase the overhead in cases where the request could have been satisfied without blocking, though the use of the thread pool should minimize that cost. But the big benefit is that the calling process no longer changes its ID when things do block. That results in a more straightforward user-space API with minimal surprises - certainly a good thing to do.
Linus was at the presentation, and seemed to think that the proposed API
was not completely unreasonable. So it may well be that, before too long,
we'll see a version of the acall API proposed for the mainline. And that
could lead to a proper solution to the asynchronous I/O problem at last.
Index entries for this article | |
---|---|
Kernel | Asynchronous I/O |
Kernel | Syslets |
Conference | linux.conf.au/2009 |
(Log in to post comments)
LCA: A new approach to asynchronous I/O
Posted Jan 29, 2009 3:06 UTC (Thu) by kjp (guest, #39639) [Link]
I'd rather manage my own thread pools in user space than figuring out this crazy api which essentially does the same thing.
LCA: A new approach to asynchronous I/O
Posted Feb 10, 2009 21:03 UTC (Tue) by ddaa (guest, #5338) [Link]
Seconded.
According to the mailinator blog, blocking I/O wrapped in threads using NPTL smokes non-blocking I/O in terms of performance. It is also fairly straightforward to implement at the application level.
There are certainly reasons to implement the POSIX standard, mostly political I guess. But for pragmatic application development on operating systems that have good threading, non-blocking I/O seems like more trouble to use than it's worth.
Just an in-kernel thread pool
Posted Jan 29, 2009 15:41 UTC (Thu) by aliguori (subscriber, #30636) [Link]
Yes, one can argue that in the future, we could do smarter things about thread scheduling in the kernel but one could also argue that we could equally well expose those knobs down to userspace.
LCA: A new approach to asynchronous I/O
Posted Jan 29, 2009 19:48 UTC (Thu) by jd (subscriber, #26381) [Link]
I remember the arguments against M:N - added complexity, it's not as optimal, etc - but should I actually not be so totally out in left-field on this, syslets + M:N might be more efficient than other AIO solutions.
(Of course, if it were that simple, it would have already been done. The core kernel crew catch chaotic coding quirks like that. It follows I'm likely not just out on left-field but about to step off the edge of the world. I'll let you know if there any elephants holding it up.)
VMS All Over Again
Posted Feb 5, 2009 2:15 UTC (Thu) by ldo (guest, #40946) [Link]
Funny how every time I hear about a proposal for adding asynchronous I/O to Linux, I think about VMS. In that system, all time-consuming operations (I/O, getting information from another process etc) were asynchronous; the supposedly synchronous syscalls were just user-space glue that did nothing but make the asynchronous call, and then wait for it to complete before returning. So as far as the kernel was concerned, there was no synchronous/asynchronous distinction.
Seems to me we keep taking baby steps in that direction while trying to avoid admitting that itÂ’s the right idea.
VMS All Over Again
Posted Feb 5, 2009 14:50 UTC (Thu) by donbarry (guest, #10485) [Link]
*system calls* made asynchronously. The CPUs had no I/O capabilities, and
one would write a system request into the first word of memory of the
process. Peripheral processors would periodically scan process memory
beginnings, see the request, and set a bit to acknowledge reception.
I/O would begin, and later the word would be written to again to acknowledge
completion, with a toggle bit as semaphore between PP and CPU.
Meanwhile, your faultlessly brilliant code would hopefully continue doing
nuclear bomb calculations, cryptographic decrypts, or payroll, waiting
for the washing-machine disks to do their deeds and the PPs to propagate
their transfers back into central memory.
VMS All Over Again
Posted Aug 9, 2015 17:24 UTC (Sun) by trent (guest, #101635) [Link]
VMS All Over Again
Posted Aug 13, 2015 22:44 UTC (Thu) by zlynx (guest, #2285) [Link]
VMS All Over Again
Posted Aug 14, 2015 3:49 UTC (Fri) by madscientist (subscriber, #16861) [Link]
VMS All Over Again
Posted Aug 14, 2015 5:10 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
Driver letters were a legacy of DOS, the original VAX allowed to address filesystem on named drives (identified by actual names, not letters).
VMS All Over Again
Posted Aug 14, 2015 12:39 UTC (Fri) by madscientist (subscriber, #16861) [Link]
VMS All Over Again
Posted Aug 14, 2015 14:29 UTC (Fri) by trent (guest, #101635) [Link]
I'm not sure if you're saying Windows doesn't support home drives anywhere else other than C:\, too. That's nonsense, you can customize everything. Enterprise Windows deployments do this all the time.
VMS All Over Again
Posted Aug 14, 2015 15:12 UTC (Fri) by madscientist (subscriber, #16861) [Link]
Yes, a number of programs want to install into the user's home directory and aren't easy to install anywhere else. Of course after you've installed each one you can go back and try to move them by hand, and hope it works. What a hassle. We didn't want to bother: we just wanted his home directory on a large drive and everything would install there normally, and his sister's home directory with her schoolwork etc. would be on the smaller drive but they wouldn't have to fight over disk space. This should not require an MCSE certificate to accomplish.Oh yes, I'm sure it's possible to put all home directories on other drives, if you're an enterprise Windows user. Here's a lovely article from microsoft themselves explaining how to do it... only during a clean install of Windows and even then it's not something any normal person would want to undertake. Or here's a LifeHacker article admitting it's "quite an undertaking" and showing an "utterly flawless" method that involves booting into system recovery mode. It's far beyond anything any normal person should have to do, and when we tried to follow one of these guides the resulting account was always dodgy and had weird quirks: files showed up in Explorer but couldn't be opened, or showed up twice, or whatever. When we went back to a normal C: home directory all those weird issues stopped.
VMS All Over Again
Posted Aug 15, 2015 2:48 UTC (Sat) by dlang (guest, #313) [Link]
On android we see a bunch of apps that refuse to install on the sd card, same sort of mess
VMS All Over Again
Posted Aug 17, 2015 16:54 UTC (Mon) by mathstuf (subscriber, #69389) [Link]
VMS All Over Again
Posted Aug 14, 2015 15:26 UTC (Fri) by cebewee (guest, #94775) [Link]
VMS All Over Again
Posted Aug 14, 2015 14:18 UTC (Fri) by trent (guest, #101635) [Link]
The tight integration between the cache manager, I/O manager and file system driver actually makes NTFS incredibly performant when used correctly. What works well on Linux/UNIX rarely performs well on Windows; whereas what performs best on Windows isn't even possible on UNIX.
VMS All Over Again
Posted Aug 14, 2015 15:48 UTC (Fri) by madscientist (subscriber, #16861) [Link]
Well, if you can't use a filesystem in a normal way but instead have create special and non-obvious patterns to use it "correctly", I'd say that's a problem with the filesystem. But maybe that's just me. I have absolutely no interest in writing Windows-only software so saying that "what performs best on Windows isn't even possible on UNIX" is not a selling point for Windows! If it's not possible to do portably why would I ever do it at all? 90+% of the users of software I work on deploy it on UNIX: only a small minority use it on Windows.Some software I work on uses standard C/C++ IO. It stores 20+G of data in a directory hierarchy of files averaging about 50K each. Every single aspect of this is worse on NTFS than Linux filesystems: it's slower to unpack zip files, to copy using Windows operations, to create, read, and update the files themselves. The fact that Windows locks files while they're open means that to do atomic replacements of files you have to loop and retry in case the rename step fails, you can't replace running programs then restart them, etc. Also every aspect of our build is slower: it's slower to clone Git repos and check out. Compilation is slower. Deployment is slower.
My wife and son are collaborating on writing a game using Unreal on Windows and storing it in Git. They were both cloning from a directory on my wife's system using shared partitions (not a protocol like ssh, etc.) That repo was constantly getting corrupted somehow and I would have to go in and fix it by making a bare copy of one of their good repos and replacing the shared one with that. So we moved the repo to my Linux system exported with Samba, so they're both still simply using a direct shared partition path to clone from, and now it works perfectly every time.
For other missing features, NTFS has multiple ways to do things "sort of" like a simple UNIX symbolic link but nothing that is really equivalent: either they have permissions restrictions or are not transparent. This ain't rocket science!
VMS All Over Again
Posted Aug 14, 2015 19:24 UTC (Fri) by trent (guest, #101635) [Link]
The locking of files is an interesting one because Linux/UNIX people interpret the inability to just blow away existing file content on Windows as a deficiency. It's actually a side-effect of a much more sophisticated and integrated cache manager, memory manager, object manager and I/O manager. What's actually happening here behind the scenes is *exactly* what facilitates Windows' ability to achieve thread-agnostic completion-oriented asynchronous I/O irrespective of the underlying device (disk, network or other). You pay a higher up-front cost for process startup and creation of all the underlying section objects, but allows optimal multi-threaded access once established. You can achieve much higher performance and better hardware utilization with these facilities than you'd ever be able to on Linux/UNIX.
And yeah, git feels like a dog on Windows, because it's inherently designed around UNIX I/O primitives, which are the least optimal way of doing things on Windows.
VMS All Over Again
Posted Aug 14, 2015 21:24 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
So Linux filesystem call quickly goes into the kernel and there it's quickly dispatched by the VFS layer and in most cases immediately returns the data from the metadata cache. No memory allocations or generic translation layers involved.
In Windows a system call has to construct an IRP and dispatch it. It goes through intermediate layers (which can filter and/or mutilate it) and then target FS driver has to parse the request back into a usable form. It's all very generic and flexible - you can write a lot of functionality (like virus filters) in a completely generic way.
Except... It's slow! The layered model was unusably slow in the NT3.5 era, so Microsoft added FastIO which short-circuited all of that layered stuff ( http://www.osronline.com/article.cfm?id=166 ) with a simple function pointer table. Of course, asynchronous IO was emulated using (guess it!) thread pools.
VMS All Over Again
Posted Aug 15, 2015 2:50 UTC (Sat) by dlang (guest, #313) [Link]
VMS All Over Again
Posted Aug 15, 2015 6:40 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]
VMS All Over Again
Posted Aug 15, 2015 19:11 UTC (Sat) by trent (guest, #101635) [Link]
What I would disagree with is this part:
> And async model based on IRPs also causes a lot of fundamental problems - most of IO is synchronous and layering it on top of asynchronous layer causes unnecessary slowdowns.
The NT kernel defers that decision to device drivers. They can decide whether or not they can complete a call synchronously or if it needs to be handled asynchronously. It adds a lot of complexity to the driver model, but it's exactly that complexity that facilitates the superior asynchronous I/O performance.
With PyParallel, I implemented support for dynamically switching between synchronous and asynchronous calls based on system load and protocol hints -- it works very well and basically gives you the best possible performance by using synchronous non-blocking calls if it won't impede concurrency, then switching to async calls as necessary. This gives you very, very low latency when you've got a few clients connected, or very fair latency when you have many (see http://pyparallel.org).
> So Linux filesystem call quickly goes into the kernel and there it's quickly dispatched by the VFS layer and in most cases immediately returns the data from the metadata cache. No memory allocations or generic translation layers involved.
Well the key to being faster is always to do less, and Linux does less, so, yeah, this particular code path is probably faster if you're just dealing with memory buffers and not Irp setup/teardown (although Irps are typically re-used and allocated from lookaside lists and whatnot, amortizing the overhead).
But keep in mind what we're talking about here -- the ability for Linux to do non-blocking file I/O. Or rather, the fact that all file I/O in Linux is blocking becomes problematic in two situations: a) you've got a single-threaded event loop and use non-blocking socket I/O to achieve concurrency -- blocking for file I/O in this situation impacts your ability to serve other clients' requests, or b) you've got a pool of processes or threads doing work -- in this case you'll typically size your pool to equal ncpu, but if you've got 8 procs/threads and 8 cores and they all issue blocking I/O, you're not optimally using your hardware because there is nothing for those cores to do until the IO returns (or if they're serving network clients, all network clients associated with that process/thread's epoll set will be affected).
That is where the complexity of the Windows I/O model shines -- because you can do everything asynchronously, it's much easier to a) provide much fairer service to clients (lower jitter, reduced stddev), and b) make better use of underlying hardware (either keeping all CPU cores saturated, or I/O channels saturated (depending on your application)).
VMS All Over Again
Posted Aug 16, 2015 8:28 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]
Not really. The drivers themselves are (almost) always called using an IRP, which can be submitted in any way. If a driver needs to make a further call to another layer, then it also must create an IRP and send it to the destination.
And at this point there's really little difference between synchronous and asynchronous calls (as was intended!). But synchronous calls still suffer the overhead of this generic system. It's not a large hit (Microsoft went to great lengths to speed up IRP processing), but it's there.
Ultimately, there's little reason to use asynchronous processing in most cases. Pretty much the only case where it matters is network (it is often high-throughput and massively-parallel), but almost every other usual hardware is not very parallel.
VMS All Over Again
Posted Aug 16, 2015 8:37 UTC (Sun) by zlynx (guest, #2285) [Link]
I am surprised that you would say that. It is only true about simple, single spinning hard disks. And those are quickly going away.
SSDs work best at high queue depths and SSDs (of whichever type) are the future.
VMS All Over Again
Posted Aug 16, 2015 8:47 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]
NMVe are different, but we're likely to skip straight to NV-RAM which is _completely_ different again.
VMS All Over Again
Posted Aug 14, 2015 17:29 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
The _only_ operation that works faster on Windows is deletion, just because it's very cheap due to NTFS design.
Everything else is slow as molasses. The filesystem layer in Windows scales badly, far worse than Linux's. A directory with a million files works OK on Linux but is impossible on Windows. It's not unusual to find 1000x difference between them on metadata-heavy tests, of course in Linux's favor.
LCA: A new approach to asynchronous I/O
Posted Feb 9, 2009 11:43 UTC (Mon) by muwlgr (guest, #35359) [Link]