Concurrent page-fault handling with per-VMA locks
The mmap_lock (formerly called mmap_sem) is a reader/writer lock that controls access to a process's address space; before making changes there (mapping in a new range, for example), the kernel must acquire that lock. Page-fault handling must also acquire mmap_lock (in reader mode) to ensure that the address space doesn't change in surprising ways while a fault is being resolved. A process can have a large address space and many threads running (and incurring page faults) concurrently, turning mmap_lock into a significant bottleneck. Even if the lock itself is not contended, the constant cache-line bouncing hurts performance.
Many attempts at solving the mmap_lock scalability problem have taken the form of speculative page-fault handling, where the work to resolve a fault is done without taking mmap_lock in the hope that the address space doesn't change in the meantime. Should concurrent access occur, the speculative page-fault code drops the work it has done and retries after taking mmap_lock. Various implementations have been shown over the years and they have demonstrated performance benefits, but the solutions are complex and none have managed to convince enough developers to be merged into the mainline kernel.
An alternative approach that has often been considered is range locking. Rather than locking the entire address space to make a change to a small part of it, range locking ensures exclusive access to the address range of interest while allowing accesses to other parts of the address space to proceed concurrently. Range locking turns out to be tricky as well, though, and no implementation has gotten close to being considered for merging.
VMA locking
A process's address space is described by a sequence of virtual memory areas (VMAs), represented by struct vm_area_struct. Each VMA corresponds to an independent range of address space; an mmap() call will normally create a new one, for example. Consecutive VMAs with the same characteristics can be merged; VMAs can also be split if, for example, a process changes the memory protections on a portion of the range. The number of VMAs varies from one process to the next, but it can grow to be quite large; the Emacs process within which this article is being written has over 1,100 of them, while gnome-shell has over 3,100.
At LSFMM this year, Matthew Wilcox suggested that the range-locking problem could be simplified by turning it into a VMA-locking problem. Since each VMA covers a range of the address space, locking the VMA would be equivalent to locking that range. The result would have much coarser resolution than true range locking, but it might still be good enough to be worth the effort.
Baghdasaryan's patch set is the attempt to find out if that is the case. But, of course, it immediately ran into the complexities of memory-management subsystem locking. There are two distinct types of locks that need to be taken on a VMA:
- Page-fault handling needs to ensure that the VMA remains present while a fault is being resolved and that it doesn't change in problematic ways. This work can be done concurrently with the handling of other faults or a number of other tasks, though. So the page-fault handler needs to take what is essentially a read lock.
- Address-space changes will need exclusive access to one or more VMAs; while (for example) a VMA is being split, no other part of the kernel can be allowed to do anything with any of the parts. So these types of changes require a write lock.
The original idea had been to use a reader/writer lock for this task, but
that led to another problem: write locks often need to be applied to
multiple VMAs at once. It would be possible to implement this with
reader/writer locks but, as Baghdasaryan pointed out in the cover letter:
"Tracking all the locked VMAs, avoiding recursive locks and other
complications would make the code more complex
". There is surprisingly
little desire for more complexity in the core memory-management code, so he
went in search of a different solution.
The implementation
The scheme that emerged was a combination of a reader/writer lock and a sequence number that is added to every VMA, but also to the mm_struct structure that describes the address space as a whole. If the sequence number in a given VMA is equal to the mm_struct sequence number, then that VMA is considered locked for modification and inaccessible for concurrent page-fault handling. If the two numbers disagree, no lock exists and concurrent access is possible.
When a page fault occurs, the handler will first attempt to read-lock the per-VMA lock; if that fails then it falls back to acquiring the full mmap_lock as is done now. If the read lock succeeds, though, the handler must also check the sequence numbers; if the sequence number for the relevant VMA matches that in the mm_struct (which cannot change as long as mmap_lock is held), then other changes are afoot and handling must, once again, fall back to taking mmap_lock. Otherwise the VMA is available and the fault can be handled without locking the address space as a whole. The read lock will be released once that task is complete.
When the memory-management system must make address-space changes, instead, it must lock each of the VMAs that will be affected. The first step is to take a write lock on mmap_lock, then, for each VMA, it will acquire the reader/writer lock in write mode (potentially waiting for any existing readers to let go of it). That lock is only held for long enough to set the VMA's sequence number equal to the mm_struct sequence number, though. Once that change has been made, the VMA is locked even after the reader/writer lock is released.
Another way to describe this is to say that the per-VMA reader/writer lock really only exists to protect access to the per-VMA sequence number, which is the real per-VMA lock.
After the kernel has locked all of the relevant VMAs, whatever changes need to be made can proceed. It will not be possible to handle page faults within those VMAs during this time (as is the case now), but other parts of the address space will be unaffected. Once the work is complete, all of those VMAs can be unlocked by simply increasing the mm_struct sequence number. There is no need to go back to each locked VMA — or even to remember which ones they are.
There are, of course, plenty of other details that have been glossed over
here, including the need to bring VMAs under read-copy-update protection so
that they can be looked up without holding mmap_lock. But the
locking scheme is the core that makes it all work. According to
Baghdasaryan, the resulting performance increase is about 75% of that
achieved with the speculative page-fault patches, so it's still leaving some
performance on the table. But, he said: "Still, with lower complexity
this approach might be more desirable
".
This work is deemed to be a proof-of-concept at this point. Among other
things, it only handles faults on anonymous pages and, even then, only
those that are not in swap. Support for swapped and file-backed pages can
be added later, he said, if the approach seems worth pursuing. Answering
that question may take a while; core memory-management patches tend not to
be merged quickly, and this discussion is just beginning. But if it works
out, this patch set could be a step in the direction of the long-wished-for
range-locking mechanism for process address spaces.
Index entries for this article | |
---|---|
Kernel | Memory management/mmap_sem |
Kernel | Releases/6.4 |
(Log in to post comments)
Concurrent page-fault handling with per-VMA locks
Posted Sep 5, 2022 16:36 UTC (Mon) by developer122 (guest, #152928) [Link]
Or in this implementation are they just taken as collateral damage, and unnecessarily locked until the current write operation ends (finishes it's work on the VMAs of interest)?
Rollover
Posted Sep 5, 2022 18:49 UTC (Mon) by corbet (editor, #1) [Link]
This is just a guess but ... I would expect that, if a VMA has been entirely idle for long enough for the sequence number to roll over and completely back around, the chances of an access coming at just the right time to cause a false positive are pretty tiny. But if that occurs, the worst that happens is that a fault is handled a bit more slowly that it otherwise would be.
Rollover
Posted Oct 12, 2022 16:46 UTC (Wed) by surenb (subscriber, #114881) [Link]
Concurrent page-fault handling with per-VMA locks
Posted Sep 5, 2022 21:05 UTC (Mon) by Homer512 (subscriber, #85295) [Link]
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 15:31 UTC (Tue) by calumapplepie (guest, #143655) [Link]
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 16:20 UTC (Tue) by adobriyan (subscriber, #30858) [Link]
VMAs are 200 bytes (on my F35 system) and on most systems they will be allocated from kmalloc-256,
so there is plenty of space.
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 16:27 UTC (Tue) by adobriyan (subscriber, #30858) [Link]
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 16:35 UTC (Tue) by Wol (subscriber, #4433) [Link]
Could you make the sequence number include the pid?
What are the chances of a 16-bit rolling number concatenated with the low order 16-bit pid as your sequence number colliding?
Cheers,
Wol
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 20:42 UTC (Tue) by WolfWings (subscriber, #56790) [Link]
Concurrent page-fault handling with per-VMA locks
Posted Sep 7, 2022 6:24 UTC (Wed) by Wol (subscriber, #4433) [Link]
Cheers,
Wol
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 20:48 UTC (Tue) by Homer512 (subscriber, #85295) [Link]
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 2:20 UTC (Tue) by calumapplepie (guest, #143655) [Link]
Or maybe the authors didn't think of this case, which would be concerning, since there exists a potential for a deadlock if they did it wrong. We probably can't assume that there even is a current write operation which can complete, since I thin the number is only incremented on write operation completion. If it was incremented on both operation completion and initiation, you'd get that property of no-deadlocks. All VMAs would be unlocked if there is no write operation, regardless of rollover, because the mmap sequence number would be odd if there are modifications occurring and even otherwise, and VMA sequence numbers would only ever be set to odd values.
Both of these are fairly inexpensive solutions, as an extra increment isn't much compared to handling a page fault, and the structure in question will already be loaded into the L1 cache for modification. Similarly, I believe the first solution boils down to an additional decrement as well, since the mmap_lock must be acquired anyways in the fallback case. They also aren't mutually exclusive; just make the fallback path decrement the VMA sequence number by 2 to keep the pattern of odds/evens. I'd encourage the devs to implement both, given the impact that a bug in this could have: at worse, an unpredictable deadlock that only occurs in long-running processes and is virtually impossible to reproduce; at best, degraded performance in long-running processes due to constantly falling back on the worst-case code. I'm willing to bet that *some* process's architecture involves setting up a few large VMAs full of pages being faulted in and out but which never itself gets modified.
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 6:28 UTC (Tue) by kilobyte (subscriber, #108024) [Link]
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 10:56 UTC (Tue) by developer122 (guest, #152928) [Link]
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 8:42 UTC (Tue) by unixbhaskar (guest, #44758) [Link]
Ummm :)
Concurrent page-fault handling with per-VMA locks
Posted Sep 6, 2022 8:49 UTC (Tue) by bojan (subscriber, #14302) [Link]
Concurrent page-fault handling with per-VMA locks
Posted Sep 16, 2022 2:15 UTC (Fri) by kiryl (subscriber, #41516) [Link]
Workloads that require extreme scaling (databases, VMMs, etc) tend to have one or a few very hot VMAs and per-VMA locking would not bring any help.
Concurrent page-fault handling with per-VMA locks
Posted Sep 16, 2022 7:28 UTC (Fri) by jem (subscriber, #24231) [Link]
Loading a lot of Emacs Lisp libraries can result in a zillion shared objects being loaded, each with their own VMAs.