Concurrent page-fault handling with per-VMA locks

By Jonathan Corbet
September 5, 2022

The kernel is, in many ways, a marvel of scalability, but there is a longstanding pain point in the memory-management subsystem that has resisted all attempts at elimination: the mmap_lock. This lock was inevitably a topic at the 2022 Linux Storage, Filesystem, Memory-Management and BPF Summit (LSFMM), where the idea of using per-VMA locks was raised. Suren Baghdasaryan has posted an implementation of that idea — but with an interesting twist on how those locks are implemented.

The mmap_lock (formerly called mmap_sem) is a reader/writer lock that controls access to a process's address space; before making changes there (mapping in a new range, for example), the kernel must acquire that lock. Page-fault handling must also acquire mmap_lock (in reader mode) to ensure that the address space doesn't change in surprising ways while a fault is being resolved. A process can have a large address space and many threads running (and incurring page faults) concurrently, turning mmap_lock into a significant bottleneck. Even if the lock itself is not contended, the constant cache-line bouncing hurts performance.

Many attempts at solving the mmap_lock scalability problem have taken the form of speculative page-fault handling, where the work to resolve a fault is done without taking mmap_lock in the hope that the address space doesn't change in the meantime. Should concurrent access occur, the speculative page-fault code drops the work it has done and retries after taking mmap_lock. Various implementations have been shown over the years and they have demonstrated performance benefits, but the solutions are complex and none have managed to convince enough developers to be merged into the mainline kernel.

An alternative approach that has often been considered is range locking. Rather than locking the entire address space to make a change to a small part of it, range locking ensures exclusive access to the address range of interest while allowing accesses to other parts of the address space to proceed concurrently. Range locking turns out to be tricky as well, though, and no implementation has gotten close to being considered for merging.

VMA locking

A process's address space is described by a sequence of virtual memory areas (VMAs), represented by struct vm_area_struct. Each VMA corresponds to an independent range of address space; an mmap() call will normally create a new one, for example. Consecutive VMAs with the same characteristics can be merged; VMAs can also be split if, for example, a process changes the memory protections on a portion of the range. The number of VMAs varies from one process to the next, but it can grow to be quite large; the Emacs process within which this article is being written has over 1,100 of them, while gnome-shell has over 3,100.

At LSFMM this year, Matthew Wilcox suggested that the range-locking problem could be simplified by turning it into a VMA-locking problem. Since each VMA covers a range of the address space, locking the VMA would be equivalent to locking that range. The result would have much coarser resolution than true range locking, but it might still be good enough to be worth the effort.

Baghdasaryan's patch set is the attempt to find out if that is the case. But, of course, it immediately ran into the complexities of memory-management subsystem locking. There are two distinct types of locks that need to be taken on a VMA:

Page-fault handling needs to ensure that the VMA remains present while a fault is being resolved and that it doesn't change in problematic ways. This work can be done concurrently with the handling of other faults or a number of other tasks, though. So the page-fault handler needs to take what is essentially a read lock.
Address-space changes will need exclusive access to one or more VMAs; while (for example) a VMA is being split, no other part of the kernel can be allowed to do anything with any of the parts. So these types of changes require a write lock.

The original idea had been to use a reader/writer lock for this task, but that led to another problem: write locks often need to be applied to multiple VMAs at once. It would be possible to implement this with reader/writer locks but, as Baghdasaryan pointed out in the cover letter: "Tracking all the locked VMAs, avoiding recursive locks and other complications would make the code more complex". There is surprisingly little desire for more complexity in the core memory-management code, so he went in search of a different solution.

The implementation

The scheme that emerged was a combination of a reader/writer lock and a sequence number that is added to every VMA, but also to the mm_struct structure that describes the address space as a whole. If the sequence number in a given VMA is equal to the mm_struct sequence number, then that VMA is considered locked for modification and inaccessible for concurrent page-fault handling. If the two numbers disagree, no lock exists and concurrent access is possible.

When a page fault occurs, the handler will first attempt to read-lock the per-VMA lock; if that fails then it falls back to acquiring the full mmap_lock as is done now. If the read lock succeeds, though, the handler must also check the sequence numbers; if the sequence number for the relevant VMA matches that in the mm_struct (which cannot change as long as mmap_lock is held), then other changes are afoot and handling must, once again, fall back to taking mmap_lock. Otherwise the VMA is available and the fault can be handled without locking the address space as a whole. The read lock will be released once that task is complete.

When the memory-management system must make address-space changes, instead, it must lock each of the VMAs that will be affected. The first step is to take a write lock on mmap_lock, then, for each VMA, it will acquire the reader/writer lock in write mode (potentially waiting for any existing readers to let go of it). That lock is only held for long enough to set the VMA's sequence number equal to the mm_struct sequence number, though. Once that change has been made, the VMA is locked even after the reader/writer lock is released.

Another way to describe this is to say that the per-VMA reader/writer lock really only exists to protect access to the per-VMA sequence number, which is the real per-VMA lock.

After the kernel has locked all of the relevant VMAs, whatever changes need to be made can proceed. It will not be possible to handle page faults within those VMAs during this time (as is the case now), but other parts of the address space will be unaffected. Once the work is complete, all of those VMAs can be unlocked by simply increasing the mm_struct sequence number. There is no need to go back to each locked VMA — or even to remember which ones they are.

There are, of course, plenty of other details that have been glossed over here, including the need to bring VMAs under read-copy-update protection so that they can be looked up without holding mmap_lock. But the locking scheme is the core that makes it all work. According to Baghdasaryan, the resulting performance increase is about 75% of that achieved with the speculative page-fault patches, so it's still leaving some performance on the table. But, he said: "Still, with lower complexity this approach might be more desirable".

This work is deemed to be a proof-of-concept at this point. Among other things, it only handles faults on anonymous pages and, even then, only those that are not in swap. Support for swapped and file-backed pages can be added later, he said, if the approach seems worth pursuing. Answering that question may take a while; core memory-management patches tend not to be merged quickly, and this discussion is just beginning. But if it works out, this patch set could be a step in the direction of the long-wished-for range-locking mechanism for process address spaces.

Index entries for this article
Kernel	Memory management/mmap_sem
Kernel	Releases/6.4

(Log in to post comments)

Concurrent page-fault handling with per-VMA locks

Posted Sep 5, 2022 16:36 UTC (Mon) by developer122 (guest, #152928) [Link]

How does the kernel deal with sequence number rollover? ie in this example when the global sequence number rolls around to be equal to the stale numbers on a few of the thousands of VMAs?

Or in this implementation are they just taken as collateral damage, and unnecessarily locked until the current write operation ends (finishes it's work on the VMAs of interest)?

Rollover

Posted Sep 5, 2022 18:49 UTC (Mon) by corbet (editor, #1) [Link]

This is just a guess but ... I would expect that, if a VMA has been entirely idle for long enough for the sequence number to roll over and completely back around, the chances of an access coming at just the right time to cause a false positive are pretty tiny. But if that occurs, the worst that happens is that a fault is handled a bit more slowly that it otherwise would be.

Rollover

Posted Oct 12, 2022 16:46 UTC (Wed) by surenb (subscriber, #114881) [Link]

Mr Corbet is right as usual. The comment in https://lwn.net/ml/linux-kernel/c84136d3-703a-0e57-20ce-5... indicates " Overflow might produce false locked result but it's not critical.". This is because we will simply fallback to taking mmap_lock in the case of the false positive. Per Laurent's comment I'm going to expand this comment to explain the details.

Concurrent page-fault handling with per-VMA locks

Posted Sep 5, 2022 21:05 UTC (Mon) by Homer512 (subscriber, #85295) [Link]

Do we even care about rollover? If the sequence number is 64 bit, a 5 GHz machine cannot roll it over in over 100 years, even if it could increment the number once per clock cycle.

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 15:31 UTC (Tue) by calumapplepie (guest, #143655) [Link]

Hmm, good point... the downside would be memory consumption. 64-bit sequence numbers would increase the size of every vma_struct by about 5% (by my estimate of scanning the code). `sudo cat /proc/*/maps | wc` tells me I have 146,368 VMAs right now on my laptop; 8 extra bytes each would add up to around a megabyte, which is fairly small, especially considering what I have running right now (Firefox, Chrome, Libreoffice, Calibre, and Evolution, plus a few other apps). That being said, I remember seeing an article on here about an effort to share vma_struct instances between processes, so there might some group that's seeing much larger numbers.

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 16:20 UTC (Tue) by adobriyan (subscriber, #30858) [Link]

> 64-bit sequence numbers would increase the size of every vma_struct by about 5% (by my estimate of scanning the code)

VMAs are 200 bytes (on my F35 system) and on most systems they will be allocated from kmalloc-256,
so there is plenty of space.

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 16:27 UTC (Tue) by adobriyan (subscriber, #30858) [Link]

Hmm... Fedora disabled SLAB merging.

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 16:35 UTC (Tue) by Wol (subscriber, #4433) [Link]

> Hmm, good point... the downside would be memory consumption. 64-bit sequence numbers would increase the size of every vma_struct by about 5% (by my estimate of scanning the code).

Could you make the sequence number include the pid?

What are the chances of a 16-bit rolling number concatenated with the low order 16-bit pid as your sequence number colliding?

Cheers,
Wol

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 20:42 UTC (Tue) by WolfWings (subscriber, #56790) [Link]

...hasn't the PID been quite a bit larger than 16-bit for some time now? Just software limited to only use 16-bit PIDs by default for compatibility?

Concurrent page-fault handling with per-VMA locks

Posted Sep 7, 2022 6:24 UTC (Wed) by Wol (subscriber, #4433) [Link]

I know. Which is why I said use the low-order 16 bits ...

Cheers,
Wol

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 20:48 UTC (Tue) by Homer512 (subscriber, #85295) [Link]

True. All in all, corbet's answer is probably the better one: A false-positive due to rollover is harmless and just results in semi-random slow-paths. So we might even use a tiny 8 bit counter if it makes struct packing better.

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 2:20 UTC (Tue) by calumapplepie (guest, #143655) [Link]

Perhaps it's handled by, in the numbers-are-equal case, attempting to acquire mmap_lock. if, on acquisition, we see that the numbers are still equal, we know a rollover must have happened (since the VMA is locked for modification and yet the thread that should be modifiying it... isn't). so we can decrement the VMA sequence number, solving the issue until the next rollover without making it come sooner (as one would if we incremented the central sequence number).

Or maybe the authors didn't think of this case, which would be concerning, since there exists a potential for a deadlock if they did it wrong. We probably can't assume that there even is a current write operation which can complete, since I thin the number is only incremented on write operation completion. If it was incremented on both operation completion and initiation, you'd get that property of no-deadlocks. All VMAs would be unlocked if there is no write operation, regardless of rollover, because the mmap sequence number would be odd if there are modifications occurring and even otherwise, and VMA sequence numbers would only ever be set to odd values.

Both of these are fairly inexpensive solutions, as an extra increment isn't much compared to handling a page fault, and the structure in question will already be loaded into the L1 cache for modification. Similarly, I believe the first solution boils down to an additional decrement as well, since the mmap_lock must be acquired anyways in the fallback case. They also aren't mutually exclusive; just make the fallback path decrement the VMA sequence number by 2 to keep the pattern of odds/evens. I'd encourage the devs to implement both, given the impact that a bug in this could have: at worse, an unpredictable deadlock that only occurs in long-running processes and is virtually impossible to reproduce; at best, degraded performance in long-running processes due to constantly falling back on the worst-case code. I'm willing to bet that *some* process's architecture involves setting up a few large VMAs full of pages being faulted in and out but which never itself gets modified.

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 6:28 UTC (Tue) by kilobyte (subscriber, #108024) [Link]

If you can't use a large enough counter, what about reads checking if you're dangerously close to a rollover, and if so doing a fake write?

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 10:56 UTC (Tue) by developer122 (guest, #152928) [Link]

I guess if you want to catch potential bugs like this, you need to init the counter to be close to rollover. I remember some wisdom once (dunno if it's still done) that timeouts in the kernel by default should be set to 10 minutes after boot, to help catch issues early.

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 8:42 UTC (Tue) by unixbhaskar (guest, #44758) [Link]

"The number of VMAs varies from one process to the next, but it can grow to be quite large; the Emacs process within which this article is being written has over 1,100 of them, while gnome-shell has over 3,100."

Ummm :)

Concurrent page-fault handling with per-VMA locks

Posted Sep 6, 2022 8:49 UTC (Tue) by bojan (subscriber, #14302) [Link]

Exactly! Emacs! Come on! :-)

Concurrent page-fault handling with per-VMA locks

Posted Sep 16, 2022 2:15 UTC (Fri) by kiryl (subscriber, #41516) [Link]

Do Emacs and gnome-shell suffer from scalability issues? If so, I believe the problem is on the other side of syscall.

Workloads that require extreme scaling (databases, VMMs, etc) tend to have one or a few very hot VMAs and per-VMA locking would not bring any help.

Concurrent page-fault handling with per-VMA locks

Posted Sep 16, 2022 7:28 UTC (Fri) by jem (subscriber, #24231) [Link]

The reason Emacs uses so many VMAs is the new Emacs Lisp Native Compilation feature introduced in version 28.1. Each Emacs Lisp file (.el) can be compiled into a byte-code file (.elc), which in turn can be compiled into a .eln file, which is a standard shared object file with a different file name suffix. (On Windows the files are DLLs.)

Loading a lot of Emacs Lisp libraries can result in a zillion shared objects being loaded, each with their own VMAs.