Toward more predictable and reliable out-of-memory handling
OOM detection
The first patch set tries to answer a question that one might think would be obvious: how do we know when the system is out of memory? The problem is that a running system is highly dynamic. The lack of a free page to allocate at the moment does not mean that such pages could not be created; given the high cost of invoking the OOM killer, it is best not to declare an OOM situation if the kernel might be able to scrounge some memory from somewhere. Current kernels, though, are a bit unpredictable regarding when they give up and, in some cases, might wait too long.
If there are no pages to satisfy an allocation request, the kernel will perform direct reclaim to try to free some memory. In some cases, direct reclaim will be successful; that happens, for example, if it finds clean pages that can be immediately repurposed. In other cases, though, reclaiming pages requires writing them back to backing store; those pages will not be available for what is, from a computer's perspective, a long time. Still, they should become available eventually, so the kernel is justifiably reluctant to declare an OOM situation for as long as reclaimable pages exist.
The problem is that there are no real bounds on how long it might take for "reclaimable" pages to actually be reclaimed, for a number of reasons. Additionally, the allocator can conceivably find itself endlessly retrying if a single page is reclaimed, even if that page cannot be used for the current allocation request. As a result, the kernel can find itself hung up in allocation attempts that do not succeed, but which do not push the system into OOM handling.
Michal's patch defines a new heuristic for deciding when the system is truly out of memory. When an allocation attempt initially fails, the logic is similar to what is done in current kernels: a retry will be attempted (after an I/O wait) if there is a memory zone in the system where the sum of free and reclaimable pages is at least as large as the allocation request. If the retries continue to fail, though, a couple of changes come into play.
The first of those is that there is an upper bound of sixteen retries;
after that, the kernel gives up and goes into OOM-handling mode. That may
bring about an OOM situation sooner than current kernels (which can loop
indefinitely) will, but, as Michal put it:
"OOM killer would be more appropriate than looping without any
progress for unbounded amount of time
". Beyond that, the kernel's
count of the number of reclaimable pages is discounted more heavily after
each unsuccessful retry; after eight retries, that number will be cut in
half. That makes it increasingly unlikely that the estimate of reclaimable
pages will motivate the kernel to keep retrying.
The result of these changes is that the kernel will go into OOM handling in a more predictable manner when memory gets tight. Users will still curse the results, but the system as a whole should more reliably survive OOM situations.
The OOM reaper
At least, that should be the case if the OOM killer is actually able to free pages when the kernel invokes it. As has been seen in recent years, it is not that hard to create a situation where the OOM killer is unable to make any progress, usually because the targeted process is blocked on a lock and the OOM situation itself prevents that lock from being released. If an OOM-killed process cannot run, it cannot exit and, thus, it cannot free its memory; as a result, the entire OOM-killing mechanism fails.
The observation (credited to Mel Gorman and Oleg Nesterov) at the core of Michal's OOM reaper patch set is that it is not necessary to wait for the targeted process to die before stripping it of much of its memory. That process has received a SIGKILL signal, meaning it will not run again in user mode. That, in turn, means that it will no longer access any of its anonymous pages. Those pages can be reclaimed immediately without changing the end result.
The OOM reaper is implemented as a separate thread; this is done because the reaper must be able to run when it is called upon to do its work. Other kernel execution mechanisms, such as workqueues, might themselves be blocked by the OOM situation, so they cannot be counted upon. If this patch is merged, the oom_reaper thread will sit unused on the majority of Linux systems out there, but it will be certain to be available on the systems where it is needed.
The reaper is not without its rough edges. It must still take the
mmap_sem lock to free the pages, meaning that it could be blocked
if mmap_sem is held elsewhere. Still, Michal says that the
probability of trouble "is reduced considerably
" compared to
current kernels. One other potential problem is that, if the targeted
process is dumping core at the time it is killed, removing its pages may
corrupt the dump. This tradeoff is worthwhile, though, Michal says, since
keeping the system running is more important in such situations.
Memory-management patches are notoriously difficult to get merged into the
kernel. With regard to the OOM detection patch, Michal said the work "has
been sitting and waiting for the fundamental objections for quite some time
and there were none
". He would like to see it merged in 4.6 or
thereafter. Objections to the OOM reaper have also been hard to find, but
there has been no talk yet as to when that patch might head for the
mainline. Once these patches get there, the OOM-handling subsystem may
work a little better, but it seems unlikely that users will appreciate it
any more than they do now.
Index entries for this article | |
---|---|
Kernel | Memory management/Out-of-memory handling |
Kernel | OOM killer |
(Log in to post comments)
Toward more predictable and reliable out-of-memory handling
Posted Dec 17, 2015 9:00 UTC (Thu) by rgb (subscriber, #57129) [Link]
Of course you don't really appreciate a crashing program, but I very much prefer that over a freezing up of the whole system. So I do appreciate a system responing to my input very much!
Toward more predictable and reliable out-of-memory handling
Posted Dec 21, 2015 16:13 UTC (Mon) by sam.thursfield (subscriber, #94496) [Link]
Toward more predictable and reliable out-of-memory handling
Posted Dec 23, 2015 1:28 UTC (Wed) by ploxiln (subscriber, #58395) [Link]
The last few times I ran into what should have been an OOM situation, everything just froze, for longer than I cared to wait (at least a few minutes). I've seen this for the whole system, and also for a couple processes in a memory cgroup. I've seen it with no swap at all, and also with some swap space. Freezing for ~15 seconds is worse than death IMHO.
I "devops" some servers that run some processes that occasionally act up, and I've resorted to writing a script that checks memory conditions every few seconds and if there isn't a nice large buffer of free+cache memory, goes about sending SIGTERM and then SIGKILL to the biggest members of the suspected set of processes (and notifying me).
Toward more predictable and reliable out-of-memory handling
Posted Dec 21, 2015 13:07 UTC (Mon) by sorokin (guest, #88478) [Link]
Can't the (1) be fixed by using something like posix_spawn and can't the (2) be fixed by mapping as MAP_PRIVATE only GOT and PLT and MAP_SHARED everything else in shared objects? As I understand in this case it will be possible to default to vm.overcommit_memory=2 and forget about oom-killer.
Toward more predictable and reliable out-of-memory handling
Posted Dec 23, 2015 0:39 UTC (Wed) by eternaleye (guest, #67051) [Link]
In particular, there are most certainly programs which allocate large amounts of address space without necessarily using that memory in the end.
In addition, changing every use of fork to posix_spawn is... a decidedly nontrivial undertaking.
Besides that, failing to check the return value of malloc() is a pervasive issue, which has been worsened by the default-overcommit behavior on Linux has trained people that "malloc() never fails."
And then, if the user program DOES check the return value of malloc(), what are its options?
- Drop some internal caches
- Loop forever
- Die
i.e., the exact same issue as the OOM killer, with the added downside of not having the ability to take advantage of global information - say, page cache pages being clean.
And while it may _seem_ as though there's a benefit there, where the allocating process takes the hit, it's a false fairness - after all, a 3.9 GB allocation may have just succeeded for another program, and now your shell wants one measly page and can't get it.
Toward more predictable and reliable out-of-memory handling
Posted Apr 15, 2016 23:27 UTC (Fri) by oak (guest, #2786) [Link]
For example programs that create threads. While the stack in main thread grows dynamically, rest of the threads have a fixed sized stack. While one can specify what sized stack threads get, most programs don't.
By default, thread stack sizes on Linux are currently 8MB. If you have a lot of programs that use a lot of threads, without overcommit that would eat quite a bit of RAM.
Toward more predictable and reliable out-of-memory handling
Posted Dec 25, 2015 5:40 UTC (Fri) by Shabbyx (guest, #104730) [Link]
It's so annoying when OOM happens, it's infinitely more annoying when the kernel decides to freeze the whole system up for minutes, until it says "oh, sorry, I was supposed to kill this guy". I wished many times it would just go ahead with the kill immediately.
By 4.6, I might have already ended up with a new motherboard and more ram, so this fix would be too late.
Toward more predictable and reliable out-of-memory handling
Posted May 16, 2016 17:29 UTC (Mon) by Det (guest, #108823) [Link]
Sooo.. did you?
Toward more predictable and reliable out-of-memory handling
Posted Apr 12, 2019 9:44 UTC (Fri) by slack1256 (guest, #131415) [Link]
Toward more predictable and reliable out-of-memory handling
Posted Aug 23, 2016 6:09 UTC (Tue) by mikachu (guest, #5333) [Link]
http://lkml.iu.edu/hypermail/linux/kernel/1608.2/04321.html