Toward more predictable and reliable out-of-memory handling

By Jonathan Corbet
December 16, 2015

The kernel's out-of-memory (OOM) behavior has been a topic of discussion almost since the inception of the linux-kernel mailing list. Opinions abound on how the kernel should account for memory, whether it should allow memory to be overcommitted, what it means to be out of memory, and what should be done when that situation comes about. There seems to be general agreement on only one thing: OOM situations are bad, and the kernel's handling of OOM situations is even worse. Over the years, numerous developers have tried to improve the situation; the latest attempt can be seen in two patch sets from Michal Hocko.

OOM detection

The first patch set tries to answer a question that one might think would be obvious: how do we know when the system is out of memory? The problem is that a running system is highly dynamic. The lack of a free page to allocate at the moment does not mean that such pages could not be created; given the high cost of invoking the OOM killer, it is best not to declare an OOM situation if the kernel might be able to scrounge some memory from somewhere. Current kernels, though, are a bit unpredictable regarding when they give up and, in some cases, might wait too long.

If there are no pages to satisfy an allocation request, the kernel will perform direct reclaim to try to free some memory. In some cases, direct reclaim will be successful; that happens, for example, if it finds clean pages that can be immediately repurposed. In other cases, though, reclaiming pages requires writing them back to backing store; those pages will not be available for what is, from a computer's perspective, a long time. Still, they should become available eventually, so the kernel is justifiably reluctant to declare an OOM situation for as long as reclaimable pages exist.

The problem is that there are no real bounds on how long it might take for "reclaimable" pages to actually be reclaimed, for a number of reasons. Additionally, the allocator can conceivably find itself endlessly retrying if a single page is reclaimed, even if that page cannot be used for the current allocation request. As a result, the kernel can find itself hung up in allocation attempts that do not succeed, but which do not push the system into OOM handling.

Michal's patch defines a new heuristic for deciding when the system is truly out of memory. When an allocation attempt initially fails, the logic is similar to what is done in current kernels: a retry will be attempted (after an I/O wait) if there is a memory zone in the system where the sum of free and reclaimable pages is at least as large as the allocation request. If the retries continue to fail, though, a couple of changes come into play.

The first of those is that there is an upper bound of sixteen retries; after that, the kernel gives up and goes into OOM-handling mode. That may bring about an OOM situation sooner than current kernels (which can loop indefinitely) will, but, as Michal put it: "OOM killer would be more appropriate than looping without any progress for unbounded amount of time". Beyond that, the kernel's count of the number of reclaimable pages is discounted more heavily after each unsuccessful retry; after eight retries, that number will be cut in half. That makes it increasingly unlikely that the estimate of reclaimable pages will motivate the kernel to keep retrying.

The result of these changes is that the kernel will go into OOM handling in a more predictable manner when memory gets tight. Users will still curse the results, but the system as a whole should more reliably survive OOM situations.

The OOM reaper

At least, that should be the case if the OOM killer is actually able to free pages when the kernel invokes it. As has been seen in recent years, it is not that hard to create a situation where the OOM killer is unable to make any progress, usually because the targeted process is blocked on a lock and the OOM situation itself prevents that lock from being released. If an OOM-killed process cannot run, it cannot exit and, thus, it cannot free its memory; as a result, the entire OOM-killing mechanism fails.

The observation (credited to Mel Gorman and Oleg Nesterov) at the core of Michal's OOM reaper patch set is that it is not necessary to wait for the targeted process to die before stripping it of much of its memory. That process has received a SIGKILL signal, meaning it will not run again in user mode. That, in turn, means that it will no longer access any of its anonymous pages. Those pages can be reclaimed immediately without changing the end result.

The OOM reaper is implemented as a separate thread; this is done because the reaper must be able to run when it is called upon to do its work. Other kernel execution mechanisms, such as workqueues, might themselves be blocked by the OOM situation, so they cannot be counted upon. If this patch is merged, the oom_reaper thread will sit unused on the majority of Linux systems out there, but it will be certain to be available on the systems where it is needed.

The reaper is not without its rough edges. It must still take the mmap_sem lock to free the pages, meaning that it could be blocked if mmap_sem is held elsewhere. Still, Michal says that the probability of trouble "is reduced considerably" compared to current kernels. One other potential problem is that, if the targeted process is dumping core at the time it is killed, removing its pages may corrupt the dump. This tradeoff is worthwhile, though, Michal says, since keeping the system running is more important in such situations.

Memory-management patches are notoriously difficult to get merged into the kernel. With regard to the OOM detection patch, Michal said the work "has been sitting and waiting for the fundamental objections for quite some time and there were none". He would like to see it merged in 4.6 or thereafter. Objections to the OOM reaper have also been hard to find, but there has been no talk yet as to when that patch might head for the mainline. Once these patches get there, the OOM-handling subsystem may work a little better, but it seems unlikely that users will appreciate it any more than they do now.

Index entries for this article
Kernel	Memory management/Out-of-memory handling
Kernel	OOM killer

(Log in to post comments)

Toward more predictable and reliable out-of-memory handling

Posted Dec 17, 2015 9:00 UTC (Thu) by rgb (subscriber, #57129) [Link]

"..., but it seems unlikely that users will appreciate it any more than they do now."

Of course you don't really appreciate a crashing program, but I very much prefer that over a freezing up of the whole system. So I do appreciate a system responing to my input very much!

Toward more predictable and reliable out-of-memory handling

Posted Dec 21, 2015 16:13 UTC (Mon) by sam.thursfield (subscriber, #94496) [Link]

I agree! The out-of-memory killer is infinitely better than having my system freezing up due to swap-thrashing. I've never once thought "oh, I wish my system had hung indefinitely there instead of killing Firefox".

Toward more predictable and reliable out-of-memory handling

Posted Dec 23, 2015 1:28 UTC (Wed) by ploxiln (subscriber, #58395) [Link]

I can't help but chime in with a "me too", I'd much prefer processes be killed quickly when free-able memory is close to running out. If it's anywhere near close, say less than 5% of memory available for use as fs cache, something is already wrong.

The last few times I ran into what should have been an OOM situation, everything just froze, for longer than I cared to wait (at least a few minutes). I've seen this for the whole system, and also for a couple processes in a memory cgroup. I've seen it with no swap at all, and also with some swap space. Freezing for ~15 seconds is worse than death IMHO.

I "devops" some servers that run some processes that occasionally act up, and I've resorted to writing a script that checks memory conditions every few seconds and if there isn't a nice large buffer of free+cache memory, goes about sending SIGTERM and then SIGKILL to the biggest members of the suspected set of processes (and notifying me).

Toward more predictable and reliable out-of-memory handling

Posted Dec 21, 2015 13:07 UTC (Mon) by sorokin (guest, #88478) [Link]

Do I understand correctly that all problems with OOM handling in Linux are caused by memory overcommitting? Is it true that memory overcommitting is required for two reasons: (1) because fork is used to create child processes and (2) because shared objects are mapped as MAP_PRIVATE?

Can't the (1) be fixed by using something like posix_spawn and can't the (2) be fixed by mapping as MAP_PRIVATE only GOT and PLT and MAP_SHARED everything else in shared objects? As I understand in this case it will be possible to default to vm.overcommit_memory=2 and forget about oom-killer.

Toward more predictable and reliable out-of-memory handling

Posted Dec 23, 2015 0:39 UTC (Wed) by eternaleye (guest, #67051) [Link]

Well, the first issue is that while those two points are part of why, they're far from the whole story.

In particular, there are most certainly programs which allocate large amounts of address space without necessarily using that memory in the end.

In addition, changing every use of fork to posix_spawn is... a decidedly nontrivial undertaking.

Besides that, failing to check the return value of malloc() is a pervasive issue, which has been worsened by the default-overcommit behavior on Linux has trained people that "malloc() never fails."

And then, if the user program DOES check the return value of malloc(), what are its options?
- Drop some internal caches
- Loop forever
- Die

i.e., the exact same issue as the OOM killer, with the added downside of not having the ability to take advantage of global information - say, page cache pages being clean.

And while it may _seem_ as though there's a benefit there, where the allocating process takes the hit, it's a false fairness - after all, a 3.9 GB allocation may have just succeeded for another program, and now your shell wants one measly page and can't get it.

Toward more predictable and reliable out-of-memory handling

Posted Apr 15, 2016 23:27 UTC (Fri) by oak (guest, #2786) [Link]

> there are most certainly programs which allocate large amounts of address space without necessarily using that memory in the end.

For example programs that create threads. While the stack in main thread grows dynamically, rest of the threads have a fixed sized stack. While one can specify what sized stack threads get, most programs don't.

By default, thread stack sizes on Linux are currently 8MB. If you have a lot of programs that use a lot of threads, without overcommit that would eat quite a bit of RAM.

Toward more predictable and reliable out-of-memory handling

Posted Dec 25, 2015 5:40 UTC (Fri) by Shabbyx (guest, #104730) [Link]

I would be so happy to see this merged. I wish it were merged today and backported to all previous kernels.

It's so annoying when OOM happens, it's infinitely more annoying when the kernel decides to freeze the whole system up for minutes, until it says "oh, sorry, I was supposed to kill this guy". I wished many times it would just go ahead with the kill immediately.

By 4.6, I might have already ended up with a new motherboard and more ram, so this fix would be too late.

Toward more predictable and reliable out-of-memory handling

Posted May 16, 2016 17:29 UTC (Mon) by Det (guest, #108823) [Link]

> By 4.6, I might have already ended up with a new motherboard and more ram, so this fix would be too late.

Sooo.. did you?

Toward more predictable and reliable out-of-memory handling

Posted Apr 12, 2019 9:44 UTC (Fri) by slack1256 (guest, #131415) [Link]

FYI you can trigger the OOM killer yourself via Alt-SysRq-f.

Toward more predictable and reliable out-of-memory handling

Posted Aug 23, 2016 6:09 UTC (Tue) by mikachu (guest, #5333) [Link]

These patches were merged in 4.7 and the result is that the OOM killer is now invoked way too easily, when there's plenty of reclaimable memory still around. Hopefully it will be fixed in the stable series, otherwise in 4.8.
http://lkml.iu.edu/hypermail/linux/kernel/1608.2/04321.html