The state of the pageout scalability patches
One of the core changes in this patch set remains the same: it still separates the least-recently-used (LRU) lists for pages backed up by files and those backed up by swap. When memory gets tight, it is generally preferable to evict page cache pages (those backed up by files) rather than anonymous memory. File-backed pages are less likely to need to be written back to disk and they are more likely to be well laid-out on disk, making it quicker to read them back in if necessary. Current Linux kernels keep both types of pages on the same LRU list, though, forcing the pageout code to scan over (potentially large numbers of) pages which it is not interested in evicting. Rik's patch improves this situation by splitting the LRU list in two, allowing the pageout code to only look at pages which might actually be candidates for eviction.
There comes a point, though, where anonymous pages need to be reclaimed as well. The kernel will make an effort to pick the best pages to evict by going for those which have not been recently referenced. Doing that, however, requires going through the entire list of anonymous pages, clearing the "referenced" bit on each. A large system can have many millions of anonymous pages; iterating over the entire set can take a long time. And, as it turns out, it's not really necessary.
The VM scalability patch set now changes that behavior by simply keeping a certain percentage of the system's anonymous pages on the inactive list - the first place the system looks for pages to evict. Those pages will drift toward the front of the list over time, but will be returned to the active list if they are used. Essentially, this patch is applying a form of the "referenced" test to a portion of anonymous memory - whether or not anonymous pages are being evicted at the time - rather than trying to check the referenced state of all anonymous pages when the kernel decides it needs to reclaim some of them.
Another set of patches addresses a different situation: pages which cannot be evicted at all. These pages might have been locked into memory with a system call like mlock(), be part of a locked SYSV shared memory region, or be part of a RAM disk, for example. They can be either page cache or anonymous pages. Either way, there is little point in having the reclaim code scan them, since it will not be possible to evict them. But, of course, the current reclaim code does have to scan over these pages.
This unneeded scanning, as it turns out, can be a problem. The extensive unevictable LRU document included with the patch claims:
Most of us are not currently working with systems of this size; one must spend a fair amount of money to gain the benefits of this sort of pathological behavior. Still, it seems like something which is worth fixing.
The solution, of course, is yet another list. When a page is determined to be unevictable, that page will go onto the special, per-zone unevictable list, after which the pageout code will simply not see it anymore. As a result of the variety of ways in which a page can become unevictable, the kernel will not always know at mapping time whether a specific page can go onto the unevictable list or not. So the pageout code must keep an eye out for those pages as it scans for reclaim candidates and shunt them over to the unevictable list as they are found. In relatively short order, the locked-down pages will accumulate in this list, freeing the pageout code to concentrate on pages it can actually do something about.
Many of the concerns which have been raised about this patch set over the
last year have been addressed. A few remain, though. Some of the new
features require new page flags; these flags are in extremely short supply,
so there is always pressure to find ways of implementing things which do
not allocate more of them. There are a few too many configuration options
and associated #ifdef blocks. And so on. Addressing these may
take a while, but convincing everybody that these (rather fundamental) memory
management changes are beneficial under all circumstances may take rather
longer. So, while this patch set is making progress, a 2.6.27 merge is
probably not in the cards.
Index entries for this article | |
---|---|
Kernel | Memory management/Page replacement algorithms |
Kernel | Scalability |
(Log in to post comments)
The state of the pageout scalability patches
Posted Jun 19, 2008 4:07 UTC (Thu) by jwb (guest, #15467) [Link]
A system with 128GB of memory is completely within the reach of most businesses these days. I priced out a machine today with 32 CPU cores and 128GB of memory. It was $51000 on the list price.
The state of the pageout scalability patches
Posted Jun 19, 2008 6:16 UTC (Thu) by dlang (guest, #313) [Link]
I recently priced a 4-socket opteron box with 128G of ram at ~$25K. I ended up buying the 2-socket boxs which are limited to 'only' 64G these aren't the $2-3k 1u pizzabox machines that companies like to buy in large numbers for clusters, but if you have apps that don't cluster well (or have per-machine licensing) they can still be a _very_ good deal.