VM changes in 2.6.6

[Posted April 14, 2004 by corbet]

Among the patches merged into the upcoming 2.6.6 release is a set of virtual memory changes. Changes to such a fundamental subsystem are always of interest, especially in the middle of a "stable" kernel series. Here, then, is a quick discussion of what has transpired.

In response to the reverse mapping VM discussions over the last month or so, Hugh Dickins has posted a series of patches which prepare the kernel for a full object-based reverse-mapping scheme and the removal of the per-page PTE chains. Hugh's patches carefully leave room for the inclusion of either his anonmm patches or Andrea Arcangeli's anon_vma work, though he seems to expect that anon_vma will win out. The full set of patches posted so far can be found in the "memory management" part of the "patches and updates" section, below.

Of those patches, the first three have been merged as of this writing. rmap 1 simply creates a new include file (linux/rmap.h) and moves much of the reverse-mapping declarations there. The second patch (rmap 2) changes the way the swap subsystem keeps track of swap cache pages; this change is needed to free up a couple of struct page fields for reverse mapping tasks. Finally, rmap 3 finishes out the struct page work for various architectures.

Later patches in Hugh's series get more ambitious; rmap 7 adds object-based reverse mapping for file-backed memory. Those patches have not been merged as of this writing, however.

A completely different set of patches which changes how the page cache works has been merged. The description of this work, as written by Andrew Morton, reads:

The basic problem which we (mainly Daniel McNeil) have been struggling with is in getting a really reliable fsync() across the page lists while other processes are performing writeback against the same file. It's like juggling four bars of wet soap with your eyes shut while someone is whacking you with a baseball bat.

This work made some fundamental changes in how page cache pages are tracked. The struct page structure has long included a field called "list", being a list_head structure used to track the state of the page. When the page is marked dirty, or placed under I/O, it is put on a list with other such pages. Unfortunately, managing those lists as the state of the page changes proves to be difficult; hence the juggling analogy.

In response, the page lists have been removed altogether; as a side-benefit, this change shrinks struct page by eight bytes - a significant savings, considering that there is one such structure for every physical page in the system. The lists have been replaced with an enhanced radix tree which supports "tagging" of pages. When a page is dirtied, it is simply marked dirty in the radix tree, rather than being added to a list. Similarly, pages which are currently being written back to disk are marked. A new set of radix tree operations allows the kernel to find these pages when the need arises. Searching the tree is not as fast as following a dedicated list, but the radix tree implementation appears to be fast enough that few people will notice the difference.

These changes required touching a lot of VM and page cache code; every user of the page->list field had to be fixed. As a result of the changes, the order in which dirty pages are written to disk has changed; writing always happens in file-offset order now. This change appears to be an improvement for many applications; Andrew reports as much as 30% faster benchmark results. I/O can slow down for some situations involving parallel writes on SMP systems, however.

Index entries for this article
Kernel	anon_vma
Kernel	Memory management
Kernel	Object-based reverse mapping

(Log in to post comments)

Aphorisms in the Linux community

Posted Apr 16, 2004 10:14 UTC (Fri) by Duncan (guest, #6647) [Link]

> It's like juggling four bars of wet soap
> with your eyes shut while someone is
> whacking you with a baseball bat.

LOL! I predict I'll see that in a sig line sometime in the future!

One of the interesting things about open source is insight into the simple
humanity aspect of the kernel hackers on occasion. Linus's always
real-world applicable humor is another example, and one couldn't help but
draw their own conclusions at the contrast between the pix of Bill Gates
with cream-pie in his face, and Linus, sitting on that dunk tank board. I
know which guy's kernel *I'm* more likely to be comfortable with running
MY "mission critical applications" on, even if those "mission critical
applications" are no more than XMMS, my newsreader (PAN), and web browser
(Konqueror), all running at the same time.

Duncan

order of I/O

Posted Apr 16, 2004 22:53 UTC (Fri) by giraffedata (guest, #1954) [Link]

As a result of the changes, the order in which dirty pages are written to disk has changed; writing always happens in file-offset order now.

I find that hard to believe. The order in which pages are written to disk is controlled by the block layer/device driver, and tends to be disk address order. I presume this means to say the order in which the I/Os to write dirty pages to disk are requested of the block layer always happens in file-offset order (as opposed to order in which they became dirty) now. It's hard to see how that makes a big difference in performance, considering I/Os to clean all the dirty pages are requested at about the same time.

order of I/O

Posted Apr 17, 2004 0:17 UTC (Sat) by corbet (editor, #1) [Link]

Actually, the order is determined by the I/O scheduler. But the scheduler can only work with requests once they are handed to it. Changing that order can make a big difference in what the driver sees. And, apparently, in some situations, the performance difference can be significant - either better or worse.