Page migration

[Posted October 25, 2005 by corbet]

NUMA systems have, by design, memory which is local to specific nodes (groups of processors). While all memory is accessible, local memory is faster to work with than remote memory. The kernel takes NUMA behavior into account by attempting to allocate local memory for processes, and by avoiding moving processes between nodes whenever possible. Sometimes processes must be moved, however, with the result that the local-allocation optimization can quickly become a pessimization instead. What would be nice, in such situations, would be the ability to move a process's memory when the process itself is shifted to a new node.

Memory migration patches have been circulating for some time now. The latest version is this patch set posted by Christoph Lameter. This patch deliberately does not solve the entire problem, but it does try to establish enough infrastructure that a full migration solution can be evolved eventually.

This patch does not automatically migrate memory for processes which have been moved; instead, it leaves the migration decision to user space. There is a new system call:

    long migrate_pages(pid_t pid, unsigned long maxnode,
                       unsigned long *old_nodes,
                       unsigned long *new_nodes);

This call will attempt to move any pages belonging to the given process from old_nodes to new_nodes. There is also a new MPOL_MF_MOVE option to the set_mempolicy() system call which can be used to the same effect. Either way, user space can request that a given process vacate a set of nodes. This operation can be performed in response to an explicit move of the process itself (which might be done by a system scheduling daemon, for example), or in response to other events, such as the impending shutdown and removal of a node.

The implementation is simple for now: the code iterates over the process's memory and attempts to force each page needing migration to be swapped. When the process faults the page back in, it should then be allocated on the process's current node. The force-out process actually takes a few passes over the list; initially it passes over locked pages and just concerns itself with pages which are easy to evict. In later passes, it will wait for locked pages and do the hard work of getting the final pages out of memory.

Migrating pages by way of the swap device is not the most efficient way of moving them across a NUMA system. Later work on the patch will be aimed at adding direct node-to-node migration, and other features as well. In the mean time, however, the developers would like to see the current implementation merged in time for 2.6.15. Andrew Morton has expressed some reservations, however: he would like to see an explanation of how this code can be made to work with near complete reliability. There are a number of things which can prevent the migration of pages; these include pages locked in place by user space, page undergoing direct I/O, and more. Christoph responded that the patch will get there, eventually. Whether this claim is sufficiently convincing to get the migration patches into 2.6.15 remains to be seen.

Index entries for this article
Kernel	Memory management/NUMA systems
Kernel	NUMA

(Log in to post comments)

Page migration

Posted Oct 27, 2005 22:31 UTC (Thu) by Duncan (guest, #6647) [Link]

> Migrating pages by way of the swap device
> is not the most efficient way of moving
> them across a NUMA system.

... Especially for those of us that have swap disabled! It'll be nice to
have migration working, but what's this with forcing it out to slow swap
so it can be faulted back in on the other node? It's likely that's so
slow the cost of doing it exceeds the potential benefit of NUMA optimized
memory accesses for the remaining lifetime of the process, in many cases!

Of course, if by "swapped out" and "faulted back in", it just means a trip
out of directly allocated memory into cache memory and faulted back in
from there, no big deal (unless no swap mean's it's disabled), but if it's
actually written to disk, that's /quite/ a bit of extra latency to make up
in NUMA optimization, enough so it's not likely to be worth it save for
processes running (and accessing that memory) for > an hour, anyway.

So... I guess my view on this depends on the defined value of
"eventually", altho I think I'd still prefer it wait out a version, to be
merged with .16 (or later), if it's all going to be swap dependant
for .15. After all, it's not like the patches won't still be there for
those that want them with .15, anyway, just not mainlined (-mm might be
fine).

Duncan

Why a system call?

Posted Oct 28, 2005 0:32 UTC (Fri) by xoddam (subscriber, #2322) [Link]

I don't understand how userspace is expected to know when and why it
would be worthwhile to move pages from node to node. We don't have
explicit system calls requesting that pages be written out to the swap
device or faulted back -- the OS handles it automatically and it's the
job of the VM system to do it well. Recent optimisations like swap
readahead are progress in that direction.

I imagine the intermediate use of the swap device is merely a stepping
stone for the implementor. I'm sure Christoph is aware of the price of
disk latency!

Perhaps nonlocal memory should be treated as a fast swap device after a
process has been migrated -- pages can be faulted across to local memory
when they are used on the new node. As an optimisation this could be
'extra lazy', since all pages are actually accessible -- eg. only pages
written to by the new node (and, if there is a good heuristic for this,
the most frequently read pages) need be copied to local memory.

Do NUMA systems have memory-to-memory copy operations which don't trash
the processor caches? I can imagine a "DMA" between node memories could
take place while the source pages remain readable by the processor.

Page migration

Posted Oct 31, 2005 13:58 UTC (Mon) by branden (guest, #7029) [Link]

Is there some overlap between this and suspend-to-disk functionality?

And if there isn't, should there be?