task_diag and statx()

By Jonathan Corbet
May 4, 2016

The interfaces supported by Linux to provide access to information about processes and files have literally been around for decades. One might think that, by this time, they would have reached a state of relative perfection. But things are not so perfect that developers are deterred from working on alternatives; the motivating factor in the two cases studied here is the same: reducing the cost of getting information out of the kernel while increasing the range of information that is available.

task_diag

There is no system call in Linux that provides information about running processes; instead, that information can be found in the /proc filesystem. Each process is represented by a directory under /proc; that directory contains a directory tree of its own with files providing information on just about every aspect of the process's existence. A quick look at the /proc hierarchy for a running bash instance reveals 279 files in 40 different directories. Whether one wants to know about control-group membership, resource usage, memory mappings, environment variables, open files, namespaces, out-of-memory-killer policies, or more, there is a file somewhere in that tree with the requisite information.

There are a lot of advantages to /proc, starting with the way it implements the classic Unix "everything is a file" approach. The information is readable as plain text, making it accessible from the command line or shell scripts. To a great extent, the interface is self-documenting, though some parts are more obvious than others. The contents of the stat file, for example, require an outside guide to be intelligible.

There are some downsides to this approach too, though. Accessing a file in /proc requires a minimum of three system calls — open(), read(), and close() — and that is after the file has been located in the directory hierarchy. Getting a range of information can require reading several files, with the system-call count multiplied accordingly. Some /proc files are expensive to read, and much of the resulting data may not be of interest to the reading process. There has been, to put it charitably, no unifying vision guiding the design of the /proc hierarchy, so each file there must be approached as a new parsing problem. It all adds up to a slow and cumbersome interface for applications that need significant amounts of information about multiple processes.

A possible solution comes in the form of the task_diag patch set from Andrey Vagin; it adds a binary interface allowing the extraction of lots of process information from the kernel using a single request. The starting point is a file called /proc/task-diag, which an interested process must open. That process then uses the netlink protocol to send a message describing the desired information, which can then be read back from the same file.

The request for information is contained within this structure:

    struct task_diag_pid {
	__u64   show_flags;
	__u64   dump_strategy;
	__u32   pid;
    };

The dump_strategy field tells the kernel which processes are of interest. Its value can be one of TASK_DIAG_DUMP_ONE (information about the single process identified by pid), TASK_DIAG_DUMP_THREAD (get information about all threads of pid), TASK_DIAG_DUMP_CHILDREN (all children of pid), TASK_DIAG_DUMP_ALL (all processes in the system) or TASK_DIAG_DUMP_ALL_THREADS (all threads in the system).

The show_flags field, instead, describes which information is to be returned for each process. With TASK_DIAG_SHOW_BASE, the "base" information will be returned:

    struct task_diag_base {
	__u32   tgid;
	__u32   pid;
	__u32   ppid;
	__u32   tpid;
	__u32   sid;
	__u32   pgid;
	__u8    state;
	char    comm[TASK_DIAG_COMM_LEN];
    };

Other possible flags include TASK_DIAG_SHOW_CREDS to get credential information, TASK_SHOW_VMA and TASK_SHOW_VMA_STAT for information on memory mappings, TASK_DIAG_SHOW_STAT for resource-usage statistics, and TASK_DIAG_SHOW_STATM for memory-usage statistics. If this interface is merged into the mainline, other options will surely follow.

The patches have been through a few rounds of review. Presumably something along these lines will eventually be merged, but it is not clear that the level of review required to safely add a new major kernel API has happened. There is also no man page for this feature yet. So it would not be surprising if a few more iterations were required before this one is declared to be ready.

statx()

Information about files in Linux, as with all Unix-like systems, comes via the stat() system call and its variants. Developers have chafed against its limitations for a long time. This system call, being enshrined in the POSIX standard, cannot be extended to return more information. It will likely return information that the calling process doesn't need — a wasted effort that can, for some information and filesystems, be expensive. And so on. For these reasons, an extended stat() system call has been a topic of discussion for many years.

Back in 2010, David Howells proposed a xstat() call that addressed a number of these problems, but that proposal got bogged down in discussion without being merged. Six years later, David is back with a new version of this patch. Time will tell if he is more successful this time around.

The new system call is now called statx(); the proposed interface is:

	int statx(int dfd, const char *filename, unsigned atflag,
		  unsigned mask, struct statx *buffer);

The file of interest is identified by filename; that file is expected to be found in or underneath the directory indicated by the file descriptor passed in dfd. If dfd is AT_FDCWD, the filename is interpreted relative to the current working directory. If filename is null, information about the file represented by dfd is returned instead.

The atflag parameter modifies how the information is collected. If it is AT_SYMLINK_NOFOLLOW and filename is a symbolic link, information is returned about the link itself. Other atflag values include AT_NO_AUTOMOUNT to prevent filesystems from being automatically mounted by the request, AT_FORCE_ATTR_SYNC to force a network filesystem to update attributes from the server before returning the information, and AT_NO_ATTR_SYNC to avoid updating from the server, even at the cost of returning approximate information. That last option can speed things up considerably when querying information about files on remote filesystems.

The mask parameter, instead, specifies which information the caller is looking for. The current patch set has fifteen options, varying from STATX_MODE (to get the permission bits) to STATX_GEN to get the current inode generation number (on filesystems that have such a concept). That mask appears again in the returned structure to indicate which fields are valid; that structure looks like:

    struct statx {
	__u32	st_mask;	/* What results were written [uncond] */
	__u32	st_information;	/* Information about the file [uncond] */
	__u32	st_blksize;	/* Preferred general I/O size [uncond] */
	__u32	st_nlink;	/* Number of hard links */
	__u32	st_gen;		/* Inode generation number */
	__u32	st_uid;		/* User ID of owner */
	__u32	st_gid;		/* Group ID of owner */
	__u16	st_mode;	/* File mode */
	__u16	__spare0[1];
	__u64	st_ino;		/* Inode number */
	__u64	st_size;	/* File size */
	__u64	st_blocks;	/* Number of 512-byte blocks allocated */
	__u64	st_version;	/* Data version number */
	__s64	st_atime_s;	/* Last access time */
	__s64	st_btime_s;	/* File creation time */
	__s64	st_ctime_s;	/* Last attribute change time */
	__s64	st_mtime_s;	/* Last data modification time */
	__s32	st_atime_ns;	/* Last access time (ns part) */
	__s32	st_btime_ns;	/* File creation time (ns part) */
	__s32	st_ctime_ns;	/* Last attribute change time (ns part) */
	__s32	st_mtime_ns;	/* Last data modification time (ns part) */
	__u32	st_rdev_major;	/* Device ID of special file */
	__u32	st_rdev_minor;
	__u32	st_dev_major;	/* ID of device containing file [uncond] */
	__u32	st_dev_minor;
	__u64	__spare1[16];	/* Spare space for future expansion */
    };

Many of those fields match those found in the classic struct stat or are close to them. Times have been split into separate second and nanosecond fields, enabling both high-precision timestamps and year-2038 compliance. The __spare1 array at the end is meant to allow other types of data to be added in the future. Finally, st_information gives general information about the file, including whether it's encrypted, whether it's a kernel-generated file, or whether it's stored on a remote server.

The only response to this patch set, as of this writing, came from Jeff Layton, who suggested "I think we really ought to resist excessive bikeshedding this time around". If the other developers accept that advice, then it's possible that an enhanced stat() interface might just get into the kernel sometime this year. Nobody will be able to complain that this particular change has been rushed.

Index entries for this article
Kernel	Filesystems/stat()
Kernel	/proc
Kernel	System calls

(Log in to post comments)

task_diag and statx()

Posted May 4, 2016 9:58 UTC (Wed) by kolyshkin (guest, #34342) [Link]

The task-diag interface was born as an answer to complex and slow CRIU code to gather information about the processes on the system (to be checkpointed). But the most obvious use case is to make utilities like ps and top fast again. I bet every seasoned sysadmin saw very slow "ps ax" and/or unresponsive "top". Some testing with ps patched to use the task-diag interface showed 5x to 10x speed improvement over one using traditional /proc/$PID/ forest.

Will LWN readers be interested in an article describing task-diag in greater details?

task_diag and statx()

Posted May 4, 2016 10:47 UTC (Wed) by johannbg (subscriber, #65743) [Link]

"Will LWN readers be interested in an article describing task-diag in greater details?"

Yeah why not?

task_diag and statx()

Posted May 4, 2016 20:26 UTC (Wed) by scottt (guest, #5028) [Link]

> Will LWN readers be interested in an article describing task-diag in greater details?

Yes, definitely!

task_diag and statx()

Posted May 8, 2016 8:37 UTC (Sun) by richmoore (guest, #53133) [Link]

I'd definitely be interested.

task_diag and statx()

Posted Aug 24, 2016 9:56 UTC (Wed) by xman (guest, #46972) [Link]

Like in other cases, CRIU just highlighted the extant pain. This has been annoying in so many ways. Beyond just simple efficiency, having the potential for parsing errors with such primitive functions of the OS seemed pretty crazy.

task_diag and statx()

Posted May 4, 2016 10:00 UTC (Wed) by kolyshkin (guest, #34342) [Link]

There is some more information about task-diag at https://criu.org/Task-diag

task_diag and statx()

Posted Aug 24, 2016 9:57 UTC (Wed) by xman (guest, #46972) [Link]

task_diag and statx()

Posted May 4, 2016 11:53 UTC (Wed) by Sesse (subscriber, #53779) [Link]

Curious to see the “spare space” at the end. Why doesn't it simply set a size-of-struct member at the start? It's what Win32 has done for years, with great success.

task_diag and statx()

Posted May 4, 2016 12:52 UTC (Wed) by fishface60 (subscriber, #88700) [Link]

When the suggested new clone syscall (for returning a file descriptor) was discussed, it passed the size of the struct in for extensibility, so new information could be added at the end without requiring padding, by the API later supporting a larger struct. So you call it something like:

struct statx buffer;
statx(dfd, filename, 0, STATX_MODE|STATX_GEN, sizeof(buffer), &buffer);

task_diag and statx()

Posted May 4, 2016 13:03 UTC (Wed) by Sesse (subscriber, #53779) [Link]

Yes, that would work equally well.

task_diag and statx()

Posted May 4, 2016 19:20 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

There is already a parameter controlling which fields are accessed (mask), so it shouldn't even be necessary to pass the structure size. If any additional fields are appended in the future they will just be ignored unless the corresponding mask bits are set, so programs using the older, smaller structure (and none of the new mask bits) won't see any difference.

task_diag and statx()

Posted May 4, 2016 19:27 UTC (Wed) by Sesse (subscriber, #53779) [Link]

As long as you're not out of mask bits? =)

task_diag and statx()

Posted May 4, 2016 22:01 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

You would need more mask bits, but I presume we wouldn't want to add more fields without corresponding mask bits anyway. That just gets back to the stat() API and the overhead of returning data the caller doesn't need.

If we were getting close to running out of mask bits, the logical thing to do, short of defining a new syscall, would be to add a single MASK_OTHER bit and a corresponding mask2 field in the structure which the caller would set before invoking statx(). (Which raises a question: Why not do the same with the existing mask parameter, which after all is already part of the structure? Is there some advantage to making the mask a parameter?)

task_diag and statx()

Posted May 4, 2016 22:43 UTC (Wed) by derobert (subscriber, #89569) [Link]

I believe the idea of the mask bits is so the caller can indicate "I really want this, even if its expensive." If its not expensive, it'll return it regardless—for example, it appears ext4 returns create time regardless of the mask https://git.kernel.org/cgit/linux/kernel/git/dhowells/lin...

So, if future fields work like that to, they may be returned even if not requested. Thus, the struct needs to start big enough.

task_diag and statx()

Posted May 5, 2016 0:22 UTC (Thu) by jiiksteri (subscriber, #75247) [Link]

> I believe the idea of the mask bits is so the caller can indicate "I really want this, even if its expensive." If its not expensive, it'll return it regardless

But that "it'll return it regardless" part is kind of dangerous, right? Because that gives sloppy $userspace_app the opportunity of relying on statx() always returning a valid statx->foo even when STATX_FOO isn't set in the mask parameter. And obviously because said $userspace_app is sloppy it doesn't check ->mask in the returned struct either.

I admit the example is a bit contrived, but if this sloppy $userspace_app ever becomes important enough, the kernel will have to yield and just state that this potentially expensive statx->foo will need to be returned regardless of the mask passed in.

This would never happen if the interface stated that _only_ the fields set in the mask parameter are ever set in the returned struct, period. And at that point it might just as well use the mask bits in the struct passed in.

task_diag and statx()

Posted May 5, 2016 11:18 UTC (Thu) by keeperofdakeys (guest, #82635) [Link]

That is a great point, and is related to ensuring unused flag bits are indeed zero https://lwn.net/Articles/588444/. The truth is, even if your documentation says one thing, you can't go back on an interface if something uses it - otherwise you'd break userspace.

task_diag and statx()

Posted May 5, 2016 15:13 UTC (Thu) by derobert (subscriber, #89569) [Link]

Mainly that'd be a risk that an app would break if the underlying filesystem were changed. E.g., it might work on ext4 but not btrfs (I haven't actually checked if they return different things).

That's certainly a tradeoff in this design. You can't have three-valued (need, want if cheap, never) bit though, and even then someone could pass want if cheap and forget to check the return mask.

I wouldn't be surprised if concerns like yours have been part of the discussion of previous statx/xstat patches. As the article notes, it's been subject to a lot of design...

task_diag and statx()

Posted May 7, 2016 11:08 UTC (Sat) by jiiksteri (subscriber, #75247) [Link]

> Mainly that'd be a risk that an app would break if the underlying filesystem were changed. E.g., it might work on ext4 but not btrfs (I haven't actually checked if they return different things).

...or if later developments suddenly made it more expensive for, say, ext4 to return a piece of information it used to return by default because it was easy. I'm not sure they could stop returning it by default at that point.

So in a way filesystems returning extra information here without being asked to restrict their own freedom of implementation. And of course, as you mention, this could make things difficult for other filesystems that _don't_ return the same information without the flag; I imagine the user whose apps break when changing filesystems would easily blame the (new) filesystem for something like this :)

> I wouldn't be surprised if concerns like yours have been part of the discussion of previous statx/xstat patches.

Most likely. These are smart people and I appreciate the work they do. I'm just playing the devil's advocate here.

task_diag and statx()

Posted May 4, 2016 12:11 UTC (Wed) by giggls (subscriber, #48434) [Link]

I once learned, that in contrast to other systems like Windows Unix-like Systems have many simple system calls instead of fewer complicated ones.

So to me implementing something like this looks like combining the bad part of both of these approaches.

task_diag and statx()

Posted May 4, 2016 15:32 UTC (Wed) by HenrikH (subscriber, #31152) [Link]

And which are those bad parts and why are they bad? Simply stating that something is done way X in Windows does not make it inheritable bad.

task_diag and statx()

Posted May 4, 2016 17:49 UTC (Wed) by zyga (subscriber, #81533) [Link]

The article clearly states that this is not new data, just data exposed from various places in one system call. The article also states that this makes some operations faster and easier (and I'd add myself, that unless the task at hand is frozen, also atomic). The downside of the existing, fine grained interface is that it is slower.

task_diag and statx()

Posted May 4, 2016 17:49 UTC (Wed) by flussence (subscriber, #85566) [Link]

Think of it as premature optimization. Windows rushed into adding as many features as possible (which is now a perpetual back-compat burden) while Unix gave people a few RISC-ish syscalls and left them to their own devices.

With the benefit of half a century or so in hindsight it doesn't seem excessive to say “yes, there's enough people doing this to warrant tuning it”. I for one would love to run htop sorted by CPU time and not have it permanently occupy a row of its own output.

task_diag and statx()

Posted May 4, 2016 20:18 UTC (Wed) by shemminger (subscriber, #5739) [Link]

Why is this using a system call and not something like generic netlink which is both extensible and message based?

task_diag and statx()

Posted May 4, 2016 23:56 UTC (Wed) by dw (subscriber, #12017) [Link]

Indeed, there is already a netlink task interface which provides at least some of these fields

task_diag and statx()

Posted May 18, 2016 17:48 UTC (Wed) by avagin (subscriber, #63724) [Link]

I think you mean taststats. It doesn't handle user and pid namespaces properly. And it works only from the root user namespace.

We try to use this interface in a second version of task_diag patches:
https://lkml.org/lkml/2015/7/6/142

There is an opinion that netlink isn't completely suitable for non-network tasks. And Andy Lutomirski suggests to add a new interfaces to fix known issues of netlink:
https://www.mail-archive.com/netdev@vger.kernel.org/msg10...

task_diag and statx()

Posted May 6, 2016 15:13 UTC (Fri) by droundy (subscriber, #4559) [Link]

I was wondering the opposite: why is task_diag not implemented as a system call? What are the advantages of using the netlink protocol?

All I can think of is that it enables flexible returning of arbitrary length data through multiple reads. Is there anything else that provides a compelling advantage for netlink?

task_diag and statx()

Posted May 18, 2016 18:05 UTC (Wed) by avagin (subscriber, #63724) [Link]

netlink allows to extend protocol without breaking backward compatibility. You can add new attributes or append new fields to an existing one.

task_diag and statx()

Posted May 18, 2016 17:19 UTC (Wed) by avagin (subscriber, #63724) [Link]

First versions of task_diag used netlink sockets, but there were some questions about security issues, about handling user and pid namespaces.
Here is a thread with our discussion about using netlink for task_diag: https://lkml.org/lkml/2015/12/15/520.