task_diag and statx()
task_diag
There is no system call in Linux that provides information about running processes; instead, that information can be found in the /proc filesystem. Each process is represented by a directory under /proc; that directory contains a directory tree of its own with files providing information on just about every aspect of the process's existence. A quick look at the /proc hierarchy for a running bash instance reveals 279 files in 40 different directories. Whether one wants to know about control-group membership, resource usage, memory mappings, environment variables, open files, namespaces, out-of-memory-killer policies, or more, there is a file somewhere in that tree with the requisite information.
There are a lot of advantages to /proc, starting with the way it implements the classic Unix "everything is a file" approach. The information is readable as plain text, making it accessible from the command line or shell scripts. To a great extent, the interface is self-documenting, though some parts are more obvious than others. The contents of the stat file, for example, require an outside guide to be intelligible.
There are some downsides to this approach too, though. Accessing a file in /proc requires a minimum of three system calls — open(), read(), and close() — and that is after the file has been located in the directory hierarchy. Getting a range of information can require reading several files, with the system-call count multiplied accordingly. Some /proc files are expensive to read, and much of the resulting data may not be of interest to the reading process. There has been, to put it charitably, no unifying vision guiding the design of the /proc hierarchy, so each file there must be approached as a new parsing problem. It all adds up to a slow and cumbersome interface for applications that need significant amounts of information about multiple processes.
A possible solution comes in the form of the task_diag patch set from Andrey Vagin; it adds a binary interface allowing the extraction of lots of process information from the kernel using a single request. The starting point is a file called /proc/task-diag, which an interested process must open. That process then uses the netlink protocol to send a message describing the desired information, which can then be read back from the same file.
The request for information is contained within this structure:
struct task_diag_pid { __u64 show_flags; __u64 dump_strategy; __u32 pid; };
The dump_strategy field tells the kernel which processes are of interest. Its value can be one of TASK_DIAG_DUMP_ONE (information about the single process identified by pid), TASK_DIAG_DUMP_THREAD (get information about all threads of pid), TASK_DIAG_DUMP_CHILDREN (all children of pid), TASK_DIAG_DUMP_ALL (all processes in the system) or TASK_DIAG_DUMP_ALL_THREADS (all threads in the system).
The show_flags field, instead, describes which information is to be returned for each process. With TASK_DIAG_SHOW_BASE, the "base" information will be returned:
struct task_diag_base { __u32 tgid; __u32 pid; __u32 ppid; __u32 tpid; __u32 sid; __u32 pgid; __u8 state; char comm[TASK_DIAG_COMM_LEN]; };
Other possible flags include TASK_DIAG_SHOW_CREDS to get credential information, TASK_SHOW_VMA and TASK_SHOW_VMA_STAT for information on memory mappings, TASK_DIAG_SHOW_STAT for resource-usage statistics, and TASK_DIAG_SHOW_STATM for memory-usage statistics. If this interface is merged into the mainline, other options will surely follow.
The patches have been through a few rounds of review. Presumably something along these lines will eventually be merged, but it is not clear that the level of review required to safely add a new major kernel API has happened. There is also no man page for this feature yet. So it would not be surprising if a few more iterations were required before this one is declared to be ready.
statx()
Information about files in Linux, as with all Unix-like systems, comes via the stat() system call and its variants. Developers have chafed against its limitations for a long time. This system call, being enshrined in the POSIX standard, cannot be extended to return more information. It will likely return information that the calling process doesn't need — a wasted effort that can, for some information and filesystems, be expensive. And so on. For these reasons, an extended stat() system call has been a topic of discussion for many years.
Back in 2010, David Howells proposed a xstat() call that addressed a number of these problems, but that proposal got bogged down in discussion without being merged. Six years later, David is back with a new version of this patch. Time will tell if he is more successful this time around.
The new system call is now called statx(); the proposed interface is:
int statx(int dfd, const char *filename, unsigned atflag, unsigned mask, struct statx *buffer);
The file of interest is identified by filename; that file is expected to be found in or underneath the directory indicated by the file descriptor passed in dfd. If dfd is AT_FDCWD, the filename is interpreted relative to the current working directory. If filename is null, information about the file represented by dfd is returned instead.
The atflag parameter modifies how the information is collected. If it is AT_SYMLINK_NOFOLLOW and filename is a symbolic link, information is returned about the link itself. Other atflag values include AT_NO_AUTOMOUNT to prevent filesystems from being automatically mounted by the request, AT_FORCE_ATTR_SYNC to force a network filesystem to update attributes from the server before returning the information, and AT_NO_ATTR_SYNC to avoid updating from the server, even at the cost of returning approximate information. That last option can speed things up considerably when querying information about files on remote filesystems.
The mask parameter, instead, specifies which information the caller is looking for. The current patch set has fifteen options, varying from STATX_MODE (to get the permission bits) to STATX_GEN to get the current inode generation number (on filesystems that have such a concept). That mask appears again in the returned structure to indicate which fields are valid; that structure looks like:
struct statx { __u32 st_mask; /* What results were written [uncond] */ __u32 st_information; /* Information about the file [uncond] */ __u32 st_blksize; /* Preferred general I/O size [uncond] */ __u32 st_nlink; /* Number of hard links */ __u32 st_gen; /* Inode generation number */ __u32 st_uid; /* User ID of owner */ __u32 st_gid; /* Group ID of owner */ __u16 st_mode; /* File mode */ __u16 __spare0[1]; __u64 st_ino; /* Inode number */ __u64 st_size; /* File size */ __u64 st_blocks; /* Number of 512-byte blocks allocated */ __u64 st_version; /* Data version number */ __s64 st_atime_s; /* Last access time */ __s64 st_btime_s; /* File creation time */ __s64 st_ctime_s; /* Last attribute change time */ __s64 st_mtime_s; /* Last data modification time */ __s32 st_atime_ns; /* Last access time (ns part) */ __s32 st_btime_ns; /* File creation time (ns part) */ __s32 st_ctime_ns; /* Last attribute change time (ns part) */ __s32 st_mtime_ns; /* Last data modification time (ns part) */ __u32 st_rdev_major; /* Device ID of special file */ __u32 st_rdev_minor; __u32 st_dev_major; /* ID of device containing file [uncond] */ __u32 st_dev_minor; __u64 __spare1[16]; /* Spare space for future expansion */ };Many of those fields match those found in the classic struct stat or are close to them. Times have been split into separate second and nanosecond fields, enabling both high-precision timestamps and year-2038 compliance. The __spare1 array at the end is meant to allow other types of data to be added in the future. Finally, st_information gives general information about the file, including whether it's encrypted, whether it's a kernel-generated file, or whether it's stored on a remote server.
The only response to this patch set, as of this writing, came from Jeff Layton, who suggested "I
think we really ought to resist excessive bikeshedding this time
around
". If the other developers accept that advice, then it's
possible that an enhanced stat() interface might just get into the
kernel sometime this year. Nobody will be able to complain that this
particular change has been rushed.
Index entries for this article | |
---|---|
Kernel | Filesystems/stat() |
Kernel | /proc |
Kernel | System calls |
(Log in to post comments)
task_diag and statx()
Posted May 4, 2016 9:58 UTC (Wed) by kolyshkin (guest, #34342) [Link]
Will LWN readers be interested in an article describing task-diag in greater details?
task_diag and statx()
Posted May 4, 2016 10:47 UTC (Wed) by johannbg (subscriber, #65743) [Link]
Yeah why not?
task_diag and statx()
Posted May 4, 2016 20:26 UTC (Wed) by scottt (guest, #5028) [Link]
Yes, definitely!
task_diag and statx()
Posted May 8, 2016 8:37 UTC (Sun) by richmoore (guest, #53133) [Link]
task_diag and statx()
Posted Aug 24, 2016 9:56 UTC (Wed) by xman (guest, #46972) [Link]
task_diag and statx()
Posted May 4, 2016 10:00 UTC (Wed) by kolyshkin (guest, #34342) [Link]
task_diag and statx()
Posted Aug 24, 2016 9:57 UTC (Wed) by xman (guest, #46972) [Link]
task_diag and statx()
Posted May 4, 2016 11:53 UTC (Wed) by Sesse (subscriber, #53779) [Link]
task_diag and statx()
Posted May 4, 2016 12:52 UTC (Wed) by fishface60 (subscriber, #88700) [Link]
When the suggested new clone syscall (for returning a file descriptor) was discussed, it passed the size of the struct in for extensibility, so new information could be added at the end without requiring padding, by the API later supporting a larger struct. So you call it something like:struct statx buffer; statx(dfd, filename, 0, STATX_MODE|STATX_GEN, sizeof(buffer), &buffer);
task_diag and statx()
Posted May 4, 2016 13:03 UTC (Wed) by Sesse (subscriber, #53779) [Link]
task_diag and statx()
Posted May 4, 2016 19:20 UTC (Wed) by nybble41 (subscriber, #55106) [Link]
task_diag and statx()
Posted May 4, 2016 19:27 UTC (Wed) by Sesse (subscriber, #53779) [Link]
task_diag and statx()
Posted May 4, 2016 22:01 UTC (Wed) by nybble41 (subscriber, #55106) [Link]
If we were getting close to running out of mask bits, the logical thing to do, short of defining a new syscall, would be to add a single MASK_OTHER bit and a corresponding mask2 field in the structure which the caller would set before invoking statx(). (Which raises a question: Why not do the same with the existing mask parameter, which after all is already part of the structure? Is there some advantage to making the mask a parameter?)
task_diag and statx()
Posted May 4, 2016 22:43 UTC (Wed) by derobert (subscriber, #89569) [Link]
So, if future fields work like that to, they may be returned even if not requested. Thus, the struct needs to start big enough.
task_diag and statx()
Posted May 5, 2016 0:22 UTC (Thu) by jiiksteri (subscriber, #75247) [Link]
But that "it'll return it regardless" part is kind of dangerous, right? Because that gives sloppy $userspace_app the opportunity of relying on statx() always returning a valid statx->foo even when STATX_FOO isn't set in the mask parameter. And obviously because said $userspace_app is sloppy it doesn't check ->mask in the returned struct either.
I admit the example is a bit contrived, but if this sloppy $userspace_app ever becomes important enough, the kernel will have to yield and just state that this potentially expensive statx->foo will need to be returned regardless of the mask passed in.
This would never happen if the interface stated that _only_ the fields set in the mask parameter are ever set in the returned struct, period. And at that point it might just as well use the mask bits in the struct passed in.
task_diag and statx()
Posted May 5, 2016 11:18 UTC (Thu) by keeperofdakeys (guest, #82635) [Link]
task_diag and statx()
Posted May 5, 2016 15:13 UTC (Thu) by derobert (subscriber, #89569) [Link]
That's certainly a tradeoff in this design. You can't have three-valued (need, want if cheap, never) bit though, and even then someone could pass want if cheap and forget to check the return mask.
I wouldn't be surprised if concerns like yours have been part of the discussion of previous statx/xstat patches. As the article notes, it's been subject to a lot of design...
task_diag and statx()
Posted May 7, 2016 11:08 UTC (Sat) by jiiksteri (subscriber, #75247) [Link]
...or if later developments suddenly made it more expensive for, say, ext4 to return a piece of information it used to return by default because it was easy. I'm not sure they could stop returning it by default at that point.
So in a way filesystems returning extra information here without being asked to restrict their own freedom of implementation. And of course, as you mention, this could make things difficult for other filesystems that _don't_ return the same information without the flag; I imagine the user whose apps break when changing filesystems would easily blame the (new) filesystem for something like this :)
> I wouldn't be surprised if concerns like yours have been part of the discussion of previous statx/xstat patches.
Most likely. These are smart people and I appreciate the work they do. I'm just playing the devil's advocate here.
task_diag and statx()
Posted May 4, 2016 12:11 UTC (Wed) by giggls (subscriber, #48434) [Link]
So to me implementing something like this looks like combining the bad part of both of these approaches.
task_diag and statx()
Posted May 4, 2016 15:32 UTC (Wed) by HenrikH (subscriber, #31152) [Link]
task_diag and statx()
Posted May 4, 2016 17:49 UTC (Wed) by zyga (subscriber, #81533) [Link]
task_diag and statx()
Posted May 4, 2016 17:49 UTC (Wed) by flussence (subscriber, #85566) [Link]
With the benefit of half a century or so in hindsight it doesn't seem excessive to say “yes, there's enough people doing this to warrant tuning it”. I for one would love to run htop sorted by CPU time and not have it permanently occupy a row of its own output.
task_diag and statx()
Posted May 4, 2016 20:18 UTC (Wed) by shemminger (subscriber, #5739) [Link]
task_diag and statx()
Posted May 4, 2016 23:56 UTC (Wed) by dw (subscriber, #12017) [Link]
task_diag and statx()
Posted May 18, 2016 17:48 UTC (Wed) by avagin (subscriber, #63724) [Link]
We try to use this interface in a second version of task_diag patches:
https://lkml.org/lkml/2015/7/6/142
There is an opinion that netlink isn't completely suitable for non-network tasks. And Andy Lutomirski suggests to add a new interfaces to fix known issues of netlink:
https://www.mail-archive.com/netdev@vger.kernel.org/msg10...
task_diag and statx()
Posted May 6, 2016 15:13 UTC (Fri) by droundy (subscriber, #4559) [Link]
All I can think of is that it enables flexible returning of arbitrary length data through multiple reads. Is there anything else that provides a compelling advantage for netlink?
task_diag and statx()
Posted May 18, 2016 18:05 UTC (Wed) by avagin (subscriber, #63724) [Link]
task_diag and statx()
Posted May 18, 2016 17:19 UTC (Wed) by avagin (subscriber, #63724) [Link]
Here is a thread with our discussion about using netlink for task_diag: https://lkml.org/lkml/2015/12/15/520.