Seccomp and deep argument inspection
Kees Cook has been doing some thinking about plans for new seccomp features to work on soon. There were four separate areas that he was interested in, which he detailed in a lengthy mid-May message on the linux-kernel mailing list. One of those features, deep argument inspection, has been covered here before, but it would seem that we are getting closer to a resolution on how that all will work.
Deep arguments
Seccomp filtering (or "seccomp mode 2") allows a process to filter which system calls can be made by it or its threads—it can be used to "sandbox" a program such that it cannot make calls that it shouldn't. Those filters use the "classic" BPF (cBPF) language to specify which system calls and argument values to allow or disallow. The seccomp() system call is used to enable filtering mode or to load a cBPF filtering program. Those programs only have access to the values of the arguments passed to the system call; if those arguments are pointers, they cannot be dereferenced by seccomp, which means that accepting or rejecting the system call cannot depend on, for example, values in structures that are passed to system calls via pointers—or even string values.
The reason that seccomp cannot dereference the pointers is to avoid the time-of-check-to-time-of-use (TOCTTOU) race condition, where user space can change the value of what is being pointed to between the time that the kernel checks it and the time that the value gets used. But certain system calls, especially newer ones like clone3() and openat2(), have some important arguments passed in structures via pointers. These new system calls are designed with an eye toward easily adding new arguments and flags by redefining the structure that gets passed; in his email, Cook called these "extensible argument" (or EA) system calls.
It does not make sense for seccomp to provide a mechanism to inspect the
pointer arguments of every system call, he said: "[...] the grudging
consensus was reached that having seccomp do this for ALL syscalls was
likely going to be extremely disruptive for very little gain
". But
for the EA system calls (or perhaps only a subset of those), seccomp could
copy the structure pointed to and make it available to the BPF program via its struct
seccomp_data. That would mean that seccomp would need to change
to perform that copy, which would require a copy_from_user() call,
and affected system calls would need to be seccomp-aware so that they can
use the cached copy if seccomp creates one.
There are some other wrinkles to the problem, of course. The size of the structure passed to the EA system calls may grow over time in order to add new features. If the size is larger than expected on either side (user space or kernel), finding or filling zeroes in the "extra" space is specifically designed to mean that those new features are unused (the openat2() man page linked above has some good information on how this is meant to work). Since user space and the kernel do not have to be in lockstep, that will allow newer user-space programs to call into an older kernel and vice versa. But that also means that seccomp needs to be prepared to handle argument sizes larger (or smaller) than "expected" and ensure that the zero-filling is done correctly.
It gets even more complicated because different threads might have different ideas of what the EA structure size is, Cook said:
He had suggestions of a few different possibilities to solve the problem, but seemed to prefer the zero-fill option:
Others commenting also seemed to prefer that option, though Jann Horn noted that there is no need to zero-fill beyond the size that the kernel knows about:
Implementing that new operation would require changes to cBPF, however,
which is not going to happen, according
to BPF maintainer Alexei Starovoitov: "cbpf is frozen.
" An
alternative would be for seccomp to switch to extended BPF (eBPF) for its
filters. Using eBPF would allow the filters to perform that operation
themselves without adding any new opcodes, but switching to eBPF is
something that Cook hopes to avoid. As he explained in a message
back in 2018, eBPF is something of fast-moving target, which worries him
from a security standpoint: "[...] I want absolutely zero surprises
when it comes to seccomp
". Beyond that, eBPF would add a lot more
code for the seccomp filter to interact with in potentially dangerous ways.
Aleksa Sarai, who is the developer behind the EA scheme, generally agreed with Cook's plan for handling those structures, but he raised another point. The structures may contain pointers—those cannot be dereferenced by seccomp either, of course. Should something be done so that the filters can access that data as well? When these "nested pointers" came up in another discussion, Linus Torvalds made it abundantly clear that he thinks that is not a problem that the kernel should deal with at all.
Less-deep arguments
A few days after his original post, Cook posted an item on the ksummit-discuss mailing list to suggest that there be a session at the (virtual) Kernel Summit in August to discuss these seccomp issues. Torvalds acknowledged that this kind of system call exists, but did not think there was much to discuss with regard to seccomp:
[...] And if you have some actual and imminent real security issue, you mention _that_ and explain _that_, and accept that maybe you need to do that expensive emulation (because the kernel people just don't care about your private hang-ups) or you need to explain why it's a real issue and why the kernel should help with your odd special case.
Cook seemed somewhat relieved in his response:
Christian Brauner, who has also been doing a lot of development in these areas, agreed that the filters could likely live without the ability to chase pointers any further than the top level. Sarai would like to see there at least be a path forward if requirements of that sort do arise, but seemed willing to keep things simple for now—perhaps forever.
io_uring
In his message on linux-kernel, Horn raised an interesting point for seccomp developers: handling io_uring. Since its introduction in early 2019, io_uring has rapidly added features that effectively allow routing around the normal system-call entry path, while still performing the actions that a seccomp filter might be trying to prevent.
Obviously, the filters could simply disallow the io_uring system calls
entirely, but that may be problematic down the road. Sarai agreed
that it is something that may need some attention. Cook said that he
needed to look more closely at io_uring: "I thought this
was strictly for I/O ... like it's named
". Trying to filter based
on the arguments to the io_uring system calls will be a difficult problem
to solve, since the actual commands and their arguments are buried inside a
ring buffer that lives in
an mmap()
region shared between the kernel and user space. Chasing pointers in that environment
seems likely to require eBPF—or even stronger medicine.
It would seem that a reasonable path for inspecting the first level of structure "arguments" to some system calls has been identified. clone3() and openat2() are obvious candidates, since their flag arguments, which will help seccomp filters determine if the call is "reasonable" under the rules of the sandbox, live in such structures. On the other hand, complex, multiplexing system calls like ioctl() and bpf() were specifically mentioned as system calls that would not make sense to try to add the pointer-chasing feature. Though Cook did not put any timetable on his plans, one might think we will see this feature sometime before the end of the year.
Index entries for this article | |
---|---|
Kernel | Security/seccomp |
Security | Linux kernel/Seccomp |
(Log in to post comments)
Seccomp and deep argument inspection
Posted Jun 11, 2020 0:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
Perhaps code can be added to flatten these syscalls into a long buffer, if this is really ever needed?
Seccomp and deep argument inspection
Posted Jun 11, 2020 2:15 UTC (Thu) by NYKevin (subscriber, #129325) [Link]
Regardless, this is obviously a lot more complicated than a single call to (the moral equivalent of) memcpy(), which is probably why Linus doesn't want to do it.
Seccomp and deep argument inspection
Posted Jun 11, 2020 1:26 UTC (Thu) by andresfreund (subscriber, #69562) [Link]
Seems like you'd have to have some integration at a different level, perhaps when io_uring gets SQEs from the ringbuffer.
Seccomp and deep argument inspection
Posted Jun 11, 2020 6:48 UTC (Thu) by roc (subscriber, #30627) [Link]
Seccomp and deep argument inspection
Posted Jun 11, 2020 7:05 UTC (Thu) by brauner (subscriber, #109349) [Link]
If I may fill in some history how we ended up with esyscalls. The versioning-by-size (vbs) idea behind extensible syscalls has already been expressed in sched_setattr() and perf_event_open() for a long time (with some ABI quirks). When I did clone3() we quickly shifted to a design that would allow for it to be easily extended or re-versioned. So I added a local and simple version of the later copy_struct_from_user() similar to what was done in other places to clone3() in 7f192e3cd316ba58c88dfa26796cf7 that implemented vbs.+noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, + struct clone_args __user *uargs, + size_t size) +{ + struct clone_args args; + + if (unlikely(size > PAGE_SIZE)) + return -E2BIG; + + if (unlikely(size < sizeof(struct clone_args))) + return -EINVAL; + + if (unlikely(!access_ok(uargs, size))) + return -EFAULT; + + if (size > sizeof(struct clone_args)) { + unsigned char __user *addr; + unsigned char __user *end; + unsigned char val; + + addr = (void __user *)uargs + sizeof(struct clone_args); + end = (void __user *)uargs + size; + + for (; addr < end; addr++) { + if (get_user(val, addr)) + return -EFAULT; + if (val) + return -E2BIG; + } + + size = sizeof(struct clone_args); + } + + if (copy_from_user(&args, uargs, size)) + return -EFAULT;At the same time, Aleksa was working on openat2() and copied the vbs logic from clone3() at which point we realized that it would probably make sense to add a copy_struct_from_user() that would implement vbs and expose it to the kernel in general. This logic was pulled in e524d16e7e324039f2a9f82e302f0a39ac7d5812 before openat2() landed. Then all the current custom vbs implementations were replaced by this (At least the ones where it could easily be done.) and the openat2() patchset switched over to it as well.
Seccomp and deep argument inspection
Posted Jun 11, 2020 17:42 UTC (Thu) by Jandar (subscriber, #85683) [Link]
> + return -EINVAL;
Doesn't this mean old user-space compiled with a than smaller struct ceases to work?
Seccomp and deep argument inspection
Posted Jun 11, 2020 21:40 UTC (Thu) by brauner (subscriber, #109349) [Link]