The Sequoia seq_file vulnerability
A local root hole in the Linux kernel, called Sequoia, was disclosed by Qualys on July 20. A full system compromise is possible until the kernel is patched (or mitigations that may not be fully effective are applied). At its core, the vulnerability relies on a path through the kernel where 64-bit size_t values are "converted" to signed integers, which effectively results in an overflow. The flaw was reported to Red Hat on June 9, along with a local systemd denial-of-service vulnerability, leading to a kernel crash, found at the same time. Systems with untrusted local users need updates for both problems applied as soon as they are available—out of an abundance of caution, other systems likely should be updated as well.
Down in the guts of the kernel's seq_file
interface, which is used for handling virtual
files in /proc and the like, buffers are needed to store each
line of the file's "contents". To start, a page of memory is allocated for
the buffer, but if that
is not sufficient, a new buffer that is twice the size of the
old one is allocated. This is all done using a size_t, which is an unsigned
64-bit quantity (on x86_64) that is large enough to hold the results, so "the
system would run out of memory long before this multiplication overflows
".
But that value (m->size in the advisory) is passed to other functions that expect a signed 32-bit integer. In particular, the exploit uses the output of /proc/self/mountinfo to get to a place in the kernel where that is the case. The attacker can create a directory hierarchy with a path length larger than 1GB (roughly one-million nested directories), bind-mount it into a user namespace, and then delete the directory. In the namespace, the attacker opens and reads mountinfo, which causes the string "//deleted" to be written outside of the seq_file buffer. The seq_file interface creates a buffer that is 2GB in size, but down in the code that writes the string, that value gets interpreted as -2GB, which is used to calculate where to do the write. That is far outside the buffer, of course, and so it overwrites memory at a known offset elsewhere in the vmalloc() region.
As might be guessed, writing a ten-byte fixed string outside of the buffer is only the start of a complicated, but easily replicated, series of gyrations leading to a root shell. The Qualys advisory goes into great detail about the journey, which involves several different kernel features that have come about over the last decade or so: user namespaces, BPF, and user-space page fault handling (or FUSE, which is much older). But in the end, it is the mishandling of the integer buffer size that gets the foot in the door; the fix is something of a band-aid to simply reject seq_buf allocations that get "too large".
Qualys said that it has exploit code that can gain root on "default
installations of Ubuntu 20.04,
Ubuntu 20.10, Ubuntu 21.04, Debian 11, and Fedora 34 Workstation
",
furthermore "other
Linux distributions are certainly vulnerable, and probably
exploitable
". One of the suggested mitigations is turning off the
ability for unprivileged users to create user namespaces (via
/proc/sys/kernel/unprivileged_userns_clone), because that will
stop their ability to mount the deep directory hierarchy that results in
the 1GB+ path name. But even without user namespaces, there may be an
alternative:
However, the attacker may mount a long directory via FUSE instead; we have not fully explored this possibility, because we accidentally stumbled upon CVE-2021-33910 in systemd: if an attacker FUSE-mounts a long directory (longer than 8MB), then systemd exhausts its stack, crashes, and therefore crashes the entire operating system (a kernel panic).
The systemd bug, which Qualys also describes well, stems from switching from strdup() to strdupa() in the mount-path handling code. That switched from heap-based allocations to stack-based ones, which provides the opportunity to exhaust the 8MB (by default) stack. So an overlong path that is found when systemd is parsing /proc/self/mountinfo will result in a segmentation fault; Linux is decidedly unhappy about running without an init process, so it panics as well.
A user can trigger the crash by mounting a FUSE filesystem, creating a directory path longer than 8MB, moving the FUSE filesystem to that directory, and causing systemd to parse the mountinfo file again. The fix in this case is to switch back to using strdup() as it was before an April 2015 commit.
The systemd crash was found while developing Sequoia exploit, which uses BPF code to find and then overwrite the modprobe_path variable in the kernel. That path is used to run an executable as root (normally /sbin/modprobe) when a kernel module gets loaded. Pointing that at a different path, where an attacker-controlled executable lives, gives root privileges.
One would expect the BPF verifier to thwart any attempts to execute a harmful program, since that is the primary mechanism to restrict users from loading unsafe BPF programs. But that is where user-space page-fault handling comes into play: the userfaultfd() system call allows the exploit to effectively pause the kernel after the BPF verifier has been run. Then the overwrite of "//deleted" can be arranged to land at the "right" place in the BPF code, which is loaded into the vmalloc() region. userfaultfd() allows the exploit to change the BPF code after it has been verified but before it gets JIT-compiled, thus evading the verifier.
The out-of-bounds write of "//deleted" is turned into an information
disclosure that allows the exploit code to have limited control over what
gets overwritten.
After that, the techniques
from a 2020 BPF vulnerability found by Manfred Paul were used to
"transform this limited out-of-bounds write into an arbitrary read
and write of kernel memory
"
The six-step "Exploitation overview" section of the advisory gives a nice
overview of the exploit, while the following section gives all the gory
details. It makes for interesting reading for those who are curious about
kernel exploits and the intricate steps that are needed to make them work.
The reports of both flaws were in the hands of Red Hat for nearly a month until they were reported to the closed mailing lists for kernel and distribution security reports on July 6. Two weeks after that, the coordinated release of the advisories was made and distribution updates started rolling out. It is not at all clear why there was a month-long delay, since neither of the fixes seems particularly challenging. Maybe that time was spent looking for other similar problems in the kernel and systemd.
Clearly the integer conversion at the heart of the exploit needed fixing; one wonders how many other size_t-to-int problems of that sort still linger in the kernel. One might also wonder what kind of a can of worms was opened when userfaultfd() was added to the kernel; it provides a way for user space to semi-arbitrarily pause the kernel in spots of its choosing. That may come in handy again for kernel exploits down the road.
Meanwhile, systemd has hopefully scrutinized any other stack-based allocations it is doing, especially for user-controllable values like paths. While exhausting the stack is often not a security problem for user-space programs (except, of course, for exploits like Stack Clash), systemd sits in a sensitive place in many Linux systems. Any user-controlled means to bring it down is a clear route to a denial of service on the system.
Integer-conversion problems of the sort we see here are something that would likely be impossible in some other languages (e.g. Rust). Stack (or memory) exhaustion, on the other hand, is not really something that can be handled at the language level; hitting a resource limit must be handled somehow and crashing a user-space program is often the least harmful thing the operating system can do. But at least we now have N-2 bugs in our systems; unfortunately, the unknown N is likely distressingly large.
Index entries for this article | |
---|---|
Kernel | Security/Vulnerabilities |
Kernel | seq_file |
Security | Linux kernel/Vulnerabilities |
Security | Systemd |
(Log in to post comments)
Narrowing conversion
Posted Jul 21, 2021 23:05 UTC (Wed) by tialaramex (subscriber, #21167) [Link]
This is called an implicit narrowing conversion. Many modern languages outright forbid this, requiring that the programmer *explicitly* choose to perform a narrowing conversion or (perhaps better) even requiring that they explicitly decide what happens when the big thing won't fit in the small thing.
You can imagine that in kernel code, some sort of error, or panic should likely be the choice, if the programmer did not instead decide to approach the problem in a different way to avoid a narrowing conversion. Apparently out-of-box Windows MSVC programs warn if you do an implicit narrowing conversion but alas GCC does not, even under -Wall, although -Wconversion exists.
Since it's the topic du jour, in Rust you can't have implicit narrowing conversion, but it does provide an explicit conversion that can silently lose information in narrowing, programmers might write:
let signedInt = sixtyFourBitNumber as s32;
Whereas perhaps they ought to write:
let signedInt: s32 = sixtyFourBitNumber.try_into().unwrap(); // panic if it doesn't fit
or indeed:
let signedInt: s32 = sixtyFourBitNumber.try_into()? // only compiles if the function we do this in has a suitable error return
let signedInt: s32 = sixtyFourBitNumber.try_into().unwrap_or(SAFE_DEFAULT); // we're confident SAFE_DEFAULT is fine
Narrowing conversion
Posted Jul 22, 2021 0:08 UTC (Thu) by roc (subscriber, #30627) [Link]
Narrowing conversion
Posted Jul 22, 2021 0:41 UTC (Thu) by Gaelan (guest, #145108) [Link]
Narrowing conversion
Posted Jul 22, 2021 1:21 UTC (Thu) by roc (subscriber, #30627) [Link]
I'm not sure what the best way to proceed from here is. One option would be to introduce that trait+method for truncating conversions, and deprecate or ban lossy "as" scalar conversions in a future edition. But obviously a lot of existing code would be affected.
Narrowing conversion
Posted Jul 23, 2021 0:47 UTC (Fri) by khim (subscriber, #9252) [Link]
> But obviously a lot of existing code would be affected.Isn't it something Rust editions are supposed to cover?
Narrowing conversion
Posted Jul 23, 2021 7:50 UTC (Fri) by Fowl (subscriber, #65667) [Link]
Narrowing conversion
Posted Jul 23, 2021 9:54 UTC (Fri) by khim (subscriber, #9252) [Link]
Not really. In Rust, unlike in C++, editions are supposed to coexist. You can write one crate in Rust 2015, another in Rust 2018, and third in Rust 2021.
Sure, if you are actively developing your code then you would probably want to upgrade it, but there are no pressure to do that ASAP.
Narrowing conversion
Posted Jul 23, 2021 16:40 UTC (Fri) by jezuch (subscriber, #52988) [Link]
FWIW, AFAICT Rust doesn't have implicit *widening* conversion either. Turns out, it too often turns into a footgun.
Narrowing conversion
Posted Jul 22, 2021 8:30 UTC (Thu) by Wol (subscriber, #4433) [Link]
If it is (clearly) changing the length of the storage field - as in between int32 and int64, why not? Only an extreme novice would think that you could stuff an int64 into an int32 and not run at least the risk of truncation. Okay if it's not obvious the target is an int32 then it's a more fuzzy issue.
Cheers,
Wol
Narrowing conversion
Posted Jul 22, 2021 9:53 UTC (Thu) by atnot (subscriber, #124910) [Link]
Narrowing conversion
Posted Jul 22, 2021 11:56 UTC (Thu) by roc (subscriber, #30627) [Link]
Narrowing conversion
Posted Jul 28, 2021 16:33 UTC (Wed) by sandsmark (guest, #62172) [Link]
Again proving that the safety of Rust is just a subset of C++ with -Werror -Weverything. :-P
(And I'd say C++ discourages it even more because casting is uglier/more verbose, but people tend to just cast everything just to get rid of warnings/errors so you basically end up with implicit casting anyways.)
Narrowing conversion
Posted Jul 28, 2021 17:07 UTC (Wed) by mathstuf (subscriber, #69389) [Link]
The Sequoia seq_file vulnerability
Posted Jul 22, 2021 0:02 UTC (Thu) by smcv (subscriber, #53363) [Link]
This sysctl is Debian/Ubuntu-specific; some other distro kernels have picked it up, such as Arch's non-default linux-hardened kernel, but it isn't an upstream feature. I believe the upstream equivalent is to set /proc/sys/user/max_user_namespaces to 0.
Either of these mitigations will break user-space applications that expect to be able to create sandbox environments via user namespaces, such as Firefox, Chromium, podman/toolbox, and all users of bwrap (Flatpak, WebKitGTK, libgnome-desktop (for sandboxed thumbnailers), Steam, etc.), so they should probably only be done on systems where user-space is known not to need sandboxes or containers. This is a security trade-off: the mitigation reduces the kernel's attack surface, but also reduces the ability for parts of user-space to distrust other parts of user-space.
Some of these applications can work around inability for unprivileged programs to create user namespaces by using a setuid-root helper (Chromium's deprecated setuid sandbox helper, or a setuid copy of bwrap), but this is another security tradeoff: a setuid-root helper might have root privilege escalation vulnerabilities itself.
It is also not always possible for a setuid-root helper to provide full functionality. A setuid copy of bwrap disables some of its features, because it would not be possible to implement them securely; this reduced functionality is known to break some Flatpak apps, particularly those that make use of sub-sandboxing, notably (again) Chromium.
The Sequoia seq_file vulnerability
Posted Jul 22, 2021 7:24 UTC (Thu) by ganneff (subscriber, #7069) [Link]
Firefox (78.12.0esr-1~deb10u1) at least doesn't seem to care - or doesn't complain.
The Sequoia seq_file vulnerability
Posted Jul 22, 2021 13:00 UTC (Thu) by smcv (subscriber, #53363) [Link]
I don't immediately have a Debian 10 system to hand, but on Debian 11 systems with firefox 88.0.1 and firefox-esr 78.11.0esr, the top-level `firefox` process is in the same userns and filesystem namespace as my interactive shell, but each `firefox -contentproc` subprocess has its own userns (you can see this with `ls -l /proc/$pid/ns/user`) and is running in a private filesystem namespace with an empty root directory (you can see this with `ls -l /proc/$pid/root/`).
The Sequoia seq_file vulnerability
Posted Aug 10, 2021 17:24 UTC (Tue) by Margaret48 (guest, #129042) [Link]
The Sequoia seq_file vulnerability
Posted Jul 22, 2021 7:49 UTC (Thu) by ma4ris5 (guest, #151140) [Link]
How about elaborating for the compiler, what the programmer's intention is?
What if there would be an attribute, which can elaborate expected valid value range?
> static void *seq_buf_alloc(unsigned long size
> __attribute__ ((expect_range (0, MAX_RW_COUNT))) )
> {
> if (unlikely(size > MAX_RW_COUNT))
> return NULL;
>
> return kvmalloc(size, GFP_KERNEL_ACCOUNT);
> }
Here compiler could warn, if caller of seq_buf_alloc() could pass out of range value for 'size'.
On the other hand, if no caller can create oversized value, compiler could optimize
the if condition away as dead code.
Older compilers could work still, because their code generation isn't affected:
They don't know about the new "expect_range" attribute.
Thus they would keep the check in dead code case.
They wouldn't be able to show the additional warnings though.
The Sequoia seq_file vulnerability
Posted Jul 22, 2021 10:30 UTC (Thu) by ma4ris5 (guest, #151140) [Link]
http://huonw.github.io/blog/2016/04/myths-and-legends-abo...
Rust seems to check integer overflows at debug builds (s32, s64),
and it compiles out checks at production builds. So Rust seems to have too
low level approach to overflows.
My "expect_range" suggestion is such, that developer could elaborate the range
regardless of the value's type (u8, s8, ..., u32, u64, s32, s64), (possibly) the bit range
of the value, so that compiler could verify that an overflow can't happen when
doing arithmetic calculations at compile or static analysis time.
Thus there should be enough information for static analysis, and, or, compile time,
to do mathematical proof, to imply that overflows are absent: out of range values don't exist.
Mathematical proof would enable compiler to skip unnecessary branches, and remove dead code.
Optional warnings could be used to guide development into safe code.
Arbitrary precision (integer) arithmetic elaborates a solution for this:
https://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic
> In many cases, the task or the programmer can guarantee that the integer values
> in a specific application will not grow large enough to cause an overflow. Such
> guarantees may be based on pragmatic limits: a school attendance program
> may have a task limit of 4,000 students. A programmer may design the computation
> so that intermediate results stay within specified precision boundaries.
The Sequoia seq_file vulnerability
Posted Jul 25, 2021 9:38 UTC (Sun) by geuder (guest, #62854) [Link]
I have not liked that approach since the days of assert(3) and NDEBUG, which is from the 1980s or even 70s (didn't do any programming during the latter). Code that behaves differently in debug builds just causes loads of new problems. You must fully test both variants to get a benefit reliably.
For the vulnerability at hand such mechanism would not have helped: Even if kernel debug builds were tested, it would be unlikely to have a test case creating 10GB of path name.
The Sequoia seq_file vulnerability
Posted Jul 25, 2021 18:21 UTC (Sun) by tialaramex (subscriber, #21167) [Link]
If you would rather overflow checks also happen in production, you can ask for the Rust compiler setting overflow-checks = true in your production builds, as well as debug where it is default.
However, that would turn such bugs into a kernel crash, rather than local root. Not really what you want.
Rust also lets you explicitly say what you intend to happen for overflow. For example you can say you want saturating arithmetic (127_i8.saturating_add(1_i8) == 127_i8) or wrapping arithmetic (127_i8.wrapping_add(1_i8) == -128_i8) and now, at last, you've actually expressed your intention and so it's more likely this code might do what you intended rather than crashing or worse.
Ultimately if the programmer did not express any intent, it's difficult to see what a practical general purpose language ought to do.
In WUFFS they get to just reject all programs that don't specify. So if you look at byte A and byte B and you purport that you can add those together A+B and store the result as one byte, the WUFFS compiler rejects that code. You need to spell out constraints so that A and B will be small enough for this to work (WUFFS will check the constraints at runtime and error out if they are not met) OR explain how the addition should saturate or wrap to get one byte out (then WUFFS generates code that does so).
This feels like an appropriate discipline in their situation, but most programmers would chafe badly under those conditions.
The Sequoia seq_file vulnerability
Posted Aug 6, 2021 9:40 UTC (Fri) by Randakar (guest, #27808) [Link]
In lieu of the compiler explicitly rejecting such conversions (which it should) the next best thing is to fail in the most loud and noisy way possible so that this type of bug has a bigger chance to get noticed.
Silently doing unexpected things is almost always worse than an error, even if it crashes the system.
The Sequoia seq_file vulnerability
Posted Jul 22, 2021 8:36 UTC (Thu) by pabs (subscriber, #43278) [Link]
The Sequoia seq_file vulnerability
Posted Jul 22, 2021 10:10 UTC (Thu) by immibis (guest, #105511) [Link]
The Sequoia seq_file vulnerability
Posted Jul 22, 2021 10:32 UTC (Thu) by jmaa (guest, #128856) [Link]
The Sequoia seq_file vulnerability
Posted Jul 22, 2021 20:55 UTC (Thu) by Paf (subscriber, #91811) [Link]
The Sequoia seq_file vulnerability
Posted Jul 23, 2021 0:35 UTC (Fri) by pabs (subscriber, #43278) [Link]
The Sequoia seq_file vulnerability
Posted Jul 23, 2021 11:59 UTC (Fri) by matthias (subscriber, #94967) [Link]
The Sequoia seq_file vulnerability
Posted Jul 25, 2021 13:34 UTC (Sun) by aaronmdjones (subscriber, #119973) [Link]
I don't have userspace helper binary support, I don't have BPF support, I don't have user namespaces support (unprivileged or not), I don't have FUSE support, and I don't have userfaultfd support.
As far as I can see, between the lack of user namespaces and the lack of FUSE, I have nothing to worry about, since exploiting this vulnerability relies on mounting, which is a privileged operation in my case (no; mount(8) is not setuid).
The Sequoia seq_file vulnerability
Posted Jul 26, 2021 6:10 UTC (Mon) by wtarreau (subscriber, #51152) [Link]