The Sequoia seq_file vulnerability

By Jake Edge
July 21, 2021

A local root hole in the Linux kernel, called Sequoia, was disclosed by Qualys on July 20. A full system compromise is possible until the kernel is patched (or mitigations that may not be fully effective are applied). At its core, the vulnerability relies on a path through the kernel where 64-bit size_t values are "converted" to signed integers, which effectively results in an overflow. The flaw was reported to Red Hat on June 9, along with a local systemd denial-of-service vulnerability, leading to a kernel crash, found at the same time. Systems with untrusted local users need updates for both problems applied as soon as they are available—out of an abundance of caution, other systems likely should be updated as well.

Down in the guts of the kernel's seq_file interface, which is used for handling virtual files in /proc and the like, buffers are needed to store each line of the file's "contents". To start, a page of memory is allocated for the buffer, but if that is not sufficient, a new buffer that is twice the size of the old one is allocated. This is all done using a size_t, which is an unsigned 64-bit quantity (on x86_64) that is large enough to hold the results, so "the system would run out of memory long before this multiplication overflows".

But that value (m->size in the advisory) is passed to other functions that expect a signed 32-bit integer. In particular, the exploit uses the output of /proc/self/mountinfo to get to a place in the kernel where that is the case. The attacker can create a directory hierarchy with a path length larger than 1GB (roughly one-million nested directories), bind-mount it into a user namespace, and then delete the directory. In the namespace, the attacker opens and reads mountinfo, which causes the string "//deleted" to be written outside of the seq_file buffer. The seq_file interface creates a buffer that is 2GB in size, but down in the code that writes the string, that value gets interpreted as -2GB, which is used to calculate where to do the write. That is far outside the buffer, of course, and so it overwrites memory at a known offset elsewhere in the vmalloc() region.

As might be guessed, writing a ten-byte fixed string outside of the buffer is only the start of a complicated, but easily replicated, series of gyrations leading to a root shell. The Qualys advisory goes into great detail about the journey, which involves several different kernel features that have come about over the last decade or so: user namespaces, BPF, and user-space page fault handling (or FUSE, which is much older). But in the end, it is the mishandling of the integer buffer size that gets the foot in the door; the fix is something of a band-aid to simply reject seq_buf allocations that get "too large".

Qualys said that it has exploit code that can gain root on "default installations of Ubuntu 20.04, Ubuntu 20.10, Ubuntu 21.04, Debian 11, and Fedora 34 Workstation", furthermore "other Linux distributions are certainly vulnerable, and probably exploitable". One of the suggested mitigations is turning off the ability for unprivileged users to create user namespaces (via /proc/sys/kernel/unprivileged_userns_clone), because that will stop their ability to mount the deep directory hierarchy that results in the 1GB+ path name. But even without user namespaces, there may be an alternative:

However, the attacker may mount a long directory via FUSE instead; we have not fully explored this possibility, because we accidentally stumbled upon CVE-2021-33910 in systemd: if an attacker FUSE-mounts a long directory (longer than 8MB), then systemd exhausts its stack, crashes, and therefore crashes the entire operating system (a kernel panic).

The systemd bug, which Qualys also describes well, stems from switching from strdup() to strdupa() in the mount-path handling code. That switched from heap-based allocations to stack-based ones, which provides the opportunity to exhaust the 8MB (by default) stack. So an overlong path that is found when systemd is parsing /proc/self/mountinfo will result in a segmentation fault; Linux is decidedly unhappy about running without an init process, so it panics as well.

A user can trigger the crash by mounting a FUSE filesystem, creating a directory path longer than 8MB, moving the FUSE filesystem to that directory, and causing systemd to parse the mountinfo file again. The fix in this case is to switch back to using strdup() as it was before an April 2015 commit.

The systemd crash was found while developing Sequoia exploit, which uses BPF code to find and then overwrite the modprobe_path variable in the kernel. That path is used to run an executable as root (normally /sbin/modprobe) when a kernel module gets loaded. Pointing that at a different path, where an attacker-controlled executable lives, gives root privileges.

One would expect the BPF verifier to thwart any attempts to execute a harmful program, since that is the primary mechanism to restrict users from loading unsafe BPF programs. But that is where user-space page-fault handling comes into play: the userfaultfd() system call allows the exploit to effectively pause the kernel after the BPF verifier has been run. Then the overwrite of "//deleted" can be arranged to land at the "right" place in the BPF code, which is loaded into the vmalloc() region. userfaultfd() allows the exploit to change the BPF code after it has been verified but before it gets JIT-compiled, thus evading the verifier.

The out-of-bounds write of "//deleted" is turned into an information disclosure that allows the exploit code to have limited control over what gets overwritten. After that, the techniques from a 2020 BPF vulnerability found by Manfred Paul were used to "transform this limited out-of-bounds write into an arbitrary read and write of kernel memory" The six-step "Exploitation overview" section of the advisory gives a nice overview of the exploit, while the following section gives all the gory details. It makes for interesting reading for those who are curious about kernel exploits and the intricate steps that are needed to make them work.

The reports of both flaws were in the hands of Red Hat for nearly a month until they were reported to the closed mailing lists for kernel and distribution security reports on July 6. Two weeks after that, the coordinated release of the advisories was made and distribution updates started rolling out. It is not at all clear why there was a month-long delay, since neither of the fixes seems particularly challenging. Maybe that time was spent looking for other similar problems in the kernel and systemd.

Clearly the integer conversion at the heart of the exploit needed fixing; one wonders how many other size_t-to-int problems of that sort still linger in the kernel. One might also wonder what kind of a can of worms was opened when userfaultfd() was added to the kernel; it provides a way for user space to semi-arbitrarily pause the kernel in spots of its choosing. That may come in handy again for kernel exploits down the road.

Meanwhile, systemd has hopefully scrutinized any other stack-based allocations it is doing, especially for user-controllable values like paths. While exhausting the stack is often not a security problem for user-space programs (except, of course, for exploits like Stack Clash), systemd sits in a sensitive place in many Linux systems. Any user-controlled means to bring it down is a clear route to a denial of service on the system.

Integer-conversion problems of the sort we see here are something that would likely be impossible in some other languages (e.g. Rust). Stack (or memory) exhaustion, on the other hand, is not really something that can be handled at the language level; hitting a resource limit must be handled somehow and crashing a user-space program is often the least harmful thing the operating system can do. But at least we now have N-2 bugs in our systems; unfortunately, the unknown N is likely distressingly large.

Index entries for this article
Kernel	Security/Vulnerabilities
Kernel	seq_file
Security	Linux kernel/Vulnerabilities
Security	Systemd

(Log in to post comments)

Narrowing conversion

Posted Jul 21, 2021 23:05 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

> Clearly the integer conversion at the heart of the exploit needed fixing; one wonders how many other size_t-to-int problems of that sort still linger in the kernel.

This is called an implicit narrowing conversion. Many modern languages outright forbid this, requiring that the programmer *explicitly* choose to perform a narrowing conversion or (perhaps better) even requiring that they explicitly decide what happens when the big thing won't fit in the small thing.

You can imagine that in kernel code, some sort of error, or panic should likely be the choice, if the programmer did not instead decide to approach the problem in a different way to avoid a narrowing conversion. Apparently out-of-box Windows MSVC programs warn if you do an implicit narrowing conversion but alas GCC does not, even under -Wall, although -Wconversion exists.

Since it's the topic du jour, in Rust you can't have implicit narrowing conversion, but it does provide an explicit conversion that can silently lose information in narrowing, programmers might write:

let signedInt = sixtyFourBitNumber as s32;

Whereas perhaps they ought to write:

let signedInt: s32 = sixtyFourBitNumber.try_into().unwrap(); // panic if it doesn't fit

or indeed:

let signedInt: s32 = sixtyFourBitNumber.try_into()? // only compiles if the function we do this in has a suitable error return

let signedInt: s32 = sixtyFourBitNumber.try_into().unwrap_or(SAFE_DEFAULT); // we're confident SAFE_DEFAULT is fine

Narrowing conversion

Posted Jul 22, 2021 0:08 UTC (Thu) by roc (subscriber, #30627) [Link]

"as" is one of Rust's warts. You shouldn't have the same syntax for both lossy and non-lossy conversions.

Narrowing conversion

Posted Jul 22, 2021 0:41 UTC (Thu) by Gaelan (guest, #145108) [Link]

For non-lossy conversions, there's the alternate syntax u64::from(some_u32), but it's still not great.

Narrowing conversion

Posted Jul 22, 2021 1:21 UTC (Thu) by roc (subscriber, #30627) [Link]

From and TryFrom let you express infallible and fallible scalar conversions. There probably should be a third trait+method to explicitly request a truncating/sign-modifying conversion. But "as" for scalar conversions is a footgun.

I'm not sure what the best way to proceed from here is. One option would be to introduce that trait+method for truncating conversions, and deprecate or ban lossy "as" scalar conversions in a future edition. But obviously a lot of existing code would be affected.

Narrowing conversion

Posted Jul 23, 2021 0:47 UTC (Fri) by khim (subscriber, #9252) [Link]

> But obviously a lot of existing code would be affected.

Isn't it something Rust editions are supposed to cover?

Narrowing conversion

Posted Jul 23, 2021 7:50 UTC (Fri) by Fowl (subscriber, #65667) [Link]

The code still would want to be upgraded to a newer edition at some point.

Narrowing conversion

Posted Jul 23, 2021 9:54 UTC (Fri) by khim (subscriber, #9252) [Link]

Not really. In Rust, unlike in C++, editions are supposed to coexist. You can write one crate in Rust 2015, another in Rust 2018, and third in Rust 2021.

Sure, if you are actively developing your code then you would probably want to upgrade it, but there are no pressure to do that ASAP.

Narrowing conversion

Posted Jul 23, 2021 16:40 UTC (Fri) by jezuch (subscriber, #52988) [Link]

I guess it's "not great" by design, like the new-style casts in C++ are ugly by design to discourage casting in general.

FWIW, AFAICT Rust doesn't have implicit *widening* conversion either. Turns out, it too often turns into a footgun.

Narrowing conversion

Posted Jul 22, 2021 8:30 UTC (Thu) by Wol (subscriber, #4433) [Link]

> You shouldn't have the same syntax for both lossy and non-lossy conversions.

If it is (clearly) changing the length of the storage field - as in between int32 and int64, why not? Only an extreme novice would think that you could stuff an int64 into an int32 and not run at least the risk of truncation. Okay if it's not obvious the target is an int32 then it's a more fuzzy issue.

Cheers,
Wol

Narrowing conversion

Posted Jul 22, 2021 9:53 UTC (Thu) by atnot (subscriber, #124910) [Link]

I think just deferring to experience isn't very useful here. Adding explicit conversions quickly becomes muscle memory that isn't given the thought it should be given. You might just not consider whether it is lossy or not in that moment. Since languages should make it easier to do the right thing than the wrong thing, requiring different mechanisms for lossy and lossless conversions would give a crucial opportunity to think about how it should really be handled.

Narrowing conversion

Posted Jul 22, 2021 11:56 UTC (Thu) by roc (subscriber, #30627) [Link]

When you read "foobar as u32" it's not obvious whether that is (potentially) lossy or not. OK, with rust-analyzer I just have to hover "foobar" to see what type it is, but it would be so simple to just require different syntax for lossy vs non-lossy conversions.

Narrowing conversion

Posted Jul 28, 2021 16:33 UTC (Wed) by sandsmark (guest, #62172) [Link]

> Since it's the topic du jour, in Rust you can't have implicit narrowing conversion, but it does provide an explicit conversion that can silently lose information in narrowing, programmers might write:

Again proving that the safety of Rust is just a subset of C++ with -Werror -Weverything. :-P

(And I'd say C++ discourages it even more because casting is uglier/more verbose, but people tend to just cast everything just to get rid of warnings/errors so you basically end up with implicit casting anyways.)

Narrowing conversion

Posted Jul 28, 2021 17:07 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

That's…quite a narrow view of what Rust provides. I'm feeling forgetful…what warning flag tells me when my code has data races in C++? When I write `return make_string().c_str();`?

The Sequoia seq_file vulnerability

Posted Jul 22, 2021 0:02 UTC (Thu) by smcv (subscriber, #53363) [Link]

> One of the suggested mitigations is turning off the ability for unprivileged users to create user namespaces (via /proc/sys/kernel/unprivileged_userns_clone)

This sysctl is Debian/Ubuntu-specific; some other distro kernels have picked it up, such as Arch's non-default linux-hardened kernel, but it isn't an upstream feature. I believe the upstream equivalent is to set /proc/sys/user/max_user_namespaces to 0.

Either of these mitigations will break user-space applications that expect to be able to create sandbox environments via user namespaces, such as Firefox, Chromium, podman/toolbox, and all users of bwrap (Flatpak, WebKitGTK, libgnome-desktop (for sandboxed thumbnailers), Steam, etc.), so they should probably only be done on systems where user-space is known not to need sandboxes or containers. This is a security trade-off: the mitigation reduces the kernel's attack surface, but also reduces the ability for parts of user-space to distrust other parts of user-space.

Some of these applications can work around inability for unprivileged programs to create user namespaces by using a setuid-root helper (Chromium's deprecated setuid sandbox helper, or a setuid copy of bwrap), but this is another security tradeoff: a setuid-root helper might have root privilege escalation vulnerabilities itself.

It is also not always possible for a setuid-root helper to provide full functionality. A setuid copy of bwrap disables some of its features, because it would not be possible to implement them securely; this reduced functionality is known to break some Flatpak apps, particularly those that make use of sub-sandboxing, notably (again) Chromium.

The Sequoia seq_file vulnerability

Posted Jul 22, 2021 7:24 UTC (Thu) by ganneff (subscriber, #7069) [Link]

>Either of these mitigations will break user-space applications that expect to be able to create sandbox environments via user namespaces, such as Firefox, Chromium, podman/toolbox, and all users of bwrap (Flatpak, WebKitGTK, libgnome-desktop (for sandboxed thumbnailers), Steam, etc.), so they should probably only be done on systems where user-space is known not to need sandboxes or containers.

Firefox (78.12.0esr-1~deb10u1) at least doesn't seem to care - or doesn't complain.

The Sequoia seq_file vulnerability

Posted Jul 22, 2021 13:00 UTC (Thu) by smcv (subscriber, #53363) [Link]

I suspect this means that the content processes (`firefox -contentproc`) are not using user namespaces to limit their view of the rest of the system to protect it from the possibility that they might get compromised by a malicious site; or perhaps they're using a setuid helper to get into a different user namespace.

I don't immediately have a Debian 10 system to hand, but on Debian 11 systems with firefox 88.0.1 and firefox-esr 78.11.0esr, the top-level `firefox` process is in the same userns and filesystem namespace as my interactive shell, but each `firefox -contentproc` subprocess has its own userns (you can see this with `ls -l /proc/$pid/ns/user`) and is running in a private filesystem namespace with an empty root directory (you can see this with `ls -l /proc/$pid/root/`).

The Sequoia seq_file vulnerability

Posted Aug 10, 2021 17:24 UTC (Tue) by Margaret48 (guest, #129042) [Link]

firefox unlike chromium silently disables most parts of its sandbox when userns aren't available leaving only seccomp.

The Sequoia seq_file vulnerability

Posted Jul 22, 2021 7:49 UTC (Thu) by ma4ris5 (guest, #151140) [Link]

In this case, seq_buf_alloc() doesn't have integer overflow, argument is already an unsigned 64 bit value.

How about elaborating for the compiler, what the programmer's intention is?
What if there would be an attribute, which can elaborate expected valid value range?

> static void *seq_buf_alloc(unsigned long size
> __attribute__ ((expect_range (0, MAX_RW_COUNT))) )
> {
> if (unlikely(size > MAX_RW_COUNT))
> return NULL;
>
> return kvmalloc(size, GFP_KERNEL_ACCOUNT);
> }

Here compiler could warn, if caller of seq_buf_alloc() could pass out of range value for 'size'.
On the other hand, if no caller can create oversized value, compiler could optimize
the if condition away as dead code.

Older compilers could work still, because their code generation isn't affected:
They don't know about the new "expect_range" attribute.
Thus they would keep the check in dead code case.
They wouldn't be able to show the additional warnings though.

The Sequoia seq_file vulnerability

Posted Jul 22, 2021 10:30 UTC (Thu) by ma4ris5 (guest, #151140) [Link]

I found a page which elaborates Rust's approach for integer overflows:
http://huonw.github.io/blog/2016/04/myths-and-legends-abo...

Rust seems to check integer overflows at debug builds (s32, s64),
and it compiles out checks at production builds. So Rust seems to have too
low level approach to overflows.

My "expect_range" suggestion is such, that developer could elaborate the range
regardless of the value's type (u8, s8, ..., u32, u64, s32, s64), (possibly) the bit range
of the value, so that compiler could verify that an overflow can't happen when
doing arithmetic calculations at compile or static analysis time.

Thus there should be enough information for static analysis, and, or, compile time,
to do mathematical proof, to imply that overflows are absent: out of range values don't exist.

Mathematical proof would enable compiler to skip unnecessary branches, and remove dead code.

Optional warnings could be used to guide development into safe code.

Arbitrary precision (integer) arithmetic elaborates a solution for this:
https://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic
> In many cases, the task or the programmer can guarantee that the integer values
> in a specific application will not grow large enough to cause an overflow. Such
> guarantees may be based on pragmatic limits: a school attendance program
> may have a task limit of 4,000 students. A programmer may design the computation
> so that intermediate results stay within specified precision boundaries.

The Sequoia seq_file vulnerability

Posted Jul 25, 2021 9:38 UTC (Sun) by geuder (guest, #62854) [Link]

> Rust seems to check integer overflows at debug builds (s32, s64), and it compiles out checks at production builds.

I have not liked that approach since the days of assert(3) and NDEBUG, which is from the 1980s or even 70s (didn't do any programming during the latter). Code that behaves differently in debug builds just causes loads of new problems. You must fully test both variants to get a benefit reliably.

For the vulnerability at hand such mechanism would not have helped: Even if kernel debug builds were tested, it would be unlikely to have a test case creating 10GB of path name.

The Sequoia seq_file vulnerability

Posted Jul 25, 2021 18:21 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

For this vulnerability what we've got is a narrowing conversion rather than conventional integer overflow. There's discussion above this on LWN. In C narrowing conversions are implicit -- if you assign a huge number into a small integer type, the compiler shrugs and throws away whatever doesn't fit. Rust's conversions are explicit -- if the types don't match you need to say "as" to get it converted, however a programmer may not be brought up short by the need for conversion and realise this is a terrible idea because some of their values get thrown away.

If you would rather overflow checks also happen in production, you can ask for the Rust compiler setting overflow-checks = true in your production builds, as well as debug where it is default.

However, that would turn such bugs into a kernel crash, rather than local root. Not really what you want.

Rust also lets you explicitly say what you intend to happen for overflow. For example you can say you want saturating arithmetic (127_i8.saturating_add(1_i8) == 127_i8) or wrapping arithmetic (127_i8.wrapping_add(1_i8) == -128_i8) and now, at last, you've actually expressed your intention and so it's more likely this code might do what you intended rather than crashing or worse.

Ultimately if the programmer did not express any intent, it's difficult to see what a practical general purpose language ought to do.

In WUFFS they get to just reject all programs that don't specify. So if you look at byte A and byte B and you purport that you can add those together A+B and store the result as one byte, the WUFFS compiler rejects that code. You need to spell out constraints so that A and B will be small enough for this to work (WUFFS will check the constraints at runtime and error out if they are not met) OR explain how the addition should saturate or wrap to get one byte out (then WUFFS generates code that does so).

This feels like an appropriate discipline in their situation, but most programmers would chafe badly under those conditions.

The Sequoia seq_file vulnerability

Posted Aug 6, 2021 9:40 UTC (Fri) by Randakar (guest, #27808) [Link]

Frankly crashing the kernel is *exactly* the right thing to do.

In lieu of the compiler explicitly rejecting such conversions (which it should) the next best thing is to fail in the most loud and noisy way possible so that this type of bug has a bigger chance to get noticed.

Silently doing unexpected things is almost always worse than an error, even if it crashes the system.

The Sequoia seq_file vulnerability

Posted Jul 22, 2021 8:36 UTC (Thu) by pabs (subscriber, #43278) [Link]

The BPF verifier evasion seems pretty concerning, but the article didn't mention if that has been fixed. I also wonder if changes to block each of the steps in the exploit have been made.

The Sequoia seq_file vulnerability

Posted Jul 22, 2021 10:10 UTC (Thu) by immibis (guest, #105511) [Link]

I'm not seeing the problem with the BPF verifier here. It is not concerning that the BPF verifier can malfunction if arbitrary memory is arbitrarily overwritten at arbitrary points in the process. Most code malfunctions most of the time under those conditions.

The Sequoia seq_file vulnerability

Posted Jul 22, 2021 10:32 UTC (Thu) by jmaa (guest, #128856) [Link]

I'm curious what the fix would be for the BPF? Would have expected jitted code to be read/execute-only when compiled, but perhaps there's some limitation preventing W^R?

The Sequoia seq_file vulnerability

Posted Jul 22, 2021 20:55 UTC (Thu) by Paf (subscriber, #91811) [Link]

The BPF verifier is evaded by writing to the memory containing the program after the verified has run. So, nothing BPF can do about it.

The Sequoia seq_file vulnerability

Posted Jul 23, 2021 0:35 UTC (Fri) by pabs (subscriber, #43278) [Link]

The article made it sound like userfaultfd() could be used to modify the BPF after the verifier ran, re-reading the paragraph it sounds like I missed that that can't happen without a memory write vulnerability?

The Sequoia seq_file vulnerability

Posted Jul 23, 2021 11:59 UTC (Fri) by matthias (subscriber, #94967) [Link]

Yes, userfaultfd() is only used to give the write vulnerability time to do its work by suspending the thread responsible for verifying and compiling the BPF in the right moment. On my first read, I was also puzzled, why there was not more talk about this seemingly BPF vulnerability.

The Sequoia seq_file vulnerability

Posted Jul 25, 2021 13:34 UTC (Sun) by aaronmdjones (subscriber, #119973) [Link]

And it's times like these that I really thank myself for building my own kernels that only have what I need.

I don't have userspace helper binary support, I don't have BPF support, I don't have user namespaces support (unprivileged or not), I don't have FUSE support, and I don't have userfaultfd support.

As far as I can see, between the lack of user namespaces and the lack of FUSE, I have nothing to worry about, since exploiting this vulnerability relies on mounting, which is a privileged operation in my case (no; mount(8) is not setuid).

The Sequoia seq_file vulnerability

Posted Jul 26, 2021 6:10 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

Same here and I totally agree.