Kernel Summit 2005: The ExecShield patches

[Posted July 19, 2005 by corbet]

From LWN's 2005 Kernel Summit coverage.

Arjan van de Ven took half an hour to discuss the set of security-related patches known as "ExecShield." He made an analogy with building security; just as a number of techniques are employed to secure a building, multiple approaches must be taken to computer security. The login process is like a guard at the door; SELinux is like a set of fire doors inside the building, and an intrusion detection system is like an alarm system. In this analogy, ExecShield is the locks on the windows.

Various components of ExecShield are:

Non-executable data areas. With advances in the hardware area, this issue has just about gone away. The kernel already has support for this sort of protection.
Various runtime checks built into the kernel API. Arjan mentioned checking for double-free errors in particular. Freeing a chunk of memory twice, as it turns out, is a common bug - and it is exploitable.
Address-space randomization techniques. Some of the ExecShield randomization patches were merged into 2.6.12 (see this Kernel Page article), but there are some others. The full ExecShield patch can randomize the locations of the program body, brk() area, data area, and more.
The gcc FORTIFY_SOURCE option. This feature is currently only available for user-space programs, but a patch adding kernel support is circulating. This option, in particular, can check the correctness of many memory copy operations when the size of the affected buffer is known. It is less useful in the kernel - which keeps few buffers on the stack - but still worth having.
The -fstack-protector gcc option. It adds a canary to the stack, and can thus detect stack buffer overruns after they happen (but, with luck, before they do any harm).
Information hiding. The kernel exports information (through /proc/pid/maps, for example) which can be highly useful to attackers. In particular, disclosing the locations of memory areas defeats much of the advantage of randomization. So restricting access to that information improves the security of the system.
/dev/mem is used primarily by root kits. About the only legitimate user is the X server, which uses /dev/mem for access to the frame buffer. Read access to /dev/mem is an information leak, and there is no reason for allowing write access to kernel memory at all.

Arjan estimates that the ExecShield patches mitigate the effects of some 25-30% of reported vulnerabilities in Linux systems. Even so, there was no discussion of merging the remaining ExecShield patches in the near future.

Index entries for this article
Kernel	Security/Security technologies

(Log in to post comments)

The ExecShield patches

Posted Jul 19, 2005 17:10 UTC (Tue) by jreiser (subscriber, #11027) [Link]

Random placement of the AT_SYSINFO page is another feature which is used by Fedora Core 4. The AT_SYSINFO page is a kernel-generated copy of linux-gate.so.1 which contains the "prefered" instructions for invoking a system call, and the code for sigreturn and rt_sigreturn. glibc-2.3.5 and FC4 kernels exploit these features.

There are unfortunate side effects, particularly on 32-bit x86. Random placement of the AT_SYSINFO page increases fragmentation of the address space; fewer large arrays are possible. Random placement tends to defeat pre-linking: process startup gets slower in proportion to the number and size of pre-linked shared libraries. Sigreturn will crash if the user moves the AT_SYSINFO page. See https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=162797 .

The ExecShield patches

Posted Jul 21, 2005 10:13 UTC (Thu) by mingo (subscriber, #31122) [Link]

What do you mean by "fewer large arrays are possible"? Randomization of shared libraries in FC4 happens in the first ~128MB of virtual memory. You should have no expectation of continuity in that area, and it's not a problem. The other 2.9 GB of virtual memory is not affected by this, there you can have as large arrays as you wish to.

The ExecShield patches

Posted Jul 19, 2005 19:33 UTC (Tue) by jwb (guest, #15467) [Link]

/dev/mem is only writable by root. If an attacker has root access, they can load modules which make /dev/mem writable or implements an entirely different device which serves the same purpose.

The other improvements all seems sensible.

The ExecShield patches

Posted Jul 20, 2005 6:47 UTC (Wed) by dlang (guest, #313) [Link]

however, currently the primary reason why building a monolithic kernel to preveent root from loading modules is the fact that root will have access to /dev/mem (or /proc/kmem) and can therefor fiddle with memory directly.

if access to those is cut off then people running especially security critical systems can build kernels that don't support loading modules AND don't support access to /dev/mem and gain a considerable amount of protection.

and face it, firewalls don't really need modules, they have a very static hardware configuration.

so I definantly see this as a useful option.

The ExecShield patches

Posted Jul 20, 2005 7:27 UTC (Wed) by nix (subscriber, #2304) [Link]

People running firewalls can already remove CAP_RAWIO from the kernel's capability bounding set, which bans reads and writes to /dev/mem. (Obviously, you have to grant an X server this capability, but there shouldn't be one of those running on a firewall anyway, really.)

The ExecShield patches

Posted Jul 20, 2005 20:42 UTC (Wed) by dlang (guest, #313) [Link]

unless ou are running selinux (which most distros don't do, and I definantly don't trust RedHat enough to use it on a firewall) I am not aware of an easy way to do this.

if there is one please let me know how.

Re: access to /dev/mem

Posted Jul 22, 2005 1:47 UTC (Fri) by sweikart (guest, #4276) [Link]

Here's a good description of it:

http://lwn.net/1999/1202/kernel.php3

And here's an implementation for dropping capabilities at boot time:

http://lists.nas.nasa.gov/archives/ext/linux-security-aud...

Since you can disable access to /dev/mem with the capability bounding set, I would request that the semantics of /dev/mem not change.

-scott

Re: access to /dev/mem

Posted Jul 25, 2005 11:54 UTC (Mon) by nix (subscriber, #2304) [Link]

The one-liner I use on my firewall is online here.

The ExecShield patches

Posted Jul 22, 2005 12:39 UTC (Fri) by mingo (subscriber, #31122) [Link]

signed modules (an upcoming feature) will prevent the loading of 'rouge' modules.

You are right in that the weakest link determines the strength of the chain, but this does not mean we should not strengthen other links even if we know that they are not the weakest link. Once that final link is strengthened too we'll see a sudden jump in strength.

so i agree that in isolation, restricting /dev/mem is like closing the door but leaving the window open. I'd like to reassure you that we are closing the windows too :)

on a sidenote, rootkits do prefer /dev/mem over module insertion, because it's "more stealth". And not only is it more stealth, it's also more robust: by checking the actual kernel image a rootkit can make it reasonably sure that it has the right kernel version - while with modules you either have the correct symbol map or you dont, and in the latter case the attacker can easily crash the system and raise attention. The most dangerous attackers prefer stealth over all - so that their unique methods stay hidden.

another sidenote is that if you cook your own kernel (which secure sites sometimes do), you can disable module support.

Re: restricting access to /dev/mem

Posted Jul 25, 2005 2:15 UTC (Mon) by sweikart (guest, #4276) [Link]

Since you can already block access to /dev/mem in userspace (using the Capability Bounding Set; see my earlier post, which was posted about the same time as yours), I would request that the kernel not alter the semantics of access to /dev/mem .

Instead, distributors can drop the capability to access /dev/mem in their startup scripts (which can be modified by people who need to access /dev/mem, whether its for driver support or other reasons).

-scott

Re: restricting access to /dev/mem

Posted Jul 25, 2005 6:38 UTC (Mon) by mingo (subscriber, #31122) [Link]

You missed the following detail: we do not want to and cannot restrict access to /dev/mem in a total way, because X.org needs it to map the BIOS and/or the framebuffer. Hence the careful filtering of access, instead of blanket turning off. On those systems where X.org needs no access to /dev/mem at all (they are quite rare) we can turn it off permanently, but this does not solve the problem for all the other systems.

Re: restricting access to /dev/mem

Posted Jul 25, 2005 16:23 UTC (Mon) by sweikart (guest, #4276) [Link]

But I don't think the kernel should enforce *any* specific policy on access rights to /dev/mem; I think this policy should be left to userspace (which can drop the SYS_RAWIO capability from the Capability Bounding set, using /proc/sys/kernel/cap-bound). Other drivers and applicatins (including mine :-) need access to /dev/mem during the boot sequence, and we can drop SYS_RAWIO when we're done.

So, my proposal is that the kernel not enforce access rights. Instead, the distributors can drop SYS_RAWIO in their boot scripts, and people (like me) who need temporary access to /dev/mem can modify the boot scripts as needed.

-scott

Re: restricting access to /dev/mem

Posted Jul 25, 2005 18:26 UTC (Mon) by mingo (subscriber, #31122) [Link]

Let me repeat it again: the problem cannot be solved via the SYS_RAWIO privilege or any other flat privilege bit. We dont want to give blanket /dev/mem access _even to processes that are allowed to read/write the safe portions of it_ (i.e. X.org).

(Furthermore, the kernel is perfecly right in enforcing that what is written/read in /dev/mem actually makes sense and doesnt corrupt the kernel itself.)

Re: restricting access to /dev/mem

Posted Jul 25, 2005 21:01 UTC (Mon) by sweikart (guest, #4276) [Link]

> We don't want to give blanket /dev/mem access _even to processes
> that are allowed to read/write the safe portions of it_
> (i.e. X.org).

I agree, I don't want to do this either. That's why I drop SYS_RAWIO (and CAP_SYS_MODULE, CAP_SYS_ADMIN, etc) in my boot scripts.

>> (Furthermore, the kernel is perfecly right in enforcing that what
> is written/read in /dev/mem actually makes sense and doesnt
> corrupt the kernel itself.)

That's a good point. Here's my dilemma.

On my secure servers, I drop *all* capabilities in the Capability Bounding Set, and drop most process capabilities in most daemons. So, how do I change a daemon's process capabilities?

capsetp(3) tells you that "the only processes that have CAP_SETPCAP available to them are processes started as a kernel-thread", and that "you will need to recompile the kernel to modify this default". Since I want my server operators to be able to install distributor's kernel RPMs, recompiling the kernel doesn't work for me.

So, I wrote a (GPL'ed) command that opens /dev/mem and raises CAP_SETPCAP in cap_bset (the kernel variable that holds the system's Capability Bounding Set), forks, restores cap_bset, then raises CAP_SETPCAP in its parent process capability set (so the parent can change other processes). This command won't run on my Fedora Core workstation, because of ExecShield.

I've thought a little bit about how the kernel could safely make CAP_SETPCAP available to userspace. By default, you don't want any userspace processes (or the Capability Bounding Set) to have CAP_SETPCAP raised; that's too big a change in the normal security model. My idea would be to create an enable-CAP_SETPCAP option for init, that would be placed into the initdefault entry in /etc/inittab; if this option doesn't exist, then init would drop CAP_SETPCAP (in the Capability Bounding Set and in its own process capability set) before it creates any other processes.

-scott

Re: restricting access to /dev/mem

Posted Aug 9, 2006 2:12 UTC (Wed) by bluefoxicy (guest, #25366) [Link]

>> We don't want to give blanket /dev/mem access _even to processes
>> that are allowed to read/write the safe portions of it_
>> (i.e. X.org).
>
>I agree, I don't want to do this either. That's why I drop SYS_RAWIO (and >CAP_SYS_MODULE, CAP_SYS_ADMIN, etc) in my boot scripts.

What Ingo is saying is he wants to do like grsecurity does and block access to /dev/mem EXCEPT for video memory. Xorg would be able to mmap() video memory in through /dev/mem; but it wouldn't be able to touch kernel memory through the same interface.

grsecurity does this so at worst an attacker can hijack X and totally screw up your video and maybe crash your system in the process (touching the video card wrong is BAD); they're just learning from really good example. It's a much more fine-grained model than CAP_SYS_RAWIO because it's not straight "you can have /dev/mem or you can't."

The ExecShield patches

Posted Jul 20, 2005 2:54 UTC (Wed) by ch (guest, #4097) [Link]

> About the only legitimate user is the X server, which uses /dev/mem
> for access to the frame buffer.

There are other user-space "device drivers" that use /dev/mem to get to memory mapped I/O devices, GPIO in particular.

The ExecShield patches

Posted Jul 25, 2005 18:31 UTC (Mon) by jsbarnes (guest, #4096) [Link]

Even X isn't a legitimate user--it can use the sysfs or /proc/bus/pci APIs
for accessing device memory (we're working on it for a future X.Org
release). Likewise, user level device drivers should use these
interfaces, /dev/mem will only get you into trouble (e.g. it's not very
portable or flexible).

The ExecShield patches

Posted Jul 31, 2005 2:13 UTC (Sun) by ch (guest, #4097) [Link]

It is just a set of conventions. If conventions against sysfs or /proc/bus/pci make more sense, that's fine. But sometimes the differences are just gratuitous. Maybe not here, but that's what to watch out for.

Performance impact?

Posted Jul 20, 2005 7:49 UTC (Wed) by farnz (subscriber, #17727) [Link]

Just out of interest, has anyone run numbers on the cost of the full set of ExecShield patches? Non-executable data is obviously free on the right hardware and hiding information previously exported doesn't cost, but checking for things like double free, or stack overruns does take some CPU time.

Performance impact?

Posted Jul 22, 2005 12:27 UTC (Fri) by mingo (subscriber, #31122) [Link]

there's no performance impact from the double-free checks.

you are right about exec-shield being cheap on newest hardware - but it's more than reasonable even on 'legacy' hardware where we apply the 'dynamic segment limit' trick to approximate NX protection. Maintaining the segment limit has some cost but it's not measurable.

What is measurable is overhead on systems that dont have NX but have fast syscall support: there we have to turn off fast syscalls because the fast syscall hardware does not follow segment limits accurately and securely. Fast syscalls are a new feature in the 2.6 kernel which can cut ~0.5 usecs off the syscall entry+exit cost. On CPUs that have neither fast syscalls nor NX (old hardware), or CPUs that have both (new systems), there's no such cost. Also, the difference between fast syscalls and int80 syscalls is more marked on new hardware - which has NX. So while exec-shield might cause a NX-less P3 to not use fast syscalls, the cost of that is not the same as if the latest P4 didnt have fast syscalls.

so in general, the performance overhead of exec-shield is not an issue. Even if it had considerable overhead (which it doesnt), i'd use it personally, considering the tough (and important) problem it tries to solve: to give a fair degree of transparent non-exec protection on CPUs that dont actually have such support in hardware. (i.e. dont have an "exec/dont-exec" bit in their pagetable format.)

Exec-shield is always on in FC1, FC2, FC3, FC4, RHEL3 and RHEL4.

Performance impact?

Posted Aug 9, 2006 2:07 UTC (Wed) by bluefoxicy (guest, #25366) [Link]

"you are right about exec-shield being cheap on newest hardware - but it's more than reasonable even on 'legacy' hardware where we apply the 'dynamic segment limit' trick to approximate NX protection. Maintaining the segment limit has some cost but it's not measurable."

[Stack +RW]
----------- (-- Segment Limit
[Libraries +RX]
[AnonMaps +RWX]
[Heap +RWX]
[Program +RX]

Do you guys still try to map libraries in below the 16MB limit to try to create NULL bytes in the addresses? That was a valiant effort but remember we still consider stuff like buffer overflows exploitable on x86-64 (where we have 48 bits VMA and 64 bit pointers, so they all contain 2 NULL bytes).

The one thing it does buy you is that you can try to map libraries low and move the segment limit down to the heap; keep NX anonymous mappings high so you don't get +X anonmaps as well. Then your image looks as follows:

[Stack +RW]
[AnonMaps +RW]
[Heap +RW]
----------- (-- Segment Limit
[Program +RX]
[Libraries +RX]

Unfortunately libraries are pretty set with the following (because statically linked data is accessed relative to the code segment and can't be separately reloaded, or something like that; I do not fully understand it yet):

[Library +RX]
[LibData +RWX]

So you can't truly make non-executable library data. PaX overloads the supervisor bit for this; the code segment limit works above the highest executable address, and below that the supervisor bit overloading takes effect.

PaX supplies 16 bits of library randomization, whereas ExecShield uses only 8. ES moves libs around in 1MB of VMA so things fit under the first 16MB; you could very well manage to use supervisor-bit overloading only on library data, CSLT working for the heap and anonmaps. PaX moves them in the first 256M so the original VMA layout is kept; everything below the stack gets the supervisor bit overloading.

I actually measured the overhead of PaX's segment limit/SBO hybrid technique at just a bit lower than the SEGMEXEC method of splitting the address space in half; except the Pentium 4 has a flaw that makes the supervisor bit overloading part EXTREMELY slow. Pentium 3 or lower or any AMD chip works just fine.

Kernel Summit 2005: The ExecShield patches

Posted Aug 9, 2006 0:45 UTC (Wed) by bluefoxicy (guest, #25366) [Link]

/me gets on his fireproof underwear :)

* Non-executable data areas. With advances in the hardware area, this issue has just about gone away. The kernel already has support for this sort of protection.

- PaX, first release October 11, full page-granularity NX bit on i386 (he still hasn't figured a way to get NX bit on ARM though).

ExecShield uses a segmentation trick; the bottom cut of the address space is executable, the top cut isn't. The highest executable mapping is also the last by this design. Problem is, that's just below the stack; all non-executable data below there will execute if you tell it to.

PaX copied this trick from ES and W^X around last year; it now serves to take the pressure off access to the stack. For areas not covered, it still uses the original method: Mark those pages SUPERVISOR so that the MMU raises a fault with the kernel on access. If kernel sees an ITLB fault it kills the process (attempt to execute); a DTLB fault (data, read-write) is allowed. This literally gives us a per-page NX bit.

I'm not sure on this, but I believe once an entry is in the DTLB, the MMU assumes it's okay for access until it gets flushed out (by a full TLB or whatnot). It still checks read/write permissions on attempt to read/write.

I'm also not sure on how it works on PPC, PPC64, SPARC, or SPARC64; none of those have NX bits in the PTE structure. Apparently there's an extra bit that can be used to coerce the CPU into letting the OS make one up (like done with i386); but on one of those it's used already for something else. Also there's crappy PLT things where the PLT is executable and shoved in a segment that holds data so a big chunk of data becomes RWX mandatory.

* Various runtime checks built into the kernel API. Arjan mentioned checking for double-free errors in particular. Freeing a chunk of memory twice, as it turns out, is a common bug - and it is exploitable.

- glibc, some time ago.

This is a good idea. It was implemented in glibc (in fact I think the patches came from in-house at RedHat).

* Address-space randomization techniques. Some of the ExecShield randomization patches were merged into 2.6.12 (see this Kernel Page article), but there are some others. The full ExecShield patch can randomize the locations of the program body, brk() area, data area, and more.

- PaX implemented randomization around 2001. It's also got better randomization over a wider area.

Yes we need brk() (heap) and main program randomization, which is in ES but not in mainline. The main program isn't going to move around unless it's a PIE; PaX can do it, but it crashes the program a lot and uses a VMA mirroring hack.

The thing is this isn't really much of a security feature. It's an amortization; if you're attacked, there's a slight chance that you'll be taken in; and a huge chance that the attack will fail. Your risk model says "WE CAN BE HACKED" but it also says "Anyone trying is likely to fail," which seems conflicting but really isn't. Turning up the randomization helps.

* The gcc FORTIFY_SOURCE option. This feature is currently only available for user-space programs, but a patch adding kernel support is circulating. This option, in particular, can check the correctness of many memory copy operations when the size of the affected buffer is known. It is less useful in the kernel - which keeps few buffers on the stack - but still worth having.

- This one came out of RedHat, and is somewhat useful.

First off FORTIFY_SOURCE does not affect only the stack; but the kernel doesn't use malloc() and thus getting this actually working right with freshly allocated buffers from kmalloc() or vmalloc() may be a pain.

FORTIFY_SOURCE may also have issue with being effective in a kernel world where things like memcpy() and strcpy() are not alone; copy_from_user() or whatever it is (I haven't kept up on my kernel hacking) needs some looks. In short, we need an actual FORTIFY_SOURCE for the kernel.

This is not much of a problem, however. FORTIFY_SOURCE uses two things from gcc: the __GNUC__ define (to see if we're using gcc), and the __builtin_* functions (mainly __builtin_object_size()). This could easily be modified for kernel use because, basically, it's completely implemented in headers.

* The -fstack-protector gcc option. It adds a canary to the stack, and can thus detect stack buffer overruns after they happen (but, with luck, before they do any harm).

- Back in the day, Crispin Cowan of Immunix wrote StackGuard (1996); then Etoh and Yoda made it ProPolice (1999); finally RedHat got it into gcc proper, direct port.

This is one I actually owe to RedHat for getting in mainline. I had been getting tired of the "if it's so good why doesn't anyone use it" argument for this, especially when OPENBSD used it forever (as much as I think OBSD is crap).

Unfortunately they made some changes. Originally the darn thing told you what function and sourcefile the stack smash occurred in; now the API can't even support that, you need to actually catch a stack smash in GDB to find out what you smashed. Great fun when you can't reproduce it. At least the function name would have been useful.

On Ubuntu, they'll be trying to figure out what called __stack_chk_fail() by attaching GDB to crashed processes. Modified glibc to trigger this and the debugging versions of the code and we can get it.

* Information hiding. The kernel exports information (through /proc/pid/maps, for example) which can be highly useful to attackers. In particular, disclosing the locations of memory areas defeats much of the advantage of randomization. So restricting access to that information improves the security of the system.

- In grsecurity forever

Can't we protect this using SELinux policy? Might take some hacking of the /proc file system mind you.

This is directly in grsecurity; they evidently had the same thought back in 2001: /proc/$pid/maps breaks ASLR if the attacker has local access. Local access isn't hard either; a vuln in a CGI script, maybe get you a directory traversal, open www.l00t.com/somescript.php?src=.\./.\./.\./.\./.\./proc/$pid/maps (the .\. tricks some sick validation code that tries to clean ..'s out of the filename), just gotta find the right process ID.

* /dev/mem is used primarily by root kits. About the only legitimate user is the X server, which uses /dev/mem for access to the frame buffer. Read access to /dev/mem is an information leak, and there is no reason for allowing write access to kernel memory at all.

- In grsecurity since forever ago as well.

Seriously there's not only a grsecurity option to prevent /dev/mem access EXCEPT for video memory; there's a policy element to do it through the ACL. That should get ported into SELinux as well, probably by an SELinux hook in the /dev/mem device itself.