Pointer tagging for x86 systems
This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible. |
Pointers are a fact of life for developers working in numerous languages. It is often convenient to be able to associate a small amount — a few bits at most — of ancillary information with a pointer. This can often be done within the pointer value itself with some careful masking and shifting. CPU manufacturers have been adding ways to support the addition of this sort of "tag" to pointers; the most recent may be AMD's "upper address ignore" (UAI) feature, support for which was recently posted by Bharata B Rao. This feature has an uncertain future in Linux, though, as the result of a fundamental design decision.
On a 64-bit system, a pointer is, naturally, 64 bits wide. But the CPU does not actually need all of those bits to dereference an address stored in a pointer. There are no systems (yet) that require — or can provide — all of the memory that can be addressed by 64 bits, meaning that there are ranges of address space that do not map to physical memory. Normally, user-space addresses start at (or near) zero and increase from there; that means that the highest-order bits will be zero even with the largest possible addresses. As a result, it can be possible to use those high-order bits to store other types of information.
There are numerous use cases for stashing metadata into those unused bits. Memory allocators could use that space to track different memory pools, for example, or for garbage collection. Database management systems have their own uses for that space. Applications can implement this sort of tagging now, but it must be done with care; an address with extra bits set is no longer a valid pointer, so that metadata must be masked out before dereferencing that pointer or passing it into code that does not understand the tagging scheme. That is error-prone and may slow down the application.
To make life easier for the developers of this sort of application, CPU manufacturers have been adding the ability for the processor to simply ignore the non-address bits in an address value. Naturally, every manufacturer has invented its own way of supporting this feature. The AMD version, UAI, specifically allows the uppermost seven bits of an address to be used for ancillary data.
If accepted, AMD's implementation of this feature would not be the first; support for the Arm "top-byte ignore" feature was merged for the 5.4 kernel in 2019. At that time, a set of prctl() commands was added to control the use of this feature. Top-byte ignore can be enabled with:
int prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE, 0, 0, 0);
This interface was designed around Arm's implementation, which makes eight bits available for tag data. The AMD implementation only allows for seven bits, meaning that applications wanting to use tagged addresses will need a way to discover how many bits are available. So Rao's patch set starts with a patch from Kirill Shutemov (intended to add support for a similar Intel feature, more about that below) adding two new parameters to the above prctl() call, both of which are integer pointers. The first of those is for the caller to specify how many bits they would like to use for pointer metadata; the kernel will update that value to reflect the number of bits that are actually available. The second pointer tells the kernel where to store the number of bits to right-shift a pointer value to obtain the tag data.
The subsequent patches then implement support for UAI in the Linux kernel.
The idea is simple enough, and this feature already exists for the Arm architecture, but the UAI patches have still run into pushback, for a number of reasons. Perhaps the most fundamental of those is that UAI allows the most-significant bit of the address to be used by user space. In current systems, only kernel-space addresses have that bit set. Turning on UAI would allow user space to create pointer values that look like kernel addresses, but which would actually be valid user-space pointers. Those pointers can, of course, be passed into the kernel via system calls where, in the absence of due care, they might be interpreted as kernel-space addresses. The consequences of such confusion would not be good, and the possibility of it happening is relatively high.
This mechanism could probably be made to work safely, but, as Andy Lutomirski
said:
"A lot of auditing of existing code would be needed to make it safe
".
Even more auditing would be required, of course, to keep it safe in a
rapidly evolving kernel. It sounds like a recipe for ongoing security
problems, which is why Thomas Gleixner said that "there is
no justification for the bit 63 abuse
". He suggested that AMD
should rework the feature in its processors to disallow that bit in address
tags; he did not
say that this problem would block the merging of UAI, but the meaning was
reasonably clear.
Another problem that Lutomirski pointed out is that UAI is not specific to any running context; once it is enabled, it is turned on for the entire CPU. That, too, could lead to unpleasant surprises, so he suggested that the kernel would need to make the UAI settings process-local, even if it slows down context switches considerably.
Finally, there is the issue of Intel's similar feature, called "Linear Address Masking" (LAM). It does not have the most-significant-bit issue that UAI has, and it is managed as part of the process context. It supports two modes, with either six or 15 bits being made available for ancillary data; the 15-bit mode only works if five-level page tables are not in use. LAM has been around for a while, and support patches were posted (by Shutemov) in early 2021. That work seems to have stalled after that posting, but can be expected to come back at some point.
Rao's UAI patch set deliberately keeps the AMD implementation entirely
separate from the proposed LAM implementation, even though the two are
doing essentially the same thing. That led recently appointed x86
co-maintainer Dave Hansen to object:
"We'll have one x86 implementation of address bit masking. Both the
Intel and AMD implementations will feed into a shared
implementation
". So this is something that would certainly need to
be fixed before this work could be considered for mainline merging.
The other issues are tied to the design of the hardware, though, and will be rather harder to fix in kernel code. For these reasons, the sentiment among kernel developers seems to be that LAM is a better-designed implementation of pointer tagging and should perhaps be what all x86 systems use. In the above-linked message, Lutomirski concluded:
I believe it's possible for a high-quality kernel UAI implementation to exist, but, as above, I think it would be slow, and it might be quite complex and fragile. Are we sure that it's worth supporting it?
A better solution, he suggested, would be for AMD to go back to the drawing board and create its own implementation of LAM instead.
In the early days of Linux, kernel developers had to adapt to whatever the
hardware manufacturers put out; the alternative was to not have hardware to
run on at all. In 2022, though, those developers feel more confident in
their ability to reject support for hardware features that, for whatever
reason, they feel do not fit in well with the design of the system. If AMD
is unable to get support for UAI into the kernel (it's worth noting that
Rao hasn't
given up yet), UAI is likely to go
mostly unused and developers needing pointer tagging may gravitate toward
competing CPUs. According to Gleixner (linked above), AMD was told about
the problems with its implementation some time ago; the company may yet
have reason to wish it had listened.
Index entries for this article | |
---|---|
Kernel | Architectures/x86 |
(Log in to post comments)
Pointer tagging for x86 systems
Posted Mar 28, 2022 17:10 UTC (Mon) by wtarreau (subscriber, #51152) [Link]
Pointer tagging for x86 systems
Posted Mar 28, 2022 17:49 UTC (Mon) by butlerm (subscriber, #13312) [Link]
Pointer tagging for x86 systems
Posted Mar 28, 2022 18:50 UTC (Mon) by farnz (subscriber, #17727) [Link]
While you're right that constructing kernel addresses is trivial, the mitigation today is also trivial - if an address is passed to the kernel with its top bit set, then the called code should simply fail noisily because Something is Bad.
In the UAI world, a pointer with the top bit set could be a kernel address, but it could also be the case that the user is using bit 63 as a tag bit, and the CPU will ignore it on access - the kernel can't tell.
Pointer tagging for x86 systems
Posted Mar 28, 2022 20:12 UTC (Mon) by bartoc (subscriber, #124262) [Link]
Pointer tagging for x86 systems
Posted Mar 30, 2022 15:12 UTC (Wed) by BenHutchings (subscriber, #37955) [Link]
Pointer tagging for x86 systems
Posted Mar 28, 2022 18:15 UTC (Mon) by jhoblitt (subscriber, #77733) [Link]
Pointer tagging for x86 systems
Posted Mar 28, 2022 18:33 UTC (Mon) by JoeBuck (subscriber, #2330) [Link]
Pointer tagging for x86 systems
Posted Mar 28, 2022 19:43 UTC (Mon) by jhoblitt (subscriber, #77733) [Link]
Pointer tagging for x86 systems
Posted Mar 28, 2022 22:35 UTC (Mon) by Paf (subscriber, #91811) [Link]
Pointer tagging for x86 systems
Posted Mar 28, 2022 20:29 UTC (Mon) by willy (subscriber, #9762) [Link]
> There are no systems (yet) that require — or can provide — all of the memory that can be addressed by 64 bits, meaning that there are ranges of address space that do not map to physical memory.
This conflates userspace addressing and kernel addressing. Userspace would love to have more address space available. Even with 64 bit pointers, address space fragmentation is a real thing, and prevents doing things like mmap() of an entire multi-petabyte file.
The real problem is with the CPU. More virtual address bits available to userspace means more levels of page table or larger tables at each level or some other undesirable expansion that affects performance. It also affects how the L3 (perhaps also L2 and L1?) caches are implemented as more address bits must be checked, and thus also stored. Five level page tables come with some real costs beyond the obvious extra level of lookup!
It is my opinion (based on precisely zero inside information) that Intel and AMD have decided that their current architectures will stop at five levels (57 bits of virtual address). When they need to go beyond this, we're talking about a 128 bit architecture with a rather different approach to page tables. So they're using the last few top bits to provide useful functionality like pointer tagging that people actually want.
Pointer tagging for x86 systems
Posted Mar 28, 2022 21:29 UTC (Mon) by jhoblitt (subscriber, #77733) [Link]
Pointer tagging for x86 systems
Posted Mar 29, 2022 3:10 UTC (Tue) by willy (subscriber, #9762) [Link]
Pointer tagging for x86 systems
Posted Mar 29, 2022 6:07 UTC (Tue) by wtarreau (subscriber, #51152) [Link]
This argument can only come from those nostalgic of the 8086/8088! What a disaster it was!
Pointer tagging for x86 systems
Posted Mar 29, 2022 12:33 UTC (Tue) by jem (subscriber, #24231) [Link]
sizeof (int) == 5 according to a C compiler I got to try out on a DEC-20.
http://pdp10.nocrew.org/docs/instruction-set/Byte.html
Pointer tagging for x86 systems
Posted Apr 1, 2022 1:02 UTC (Fri) by azz (subscriber, #371) [Link]
C is really not a good fit for the PDP-10. There's at least one PDP-10 C compiler (Alan Snyder's) where all the primitive types, including char, are 36 bits...
Pointer tagging for x86 systems
Posted Apr 7, 2022 14:35 UTC (Thu) by mrugiero (guest, #153040) [Link]
Pointer tagging for x86 systems
Posted Mar 29, 2022 9:31 UTC (Tue) by farnz (subscriber, #17727) [Link]
One thing that adds weight to your notion that they'll not go beyond 57 bits of VA space is that thus far, each page level in x86-64 adds 9 bits of VA space (because that fills an entire 4K page), and apart from the PTEs having their PAT bit in a different place to PDE and PDPTE (the PTE place for PAT is used in all higher level tables for "stop paging here, this is a large page"), the levels are all identical layouts.
And a second thing is the existence of Arm Morello as an implementation that supports the CHERI capabilities model (effectively 129 bit pointers consisting of a hidden "is a valid pointer" bit, then 64 bits used for compressed bounds, permissions and an object type field, 64 bits for an address). If Morello demonstrates that CHERI capabilities can be made to work well and have significant benefits while not hurting existing code, then Intel or AMD may well want to expand chapter 6 of the CHERI paper into a full design.
Pointer tagging for x86 systems
Posted Mar 28, 2022 21:23 UTC (Mon) by abufrejoval (subscriber, #100159) [Link]
Best get started earlier fixing this properly than pushing this along with these half-hearted approaches that just pile on more legacy.
Pointer tagging for x86 systems
Posted Mar 28, 2022 21:42 UTC (Mon) by dullfire (subscriber, #111432) [Link]
you mean you want x86 segmentation fully reinstated for AMD64?
Pointer tagging for x86 systems
Posted Mar 28, 2022 21:58 UTC (Mon) by abufrejoval (subscriber, #100159) [Link]
But generally speaking the old guys in the 1960's were much less restrained by todays VAX legacy and had some pretty cool ideas (and implementations like the IBM i-Series). There is a lot of inspiration to be found by looking back at what they had theorized on and implemented back then.
Please note that capability based security is finding its way back in projects like Google's Fuchsia and many others really concerned about the long term viability of the von Neumann/Princeton memory model.
Pointer tagging for x86 systems
Posted Mar 28, 2022 23:33 UTC (Mon) by dullfire (subscriber, #111432) [Link]
It's very very much like a capabilities bases permissions system (and very very much unlike the 8086 segmentation that was only for getting around the 16-bit limit).
Pointer tagging for x86 systems
Posted Mar 29, 2022 3:01 UTC (Tue) by mtaht (guest, #11087) [Link]
I wish they'd build it, even just as a virtual machine. It would help people to think better about where we should have started going in the 90s, especially securitywise, when it came to cpu architectures.
Take out some popcorn and watch their talk about security... https://millcomputing.com/docs/#security
Pointer tagging for x86 systems
Posted Mar 29, 2022 5:03 UTC (Tue) by pabs (subscriber, #43278) [Link]
Pointer tagging for x86 systems
Posted Mar 29, 2022 15:11 UTC (Tue) by willy (subscriber, #9762) [Link]
There's always trade-offs and people can have a real conversation about whether 32, 64 or 128 bytes is the correct size of a cache line, but there's a knee to this curve and 1-4 bytes is outside the scope of sane conversation.
Pointer tagging for x86 systems
Posted Mar 28, 2022 21:48 UTC (Mon) by calumapplepie (subscriber, #143655) [Link]
Besides, inventing a new magic CPU architecture that fixes all our problems will take years. In the meantime, lets try and use the stopgap measures to get as much performance and power efficiency as we can.
Pointer tagging for x86 systems
Posted Mar 28, 2022 22:05 UTC (Mon) by abufrejoval (subscriber, #100159) [Link]
And we should at least invest into making sure that new code can be written without fully qualified pointers (somewhat like Rust vs. Cx).
Finding ways to defuse pointers seems pretty easy compared to saving the planet, especially when potential architectures have already been proposed decades ago...
Pointer tagging for x86 systems
Posted Mar 28, 2022 22:39 UTC (Mon) by Paf (subscriber, #91811) [Link]
It’s not obvious to me these thoughts about addressing will be different.
Pointer tagging for x86 systems
Posted Mar 29, 2022 10:38 UTC (Tue) by james (subscriber, #1325) [Link]
The big problem with "next" architectures is that they don't have access to all the decades of experience conventional architectures do, but the first generation is expected to be at least competitive. If the general conclusion is that "this would be good if they changed x, y and z", it's probably too late (especially if changing those features can't be done while keeping compatibility). I'm hoping that ARM has learnt that lesson and aren't trying to productise CHERI too soon.Examples
Take MIPS and SPARC as successful "next" architectures -- tremendously influential, but with a number of features which later RISC designs dropped.
i432 -- Bob Colwell (one of the lead designers of the Pentium Pro and Pentium 4) is fascinating on the subject: he says that both hardware and software engineers dropped the ball on performance. If they hadn't -- if they'd got to within 50% of competitive on performance, with security and reliability improvements, there would have been a market for the chip.
Itanium -- the big lesson they should have learnt before they ploughed billions into hardware and software is "can we actually provide compilers that do what we say they can", but it's at least arguable there were other mistakes. I seem to remember that when compilers scheduled Itanium programs, they used cycle timings from the current processors -- in particular, fast level 1 cache. That meant that a (theoretical) high-clocking Itanium with caches that were slower in terms of cycles (but not in nanoseconds) would spend a disproportionate amount of time waiting for data from cache when running existing binaries, so Intel didn't produce a system like that.
I really don't think that Alpha counts as "next" for anything other than performance -- and dominating the CPU performance tables throughout the 1990s, then getting dropped for business reasons, doesn't exactly count as an engineering failure.
Pointer tagging for x86 systems
Posted Mar 29, 2022 13:24 UTC (Tue) by farnz (subscriber, #17727) [Link]
CHERI has a decent chance, because it's designed to let you have a "legacy" capability that gives code all the permissions it had before CHERI came into existence; the theory is that you'll start your porting with all of the kernel and userspace pointers living in the legacy capability, and then gradually narrow things down over time, rather than having to do a big bang port. The idea is that you can put capabilities in place at the edges, and move inwards over time, to have a tiny trustworthy core that has "full" capabilities, and that simply restricts the capabilities on offer as you head further from the core.
One more thing I'd add to your Itanium example; the simulations that justified Itanium's design compared hand-written "perfect" Itanium code to current compiler output for x86. With hindsight, the thing that they could easily have seen up-front and didn't spot is that the compiler improvements needed for EPIC also improved performance of compiled code for OoOE, and thus their predicted performance advantage wasn't nearly as good as it could be because OoOE also benefited from their compiler improvements.
Pointer tagging for x86 systems
Posted Mar 28, 2022 23:21 UTC (Mon) by jrtc27 (subscriber, #107748) [Link]
CHERI is not like the other two, it *is* full capability-based addressing; the C stands for Capability
Pointer tagging for x86 systems
Posted Mar 28, 2022 22:11 UTC (Mon) by JoeBuck (subscriber, #2330) [Link]
Perhaps the most fundamental of those is that UAI allows the most-significant bit of the address to be used by user space. In current systems, only kernel-space addresses have that bit set. Turning on UAI would allow user space to create pointer values that look like kernel addresses, but which would actually be valid user-space pointers.Couldn't this be changed so that the most significant non-UAI bit, instead of the most significant bit, would be used to distinguish kernel addresses?
Pointer tagging for x86 systems
Posted Mar 28, 2022 23:36 UTC (Mon) by dullfire (subscriber, #111432) [Link]
(to be clear: any overhead to those code paths will be significant)
Pointer tagging for x86 systems
Posted Mar 29, 2022 16:36 UTC (Tue) by imMute (guest, #96323) [Link]
Pointer tagging for x86 systems
Posted Mar 29, 2022 16:44 UTC (Tue) by farnz (subscriber, #17727) [Link]
Bit 63 of a 64 bit register is the sign bit if you're interpreting the register contents as a signed integer; it thus has special handling to make it easier to check. test rax, rax ; jl error will jump to the label error if the pointer in rax has bit 63 set. test doesn't support a 64 bit immediate, so you need to free up a register, or accept the slowdown from accessing memory, if you want to check any bits other than 63.
Pointer tagging for x86 systems
Posted Mar 29, 2022 22:00 UTC (Tue) by khim (subscriber, #9252) [Link]
Why would you need test
for that? Just use bt rax, 47; jc error
and that's it.
Pointer tagging for x86 systems
Posted Mar 30, 2022 8:33 UTC (Wed) by farnz (subscriber, #17727) [Link]
Because test is friendlier to the OoOE machinery on modern CPUs than bt, and hence faster to execute. As this is pure overhead in the absence of bugs and/or malicious code attacking the kernel, we want it to be as lightweight as possible so as to spend more CPU resource doing useful work, and less CPU resource validating that userspace hasn't gone insane.
Using Agner's instruction table PDF, test reg, reg has had a lower reciprocal throughput than bt reg, imm since IvyBridge on Intel's high performance side, and since Intel Haswell processors, up until the latest Xeons, can be executed on more execution ports than bt reg,imm. On the low power Intel side (Atom from Silvermont onwards), test reg, reg takes one execution unit instead of both on Silvermont, and has 3x the throughput on Goldmont.
On the AMD side, test reg, reg becomes higher throughput than bt reg, imm in Excavator cores, returns to be equally cheap for Zen 1 and Zen 2, and then test reg, reg becomes cheaper than bt reg, imm in Zen 3. In the low power cores (Bobcat, Jaguar), the cost is the same for either instruction.
Hence the preference for test reg, reg over bt reg, imm - there are no CPUs on Agner's list where bt is faster than test, but there are several cores, including the current high performance µarches from Intel and AMD, where test is cheaper to execute than bt.
Pointer tagging for x86 systems
Posted Mar 30, 2022 10:21 UTC (Wed) by khim (subscriber, #9252) [Link]
I'm not asking about why you would use 63th bit instead of 47th if you have a choice.
But we are discussing here AMD-only features and you say that one would need to free up a register, or accept the slowdown from accessing memory. That's not true. If these future CPUs have fast enough bt
then there would be no slowdown (at least if that would be CPU-specific kernel build which is often acceptable for servers or things like ChromeOS).
Yes, you probably couldn't build an universal kernel which is both supporting UAI and Intel CPUs, but that's another, separate, issue.
Pointer tagging for x86 systems
Posted Mar 30, 2022 13:59 UTC (Wed) by farnz (subscriber, #17727) [Link]
If we change future CPUs to have a fast bt, then yes, we can use it instead of the test instruction that's faster on AMD EPYC processors.
But if we change UAI to not use bit 63 as part of the tag, then we could avoid the whole problem, too. And given that both Intel and AMD have changed from having bt reg, imm be as fast as test reg, reg to having test reg, reg be faster than bt reg, imm, I think a fix to UAI is almost certainly the simpler route.
This is especially true because UAI is the new thing - if making bt fast was worthwhile for things other than bt, then we'd have done it already. Making test reg, reg fast is worthwhile because it's a common idiom used by compilers for testing the relationship between a register and 0, so having it be fast speeds up other code.
Plus, your example code bt reg, 47 is buggy in its own right if AMD ever implement 5-level paging, and is buggy on Intel chips that exist today with 5 level paging. And, on top of that, if AMD do implement 5 level paging, UAI would leave no bit that can be uniquely used to distinguish kernel and user addresses, so there's no way to make the bt solution work reliably (with 5 level paging, bits 56 to 0 are VA bits, leaving 7 bits at bit 57 to 63 not translated, but UAI permits userspace processes to convert bits 63 to 56 into tag bits). Intel, at least, left bit 63 spare, and said that you get 6 tag bits with 5 level paging, or 15 with 4 level.
Pointer tagging for x86 systems
Posted Apr 1, 2022 7:57 UTC (Fri) by ecm (subscriber, #129897) [Link]
Pointer tagging for x86 systems
Posted Mar 30, 2022 13:51 UTC (Wed) by adobriyan (subscriber, #30858) [Link]
0000000000000000 <f>:
0: 48 85 c0 test rax,rax
3: 48 0f ba e0 3f bt rax,0x3f
why not use low bit instead ?
Posted Mar 30, 2022 16:00 UTC (Wed) by ballombe (subscriber, #9523) [Link]
After all unaligned accesses are not supported anymore so all pointers start with 3 zero bits.
why not use low bit instead ?
Posted Mar 30, 2022 17:17 UTC (Wed) by mathstuf (subscriber, #69389) [Link]
why not use low bit instead ?
Posted Apr 3, 2022 9:57 UTC (Sun) by dcoutts (subscriber, #5387) [Link]
https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/rts...
why not use low bit instead ?
Posted Apr 3, 2022 12:08 UTC (Sun) by mathstuf (subscriber, #69389) [Link]
why not use low bit instead ?
Posted Apr 4, 2022 12:01 UTC (Mon) by dcoutts (subscriber, #5387) [Link]
This hardware feature is for almost certainly for performance though, not convenience. My guess is that it's primarily aimed at JVMs and similar.
I don't know for sure, but I'd guess that doing pointer tagging in software (and thus having to untag before dereferencing) is cheaper to do for the low bits than the high bits. That is, cheaper in terms of the extra instructions and their sizes. But then when doing it in hardware, a hardware impl can do it cheaply either way, and given that there's more bits available at the high end, it makes sense to use the high bits.
why not use low bit instead ?
Posted Apr 6, 2022 6:51 UTC (Wed) by anton (subscriber, #25547) [Link]
In many cases you know when using the address what the tag is, and then you can just use an offset at no or very low extra cost. E.g., if tag 3 means that we have a pointer to a cons cell, then car (aka head) accesses the machine word at offset -3, while cdr (tail) accesses the word at offset 5.Low-bit tagging is used when 3, maybe 4 bits of tags are enough. If you need more, it becomes impractical, and you use high-bit tagging.
why not use low bit instead ?
Posted Mar 30, 2022 23:25 UTC (Wed) by neilbrown (subscriber, #359) [Link]
why not use low bit instead ?
Posted Apr 1, 2022 8:13 UTC (Fri) by marcH (subscriber, #57642) [Link]
Says who?
> so all pointers start with 3 zero bits.
Yes as long as you use only 64 bits values.
Pointer tagging for x86 systems
Posted Apr 1, 2022 8:07 UTC (Fri) by marcH (subscriber, #57642) [Link]
> ....
> In the early days of Linux, kernel developers had to adapt to whatever the hardware manufacturers put out; the alternative was to not have hardware to run on at all. In 2022, though, those developers feel more confident in their ability to reject support for hardware features that, for whatever reason, they feel do not fit in well with the design of the system.
In the early days, software developers trusted CPU designs. Sure there were a couple bugs now and then but nothing huge. Then came spectre and friends...
Pointer tagging for x86 systems
Posted Apr 27, 2022 8:35 UTC (Wed) by cavok (subscriber, #33216) [Link]