Detecting and handling split locks

June 7, 2019

This article was contributed by Marta Rybczyńska

The Intel architecture allows misaligned memory access in situations where other architectures (such as ARM or RISC-V) do not. One such situation is atomic operations on memory that is split across two cache lines. This feature is largely unknown, but its impact is even less so. It turns out that the performance and security impact can be significant, breaking realtime applications or allowing a rogue application to slow the system as a whole. Recently, Fenghua Yu has been working on detecting and fixing these issues in the split-lock patch set, which is currently on its eighth revision.

From misaligned memory accesses to split locks

Misaligned memory access occurs when the processor accesses memory at an address that not aligned to the type of the operand, such as an eight-byte operation that accesses a four-byte-aligned variable. Reading four bytes from address 0x1008 is aligned, for example, while the same operation from 0x1006 is not. Misaligned accesses can cause varying behavior on different architectures, including correct and performant operation, exceptions that stop the processor, or incorrect results.

Misaligned accesses may incur a performance penalty even if the processor transparently handles them. For example, a misaligned access may be split by the CPU into two separate memory operations. Another possibility is the processor generating an exception that is silently handled by the kernel. Portable and high-performance applications should avoid misaligned accesses; the kernel's code guidelines state that developers should assume natural alignment requirements on all platforms.

A special type of a misaligned access is one that crosses two cache lines, possibly causing the processor to have to fetch multiple lines before performing the operation. Things get more complicated when an atomic operation is being performed and the processor must ensure that the data involved is seen consistently and correctly while the operation is executed. Intel platforms support atomic accesses that are split across two cache lines; such an operation is called a "split lock".

With a split lock, the value needs to be kept coherent between different CPUs, which means assuring that the two cache lines change together. As this is an uncommon operation, the hardware design needs to take a special path; as a result, split locks may have important consequences as described in the cover letter of Yu's patch set. Intel's choice was to lock the whole memory bus to solve the coherency problem; the processor locks the bus for the duration of the operation, meaning that no other CPUs or devices can access it. The split lock blocks not only the CPU performing the access, but also all others in the system. Configuring the bus-locking protocol itself also adds significant overhead to the system as a whole.

On the other hand, if the atomic operation operand fits into a single cache line, the processor will use a less expensive cache lock. This all means that developers may increase performance and avoid split locks by actions like simply correctly aligning their variables.

Split lock consequences

Yu explained the use cases that motivated this work: hard realtime, cloud computing, and avoiding a security hole. The most important one seems to be related to systems that run hard realtime applications on some cores and normal-priority processes on other cores. Split locks may cause the hard realtime requirements to be broken, as the bus locking caused by split locks executed by the regular code blocks memory accesses by the realtime code. Yu noted that, until now, such complex realtime applications could not be supported for exactly this reason:

To date the designers have been unable to deploy these solutions as they have no way to prevent the "untrusted" user code from generating split lock and bus lock to block the hard real time code to access memory during bus locking.

In the cloud case, one user process from a guest system may block other cores from accessing memory and cause performance degradation across the whole system. In a similar way, malicious code may try to slow down the system deliberately in a denial-of-service attack.

Solutions

Intel processors, starting with the upcoming Tremont generation, will be able to generate an exception (called "Alignment Check" or #AC) when a split lock is detected. Earlier processors support only an event counter for debugging purposes (exposed by an event counter called sq_misc.split_lock in perf), but do not allow immediate action from the system. Yu's work is based on this new capability.

The correct response to split locks, including what to do when they are detected while the system firmware is running, was the subject of some discussion during the review of the earlier version of the patch set. The implementation in the current version concentrates on detection of the problem.

If a split-lock event happens in the kernel itself, it issues a warning and disables the detection on the current CPU. After the warning, the faulty instruction will execute and the system will continue — whether the system should go on or panic was one of the main topics of discussion. The rationale is that a split lock in the kernel is a bug and should be fixed, but the bug is not so severe that the kernel should be made to panic.

The situation is different for user processes, which will be sent a fatal (by default) SIGBUS signal. The issue will need to be fixed before that program can run successfully. Something similar happens when a split lock is created by the system's firmware: the system will simply hang at that point. The developers decided on this handling because they were afraid that otherwise the firmware would never be fixed.

Split-lock detection is enabled by default when it is supported by the hardware. However, system administrators have control over the feature: they can use a new kernel parameter (nosplit_lock_detect) at boot time to disable it. There is also a sysfs interface to disable it at runtime at /sys/devices/system/cpu/split_lock_detect.

The patch set also includes support for KVM; it emulates the register in guests, exposing the property. The host system will have the feature enabled by default. What to do with the guests was discussed at multiple occasions; the agreed-on solution that will show up in the next iteration is to enable it in guests when the host kernel has it enabled. It means that if the host kernel has split-lock detection enabled and a guest triggers the exception, it will be stopped. On the other hand, if the host kernel has it disabled, the guest may choose to enable and use it, but is not required to.

Further work

The work has been through multiple iterations at this point and has received regular comments from the kernel developers, including Thomas Gleixner and Ingo Molnar. It has still some issues pending, but at the current pace it should show up in the mainline kernel before too long.

Index entries for this article
Kernel	Architectures/x86
GuestArticles	Rybczynska, Marta

(Log in to post comments)

Detecting and handling split locks

Posted Jun 7, 2019 19:02 UTC (Fri) by jcm (subscriber, #18262) [Link]

Intel calls these "data chunk split misaligned" loads and stores. When they occur, the (e.g.) load buffer will hold them until the ROB indicates that the execution results would be committable to architectural state. There's a couple of fun patents if you search "data chunk split misaligned" in your favorite search engine.

Worst case scenario ?

Posted Jun 7, 2019 20:50 UTC (Fri) by meuh (guest, #22042) [Link]

What about an unaligned atomic operation across two consecutive unmapped pages, which will be mapped from physical memory belonging to two different NUMA nodes ?

Worst case scenario ?

Posted Jun 7, 2019 20:58 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

At that point I submit that you need to either recruit better programmers or identify the saboteur :)

Worst case scenario ?

Posted Jun 7, 2019 21:24 UTC (Fri) by JoeBuck (subscriber, #2330) [Link]

"Identify the saboteur" ... if this approach can be used deliberately as an attack that could allow unprivileged code in a container to bring the host to its knees.

Worst case scenario ?

Posted Jun 7, 2019 23:41 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

I wasn't even thinking that far, merely as far as "if your software system does this, then something is wrong with it and you should do something about that".

Detecting and handling split locks

Posted Jun 7, 2019 22:06 UTC (Fri) by scientes (guest, #83068) [Link]

ARM can handle unaligned memory accesses since ARMv6.

Detecting and handling split locks

Posted Jun 7, 2019 22:58 UTC (Fri) by daney (guest, #24551) [Link]

They are not talking about "normal" memory accesses. The fun seems to happen on unaligned "atomic" accesses. IIRC, ARM does not allow for cache line splitting on atomic operations nor on load-exclusive/store-exclusive.

Detecting and handling split locks

Posted Jul 5, 2021 2:00 UTC (Mon) by plugwash (subscriber, #29694) [Link]

It's not that simple.

arm can handle unaligned accesses on regular ldr and str instructions (and I think also ldrh and strh but i'm not 100% sure on that) since armv6. There are other instructions the hardware can't handle unaligned accesses on though (ldrd and vldr spring to mind as ones I've had trouble with in real code). Some of these will be emulated by 32-bit kernels but not by 64-bit kernels leading to programs crashing when run on 64-bit kernels.

While I haven't checked I certainly wouldn't expect fancy stuff like atomics to properly support unaligned access on arm.

Detecting and handling split locks

Posted May 28, 2022 2:06 UTC (Sat) by plugwash (subscriber, #29694) [Link]

Unaligned access on arm is a horrible mess.

On modern arm32, regular loads and stores are unaligned safe, but many other instructions (notably ldrd and vldr) are not. 32-bit kernels will trap and emulate unaligned accesses by default, but 64-bit kernels running 32-bit applications will not.

Detecting and handling split locks

Posted May 28, 2022 2:44 UTC (Sat) by pabs (subscriber, #43278) [Link]

Is the 32-bit userland on 64-bit kernel issue something that could be handled by the kernel or a hardware limitation?

Detecting and handling split locks

Posted Jun 8, 2019 1:12 UTC (Sat) by quotemstr (subscriber, #45331) [Link]

Why doesn't the processor just lock both cache lines in the case that an atomic operation spans a cache line? I don't see why it would have to lock the whole bus.

Detecting and handling split locks

Posted Jun 8, 2019 3:11 UTC (Sat) by khim (subscriber, #9252) [Link]

Because it would need to introduce the whole new protocol for that. Basically adding *lots* of complexity to the common and very time-critical pass for the sake of something which, in property designed system, doesn't happen at all.

Detecting and handling split locks

Posted Jun 8, 2019 23:43 UTC (Sat) by luto (subscriber, #39314) [Link]

Intel *has* that feature: TSX. I’m a bit surprised that microcode doesn’t emulate split locks using TSX.

Detecting and handling split locks

Posted Jun 9, 2019 22:53 UTC (Sun) by khim (subscriber, #9252) [Link]

TSX is very recent invention and it's implementation is very tricky and buggy. First generations of CPUs where it was implemented had it disable on almost all models in the end.

I'm not surprised they haven't used it to implement something they needed 20 years ago.

Detecting and handling split locks

Posted Jun 8, 2019 5:14 UTC (Sat) by marcH (subscriber, #57642) [Link]

In software you can implement large number of overly complex features for a fixed and low cost and deliver bug fixes later (or... never).

Hardware is a bit different.

Detecting and handling split locks

Posted Jun 8, 2019 19:39 UTC (Sat) by mokki (subscriber, #33200) [Link]

Most likely the easiest way to avoid deadlocks for this rare case is to take bus lock

Detecting and handling split locks

Posted Jun 10, 2019 10:03 UTC (Mon) by dmiller (guest, #115155) [Link]

You could get a deadlock pretty easily this way, and that's one thing that hardware engineers are going to be very careful to try and avoid. Imagine processor 1 has virtual address 0xA0000 mapped to physical address 0x1000 and 0xA1000 mapped to 0x2000, while processor 2 has 0xA0000 mapped to 0x2000 and 0xA1000 mapped to 0x1000. Now say that both do a 4-byte atomic operation to 0xA0FFF at about the same time. This is equivalent to the "Dining Philosophers Problem" for n=2 which requires a good amount of complexity to do safely.

Detecting and handling split locks

Posted Jun 8, 2019 17:55 UTC (Sat) by andresfreund (subscriber, #69562) [Link]

> The Intel architecture allows misaligned memory access in situations where other architectures (such as ARM or RISC-V) do not. One such situation is atomic operations on memory that is split across two cache lines. This feature is largely unknown, but its impact is even less so.

Does anybody have any insight why support for such "split" atomic accesses was added? Seems like it's an obvious source of complexity - without IMO having a lot of practical benefit? Allowing un-aligned memory accesses is certainly useful for some work, but I don't quite see that being necessary for atomic operations?

Detecting and handling split locks

Posted Jun 8, 2019 22:32 UTC (Sat) by pbonzini (subscriber, #60935) [Link]

Because back in the 386 (and earlier) days there was no cache and atomic operations were done by literally locking the whole bus, so for the processor it was simpler not to do any check on the alignment after all it did not do any check on non-locked accesses). And when a cache was added, backwards compatibility meant you couldn't drop the feature.

Detecting and handling split locks

Posted Jun 10, 2019 2:38 UTC (Mon) by jcm (subscriber, #18262) [Link]

Indeed. This is literally what the LOCK instruction prefix byte means in x86.

Detecting and handling split locks

Posted Jun 10, 2019 6:20 UTC (Mon) by epa (subscriber, #39769) [Link]

That doesn’t explain why it wasn’t quietly dropped with the move to x86-64.

Detecting and handling split locks

Posted Jun 10, 2019 7:10 UTC (Mon) by camhusmj38 (subscriber, #99234) [Link]

Lots of things should have been tidied up at that time. Presumably, AMD thought that might make porting more difficult.

Detecting and handling split locks

Posted Jun 10, 2019 20:26 UTC (Mon) by flussence (subscriber, #85566) [Link]

If x86-64 was to support a full x86-32 feature set anyway (which it had to, to avoid the fate of Itanic), it would probably be more costly to selectively gate off all those features in long mode than to let them through.

Detecting and handling split locks

Posted Jun 8, 2019 17:57 UTC (Sat) by lkundrak (subscriber, #43452) [Link]

Thanks for this article. It did an excellent job at being an easy read even for someone who happens to be rather ignorant about the multi-processor synchronization or cache architectures. Quality writing of this sort makes it easy to recommend subscribing to LWN to friends.

indeed

Posted Jun 9, 2019 1:19 UTC (Sun) by gus3 (guest, #61103) [Link]

I agree with you wholeheartedly. In fact, you inspired me to renew my subscription, which was set to end next month.

Improving portability advice

Posted Jun 8, 2019 20:54 UTC (Sat) by mirabilos (subscriber, #84359) [Link]

The kernel documentation currently states:

> When writing code, assume the target architecture has natural alignment
> requirements.

This is actually bad advice, here’s my suggestion for improvement,
and why this is important:

> When writing code, assume the target architecture has natural alignment
> requirements, but be prepared for less alignment.

On m68k, even dword (32-bit) quantities are only aligned for 16 bits,
and there has been software that implicitly relies on the implicit
padding added by natural-alignment architectures if you have a
struct { short; int; }; which is not added on m68k.

We’ve been fixing some of them by changing some assertions (against
accidental struct growth) from sizeof(struct) == value to <= value,
but the proper fix is to make all struct padding explicit by adding
explicitly unused members to the structures. This also helps visua‐
lising the problems of uninitialised padding when comparing them.

I’d appreciate if someone with experuence in the LKML community
could pick this up and discuss and fix it there.

Improving portability advice

Posted Jun 8, 2019 23:14 UTC (Sat) by scientes (guest, #83068) [Link]

I did that for SocketCAN frames. a2f11835994ed5bcd6d66c7205947cc482231b08

Improving portability advice

Posted Jun 9, 2019 4:29 UTC (Sun) by marcH (subscriber, #57642) [Link]

I liked this BTW, do you approve? http://www.catb.org/esr/structure-packing/

Improving portability advice

Posted Jun 9, 2019 11:53 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

Oh, so ESR wrote a detailled article about it. I only knew from OpenBSD commits and figuring out the rest myself.

Yes! I’ve significantly reduced the in-memory size of several structures (which are then used in arrays, which pads them to powers of 2 at the end too, so going down from 36 to 32 actually reduces from 64 to 32) in some codebasēs.

Improving portability advice

Posted Jun 9, 2019 18:44 UTC (Sun) by itvirta (guest, #49997) [Link]

What exactly pads your structures to powers of two in arrays?

Improving portability advice

Posted Jun 10, 2019 13:17 UTC (Mon) by bcopeland (subscriber, #51750) [Link]

If the arrays are dynamically allocated, slab allocator (for example) might do this. I once looked at a case that was allocating a bunch of individual pointers that all got rounded up from 8 to 32 bytes, meaning 75% of each allocation was unused. It makes sense that the allocators work that way, but if you don't know that and do this sort of thing hundreds of times, and especially if the memory being allocated happens to be something like 2**n + small_amount, you can accidentally waste a lot of memory.

Improving portability advice

Posted Jun 11, 2019 13:26 UTC (Tue) by k8to (guest, #15413) [Link]

The typical userland allocators on Solaris & Windows also behave this way. It's not very surprising.

On Linux my memory is a bit muddied between the default glibc allocator, tcmalloc, jemalloc etc. so I'm less confident, but I would expect the same.

Improving portability advice

Posted Jun 11, 2019 15:02 UTC (Tue) by fotoba (subscriber, #61150) [Link]

If pointers will be huge or far
not near or not specified
https://www.geeksforgeeks.org/what-are-near-far-and-huge-...

This problem will not happend as I read well waht happend

Improving portability advice

Posted Jun 14, 2019 8:03 UTC (Fri) by meuh (guest, #22042) [Link]

> This is actually bad advice, here’s my suggestion for improvement,
> and why this is important:
>
>> When writing code, assume the target architecture has natural alignment
>> requirements, but be prepared for less alignment.

The original advice should be interpreted as you have to explicitly pad your structure so that members are naturally aligned.

> [...] the proper fix is to make all struct padding explicit by adding explicitly unused members to the structures.

Yes, this is the rule you have to follow when designing a data structure to be exchanged between kernel and userspace (or between kernel and firmware/hardware) so that the structure as a fixed size regarding the architecture (and padding can be initialized and/or inspected for unknown bit set).

Detecting and handling split locks

Posted Jan 22, 2020 16:22 UTC (Wed) by erkki (subscriber, #124843) [Link]

I benchmarked this and split locks cause a 200 times slow down for single threaded workloads: https://rigtorp.se/split-locks/

Detecting and handling split locks

Posted Nov 14, 2023 20:10 UTC (Tue) by d3x0r (guest, #168005) [Link]

Is there any support for GDB to be able to pause on these? I have some code apparently that needs alignment, but I don't know where; most structures are typically aligned so setting the break on the exchange will catch thousands of operations otherwise. Is there a signal generated? when I do a search for 'split lock detection gdb' I get articles about one or the other, but not both topics together.

Detecting and handling split locks

Posted Nov 14, 2023 21:28 UTC (Tue) by farnz (subscriber, #17727) [Link]

The easiest way to catch these is to boot with split_lock_detect=fatal; then the kernel will send your process a SIGBUS signal every time you get it wrong. You can then use catch signal SIGBUS or handle signal SIGBUS stop to get GDB to take control whenever you get a SIGBUS, and your normal tools to debug the program. Note that catch signal SIGBUS sets a catchpoint, which can have all your usual nice tools for working with breakpoints applied to it (e.g. automatic commands run when you hit the catchpoint, or conditional behaviour such as continuing anyway for a SIGBUS that isn't related to a split lock.

Detecting and handling split locks

Posted Nov 15, 2023 4:25 UTC (Wed) by wsy (subscriber, #121706) [Link]

https://lwn.net/Articles/911219/
https://elixir.bootlin.com/linux/latest/source/arch/x86/k...

Kernel will give a warning with IP attached.