Protecting control dependencies with volatile_if()

By Jonathan Corbet
June 18, 2021

Memory ordering issues are, as Linus Torvalds recently observed, "the rocket science of CS". Understanding memory ordering is increasingly necessary to write scalable code, so kernel developers often find themselves having to become rocket scientists. The subtleties associated with control dependencies turn out to be an especially tricky sort of rocket. A recent discussion about how to force control dependencies to be observed shows the sorts of difficulties that arise in this area.

Control dependencies

The C programming language was designed in the era of simple, uniprocessor computers. When a developer wrote a series of C statements, they could expect those statements to be executed in the order written. Decades later, though, the situation has become much more complicated; code can be extensively optimized by both compilers and CPUs to the point that it bears little resemblance to what was originally written. Code can be reordered and even eliminated if the compiler (or the processor) thinks that the end result will be the same. The effects of this reordering on single-threaded code are (in the absence of bugs) limited to making it run faster. When there are multiple threads of execution running simultaneously, though, there can be surprises in store. One thread may observe things happening in a different order than others, leading to all sorts of unfortunate confusion.

When the visible order of operations across processors is important, developers will often use barriers to ensure that operations are not reordered in damaging ways. There are, however, cases where developers can count on things happening in the right order because there is no alternative; these are described in terms of "dependencies". There are three broad classes of dependencies, described in this article from our recent lockless patterns series. Consider, for example, a simple data dependency:

    int x = READ_ONCE(a);
    WRITE_ONCE(b, x + 1);

The write to b simply cannot be reordered ahead of the read of a because neither the compiler nor the CPU knows what value should be written. The write has a data dependency on the preceding read; that dependency will prevent those two operations from being reordered. That, of course, assumes that the compiler does not conclude that it already knows what the value of a will be, perhaps from a previous read; that is why READ_ONCE() is used. The second article in the lockless patterns series describes READ_ONCE() and WRITE_ONCE() in detail.

Control dependencies are a bit more complex. Consider code like this:

    if (READ_ONCE(a))
    	WRITE_ONCE(b, 1);

There is no data dependency linking the read of a and the write to b, but that write can only occur if a has a non-zero value; the read of a must thus occur before the write. This ordering forced by a conditional branch is a control dependency. More generally, there are three things that must be present to establish a control dependency:

A read from one location (a in the case above)
A conditional branch that depends on the value that was read
A write to another location in one or more branches

When those conditions exist, there is a control dependency from the read to the write that prevents the two operations from being reordered with respect to each other.

The evil optimizing compiler

Or, at least, it would be nice if things worked that way. The problem is that, while the hardware works that way, the C language does not recognize the existence of control dependencies or, as the infamous kernel memory-barriers.txt document puts it: "Compilers do not understand control dependencies. It is therefore your job to ensure that they do not break your code." While there does not appear to be much of a history of code being broken through overly aggressive optimization of code with control dependencies, it is something that developers worry about. That has led to the proposal by Peter Zijlstra of a mechanism called volatile_if().

What sort of problem is this patch trying to address? Consider an example posted by Paul McKenney in the discussion:

    if (READ_ONCE(A)) {
	WRITE_ONCE(B, 1);
	do_something();
    } else {
	WRITE_ONCE(B, 1);
	do_something_else();
    }

This code has a control dependency between the read of A and the writes to B; each write is in a branch of the conditional statement and the fact that they write the same value does not affect the dependency. So one might conclude that the two operations could not be reordered. Compilers, though, might well rearrange the code to look like this instead:

    tmp = READ_ONCE(A);
    WRITE_ONCE(B, 1);
    if (tmp)
	do_something();
    else
	do_something_else();

This code looks equivalent, but the test on the value read from A no longer occurs before the write to B. That breaks the control dependency, freeing a sufficiently aggressive CPU to move the write ahead of the read, possibly creating a subtle and unpleasant bug.

Since C doesn't recognize control dependencies, avoiding this kind of bug can be difficult, even in cases where the developer is aware of the problem. One sure solution is to read A with acquire semantics and write B with release semantics, as described in the lockless patterns series, but acquire and release operations can be expensive on some architectures. That expense is not usually needed in this case.

volatile_if()

Zijlstra wrote in his proposal that a good solution would be to add a qualifier to the if statement to indicate that a dependency exists:

    volatile if (READ_ONCE(A)) {
    	/* ... */

The compiler would respond by ensuring that a conditional branch is emitted and that code from within the branches is not lifted out of those branches. That, however, requires cooperation from compiler writers; as Segher Boessenkool noted, that is unlikely to happen unless the standards committee gives its blessing to the idea of putting qualifiers like volatile on statements. Failing that, Zijlstra proposed a magic macro:

    volatile_if(condition) {
    	/* true case */
    } else {
        /* false case */
    }

He provided implementations for a number of architectures; these generally depend on hand-written assembly code to manually emit the conditional branch instruction needed to create the control dependency at the CPU level.

The resulting discussion focused on two main topics: the implementation of volatile_if() and whether it is needed at all. On the implementation side, Torvalds suggested a simpler approach:

    #define barrier_true() ({ barrier(); 1; })
    #define volatile_if(x) if ((x) && barrier_true())

The barrier() macro causes no code to be emitted; it is just an empty block presented to the compiler as assembly code. That keeps the compiler from reordering operations from one side of the barrier to the other; it also, Torvalds said, would force the compiler to emit the branch since it could only be evaluated on the "true" side of the branch. Life turned out to not be so simple, though; a redefinition of barrier() along the lines suggested by Jakub Jelinek would be required to make this scheme actually work.

But Torvalds also wondered why developers were worried about this problem in the first place, since he does not think it can manifest in real code:

Again, semantics do matter, and I don't see how the compiler could actually break the fundamental issue of "load->conditional->store is a fundamental ordering even without memory barriers because of basic causality", because you can't just arbitrarily generate speculative stores that would be visible to others.

And, indeed, evidence of such problems actually occurring is hard to find. He did eventually come around to seeing that a problem could potentially exist but also made it clear that he doesn't think there is any code in the kernel now that would be affected by it.

The conversation (eventually) wound down without coming to any real conclusion on whether volatile_if() is needed or not. Experience says, though, that wariness toward compiler optimizations is usually a good idea. Even if no mechanism for explicitly marking control dependencies is merged into the mainline now, it will be waiting in the wings should future compiler releases create problems.

Index entries for this article
Kernel	Memory barriers

(Log in to post comments)

Protecting control dependencies with volatile_if()

Posted Jun 18, 2021 19:41 UTC (Fri) by ndesaulniers (subscriber, #110768) [Link]

I think the most interesting comment on that centithread came from Linus discussing the idea that perhaps full "memory" clobbers are somewhat overkill for some their current use cases, but that there's no general way at the moment to permit just reads or just writes to be reordered by the compiler. Selected snippets:

> in the general barrier case, we very much want to have that "memory"
> clobber, because the whole point of the general barrier case is that
> we want to make sure that the compiler doesn't cache memory state
> across it (ie the traditional use was basically what we now use
> "cpu_relax()" for, and you would use it for busy-looping on some
> condition).

> In the case of "volatile_if()", we actually would like to have not a
> memory clobber, but a "memory read". IOW, it would be a barrier for
> any writes taking place, but reads can move around it.

> I don't know of any way to express that to the compiler.

> It would be much nicer to have a "memory read" marker instead, to let
> the compiler know "I need to have done all pending writes to memory,
> but I can still cache read values over this op because it doesn't
> _change_ memory".

> It's actually not all that uncommon that you have a "this
> doesn't modify memory, but you can't move writes around it".

> Anybody have ideas or suggestions for something like that?

https://lwn.net/ml/linux-kernel/CAHk-=wjwXs5+SOZGTaZ0bP9n...
https://lwn.net/ml/linux-kernel/CAHk-=wiOTcBuxx1pYj=mQz-f...
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100953

Protecting control dependencies with volatile_if()

Posted May 14, 2022 12:26 UTC (Sat) by liusy58 (guest, #158480) [Link]

This part is much harder for me. Could you please give a more detailed explanation?

Help educate an old 8-bit assembler mindset

Posted Jun 19, 2021 2:48 UTC (Sat) by felixfix (subscriber, #242) [Link]

The last example is not clear to me. No matter which branch is taken, WRITE_ONCE(B, 1) will be executed first thing, before calling either do_something() or do_something_else(). How can there be any problem lifting that out of the branches and making it a single call before testing A?

This assumes A and B are not aliases, such as being pointers to C; or IO addresses; and I imagine there are other special cases. But with just ordinary memory, I don't see how the re-ordering can cause any problems.

Help educate an old 8-bit assembler mindset

Posted Jun 19, 2021 4:53 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

The only reason you would do this is if you *need* the read to happen before the write, notwithstanding the fact that the read and write are to unrelated objects/variables. In practice, that means you're probably talking about some weird IO-mapped memory or the like. Or else it means you're trying to do a weird lockless thing, which is a whole other ball of wax.

Regardless, it's not aliasing. If you play games with aliasing, one of two things happens:

1. You do not violate the strict aliasing rule (e.g. because the types are the same, one of them is char, etc.). The compiler is required to assume that your pointers may alias. It must emit additional loads and stores as may be necessary; READ_ONCE() etc. are not required.
2. You do violate the strict aliasing rule. Your code is meaningless garbage (UB happens), and no amount of READ_ONCE() etc. will fix it.

Help educate an old 8-bit assembler mindset

Posted Jun 19, 2021 14:07 UTC (Sat) by felixfix (subscriber, #242) [Link]

Eh, I guess I wasn't clear. What I want to know is, why is it considered bad (or "wrong") to optimize that common WRITE_ONCE to before the READ? I understand why if A and B are actually the same, or if there are I/O operations involved, etc. But I do not understand for the normal memory case.

Help educate an old 8-bit assembler mindset

Posted Jun 19, 2021 14:43 UTC (Sat) by corbet (editor, #1) [Link]

It's all a matter of ordering as seen by others. The write to B may be a "I'm dealing with it" flag; some other process may have changed A and wants to be sure that its change will be observed before B is written.

Help educate an old 8-bit assembler mindset

Posted Jun 19, 2021 16:01 UTC (Sat) by felixfix (subscriber, #242) [Link]

But that too is a special case, only applying to threaded code (which I've written enough to be careful around).

"That breaks the control dependency, freeing a sufficiently aggressive CPU to move the write ahead of the read, possibly creating a subtle and unpleasant bug."

So again, other than special cases, including aliased variables, I/O, threaded code, and other unenumerated special cases, what kind of subtle and unpleasant bugs can this create? Or does that sentence only refer to special cases?

Help educate an old 8-bit assembler mindset

Posted Jun 19, 2021 17:47 UTC (Sat) by corbet (editor, #1) [Link]

This is an article about kernel programming, where everything is threaded. Or, as the article says, "The effects of this reordering on single-threaded code are (in the absence of bugs) limited to making it run faster. When there are multiple threads of execution running simultaneously, though, there can be surprises in store. One thread may observe things happening in a different order than others, leading to all sorts of unfortunate confusion."

Help educate an old 8-bit assembler mindset

Posted Jun 20, 2021 7:29 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

To state the obvious, this only applies if there aren't any other ways that the ordering is guaranteed, the most obvious of which would be a spinlock or mutex.

Control dependencies also do not matter in most lockless cases, much less control dependencies where the compiler does have meaningful optimizations that would break. Of course, sometimes they do.

Help educate an old 8-bit assembler mindset

Posted Jun 20, 2021 13:58 UTC (Sun) by felixfix (subscriber, #242) [Link]

I had forgotten that this is about kernels, and thus always threaded. I have written threaded user-space programs, and written ancient 8 bit kernels which were not threaded, and once even modified one line in the linux kernel (some RS-232 control line) for my own use; but never combined those, and lost track of that aspect for this (lwn) thread.

Thanks to all concerned for my education.

Protecting control dependencies with volatile_if()

Posted Jun 19, 2021 4:28 UTC (Sat) by alison (subscriber, #63752) [Link]

> #define barrier_true() ({ barrier(); 1; })
> #define volatile_if(x) if ((x) && barrier_true())
>The barrier() macro causes no code to be emitted; it is just an empty block presented to the compiler as assembly code.

Hmm, wouldn't the code need __asm__ attribute to be presented as assembly code?

Protecting control dependencies with volatile_if()

Posted Jun 19, 2021 23:37 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

Sure, it probably is defined as something like::

#define barrer() __asm__ __volatile__("":::"memory")

What that says is: Here are some assembly instructions you must execute at this point in my code. The instruction somehow don't need any of my variables in CPU registers, nor do the CPU registers need to be copied into any of my variables. However, these instructions touch memory somehow. Also, that list of instruction is empty.

This is a very common trick, the compiler is used to seeing this, and won't go "Huh? But your list of instructions is empty". Since it claims to touch memory in an unspecified way, no optimisations may move reads or writes past this line, as that would alter the meaning of the program. This is of course exactly what everybody who writes this wanted to achieve and the compiler authors know that.

Since the list of instructions is empty you don't even need to know the assembly language of the target CPU.

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 20:11 UTC (Sun) by itsmycpu (subscriber, #139639) [Link]

So why (and how) can this be implemented more efficiently (on any architecture) than a combination of load_acquire(A) and store_release(B) ?

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 20:31 UTC (Sun) by itsmycpu (subscriber, #139639) [Link]

...or, in this case, just a load_acquire.

How is a barrier like

> #define barrer() __asm__ __volatile__("":::"memory")

more effective than the barrier implied by load_acquire (on any architecture) ?
(Not a rhetoric question.)

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 22:41 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

This barrier has *zero* impact on the machine, it doesn't emit any instructions for the machine to run. This barrier only changes what the compiler will do. I can't emphasise that enough. There might in fact be no difference in the machine code at all, the goal of barrier() is only to prevent the compiler from potentially spotting an optimisation that's inappropriate and mustn't happen. If your compiler was not in fact going to attempt an optimisation that's now impossible the actual machine code is literally identical.

In contrast acquire/ release have consequences for the machine as well as the compiler. So the compiler will emit different machine code to ensure that the sequence of operations has acquire/ release semantics, and then at runtime, the actual CPU may be doing something different to ensure the semantics you asked for are achieved.

On x86-64 the instructions are only slightly different, and only in some cases, you actually pay for implicit acquire/ release semantics all the time on x86 without knowing it; on the original DEC Alpha it's a horrible nightmare of locking and CPU barriers; and on some ARM CPUs you can literally say "This load has acquire semantics" and "This store has release semantics" and everything works nicely.

So barrier() definitely CAN be faster because it doesn't emit any instructions at all. However of course just because you wrote fewer instructions doesn't make you faster. To find that out reliably in each case you'd have to benchmark. But first you'd have to be sure you care enough to go to that effort.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 0:24 UTC (Mon) by itsmycpu (subscriber, #139639) [Link]

> This barrier has *zero* impact on the machine, it doesn't emit any instructions for the machine to run. This barrier only changes what the compiler will do.

Are you sure? I'm not so sure that architectures with a so-called "weaker" memory model don't require some kind of ordering instruction to avoid hardware re-ordering across instructions. In this case, it sounds like the conditional branch instruction might have (welcome) effects on hardware ordering. It is even called a "dummy" conditional branch. Specifically, there appears to be a certain equivalence on arm64 between

LDR X0, [X1]
CBNZ X0, 1f // Dummy ctrl

and ("with LTO")

LDAPR X0, [X1] // RCpc

> In contrast acquire/ release have consequences for the machine as well as the compiler.

Not necessarily. On modern x86, load_acquire and store_release are usually simple MOV instructions.

My question is about the specifics of architectures where this isn't so simple.

The ordering required here isn't really very different from acquire and/or release, and barrier() is apparently a very general tool. So it isn't clear what advantages it is meant to have on specific architectures, especially as it would seem that otherwise barrier() could be used to implement even load_acquire.

Or perhaps at the end of the email discussion, volatile_if is meant more as a convenience than as a possible performance optimization.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 2:16 UTC (Mon) by tialaramex (subscriber, #21167) [Link]

> Are you sure?

Yes of course I'm sure. Why else would I write that I cannot emphasise this enough?

> I'm not so sure that architectures with a so-called "weaker" memory model don't require some kind of ordering instruction to avoid hardware re-ordering across instructions

Yes, most machines would need some explicit machine instructions to prevent the machine from re-ordering memory access if that was what you wanted to achieve. The Linux barrier() is a compiler barrier, not a CPU barrier. So it does not forbid the CPU from re-ordering memory accesses as it wishes, that is intentional. Use the right tool for the job.

> Not necessarily. On modern x86, load_acquire and store_release are usually simple MOV instructions.

If you'd only read a whole sentence further I more or less spelled this out, it's actually the opposite, with all that implies, simple MOV instructions have acquire/release semantics on x86. You are incurring the cost of an acquire or release and sometimes both for every access. Yet, that still isn't quite enough for full-on sequential consistency. So naive x86 programs still have surprising races.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 3:00 UTC (Mon) by itsmycpu (subscriber, #139639) [Link]

> Yes, most machines would need some explicit machine instructions to prevent the machine from re-ordering memory access if that was what you wanted to achieve. The Linux barrier() is a compiler barrier, not a CPU barrier. So it does not forbid the CPU from re-ordering memory accesses as it wishes, that is intentional. Use the right tool for the job.

I'm afraid your answers are more confusing than clarifying.

If __asm__ __volatile__("":::"memory") results only in a CPU barrier even on "weaker" architectures, then in this case, on some architectures, that alone would not be sufficient. It is those architectures that my question is about.

It means that it requires something like the conditional branch instruction to complete the ordering.

So the question remains: why would that not also be enough to implement load_aquire, such that load_acquire has the same performance as barrier + cond.branch ?

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 3:03 UTC (Mon) by itsmycpu (subscriber, #139639) [Link]

> If __asm__ __volatile__("":::"memory") results only in a CPU barrier even on "weaker" architectures,

Obviously, I meant: if it only results in a *compiler* barrier....

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 9:59 UTC (Mon) by tialaramex (subscriber, #21167) [Link]

> in this case, on some architectures, that alone would not be sufficient. It is those architectures that my question is about.

On which architectures do you believe that the CPU exposes memory writes that only happen after a conditional branch, before actually evaluating that branch? Because that's the scenario under consideration here. The compiler barrier is to prevent compilers from doing the same re-ordering, which can look fairly reasonable to them because they've got a different view of the situation.

Linux is not a hypothetical program, so, it isn't focused on executing correctly on hypothetical hardware, only real hardware.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 11:56 UTC (Mon) by itsmycpu (subscriber, #139639) [Link]

> On which architectures do you believe that the CPU exposes memory writes that only happen after a conditional branch, before actually evaluating that branch?

On none, as a hardware optimization I don't expect that to happen.

The question I'm asking is if and why (or why not), on any architecture (perhaps arm64), this:

volatile_if(READ_ONCE(A)) {
WRITE_ONCE(B,1);
}

might be faster than this:

if (load_acquire(A)) {
WRITE_ONCE(B,1);
}

especially if the latter is well optimized by a compiler. I have the tendency to think that the second form using load_acquire can be just as fast, at least if the compiler does a good job. However perhaps there are architectures on which that isn't true, and that is what my question is about.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 16:42 UTC (Mon) by farnz (subscriber, #17727) [Link]

Because, on the very weakest implementations of (e.g.) AArch64, load acquire is more expensive than load followed by a dependent branch, but the load followed by a dependent branch is a sufficient barrier for the purposes at hand.

Consider in terms of the following pseudo-code; thread 0 is writing to two memory locations, and before the two threads run on separate CPU cores, the memory at adr1 and adr2 is set to 0.


# Thread 0
MOV R0, #42
MOV R1, #99
STR [addr3], R1
STR [addr2], R1
STR_REL [addr1], R0

# Thread 1
MOV R5, 1
MOV R6, 1
LDR R4, [addr1]
CMP R4, #42
BEQ load
JMP end
.load
LDR R5, [addr2]
.end
LDR R6, [addr3]

The question we're trying to answer is "what values are in R5 and R6 of thread 1 at the label end?".

On an Alpha, there are 3 possible answers: if R4 is 42, then R5 and R6 must both be either 0 or 99, depending on whether or not the LDR R5 and LDR R6 observed the store to addr2 or not. If R4 is not 42, then R5 and R6 must be 1. Alpha is allowed to re-order the loads in thread 1 freely because they are independent of each other, and thus all of (R5 = 0, R6 = 0), (R5 = 0, R6 = 99), (R5 = 99, R6 = 0) and (R5 = 99, R6 = 99) are possible.

On x86, the answer is either R5 = R6 = 1 or R5 = R6 = 99; once the load of R4 has taken place, thread 1 is guaranteed to see the stores to both addr1 and addr2, and thus the loads to R5 and R6 must see the store to addr2. Thus, if R4 is 42, R5 is guaranteed to be 99, while if R4 is not 42, R5 is guaranteed to be 1. Same reasoning applies to R6 as to R5.

ARM's memory ordering is mostly weak, like Alpha's, but it adds in one more wrinkle; dependency chains act as ordering. Because of this, R5 is again either 1 or 99; however, this happens for slightly different reasons to x86. On ARM, the STR_REL guarantees that a load that observes the store to addr1 will also be able to observe the store to addr2; because the load of R5 has a dependency on the value of R4, the ARM memory ordering model guarantees that the load of R5 will take place after the load of R4, and hence will also observe the store to addr1. However, there's no such dependency in R6's control flow, and thus R6 is allowed to be either 0 or 99 if R4 is 42, because the CPU can reorder the load to happen before the conditional.

If the load to R4 was a load-acquire, then all three processors would observe the same memory contents. However, both ARM and Alpha would be denied reordering opportunities by the load-acquire that they have without it, and hence could be slower.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 22:43 UTC (Mon) by itsmycpu (subscriber, #139639) [Link]

>Because, on the very weakest implementations of (e.g.) AArch64, load acquire is more expensive than load followed by a dependent branch, but the load followed by a dependent branch is a sufficient barrier for the purposes at hand.

This is the claim I am interested in.

> Consider in terms of the following pseudo-code; thread 0 is writing to two memory locations, and before the two threads run on separate CPU cores, the memory at adr1 and adr2 is set to 0.

Thanks for going into this detail.
I'm assuming that you also meant addr3 to be initialized with 0. It seems you changed the location of the label ".end" while you were writing your post. As it is, with the end label between the loads of R5 and R6, R6 couldn't be 1 (assuming addr3 is initialized with 0).

> On an Alpha, there are 3 possible answers: if R4 is 42, then R5 and R6 must both be either 0 or 99, depending on whether or not the LDR R5 and LDR R6 observed the store to addr2 or not. If R4 is not 42, then R5 and R6 must be 1. Alpha is allowed to re-order the loads in thread 1 freely because they are independent of each other, and thus all of (R5 = 0, R6 = 0), (R5 = 0, R6 = 99), (R5 = 99, R6 = 0) and (R5 = 99, R6 = 99) are possible.

R6 can be 1 only if the ".end" label is moved to the actual end. So here I am putting R6 aside, since then it will be like R5 (but see discussion of R6 for ARM.).

My understanding is that "volatile_if" is meant to prevent a re-ordering of the effective load of R5 to a point before the load of R4. In fact, that appears to be the whole point (I don't see any indication that it would be meant to affect only stores). If it does succeed in preventing that, then only (R5 = 99) is possible if R4 is 42, since 42 is written with a store_release.

So if Alpha behaves as you say, I don't see how the pseudo code would be sufficient to implement "volatile_if". And I wouldn't know what it takes to fix that, or how that would affect our discussion.

> On x86, the answer is either R5 = R6 = 1 or R5 = R6 = 99; once the load of R4 has taken place, thread 1 is guaranteed to see the stores to both addr1 and addr2, and thus the loads to R5 and R6 must see the store to addr2. Thus, if R4 is 42, R5 is guaranteed to be 99, while if R4 is not 42, R5 is guaranteed to be 1. Same reasoning applies to R6 as to R5.

Yes.
(R6 can be 1 only if the ".end" label is moved to the actual end.)

> ARM's memory ordering is mostly weak, like Alpha's, but it adds in one more wrinkle; dependency chains act as ordering. Because of this, R5 is again either 1 or 99; however, this happens for slightly different reasons to x86. On ARM, the STR_REL guarantees that a load that observes the store to addr1 will also be able to observe the store to addr2; because the load of R5 has a dependency on the value of R4, the ARM memory ordering model guarantees that the load of R5 will take place after the load of R4, and hence will also observe the store to addr1. However, there's no such dependency in R6's control flow, and thus R6 is allowed to be either 0 or 99 if R4 is 42, because the CPU can reorder the load to happen before the conditional.

Here it seems that you want the ".end" label to really be between the loads of R5 and R6, as you also don't claim that R6 could be 1. That means that you mean R6 to be outside of the volatile-if statement, and since also at the language level, volatile_if is meant to allow R6 to be 0 (regardless of R4's value) in that case, I can see that ARM would be able to take advantage of that and perform the load before or during the conditional. Whereas x86 would not.

In situations where an R6 does not exist, it still seems the compiler could optimize load_acquire to produce equivalent code also without a construct like volatile_if (in so far as that might be a distinct advantage), since R5 must be 99 either way, if R4 is 42.

Protecting control dependencies with volatile_if()

Posted Jun 22, 2021 3:38 UTC (Tue) by itsmycpu (subscriber, #139639) [Link]

A follow-up question for ARM would be:

What if the load of R6 appears in both the 'then' path and the 'else' path of the BEQ conditional branch? Instead of after the conditional paths join.

Then there would be two load instructions, each dependent on the conditional. Would that mean that loading R6 is now ordered, or is the CPU allowed to internally join the two paths and remove the dependency? Is ARM's behavior defined in this case? I think it needs to be, and the straightforward definition would be that R6 is then ordered. (And that the compiler could use this for optimizations.)

Protecting control dependencies with volatile_if()

Posted Jun 22, 2021 9:47 UTC (Tue) by farnz (subscriber, #17727) [Link]

First, thank you for reading my previous comment as intended, not as written - I did make a lot of mistakes trying to put together a case where ARM's ordering is weaker than store-acquire, but you caught my intentions.

Yes, if the load of R6 appears in both paths, then it's an ordered load, due to the data dependency affecting both cases, and thus ARM can't reorder in their memory model; in theory, the compiler can exploit this to get R6 ordered even though the load of R6 is unconditional, while still allowing future loads to be reordered before the load of R6.

The point of the ARM behaviour here is to give you a guarantee that code like the following Cish does what you intended without forcing full ordering between the two threads:


const int global42 = 42;
int global_int;
int *global_ptr;
void main() {
    global_ptr = &global42;
    full_memory_barrier();
    start_threads(thread0, thread1);
    wait_for_threads();
}

void thread0() {
    global_int = 123;
    atomic_store_explicit(global_ptr, &global_int, memory_order_release);
}

void thread1() {
    int *i = global_ptr;
    int j = *i;
    if (i == &global_int) {
        assert(j == 123);
    } else {
        assert(j == 42);
    }
}

Because the store to global_ptr is a release store, any thread which observes that store also observes the store to global_int. thread1 can thus use the value of i to determine which value it loaded.

On Alpha's memory model (which is as weak as you can get), it is permissible for i to point at global42, but for the load to read 123, or vice-versa. ARM disallows this particular case.

Another way of looking at it would be that you can fix thread1 to work correctly on any standards-compliant C compiler by making it:


void thread1() {
    int *i = atomic_load_explicit(global_ptr, memory_order_consume);
    int j = *i;
    if (i == &global_int) {
        assert(j == 123);
    } else {
        assert(j == 42);
    }
}

And on Alpha, the atomic_load_explicit needs to put appropriate memory ordering instructions into the instruction stream for thread1; on ARM, those ordering instructions are implicit in ordinary loads.

Protecting control dependencies with volatile_if()

Posted Jun 22, 2021 21:03 UTC (Tue) by itsmycpu (subscriber, #139639) [Link]

> The point of the ARM behaviour here ....

Makes sense, and gives a good idea of ARM (and Alpha) in this regard.

Protecting control dependencies with volatile_if()

Posted Jun 19, 2021 6:02 UTC (Sat) by bartoc (guest, #124262) [Link]

wg14 (C) and wg21(c++) are extremely unlikely to be open to anything remotely like this idea. I'm less sure the feelings of wg14 folks but wg21 folks really are not a fan of stuff like volatile at all. It's extremely hard to specify, once specified in the standard's wording it's too weak to actually do what programmers want, and it's best left as a compiler attribute with documented platform specific behavior. Is it possible that where the annotation is appropriate the only situation where the compiler could reorder the read and write would indicate a bug in your code _anyway_? Meaning maybe the attribute should emit an error if the compiler decides it _can_ reorder an indicated write across the if.

many uses of normal volatile variables are deprecated and will emit a warning in c++20 (I'm aware there's no C++ in linux, but the formal memory model used in C was developed in concert with C++, when both had to specify the memory model more precisely in order to standardize threads and atomics). In particular v |= something where v is volatile (this seems to be to make the read-modify-write explicit, I'm not sure I'm a fan in the specific case of &= and |=, but basically agree in the case of += -= *= and /=)

Protecting control dependencies with volatile_if()

Posted Jun 19, 2021 14:53 UTC (Sat) by Paf (subscriber, #91811) [Link]

I don’t really understand the objection to &= and |= - is it a generalized objection to being able to write bitops like this? Can you say why?

Protecting control dependencies with volatile_if()

Posted Jun 19, 2021 18:47 UTC (Sat) by kkdwivedi (guest, #130744) [Link]

There is a recent paper (P2327) addressing compound operations (those used to flip bits in particular).

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p...

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 0:47 UTC (Sun) by Paf (subscriber, #91811) [Link]

Ah, thanks - I realize now I misread the original post as objecting to them. The decision of the standard to deprecate them seems ……. Let’s say “user hostile”.

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 3:33 UTC (Mon) by foom (subscriber, #14868) [Link]

The problem is that users expect volatile variables to be safe against interrupts/signals. And that is generally the case for read and write operations. So, by analogy, users expect `x |= 5;` on a volatile int to be the way to spell a read-modify-write which is also safe against interrupts. But, it has never guaranteed that -- neither in the spec nor in most implementations.

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 0:08 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

The intended purpose of volatile is explicitly to cause actual memory reads/ writes, because perhaps the "memory" isn't. it comes into existence because the computer they're writing C on has memory-mapped I/O and early attempts at optimising C compilers are a huge step forward for performance but break some of the device handling code.

Today, volatile is rightly imagined as a property of actions, not variables. "volatile write variable -> $address " => "Literally write this to memory, even though it seems as though we're never reading that memory and so needn't bother, I promise that's actually what must happen". "volatile read $address -> variable" => Literally read this from memory, even though it seems as though we already know the value, I promise that's actually what must be done". Whereas "volatile variable" is too vague.

On some hardware, there may be very fine memory mapping, such that you can actually read or write a smaller value than the natural machine word to special memory regions, maybe the normal word size is 32 bits, but for a small range of "memory" it actually matters whether you use the 16-bit move instruction or the 32-bit one because they behave differently. In some cases this gets finer than a single byte, maybe there's a flag in the bottom bit of address 0x0037 and you can actually read or write that single bit, from machine code and it has completely different behaviour from if you were to try to read or write the whole 0x0037 byte on that hardware.

So, because volatile is defined so openly in C, your compiler vendor can say OK, we make this compiler for this weird hardware, and we promise if you say that this is a _volatile_ byte pointer to 0x0037 and then you use |= 0x01 we will actually emit the machine code for the single bit write, even though that's not how it would be implemented for other occurrences of |= 0x01

But the language standard can't say anything about these platform specific details, so it just leaves that to your the compiler vendor, and with the exception of specialist compiler vendors who care very much about this stuff, the compiler vendor says as little about it as possible, hoping you will go away.

Probably what _ought_ to happen if C was designed from scratch today is that these platforms provide a special intrinsic function and if you actually need to poke the bottom bit of address 0x0037 you call some intrinsic like __foocorp_bit_poke(0x0037, 0, 1) and how that's actually implemented is only a problem for the platform compiler developer. Anybody who needs the intrinsic (e.g. someone porting Linux to the platform) uses it, everybody else is blissfully unaware. But at the time they just added this "volatile" qualifier to the C language as a dumping ground for all features about hardware memory reads/ writes and called it a day.

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 2:44 UTC (Sun) by willy (subscriber, #9762) [Link]

It's not limited to foocorp though. A PCIe device (yes, even one designed today) can have vastly different behaviour for MMIO accesses that are single byte vs double-byte vs quad-byte. Most don't because that's painful for software to cope with, but you must define some form of generic MMIO accessor or the compiler cannot be used to write a device driver.

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 22:07 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

Sure, but I was talking about bit-addressing.

Rust (sorry but I've spent lots of time staring at this recently so it's at the top of my head) provides

std::ptr::read_volatile(ptr) and std::ptr::write_votalite(ptr, foo)

If you provide read_volatile with say a pointer to i16 (a 16-bit signed integer), that'll compile and it'll emit an actual double-byte memory read, assuming that i16 pointer was pointing at the MMIO for your PCIe device you'll get what you wanted. If you do it with a pointer to u8 (an 8-bit unsigned integer) that'll do the single-byte read. And the same with the write equivalents.

These are stable but unsafe APIs. That is, Rust promises these are expected to still work tomorrow, but it doesn't vouch for the safety of just doing memory operations on raw addresses, so if this was a terrible idea you get to keep both halves when it breaks. They're defined in a completely generic way, so you could imagine calling read_volatile(pointer-to-4x4-matrix) but of course your CPU is not completely generic so it's between the CPU vendor, the compiler vendor, and the PCIe device maker whether that does anything useful.

However, for bit-addressing it's not at all obvious what the generic thing to do would be. So Rust just doesn't, and I argue that unlike your PCIe MMIO trick, bit addressed volatile memory access is in fact a niche feature in 2021.

Protecting control dependencies with volatile_if()

Posted Jun 19, 2021 22:52 UTC (Sat) by itsmycpu (subscriber, #139639) [Link]

My understanding of this situation is far from complete, as for example I do not sufficiently understand the implications of acquire/release on all architectures.

However it seems to me that this isn't really about if-statements, but about order two specific memory accesses without impacting other compiler or hardware optimizations.

So perhaps a construct like ORDER_ACCESS( A, B ) would better reflect the original intentions, and be applicable to a much larger group of use cases.

This would specifically instruct the compiler to generate code that completes previous memory accesses to A before initiating following memory accesses to B.

(And perhaps be best implemented as a compiler primitive or general language construct like barriers, and the code generation be informed by kernel engineers who have the practical experience across architectures and ideas regarding optimal code to be generated, such as using conditional branch instructions to create a hardware barrier.)

So:
tmp = READ_ONCE(A);
ORDER_ACCESS( A, B );
WRITE_ONCE(B, 1);

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 0:42 UTC (Sun) by xecycle (subscriber, #140261) [Link]

This sounds very similar to what the broken memory_order_consume is trying to do. Is that thing practically fixable?

Btw I feel a release order would be needed anyway on the write side, this form probably just exploits dependency order on the read side.

Protecting control dependencies with volatile_if()

Posted Jun 20, 2021 12:08 UTC (Sun) by Bigos (subscriber, #96807) [Link]

The Linux kernel uses dependent READ_ONCE() in order to implement something akin to memory_order_consume. The assumption is that the compilers will never reorder instructions such as these due to there being a data dependency:

int* ptr = READ_ONCE(slot);
int value = READ_ONCE(ptr);

This article is about cases where there is no data dependency but there is a control dependency. AFAIK most weak-memory-model architectures respect such a dependency if there is a jump instruction in the middle, but the compiler must first not reorder the instructions itself and must actually generate a jump instruction (which is not always obvious as it could be lowered to a cmove-like instruction).

Regarding memory_order_consume (and [[carries_dependency]]), the problem is with ensuring that there is a dependency without involving the type system. If instead load(std::memory_order_consume) returned Dependent<T> and the compiler promised that loads using it are either data or control dependent on the original load then it would be a lot easier to define and implement. I think this is what the following is about:

http://wg21.link/P0750

Protecting control dependencies with volatile_if()

Posted Jun 21, 2021 5:40 UTC (Mon) by xecycle (subscriber, #140261) [Link]

Thanks, very informative; memory_order_consume was considered broken even before the day I first heard of it, so I didn't look into its details and never realized the distinction between data dependency and control dependency.

Protecting control dependencies with volatile_if()

Posted Jun 24, 2021 7:10 UTC (Thu) by tpo (subscriber, #25713) [Link]

I would like to suggest to clarify the context of the topic of the article in the article.

My argument comes from the position, that LWN articles such as this one informally also serve as documentation and as reference material. As such it'd be splendid if the article could be as clear and unambiguous as possible.

My comment echoes felixfix's comments and confusion above.

The article is saying: "When there are multiple threads of execution running simultaneously, though, there can be surprises in store.". In here lies (IMO) the core of the whole article. But to me that single phrase is *way* understated. That assertion should be stated in a as strong way as possible so that the reader will not miss it. Otherwise - I posit - the reader can not really understand the article.

Also having multiple threads is *not* a sufficient condition leading to the troubles that are being sought to be solved in the article.

The point is that *iif* there are multiple actors *and* A and B below are somehow causally related, then:

LOAD A
STORE B

*must not be reordered*. By multiple actors I mean either something that can change state *or* has its own state that depends on either the value of A or B or the access thereof.

"Other actors" can be:

* an interrupt controller interrupting the execution of the code above and triggering execution of some different code
* any device
* another CPU core
* another CPU
* a different thread

If any of the above exist and treat (the access to A or B) or (the values of A and B) as somehow logically interdependent then reordering LOAD A and STORE B can lead to trouble.

Example: let's take a networking hardware component: if "LOAD A" would mean to the networking hardware "OK, the ethernet frame was read from the buffer which can now be used again" then it is crucial to the correct functioning of the network stack that the "STORE B" (as in write new data into the buffer) is correctly ordered.

Please note that I am not a kernel hacker and as such the above might not precisely reflect actual hardware. The point I'm tring to make is: a) the central problem should be stated much more clearly and b) the problem is as far as I'm able to tell much more generic than only multithreaded situations