Protecting control dependencies with volatile_if()
the rocket science of CS". Understanding memory ordering is increasingly necessary to write scalable code, so kernel developers often find themselves having to become rocket scientists. The subtleties associated with control dependencies turn out to be an especially tricky sort of rocket. A recent discussion about how to force control dependencies to be observed shows the sorts of difficulties that arise in this area.
Control dependencies
The C programming language was designed in the era of simple, uniprocessor computers. When a developer wrote a series of C statements, they could expect those statements to be executed in the order written. Decades later, though, the situation has become much more complicated; code can be extensively optimized by both compilers and CPUs to the point that it bears little resemblance to what was originally written. Code can be reordered and even eliminated if the compiler (or the processor) thinks that the end result will be the same. The effects of this reordering on single-threaded code are (in the absence of bugs) limited to making it run faster. When there are multiple threads of execution running simultaneously, though, there can be surprises in store. One thread may observe things happening in a different order than others, leading to all sorts of unfortunate confusion.When the visible order of operations across processors is important, developers will often use barriers to ensure that operations are not reordered in damaging ways. There are, however, cases where developers can count on things happening in the right order because there is no alternative; these are described in terms of "dependencies". There are three broad classes of dependencies, described in this article from our recent lockless patterns series. Consider, for example, a simple data dependency:
int x = READ_ONCE(a); WRITE_ONCE(b, x + 1);
The write to b simply cannot be reordered ahead of the read of a because neither the compiler nor the CPU knows what value should be written. The write has a data dependency on the preceding read; that dependency will prevent those two operations from being reordered. That, of course, assumes that the compiler does not conclude that it already knows what the value of a will be, perhaps from a previous read; that is why READ_ONCE() is used. The second article in the lockless patterns series describes READ_ONCE() and WRITE_ONCE() in detail.
Control dependencies are a bit more complex. Consider code like this:
if (READ_ONCE(a)) WRITE_ONCE(b, 1);
There is no data dependency linking the read of a and the write to b, but that write can only occur if a has a non-zero value; the read of a must thus occur before the write. This ordering forced by a conditional branch is a control dependency. More generally, there are three things that must be present to establish a control dependency:
- A read from one location (a in the case above)
- A conditional branch that depends on the value that was read
- A write to another location in one or more branches
When those conditions exist, there is a control dependency from the read to the write that prevents the two operations from being reordered with respect to each other.
The evil optimizing compiler
Or, at least, it would be nice if things worked that way. The problem is
that, while the hardware works that way,
the C language does not recognize the existence of control
dependencies or, as the infamous kernel memory-barriers.txt
document puts it: "Compilers do not understand control
dependencies. It is therefore your job to ensure that they do not break
your code.
" While there does not appear to be much of a history of
code being broken through overly aggressive optimization of code with
control dependencies, it is something that developers worry about. That
has led to the proposal
by Peter Zijlstra of a mechanism called volatile_if().
What sort of problem is this patch trying to address? Consider an example posted by Paul McKenney in the discussion:
if (READ_ONCE(A)) { WRITE_ONCE(B, 1); do_something(); } else { WRITE_ONCE(B, 1); do_something_else(); }
This code has a control dependency between the read of A and the writes to B; each write is in a branch of the conditional statement and the fact that they write the same value does not affect the dependency. So one might conclude that the two operations could not be reordered. Compilers, though, might well rearrange the code to look like this instead:
tmp = READ_ONCE(A); WRITE_ONCE(B, 1); if (tmp) do_something(); else do_something_else();
This code looks equivalent, but the test on the value read from A no longer occurs before the write to B. That breaks the control dependency, freeing a sufficiently aggressive CPU to move the write ahead of the read, possibly creating a subtle and unpleasant bug.
Since C doesn't recognize control dependencies, avoiding this kind of bug can be difficult, even in cases where the developer is aware of the problem. One sure solution is to read A with acquire semantics and write B with release semantics, as described in the lockless patterns series, but acquire and release operations can be expensive on some architectures. That expense is not usually needed in this case.
volatile_if()
Zijlstra wrote in his proposal that a good solution would be to add a qualifier to the if statement to indicate that a dependency exists:
volatile if (READ_ONCE(A)) { /* ... */
The compiler would respond by ensuring that a conditional branch is emitted and that code from within the branches is not lifted out of those branches. That, however, requires cooperation from compiler writers; as Segher Boessenkool noted, that is unlikely to happen unless the standards committee gives its blessing to the idea of putting qualifiers like volatile on statements. Failing that, Zijlstra proposed a magic macro:
volatile_if(condition) { /* true case */ } else { /* false case */ }
He provided implementations for a number of architectures; these generally depend on hand-written assembly code to manually emit the conditional branch instruction needed to create the control dependency at the CPU level.
The resulting discussion focused on two main topics: the implementation of volatile_if() and whether it is needed at all. On the implementation side, Torvalds suggested a simpler approach:
#define barrier_true() ({ barrier(); 1; }) #define volatile_if(x) if ((x) && barrier_true())
The barrier() macro causes no code to be emitted; it is just an empty block presented to the compiler as assembly code. That keeps the compiler from reordering operations from one side of the barrier to the other; it also, Torvalds said, would force the compiler to emit the branch since it could only be evaluated on the "true" side of the branch. Life turned out to not be so simple, though; a redefinition of barrier() along the lines suggested by Jakub Jelinek would be required to make this scheme actually work.
But Torvalds also wondered why developers were worried about this problem in the first place, since he does not think it can manifest in real code:
Again, semantics do matter, and I don't see how the compiler could actually break the fundamental issue of "load->conditional->store is a fundamental ordering even without memory barriers because of basic causality", because you can't just arbitrarily generate speculative stores that would be visible to others.
And, indeed, evidence of such problems actually occurring is hard to find. He did eventually come around to seeing that a problem could potentially exist but also made it clear that he doesn't think there is any code in the kernel now that would be affected by it.
The conversation (eventually) wound down without coming to any real
conclusion on whether volatile_if() is needed or not. Experience
says, though, that wariness toward compiler optimizations is usually a good
idea. Even if no mechanism for explicitly marking control dependencies is
merged into the mainline now, it will be waiting in the wings should future
compiler releases create problems.
Index entries for this article | |
---|---|
Kernel | Memory barriers |
(Log in to post comments)
Protecting control dependencies with volatile_if()
Posted Jun 18, 2021 19:41 UTC (Fri) by ndesaulniers (subscriber, #110768) [Link]
> in the general barrier case, we very much want to have that "memory"
> clobber, because the whole point of the general barrier case is that
> we want to make sure that the compiler doesn't cache memory state
> across it (ie the traditional use was basically what we now use
> "cpu_relax()" for, and you would use it for busy-looping on some
> condition).
> In the case of "volatile_if()", we actually would like to have not a
> memory clobber, but a "memory read". IOW, it would be a barrier for
> any writes taking place, but reads can move around it.
> I don't know of any way to express that to the compiler.
> It would be much nicer to have a "memory read" marker instead, to let
> the compiler know "I need to have done all pending writes to memory,
> but I can still cache read values over this op because it doesn't
> _change_ memory".
> It's actually not all that uncommon that you have a "this
> doesn't modify memory, but you can't move writes around it".
> Anybody have ideas or suggestions for something like that?
https://lwn.net/ml/linux-kernel/CAHk-=wjwXs5+SOZGTaZ0bP9n...
https://lwn.net/ml/linux-kernel/CAHk-=wiOTcBuxx1pYj=mQz-f...
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100953
Protecting control dependencies with volatile_if()
Posted May 14, 2022 12:26 UTC (Sat) by liusy58 (guest, #158480) [Link]
Help educate an old 8-bit assembler mindset
Posted Jun 19, 2021 2:48 UTC (Sat) by felixfix (subscriber, #242) [Link]
This assumes A and B are not aliases, such as being pointers to C; or IO addresses; and I imagine there are other special cases. But with just ordinary memory, I don't see how the re-ordering can cause any problems.
Help educate an old 8-bit assembler mindset
Posted Jun 19, 2021 4:53 UTC (Sat) by NYKevin (subscriber, #129325) [Link]
Regardless, it's not aliasing. If you play games with aliasing, one of two things happens:
1. You do not violate the strict aliasing rule (e.g. because the types are the same, one of them is char, etc.). The compiler is required to assume that your pointers may alias. It must emit additional loads and stores as may be necessary; READ_ONCE() etc. are not required.
2. You do violate the strict aliasing rule. Your code is meaningless garbage (UB happens), and no amount of READ_ONCE() etc. will fix it.
Help educate an old 8-bit assembler mindset
Posted Jun 19, 2021 14:07 UTC (Sat) by felixfix (subscriber, #242) [Link]
Help educate an old 8-bit assembler mindset
Posted Jun 19, 2021 14:43 UTC (Sat) by corbet (editor, #1) [Link]
It's all a matter of ordering as seen by others. The write to B may be a "I'm dealing with it" flag; some other process may have changed A and wants to be sure that its change will be observed before B is written.
Help educate an old 8-bit assembler mindset
Posted Jun 19, 2021 16:01 UTC (Sat) by felixfix (subscriber, #242) [Link]
"That breaks the control dependency, freeing a sufficiently aggressive CPU to move the write ahead of the read, possibly creating a subtle and unpleasant bug."
So again, other than special cases, including aliased variables, I/O, threaded code, and other unenumerated special cases, what kind of subtle and unpleasant bugs can this create? Or does that sentence only refer to special cases?
Help educate an old 8-bit assembler mindset
Posted Jun 19, 2021 17:47 UTC (Sat) by corbet (editor, #1) [Link]
This is an article about kernel programming, where everything is threaded. Or, as the article says, "The effects of this reordering on single-threaded code are (in the absence of bugs) limited to making it run faster. When there are multiple threads of execution running simultaneously, though, there can be surprises in store. One thread may observe things happening in a different order than others, leading to all sorts of unfortunate confusion."
Help educate an old 8-bit assembler mindset
Posted Jun 20, 2021 7:29 UTC (Sun) by pbonzini (subscriber, #60935) [Link]
Control dependencies also do not matter in most lockless cases, much less control dependencies where the compiler does have meaningful optimizations that would break. Of course, sometimes they do.
Help educate an old 8-bit assembler mindset
Posted Jun 20, 2021 13:58 UTC (Sun) by felixfix (subscriber, #242) [Link]
Thanks to all concerned for my education.
Protecting control dependencies with volatile_if()
Posted Jun 19, 2021 4:28 UTC (Sat) by alison (subscriber, #63752) [Link]
> #define volatile_if(x) if ((x) && barrier_true())
>The barrier() macro causes no code to be emitted; it is just an empty block presented to the compiler as assembly code.
Hmm, wouldn't the code need __asm__ attribute to be presented as assembly code?
Protecting control dependencies with volatile_if()
Posted Jun 19, 2021 23:37 UTC (Sat) by tialaramex (subscriber, #21167) [Link]
#define barrer() __asm__ __volatile__("":::"memory")
What that says is: Here are some assembly instructions you must execute at this point in my code. The instruction somehow don't need any of my variables in CPU registers, nor do the CPU registers need to be copied into any of my variables. However, these instructions touch memory somehow. Also, that list of instruction is empty.
This is a very common trick, the compiler is used to seeing this, and won't go "Huh? But your list of instructions is empty". Since it claims to touch memory in an unspecified way, no optimisations may move reads or writes past this line, as that would alter the meaning of the program. This is of course exactly what everybody who writes this wanted to achieve and the compiler authors know that.
Since the list of instructions is empty you don't even need to know the assembly language of the target CPU.
Protecting control dependencies with volatile_if()
Posted Jun 20, 2021 20:11 UTC (Sun) by itsmycpu (subscriber, #139639) [Link]
Protecting control dependencies with volatile_if()
Posted Jun 20, 2021 20:31 UTC (Sun) by itsmycpu (subscriber, #139639) [Link]
How is a barrier like
> #define barrer() __asm__ __volatile__("":::"memory")
more effective than the barrier implied by load_acquire (on any architecture) ?
(Not a rhetoric question.)
Protecting control dependencies with volatile_if()
Posted Jun 20, 2021 22:41 UTC (Sun) by tialaramex (subscriber, #21167) [Link]
In contrast acquire/ release have consequences for the machine as well as the compiler. So the compiler will emit different machine code to ensure that the sequence of operations has acquire/ release semantics, and then at runtime, the actual CPU may be doing something different to ensure the semantics you asked for are achieved.
On x86-64 the instructions are only slightly different, and only in some cases, you actually pay for implicit acquire/ release semantics all the time on x86 without knowing it; on the original DEC Alpha it's a horrible nightmare of locking and CPU barriers; and on some ARM CPUs you can literally say "This load has acquire semantics" and "This store has release semantics" and everything works nicely.
So barrier() definitely CAN be faster because it doesn't emit any instructions at all. However of course just because you wrote fewer instructions doesn't make you faster. To find that out reliably in each case you'd have to benchmark. But first you'd have to be sure you care enough to go to that effort.
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 0:24 UTC (Mon) by itsmycpu (subscriber, #139639) [Link]
Are you sure? I'm not so sure that architectures with a so-called "weaker" memory model don't require some kind of ordering instruction to avoid hardware re-ordering across instructions. In this case, it sounds like the conditional branch instruction might have (welcome) effects on hardware ordering. It is even called a "dummy" conditional branch. Specifically, there appears to be a certain equivalence on arm64 between
LDR X0, [X1]
CBNZ X0, 1f // Dummy ctrl
and ("with LTO")
LDAPR X0, [X1] // RCpc
> In contrast acquire/ release have consequences for the machine as well as the compiler.
Not necessarily. On modern x86, load_acquire and store_release are usually simple MOV instructions.
My question is about the specifics of architectures where this isn't so simple.
The ordering required here isn't really very different from acquire and/or release, and barrier() is apparently a very general tool. So it isn't clear what advantages it is meant to have on specific architectures, especially as it would seem that otherwise barrier() could be used to implement even load_acquire.
Or perhaps at the end of the email discussion, volatile_if is meant more as a convenience than as a possible performance optimization.
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 2:16 UTC (Mon) by tialaramex (subscriber, #21167) [Link]
Yes of course I'm sure. Why else would I write that I cannot emphasise this enough?
> I'm not so sure that architectures with a so-called "weaker" memory model don't require some kind of ordering instruction to avoid hardware re-ordering across instructions
Yes, most machines would need some explicit machine instructions to prevent the machine from re-ordering memory access if that was what you wanted to achieve. The Linux barrier() is a compiler barrier, not a CPU barrier. So it does not forbid the CPU from re-ordering memory accesses as it wishes, that is intentional. Use the right tool for the job.
> Not necessarily. On modern x86, load_acquire and store_release are usually simple MOV instructions.
If you'd only read a whole sentence further I more or less spelled this out, it's actually the opposite, with all that implies, simple MOV instructions have acquire/release semantics on x86. You are incurring the cost of an acquire or release and sometimes both for every access. Yet, that still isn't quite enough for full-on sequential consistency. So naive x86 programs still have surprising races.
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 3:00 UTC (Mon) by itsmycpu (subscriber, #139639) [Link]
I'm afraid your answers are more confusing than clarifying.
If __asm__ __volatile__("":::"memory") results only in a CPU barrier even on "weaker" architectures, then in this case, on some architectures, that alone would not be sufficient. It is those architectures that my question is about.
It means that it requires something like the conditional branch instruction to complete the ordering.
So the question remains: why would that not also be enough to implement load_aquire, such that load_acquire has the same performance as barrier + cond.branch ?
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 3:03 UTC (Mon) by itsmycpu (subscriber, #139639) [Link]
Obviously, I meant: if it only results in a *compiler* barrier....
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 9:59 UTC (Mon) by tialaramex (subscriber, #21167) [Link]
On which architectures do you believe that the CPU exposes memory writes that only happen after a conditional branch, before actually evaluating that branch? Because that's the scenario under consideration here. The compiler barrier is to prevent compilers from doing the same re-ordering, which can look fairly reasonable to them because they've got a different view of the situation.
Linux is not a hypothetical program, so, it isn't focused on executing correctly on hypothetical hardware, only real hardware.
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 11:56 UTC (Mon) by itsmycpu (subscriber, #139639) [Link]
On none, as a hardware optimization I don't expect that to happen.
The question I'm asking is if and why (or why not), on any architecture (perhaps arm64), this:
volatile_if(READ_ONCE(A)) {
WRITE_ONCE(B,1);
}
might be faster than this:
if (load_acquire(A)) {
WRITE_ONCE(B,1);
}
especially if the latter is well optimized by a compiler. I have the tendency to think that the second form using load_acquire can be just as fast, at least if the compiler does a good job. However perhaps there are architectures on which that isn't true, and that is what my question is about.
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 16:42 UTC (Mon) by farnz (subscriber, #17727) [Link]
Because, on the very weakest implementations of (e.g.) AArch64, load acquire is more expensive than load followed by a dependent branch, but the load followed by a dependent branch is a sufficient barrier for the purposes at hand.
Consider in terms of the following pseudo-code; thread 0 is writing to two memory locations, and before the two threads run on separate CPU cores, the memory at adr1 and adr2 is set to 0.
# Thread 0 MOV R0, #42 MOV R1, #99 STR [addr3], R1 STR [addr2], R1 STR_REL [addr1], R0 # Thread 1 MOV R5, 1 MOV R6, 1 LDR R4, [addr1] CMP R4, #42 BEQ load JMP end .load LDR R5, [addr2] .end LDR R6, [addr3]
The question we're trying to answer is "what values are in R5 and R6 of thread 1 at the label end?".
On an Alpha, there are 3 possible answers: if R4 is 42, then R5 and R6 must both be either 0 or 99, depending on whether or not the LDR R5 and LDR R6 observed the store to addr2 or not. If R4 is not 42, then R5 and R6 must be 1. Alpha is allowed to re-order the loads in thread 1 freely because they are independent of each other, and thus all of (R5 = 0, R6 = 0), (R5 = 0, R6 = 99), (R5 = 99, R6 = 0) and (R5 = 99, R6 = 99) are possible.
On x86, the answer is either R5 = R6 = 1 or R5 = R6 = 99; once the load of R4 has taken place, thread 1 is guaranteed to see the stores to both addr1 and addr2, and thus the loads to R5 and R6 must see the store to addr2. Thus, if R4 is 42, R5 is guaranteed to be 99, while if R4 is not 42, R5 is guaranteed to be 1. Same reasoning applies to R6 as to R5.
ARM's memory ordering is mostly weak, like Alpha's, but it adds in one more wrinkle; dependency chains act as ordering. Because of this, R5 is again either 1 or 99; however, this happens for slightly different reasons to x86. On ARM, the STR_REL guarantees that a load that observes the store to addr1 will also be able to observe the store to addr2; because the load of R5 has a dependency on the value of R4, the ARM memory ordering model guarantees that the load of R5 will take place after the load of R4, and hence will also observe the store to addr1. However, there's no such dependency in R6's control flow, and thus R6 is allowed to be either 0 or 99 if R4 is 42, because the CPU can reorder the load to happen before the conditional.
If the load to R4 was a load-acquire, then all three processors would observe the same memory contents. However, both ARM and Alpha would be denied reordering opportunities by the load-acquire that they have without it, and hence could be slower.
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 22:43 UTC (Mon) by itsmycpu (subscriber, #139639) [Link]
This is the claim I am interested in.
> Consider in terms of the following pseudo-code; thread 0 is writing to two memory locations, and before the two threads run on separate CPU cores, the memory at adr1 and adr2 is set to 0.
Thanks for going into this detail.
I'm assuming that you also meant addr3 to be initialized with 0. It seems you changed the location of the label ".end" while you were writing your post. As it is, with the end label between the loads of R5 and R6, R6 couldn't be 1 (assuming addr3 is initialized with 0).
> On an Alpha, there are 3 possible answers: if R4 is 42, then R5 and R6 must both be either 0 or 99, depending on whether or not the LDR R5 and LDR R6 observed the store to addr2 or not. If R4 is not 42, then R5 and R6 must be 1. Alpha is allowed to re-order the loads in thread 1 freely because they are independent of each other, and thus all of (R5 = 0, R6 = 0), (R5 = 0, R6 = 99), (R5 = 99, R6 = 0) and (R5 = 99, R6 = 99) are possible.
R6 can be 1 only if the ".end" label is moved to the actual end. So here I am putting R6 aside, since then it will be like R5 (but see discussion of R6 for ARM.).
My understanding is that "volatile_if" is meant to prevent a re-ordering of the effective load of R5 to a point before the load of R4. In fact, that appears to be the whole point (I don't see any indication that it would be meant to affect only stores). If it does succeed in preventing that, then only (R5 = 99) is possible if R4 is 42, since 42 is written with a store_release.
So if Alpha behaves as you say, I don't see how the pseudo code would be sufficient to implement "volatile_if". And I wouldn't know what it takes to fix that, or how that would affect our discussion.
> On x86, the answer is either R5 = R6 = 1 or R5 = R6 = 99; once the load of R4 has taken place, thread 1 is guaranteed to see the stores to both addr1 and addr2, and thus the loads to R5 and R6 must see the store to addr2. Thus, if R4 is 42, R5 is guaranteed to be 99, while if R4 is not 42, R5 is guaranteed to be 1. Same reasoning applies to R6 as to R5.
Yes.
(R6 can be 1 only if the ".end" label is moved to the actual end.)
> ARM's memory ordering is mostly weak, like Alpha's, but it adds in one more wrinkle; dependency chains act as ordering. Because of this, R5 is again either 1 or 99; however, this happens for slightly different reasons to x86. On ARM, the STR_REL guarantees that a load that observes the store to addr1 will also be able to observe the store to addr2; because the load of R5 has a dependency on the value of R4, the ARM memory ordering model guarantees that the load of R5 will take place after the load of R4, and hence will also observe the store to addr1. However, there's no such dependency in R6's control flow, and thus R6 is allowed to be either 0 or 99 if R4 is 42, because the CPU can reorder the load to happen before the conditional.
Here it seems that you want the ".end" label to really be between the loads of R5 and R6, as you also don't claim that R6 could be 1. That means that you mean R6 to be outside of the volatile-if statement, and since also at the language level, volatile_if is meant to allow R6 to be 0 (regardless of R4's value) in that case, I can see that ARM would be able to take advantage of that and perform the load before or during the conditional. Whereas x86 would not.
In situations where an R6 does not exist, it still seems the compiler could optimize load_acquire to produce equivalent code also without a construct like volatile_if (in so far as that might be a distinct advantage), since R5 must be 99 either way, if R4 is 42.
Protecting control dependencies with volatile_if()
Posted Jun 22, 2021 3:38 UTC (Tue) by itsmycpu (subscriber, #139639) [Link]
What if the load of R6 appears in both the 'then' path and the 'else' path of the BEQ conditional branch? Instead of after the conditional paths join.
Then there would be two load instructions, each dependent on the conditional. Would that mean that loading R6 is now ordered, or is the CPU allowed to internally join the two paths and remove the dependency? Is ARM's behavior defined in this case? I think it needs to be, and the straightforward definition would be that R6 is then ordered. (And that the compiler could use this for optimizations.)
Protecting control dependencies with volatile_if()
Posted Jun 22, 2021 9:47 UTC (Tue) by farnz (subscriber, #17727) [Link]
First, thank you for reading my previous comment as intended, not as written - I did make a lot of mistakes trying to put together a case where ARM's ordering is weaker than store-acquire, but you caught my intentions.
Yes, if the load of R6 appears in both paths, then it's an ordered load, due to the data dependency affecting both cases, and thus ARM can't reorder in their memory model; in theory, the compiler can exploit this to get R6 ordered even though the load of R6 is unconditional, while still allowing future loads to be reordered before the load of R6.
The point of the ARM behaviour here is to give you a guarantee that code like the following Cish does what you intended without forcing full ordering between the two threads:
const int global42 = 42; int global_int; int *global_ptr; void main() { global_ptr = &global42; full_memory_barrier(); start_threads(thread0, thread1); wait_for_threads(); } void thread0() { global_int = 123; atomic_store_explicit(global_ptr, &global_int, memory_order_release); } void thread1() { int *i = global_ptr; int j = *i; if (i == &global_int) { assert(j == 123); } else { assert(j == 42); } }
Because the store to global_ptr is a release store, any thread which observes that store also observes the store to global_int. thread1 can thus use the value of i to determine which value it loaded.
On Alpha's memory model (which is as weak as you can get), it is permissible for i to point at global42, but for the load to read 123, or vice-versa. ARM disallows this particular case.
Another way of looking at it would be that you can fix thread1 to work correctly on any standards-compliant C compiler by making it:
void thread1() { int *i = atomic_load_explicit(global_ptr, memory_order_consume); int j = *i; if (i == &global_int) { assert(j == 123); } else { assert(j == 42); } }
And on Alpha, the atomic_load_explicit needs to put appropriate memory ordering instructions into the instruction stream for thread1; on ARM, those ordering instructions are implicit in ordinary loads.
Protecting control dependencies with volatile_if()
Posted Jun 22, 2021 21:03 UTC (Tue) by itsmycpu (subscriber, #139639) [Link]
Makes sense, and gives a good idea of ARM (and Alpha) in this regard.
Protecting control dependencies with volatile_if()
Posted Jun 19, 2021 6:02 UTC (Sat) by bartoc (guest, #124262) [Link]
many uses of normal volatile variables are deprecated and will emit a warning in c++20 (I'm aware there's no C++ in linux, but the formal memory model used in C was developed in concert with C++, when both had to specify the memory model more precisely in order to standardize threads and atomics). In particular v |= something where v is volatile (this seems to be to make the read-modify-write explicit, I'm not sure I'm a fan in the specific case of &= and |=, but basically agree in the case of += -= *= and /=)
Protecting control dependencies with volatile_if()
Posted Jun 19, 2021 14:53 UTC (Sat) by Paf (subscriber, #91811) [Link]
Protecting control dependencies with volatile_if()
Posted Jun 19, 2021 18:47 UTC (Sat) by kkdwivedi (guest, #130744) [Link]
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p...
Protecting control dependencies with volatile_if()
Posted Jun 20, 2021 0:47 UTC (Sun) by Paf (subscriber, #91811) [Link]
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 3:33 UTC (Mon) by foom (subscriber, #14868) [Link]
Protecting control dependencies with volatile_if()
Posted Jun 20, 2021 0:08 UTC (Sun) by tialaramex (subscriber, #21167) [Link]
Today, volatile is rightly imagined as a property of actions, not variables. "volatile write variable -> $address " => "Literally write this to memory, even though it seems as though we're never reading that memory and so needn't bother, I promise that's actually what must happen". "volatile read $address -> variable" => Literally read this from memory, even though it seems as though we already know the value, I promise that's actually what must be done". Whereas "volatile variable" is too vague.
On some hardware, there may be very fine memory mapping, such that you can actually read or write a smaller value than the natural machine word to special memory regions, maybe the normal word size is 32 bits, but for a small range of "memory" it actually matters whether you use the 16-bit move instruction or the 32-bit one because they behave differently. In some cases this gets finer than a single byte, maybe there's a flag in the bottom bit of address 0x0037 and you can actually read or write that single bit, from machine code and it has completely different behaviour from if you were to try to read or write the whole 0x0037 byte on that hardware.
So, because volatile is defined so openly in C, your compiler vendor can say OK, we make this compiler for this weird hardware, and we promise if you say that this is a _volatile_ byte pointer to 0x0037 and then you use |= 0x01 we will actually emit the machine code for the single bit write, even though that's not how it would be implemented for other occurrences of |= 0x01
But the language standard can't say anything about these platform specific details, so it just leaves that to your the compiler vendor, and with the exception of specialist compiler vendors who care very much about this stuff, the compiler vendor says as little about it as possible, hoping you will go away.
Probably what _ought_ to happen if C was designed from scratch today is that these platforms provide a special intrinsic function and if you actually need to poke the bottom bit of address 0x0037 you call some intrinsic like __foocorp_bit_poke(0x0037, 0, 1) and how that's actually implemented is only a problem for the platform compiler developer. Anybody who needs the intrinsic (e.g. someone porting Linux to the platform) uses it, everybody else is blissfully unaware. But at the time they just added this "volatile" qualifier to the C language as a dumping ground for all features about hardware memory reads/ writes and called it a day.
Protecting control dependencies with volatile_if()
Posted Jun 20, 2021 2:44 UTC (Sun) by willy (subscriber, #9762) [Link]
Protecting control dependencies with volatile_if()
Posted Jun 20, 2021 22:07 UTC (Sun) by tialaramex (subscriber, #21167) [Link]
Rust (sorry but I've spent lots of time staring at this recently so it's at the top of my head) provides
std::ptr::read_volatile(ptr) and std::ptr::write_votalite(ptr, foo)
If you provide read_volatile with say a pointer to i16 (a 16-bit signed integer), that'll compile and it'll emit an actual double-byte memory read, assuming that i16 pointer was pointing at the MMIO for your PCIe device you'll get what you wanted. If you do it with a pointer to u8 (an 8-bit unsigned integer) that'll do the single-byte read. And the same with the write equivalents.
These are stable but unsafe APIs. That is, Rust promises these are expected to still work tomorrow, but it doesn't vouch for the safety of just doing memory operations on raw addresses, so if this was a terrible idea you get to keep both halves when it breaks. They're defined in a completely generic way, so you could imagine calling read_volatile(pointer-to-4x4-matrix) but of course your CPU is not completely generic so it's between the CPU vendor, the compiler vendor, and the PCIe device maker whether that does anything useful.
However, for bit-addressing it's not at all obvious what the generic thing to do would be. So Rust just doesn't, and I argue that unlike your PCIe MMIO trick, bit addressed volatile memory access is in fact a niche feature in 2021.
Protecting control dependencies with volatile_if()
Posted Jun 19, 2021 22:52 UTC (Sat) by itsmycpu (subscriber, #139639) [Link]
However it seems to me that this isn't really about if-statements, but about order two specific memory accesses without impacting other compiler or hardware optimizations.
So perhaps a construct like ORDER_ACCESS( A, B ) would better reflect the original intentions, and be applicable to a much larger group of use cases.
This would specifically instruct the compiler to generate code that completes previous memory accesses to A before initiating following memory accesses to B.
(And perhaps be best implemented as a compiler primitive or general language construct like barriers, and the code generation be informed by kernel engineers who have the practical experience across architectures and ideas regarding optimal code to be generated, such as using conditional branch instructions to create a hardware barrier.)
So:
tmp = READ_ONCE(A);
ORDER_ACCESS( A, B );
WRITE_ONCE(B, 1);
Protecting control dependencies with volatile_if()
Posted Jun 20, 2021 0:42 UTC (Sun) by xecycle (subscriber, #140261) [Link]
Btw I feel a release order would be needed anyway on the write side, this form probably just exploits dependency order on the read side.
Protecting control dependencies with volatile_if()
Posted Jun 20, 2021 12:08 UTC (Sun) by Bigos (subscriber, #96807) [Link]
int* ptr = READ_ONCE(slot);
int value = READ_ONCE(ptr);
This article is about cases where there is no data dependency but there is a control dependency. AFAIK most weak-memory-model architectures respect such a dependency if there is a jump instruction in the middle, but the compiler must first not reorder the instructions itself and must actually generate a jump instruction (which is not always obvious as it could be lowered to a cmove-like instruction).
Regarding memory_order_consume (and [[carries_dependency]]), the problem is with ensuring that there is a dependency without involving the type system. If instead load(std::memory_order_consume) returned Dependent<T> and the compiler promised that loads using it are either data or control dependent on the original load then it would be a lot easier to define and implement. I think this is what the following is about:
Protecting control dependencies with volatile_if()
Posted Jun 21, 2021 5:40 UTC (Mon) by xecycle (subscriber, #140261) [Link]
Protecting control dependencies with volatile_if()
Posted Jun 24, 2021 7:10 UTC (Thu) by tpo (subscriber, #25713) [Link]
My argument comes from the position, that LWN articles such as this one informally also serve as documentation and as reference material. As such it'd be splendid if the article could be as clear and unambiguous as possible.
My comment echoes felixfix's comments and confusion above.
The article is saying: "When there are multiple threads of execution running simultaneously, though, there can be surprises in store.". In here lies (IMO) the core of the whole article. But to me that single phrase is *way* understated. That assertion should be stated in a as strong way as possible so that the reader will not miss it. Otherwise - I posit - the reader can not really understand the article.
Also having multiple threads is *not* a sufficient condition leading to the troubles that are being sought to be solved in the article.
The point is that *iif* there are multiple actors *and* A and B below are somehow causally related, then:
LOAD A
STORE B
*must not be reordered*. By multiple actors I mean either something that can change state *or* has its own state that depends on either the value of A or B or the access thereof.
"Other actors" can be:
* an interrupt controller interrupting the execution of the code above and triggering execution of some different code
* any device
* another CPU core
* another CPU
* a different thread
If any of the above exist and treat (the access to A or B) or (the values of A and B) as somehow logically interdependent then reordering LOAD A and STORE B can lead to trouble.
Example: let's take a networking hardware component: if "LOAD A" would mean to the networking hardware "OK, the ethernet frame was read from the buffer which can now be used again" then it is crucial to the correct functioning of the network stack that the "STORE B" (as in write new data into the buffer) is correctly ordered.
Please note that I am not a kernel hacker and as such the above might not precisely reflect actual hardware. The point I'm tring to make is: a) the central problem should be stated much more clearly and b) the problem is as far as I'm able to tell much more generic than only multithreaded situations