I'm struggling to understand the standardese regarding memory_order_seq_cst
guarantees and I'm not able to find an appropriate example.
The code below demonstrates my situation.
#include <atomic>
#include <thread>
#include <cstdint>
void race_test() {
std::atomic<std::size_t> data = 0;
std::atomic_bool flag = false;
constexpr std::size_t REP_COUNT = 1000;
std::thread inc_thread{[&] {
for (std::size_t i = 0; i != REP_COUNT; ++i) {
data.fetch_add(1, std::memory_order_relaxed);
flag.store(true, std::memory_order_release);
flag.notify_one();
}
}};
std::thread dec_thread{[&] {
for (std::size_t i = 0; i != REP_COUNT; ++i) {
while (data.load(std::memory_order_relaxed) == 0) {
flag.wait(false, std::memory_order_acquire);
flag.store(false, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
data.fetch_sub(1, std::memory_order_relaxed);
}
}};
inc_thread.join();
dec_thread.join();
}
int main()
{
for (std::size_t i = 0; i != 1000; ++i)
race_test();
}
Ideally, inc_thread
increments data
then sets flag
and notifies dec_thread
.
dec_thread
waits for flag
to be set while data
is non-zero then decrements data
and clears flag
.
This should repeat REP_COUNT
times for each thread with data
ending as 0.
In the "real world" a producer thread pushes to a thread-safe queue at its leisure then notifies the consumer thread to process all available data. It is intentional that inc_thread
can potentially "run ahead" of dec_thread
. The goal is just that the data is consumed in a timely manner after being produced. I've just made the data
an atomic integer here since it was sufficient for demonstration.
This is currently "working", but is it guaranteed to work?
Without the atomic_thread_fence
this will occasionally deadlock with dec_thread
waiting on flag
while data
is non-zero.
I imagine this happens because dec_thread
's flag.store(false)
happens between inc_thread
's flag.store(true)
and flag.notify_one()
effectively erasing the next notification.
It sounds like std::atomic_thread_fence(std::memory_order_seq_cst)
prevents reordering of flag.store(false)
and data.load()
, and I can see how that would prevent the deadlock, but then why doesn't std::atomic_thread_fence(std::memory_order_acq_rel)
work in its place?
Am I understanding the situation properly?
I'll mention that this also "works" without the fence if flag.store(false)
uses std::memory_order_seq_cst
, but that can't be "correct"?
Does std::memory_order_seq_cst
actually guarantee this to not deadlock or am I relying on some x86-specific behavior?
What needs to be changed to make this actually guaranteed to work?
I have greatly increased REP_COUNT
to be fairly sure about racing to a deadlock or not. The example with 1000 reps seemed to be enough to deadlock about half of the time without the fence.
I'm struggling to understand the standardese regarding memory_order_seq_cst
guarantees and I'm not able to find an appropriate example.
The code below demonstrates my situation.
#include <atomic>
#include <thread>
#include <cstdint>
void race_test() {
std::atomic<std::size_t> data = 0;
std::atomic_bool flag = false;
constexpr std::size_t REP_COUNT = 1000;
std::thread inc_thread{[&] {
for (std::size_t i = 0; i != REP_COUNT; ++i) {
data.fetch_add(1, std::memory_order_relaxed);
flag.store(true, std::memory_order_release);
flag.notify_one();
}
}};
std::thread dec_thread{[&] {
for (std::size_t i = 0; i != REP_COUNT; ++i) {
while (data.load(std::memory_order_relaxed) == 0) {
flag.wait(false, std::memory_order_acquire);
flag.store(false, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
data.fetch_sub(1, std::memory_order_relaxed);
}
}};
inc_thread.join();
dec_thread.join();
}
int main()
{
for (std::size_t i = 0; i != 1000; ++i)
race_test();
}
Ideally, inc_thread
increments data
then sets flag
and notifies dec_thread
.
dec_thread
waits for flag
to be set while data
is non-zero then decrements data
and clears flag
.
This should repeat REP_COUNT
times for each thread with data
ending as 0.
In the "real world" a producer thread pushes to a thread-safe queue at its leisure then notifies the consumer thread to process all available data. It is intentional that inc_thread
can potentially "run ahead" of dec_thread
. The goal is just that the data is consumed in a timely manner after being produced. I've just made the data
an atomic integer here since it was sufficient for demonstration.
This is currently "working", but is it guaranteed to work?
Without the atomic_thread_fence
this will occasionally deadlock with dec_thread
waiting on flag
while data
is non-zero.
I imagine this happens because dec_thread
's flag.store(false)
happens between inc_thread
's flag.store(true)
and flag.notify_one()
effectively erasing the next notification.
It sounds like std::atomic_thread_fence(std::memory_order_seq_cst)
prevents reordering of flag.store(false)
and data.load()
, and I can see how that would prevent the deadlock, but then why doesn't std::atomic_thread_fence(std::memory_order_acq_rel)
work in its place?
Am I understanding the situation properly?
I'll mention that this also "works" without the fence if flag.store(false)
uses std::memory_order_seq_cst
, but that can't be "correct"?
Does std::memory_order_seq_cst
actually guarantee this to not deadlock or am I relying on some x86-specific behavior?
What needs to be changed to make this actually guaranteed to work?
I have greatly increased REP_COUNT
to be fairly sure about racing to a deadlock or not. The example with 1000 reps seemed to be enough to deadlock about half of the time without the fence.
1 Answer
Reset to default 4It seems to me that the reason why the fence helps in your tests is that it acts as a StoreLoad barrier, preventing the flag.store(false, std::memory_order_relaxed)
from being reordered with the subsequent data.load(std::memory_order_relaxed)
at the top of the next loop iteration.
Without a barrier, such reordering could indeed result in a deadlock. Suppose we are on one of the last few iterations of the inner loop, and data
has the value 0. Thanks to reordering, data.load()
happens first and observes the value 0. While flag.store(false)
is still delayed, all remaining iterations of inc_thread
run. (This is especially plausible if notify_one()
is optimized to check in user space whether there are any waiters, and avoid calling the kernel if not.) Now the flag.store(false)
commits, and the flag.wait()
begins. Since no further stores of true
from inc_thread
will occur, the wait never ends.
I think this is also the only way to have a deadlock, in practice. If data.load()
returns 0, we know that at least one instance of data.fetch_add(1)
remains to be executed by inc_thread
, and it will be followed (thanks to the release store) by a flag.store(true)
. If flag.store(false)
hasn't been reordered with data.load()
, then that store of true
will eventually be visible to dec_thread
, which will cause the wait()
to return.
So the question now becomes:
Is a
atomic_thread_fence(memory_order_seq_cst)
guaranteed to provide a StoreLoad barrier, even if there are no otherseq_cst
operations for it to order with?
In practice, the answer is likely Yes, because that's how real machines and compilers implement the fence, and they don't attempt to optimize them out. However, if we are following the letter of the ISO C++ standard, I am afraid the answer is No.
Let's look at a simpler litmus test.
std::atomic<int> x{0}, y{0}; // #0
int xres;
void thr1() {
// these stores commit in program order
x.store(1, std::memory_order_release); // #1
y.store(1, std::memory_order_release); // #2
}
void thr2() {
y.store(2, std::memory_order_relaxed); // #3
std::atomic_thread_fence(std::memory_order_seq_cst); // #4: StoreLoad barrier??
xres = x.load(std::memory_order_relaxed); // #5
}
If StoreLoad reordering were possible, we could end up with y == 2 && xres == 0
. And I think the C++ standard actually allows that outcome.
I follow C++23 N4950. Looking at [atomics.order], which defines the semantics of seq_cst
, the total order S
is trivial and contains only the fence #4. The only axiom involving fences only is p4.4 (paraphrasing):
If A,B are atomic operations on an object M such that A is coherence-ordered before B; X,Y are two seq_cst fences; X happens before A; and B happens before Y; then X precedes Y in S.
As there is only one fence in the program (namely #4), we can only apply p4.4 with X=Y=#4. It cannot precede itself, so the premise of p4.4 must be false. That is, if #4 happens before A, and B happens before #4, then A is not coherence-ordered before B.
But this doesn't tell us anything useful. The only candidate for A is #5, namely x.load()
. Coherence ordering is only relevant for atomic operations on the same object, and the only other atomic operation on x
, as a candidate for B, is #1
. But clearly we have no way to show that #1 happens-before #4, as there is no release-acquire pair that could provide synchronization. (If we stretch the rules and consider B=#0, which does of course happen before #4, we conclude only that #4 is not coherence-ordered before #0, which was already obvious.)
So we've gained nothing at all from the seq_cst
semantics of the fence. It does still function as an acq_rel
fence, but that doesn't help either. Looking at [atomics.fences], an acq_rel
fence would only do anything if sequenced before a store, or after a load, neither of which is the case for #4.
Thus, as far as ISO C++ is concerned, the fence in this program is a complete no-op and the compiler would be within its rights to ignore it. And of course, if the fence isn't there, then StoreLoad reordering is certainly possible, and the program may certainly end with y == 2 && xres == 0
.
I agree this result seems counter-intuitive because we are used to thinking of a seq_cst
fence as a "full barrier" and it's normally implemented as such. But the ISO C++ memory model is not based on the concept of reordering commits to a coherent cache, even if real machines are, so it isn't bound by those rules.
To fix it, it seems that we need a seq_cst
operation or fence in the other thread. For instance, we could add a fence between #1 and #2 in my test program. Then it's not hard to check that, depending on which precedes the other, we must either end up with y == 1
or xres == 1
. This is somewhat ironic, though, because this new fence is not doing anything from the standpoint of reordering (#1 and #2 already could not be reordered due to their release ordering.)
Likewise, in your program, I think we could ensure deadlock-free behavior by placing a seq_cst
fence in inc_thread
between data.fetch_add(1)
and flag.store(true)
. This is kind of a pity, since on a real machine, this fence will be a real barrier instruction with a nontrivial cost, and at the level of the machine architecture, it shouldn't actually be needed.
seq_cst
store is implemented with a full barrier, like a plain (release
on x86) store plusthread_fence(seq_cst)
; normally with onexchg
instruction since only RMWs can be locked operations. You're correct that's a case of x86 having to be stronger than what the standard requires in order to be sufficient; AArch64 is much closer to just strong enough where an SC store can reorder with later non-SC loads. – Peter Cordes Commented Mar 30 at 0:13inc
thread from running far ahead of thedec
thread (so some of thenotify_one
calls find no waiter to wake up). In that case thedec
thread will see non-zerodata
every time and not enter thewhile(data == 0){}
body. The final iteration of theREP_COUNT
loop will decrementdata
down to zero. Is that intended, that you only wait when thedec
thread is getting ahead of theinc
thread? Which will tend to happen since it makes no system calls when it's "behind", vs. theinc
thread always making a notify (futex) syscall. – Peter Cordes Commented Mar 30 at 0:19flag.store(false)
reordering with thedata.load()
in thewhile()
condition. I haven't fully grokked why that matters.inc_thread
runs "freely" (never waiting or spinning) so stores bydec_thread
can appear between any of its operations. YourREP_COUNT
of1000
seems very low, like I'd worry that the first thread to start could get a good fraction of the way through that before the second thread did anything, so there might not be many real iterations of testing. Especially with CPU frequency needing to ramp up for one core. – Peter Cordes Commented Mar 30 at 0:24flag.store(false)
can reorder withdata.load()
, then let's imagine thatdata.load()
is done (returning 0), then there's a delay andinc
runs entirely to completion (leavingflag == true
), and thenflag.store(false)
occurs. Then we would indeed have a deadlock. So I agree that there has to be a StoreLoad barrier between them. – Nate Eldredge Commented Mar 30 at 0:44std::atomic_thread_fence(std::memory_order_acq_rel)
work in its place?" Anacq_rel
fence is never a StoreLoad barrier; i.e. never prevents an earlier store from being reordered with a later load. – Nate Eldredge Commented Mar 30 at 5:58