multithreading - Does C++ `memory_order_seq_cst` guarantee load of other variables in the current thread?

I'm struggling to understand the standardese regarding memory_order_seq_cst guarantees and I'm not able to find an appropriate example.

The code below demonstrates my situation.

#include <atomic>
#include <thread>
#include <cstdint>

void race_test() {

  std::atomic<std::size_t> data = 0;
  std::atomic_bool flag = false;

  constexpr std::size_t REP_COUNT = 1000;

  std::thread inc_thread{[&] {
    for (std::size_t i = 0; i != REP_COUNT; ++i) {
      data.fetch_add(1, std::memory_order_relaxed);
      flag.store(true, std::memory_order_release);
      flag.notify_one();
    }
  }};

  std::thread dec_thread{[&] {
    for (std::size_t i = 0; i != REP_COUNT; ++i) {
      while (data.load(std::memory_order_relaxed) == 0) {
        flag.wait(false, std::memory_order_acquire);
        flag.store(false, std::memory_order_relaxed);
        std::atomic_thread_fence(std::memory_order_seq_cst);
      }
      data.fetch_sub(1, std::memory_order_relaxed);
    }
  }};

  inc_thread.join();
  dec_thread.join();
}

int main()
{
  for (std::size_t i = 0; i != 1000; ++i)
    race_test();
}

Ideally, inc_thread increments data then sets flag and notifies dec_thread.

dec_thread waits for flag to be set while data is non-zero then decrements data and clears flag.

This should repeat REP_COUNT times for each thread with data ending as 0.

In the "real world" a producer thread pushes to a thread-safe queue at its leisure then notifies the consumer thread to process all available data. It is intentional that inc_thread can potentially "run ahead" of dec_thread. The goal is just that the data is consumed in a timely manner after being produced. I've just made the data an atomic integer here since it was sufficient for demonstration.

This is currently "working", but is it guaranteed to work?

Without the atomic_thread_fence this will occasionally deadlock with dec_thread waiting on flag while data is non-zero.

I imagine this happens because dec_thread's flag.store(false) happens between inc_thread's flag.store(true) and flag.notify_one() effectively erasing the next notification.

It sounds like std::atomic_thread_fence(std::memory_order_seq_cst) prevents reordering of flag.store(false) and data.load(), and I can see how that would prevent the deadlock, but then why doesn't std::atomic_thread_fence(std::memory_order_acq_rel) work in its place?

Am I understanding the situation properly?

I'll mention that this also "works" without the fence if flag.store(false) uses std::memory_order_seq_cst, but that can't be "correct"?

Does std::memory_order_seq_cst actually guarantee this to not deadlock or am I relying on some x86-specific behavior? What needs to be changed to make this actually guaranteed to work?

I have greatly increased REP_COUNT to be fairly sure about racing to a deadlock or not. The example with 1000 reps seemed to be enough to deadlock about half of the time without the fence.

I'm struggling to understand the standardese regarding memory_order_seq_cst guarantees and I'm not able to find an appropriate example.

The code below demonstrates my situation.

#include <atomic>
#include <thread>
#include <cstdint>

void race_test() {

  std::atomic<std::size_t> data = 0;
  std::atomic_bool flag = false;

  constexpr std::size_t REP_COUNT = 1000;

  std::thread inc_thread{[&] {
    for (std::size_t i = 0; i != REP_COUNT; ++i) {
      data.fetch_add(1, std::memory_order_relaxed);
      flag.store(true, std::memory_order_release);
      flag.notify_one();
    }
  }};

  std::thread dec_thread{[&] {
    for (std::size_t i = 0; i != REP_COUNT; ++i) {
      while (data.load(std::memory_order_relaxed) == 0) {
        flag.wait(false, std::memory_order_acquire);
        flag.store(false, std::memory_order_relaxed);
        std::atomic_thread_fence(std::memory_order_seq_cst);
      }
      data.fetch_sub(1, std::memory_order_relaxed);
    }
  }};

  inc_thread.join();
  dec_thread.join();
}

int main()
{
  for (std::size_t i = 0; i != 1000; ++i)
    race_test();
}

Ideally, inc_thread increments data then sets flag and notifies dec_thread.

dec_thread waits for flag to be set while data is non-zero then decrements data and clears flag.

This should repeat REP_COUNT times for each thread with data ending as 0.

This is currently "working", but is it guaranteed to work?

Without the atomic_thread_fence this will occasionally deadlock with dec_thread waiting on flag while data is non-zero.

I imagine this happens because dec_thread's flag.store(false) happens between inc_thread's flag.store(true) and flag.notify_one() effectively erasing the next notification.

Am I understanding the situation properly?

I'll mention that this also "works" without the fence if flag.store(false) uses std::memory_order_seq_cst, but that can't be "correct"?

Does std::memory_order_seq_cst actually guarantee this to not deadlock or am I relying on some x86-specific behavior? What needs to be changed to make this actually guaranteed to work?

I have greatly increased REP_COUNT to be fairly sure about racing to a deadlock or not. The example with 1000 reps seemed to be enough to deadlock about half of the time without the fence.

Share Improve this question edited Apr 3 at 7:26 asked Mar 29 at 22:51 Jordan Woyak 651 silver badge4 bronze badges

1 I'll mention that this also "works" without the fence if flag.store(false) uses std::memory_order_seq_cst, but that can't be "correct"? - On x86, a seq_cst store is implemented with a full barrier, like a plain (release on x86) store plus thread_fence(seq_cst); normally with one xchg instruction since only RMWs can be locked operations. You're correct that's a case of x86 having to be stronger than what the standard requires in order to be sufficient; AArch64 is much closer to just strong enough where an SC store can reorder with later non-SC loads. – Peter Cordes Commented Mar 30 at 0:13
2 This is a weird algorithm. AFAICT, nothing stops the inc thread from running far ahead of the dec thread (so some of the notify_one calls find no waiter to wake up). In that case the dec thread will see non-zero data every time and not enter the while(data == 0){} body. The final iteration of the REP_COUNT loop will decrement data down to zero. Is that intended, that you only wait when the dec thread is getting ahead of the inc thread? Which will tend to happen since it makes no system calls when it's "behind", vs. the inc thread always making a notify (futex) syscall. – Peter Cordes Commented Mar 30 at 0:19
1 The thing that seq_cst prevents is flag.store(false) reordering with the data.load() in the while() condition. I haven't fully grokked why that matters. inc_thread runs "freely" (never waiting or spinning) so stores by dec_thread can appear between any of its operations. Your REP_COUNT of 1000 seems very low, like I'd worry that the first thread to start could get a good fraction of the way through that before the second thread did anything, so there might not be many real iterations of testing. Especially with CPU frequency needing to ramp up for one core. – Peter Cordes Commented Mar 30 at 0:24
2 If flag.store(false) can reorder with data.load(), then let's imagine that data.load() is done (returning 0), then there's a delay and inc runs entirely to completion (leaving flag == true), and then flag.store(false) occurs. Then we would indeed have a deadlock. So I agree that there has to be a StoreLoad barrier between them. – Nate Eldredge Commented Mar 30 at 0:44
2 "why doesn't std::atomic_thread_fence(std::memory_order_acq_rel) work in its place?" An acq_rel fence is never a StoreLoad barrier; i.e. never prevents an earlier store from being reordered with a later load. – Nate Eldredge Commented Mar 30 at 5:58

| Show 3 more comments

1 Answer 1

Sorted by: Reset to default 4

It seems to me that the reason why the fence helps in your tests is that it acts as a StoreLoad barrier, preventing the flag.store(false, std::memory_order_relaxed) from being reordered with the subsequent data.load(std::memory_order_relaxed) at the top of the next loop iteration.

Without a barrier, such reordering could indeed result in a deadlock. Suppose we are on one of the last few iterations of the inner loop, and data has the value 0. Thanks to reordering, data.load() happens first and observes the value 0. While flag.store(false) is still delayed, all remaining iterations of inc_thread run. (This is especially plausible if notify_one() is optimized to check in user space whether there are any waiters, and avoid calling the kernel if not.) Now the flag.store(false) commits, and the flag.wait() begins. Since no further stores of true from inc_thread will occur, the wait never ends.

I think this is also the only way to have a deadlock, in practice. If data.load() returns 0, we know that at least one instance of data.fetch_add(1) remains to be executed by inc_thread, and it will be followed (thanks to the release store) by a flag.store(true). If flag.store(false) hasn't been reordered with data.load(), then that store of true will eventually be visible to dec_thread, which will cause the wait() to return.

So the question now becomes:

Is a atomic_thread_fence(memory_order_seq_cst) guaranteed to provide a StoreLoad barrier, even if there are no other seq_cst operations for it to order with?

In practice, the answer is likely Yes, because that's how real machines and compilers implement the fence, and they don't attempt to optimize them out. However, if we are following the letter of the ISO C++ standard, I am afraid the answer is No.

Let's look at a simpler litmus test.

std::atomic<int> x{0}, y{0}; // #0
int xres;
void thr1() {
    // these stores commit in program order
    x.store(1, std::memory_order_release); // #1
    y.store(1, std::memory_order_release); // #2
}
void thr2() {
    y.store(2, std::memory_order_relaxed); // #3
    std::atomic_thread_fence(std::memory_order_seq_cst); // #4: StoreLoad barrier??
    xres = x.load(std::memory_order_relaxed); // #5
}

If StoreLoad reordering were possible, we could end up with y == 2 && xres == 0. And I think the C++ standard actually allows that outcome.

I follow C++23 N4950. Looking at [atomics.order], which defines the semantics of seq_cst, the total order S is trivial and contains only the fence #4. The only axiom involving fences only is p4.4 (paraphrasing):

If A,B are atomic operations on an object M such that A is coherence-ordered before B; X,Y are two seq_cst fences; X happens before A; and B happens before Y; then X precedes Y in S.

As there is only one fence in the program (namely #4), we can only apply p4.4 with X=Y=#4. It cannot precede itself, so the premise of p4.4 must be false. That is, if #4 happens before A, and B happens before #4, then A is not coherence-ordered before B.

But this doesn't tell us anything useful. The only candidate for A is #5, namely x.load(). Coherence ordering is only relevant for atomic operations on the same object, and the only other atomic operation on x, as a candidate for B, is #1. But clearly we have no way to show that #1 happens-before #4, as there is no release-acquire pair that could provide synchronization. (If we stretch the rules and consider B=#0, which does of course happen before #4, we conclude only that #4 is not coherence-ordered before #0, which was already obvious.)

So we've gained nothing at all from the seq_cst semantics of the fence. It does still function as an acq_rel fence, but that doesn't help either. Looking at [atomics.fences], an acq_rel fence would only do anything if sequenced before a store, or after a load, neither of which is the case for #4.

Thus, as far as ISO C++ is concerned, the fence in this program is a complete no-op and the compiler would be within its rights to ignore it. And of course, if the fence isn't there, then StoreLoad reordering is certainly possible, and the program may certainly end with y == 2 && xres == 0.

I agree this result seems counter-intuitive because we are used to thinking of a seq_cst fence as a "full barrier" and it's normally implemented as such. But the ISO C++ memory model is not based on the concept of reordering commits to a coherent cache, even if real machines are, so it isn't bound by those rules.

To fix it, it seems that we need a seq_cst operation or fence in the other thread. For instance, we could add a fence between #1 and #2 in my test program. Then it's not hard to check that, depending on which precedes the other, we must either end up with y == 1 or xres == 1. This is somewhat ironic, though, because this new fence is not doing anything from the standpoint of reordering (#1 and #2 already could not be reordered due to their release ordering.)

Likewise, in your program, I think we could ensure deadlock-free behavior by placing a seq_cst fence in inc_thread between data.fetch_add(1) and flag.store(true). This is kind of a pity, since on a real machine, this fence will be a real barrier instruction with a nontrivial cost, and at the level of the machine architecture, it shouldn't actually be needed.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

multithreading - Does C++ `memory_order_seq_cst` guarantee load of other variables in the current thread? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)