multithreading - The sequential consistent order of C++11 vs traditional GCC built-ins like `__sync_synchronize`

So I've came across Jeff Preshing's wonderful blog posts on what's Acquire/Release and how they may be achieved with some CPU barriers.

I've also read that SeqCst is about some total order that's guaranteed to be consistent with not coherence-after relation - though at times it might contradict with happens-before relation established by plain Acquire/Release operations due to historical reasons.

My question is, how does the old GCC built-ins map into the memory model proposed by C++11 (and later revisions)? In particular, how to map __sync_synchronize() into C++11 or later modern C/C++?

In the GCC manual this call is simply described as a full memory barrier, which I suppose is the combination of all four major kind of barrier i.e. LoadLoad/LoadStore/StoreLoad/StoreStore barriers all at once. But is sync_synchronize equivalent to std::atomic_thread_fence(memory_order_seq_cst)? Or maybe, formally speaking, one of them is stronger than the other (which I suppose is the case here: in general a SeqCst fence should be stronger, since it requires the toolchain/platform to improvise a global ordering somehow, no?), and it just happens that most of the CPUs out there provides only instructions that satisfies both (full memory barrier by __sync_synchronize, total sequential ordering by std::atomic_thread_fence(memory_order_seq_cst)) at once, for example x86 mfence and PowerPC hwsync?

Either __sync_synchronize and std::atomic_thread_fence(memory_order_seq_cst) are formally equal or they are effectively equal (i.e. formally speaking they are different but no commercialized CPU bother to differentiate between the two), technically speaking a memory_order_relaxed load on the same atomic still may not be relied upon to synchronize-with/create happens-before relation with it, no?

I.e. technically speaking all of these assertions are allowed to fail, right?

// Experiment 1, using C11 `atomic_thread_fence`: assertion is allowed to fail, right?

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
atomic_store_explicit(&lock, true, memory_order_relaxed);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // We should really `memory_order_acquire` the `lock`
    // or `atomic_thread_fence(memory_order_acquire)` here,
    // or this assertion may fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}

// Experiment 2, using `SeqCst` directly on the atomic store

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_store_explicit(&lock, true, memory_order_seq_cst);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // Again we should really `memory_order_acquire` the `lock`
    // or `atomic_thread_fence(memory_order_acquire)` here,
    // or this assertion may fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}

// Experiment 3, using GCC built-in: assertion is allowed to fail, right?

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
__sync_synchronize();
atomic_store_explicit(&lock, true, memory_order_relaxed);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // we should somehow put a `LoadLoad` memory barrier here,
    // or the assert might fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}

I've tried these snippets on my RPi 5 but I don't see assertions fails. Yes this doesn't formally prove anything but it also doesn't shed light on differentiating between __sync_synchronize and std::atomic_thread_fence(memory_order_seq_cst).

So I've came across Jeff Preshing's wonderful blog posts on what's Acquire/Release and how they may be achieved with some CPU barriers.

I.e. technically speaking all of these assertions are allowed to fail, right?

// Experiment 1, using C11 `atomic_thread_fence`: assertion is allowed to fail, right?

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
atomic_store_explicit(&lock, true, memory_order_relaxed);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // We should really `memory_order_acquire` the `lock`
    // or `atomic_thread_fence(memory_order_acquire)` here,
    // or this assertion may fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}

// Experiment 2, using `SeqCst` directly on the atomic store

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_store_explicit(&lock, true, memory_order_seq_cst);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // Again we should really `memory_order_acquire` the `lock`
    // or `atomic_thread_fence(memory_order_acquire)` here,
    // or this assertion may fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}

// Experiment 3, using GCC built-in: assertion is allowed to fail, right?

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
__sync_synchronize();
atomic_store_explicit(&lock, true, memory_order_relaxed);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // we should somehow put a `LoadLoad` memory barrier here,
    // or the assert might fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}

Share Improve this question asked Mar 23 at 17:23 Not A Name 332 silver badges3 bronze badges

This post seems to contain two questions that are really independent. The second half just seems to be about the fact that in order to actually achieve any synchronization or deduce anything about ordering, you need appropriate barriers in both threads, which seems to me pretty obvious. LoadLoad reordering in Thread 2 could certainly make the assert fail, and there's no way that any actions whatsoever in Thread 1 could prevent Thread 2 from doing so. – Nate Eldredge Commented Mar 23 at 19:43
1 Here's a test that should work on your RPi 5 (it does on mine): Example of LoadLoad reordering. The two keys are (1) have the variables in separate cache lines; (2) have a test that you can repeat quickly, without re-running the program or spawning new threads every time, and that isn't reliant on any particular timing synchronization between the threads. – Nate Eldredge Commented Mar 23 at 20:02
1 On my machine the number of required iterations is usually in the thousands, which is instantaneous when you're running them at full speed, but might never be seen if you're just taking one shot per run of your program. Another trick that can help (though not needed here) is to manually evict a cache line if you want a load or store to be delayed after one that's later in program order. – Nate Eldredge Commented Mar 23 at 20:04
Thx for the snippet. I was not aware of the cache line part, and I did spawn thread each time, which makes it quite slow. Thx for the tips on how to experiment on such concurrency problems. – Not A Name Commented Mar 24 at 13:22

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

Yes, __sync_synchronize() is at least in practice equivalent to std::atomic_thread_fence(memory_order_seq_cst).

Formally, __sync_synchronize() operates in terms of memory barriers and blocking memory reordering, since it predates the existence of C++11's formal memory model. atomic_thread_fence operates in terms of C++11's memory model; compiling to a full-barrier instruction is an implementation detail.

So for example it's not required by the standard for thread_fence to do anything in a program where there aren't any std::atomic<> objects because its behaviour is only defined in terms of atomics. While __sync_synchronize() (and thread_fence in practice as an implementation detail in GCC/clang) could let you hack something up in terms of synchronizing on plain int variables. That's UB in C++11, and a bad idea even in terms of a known implementation like GCC; see Who's afraid of a big bad optimizing compiler? re: the obvious vs. non-obvious badness (like invented loads) that can happen when you just use memory barriers instead of std::atomic with relaxed for shared variables to stop a compilers from keeping them in registers.

But my point is, in practice they work the same, but they're from different memory models: the __sync builtins are in terms of barriers against local reordering of accesses to cache-coherent shared memory (i.e. a CPU-architecture view), vs. C++11 std::atomic stuff being in terms of its formalism with modification orders and syncs-with / happens-before. Which formally allows some things that aren't plausible on a real CPU which uses cache-coherent shared memory.

Yes, in your code blocks, the assertion could fail on a CPU where LoadLoad reordering is possible. It's probably not possible with both variables in the same cache line. See C++ atomic variable memory order problem can not reproduce LoadStore reordering example for another case of trying to repro memory-reordering.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

multithreading - The sequential consistent order of C++11 vs traditional GCC built-ins like `__sync_synchronize` - Stack Overflo

1 Answer 1

与本文相关的文章

评论列表(0)