So I've came across Jeff Preshing's wonderful blog posts on what's Acquire
/Release
and how they may be achieved with some CPU barriers.
I've also read that SeqCst
is about some total order that's guaranteed to be consistent with not coherence-after relation - though at times it might contradict with happens-before relation established by plain Acquire
/Release
operations due to historical reasons.
My question is, how does the old GCC built-ins map into the memory model proposed by C++11 (and later revisions)? In particular, how to map __sync_synchronize()
into C++11 or later modern C/C++?
In the GCC manual this call is simply described as a full memory barrier, which I suppose is the combination of all four major kind of barrier i.e. LoadLoad
/LoadStore
/StoreLoad
/StoreStore
barriers all at once. But is sync_synchronize
equivalent to std::atomic_thread_fence(memory_order_seq_cst)
? Or maybe, formally speaking, one of them is stronger than the other (which I suppose is the case here: in general a SeqCst
fence should be stronger, since it requires the toolchain/platform to improvise a global ordering somehow, no?), and it just happens that most of the CPUs out there provides only instructions that satisfies both (full memory barrier by __sync_synchronize
, total sequential ordering by std::atomic_thread_fence(memory_order_seq_cst)
) at once, for example x86 mfence
and PowerPC hwsync
?
Either __sync_synchronize
and std::atomic_thread_fence(memory_order_seq_cst)
are formally equal or they are effectively equal (i.e. formally speaking they are different but no commercialized CPU bother to differentiate between the two), technically speaking a memory_order_relaxed
load on the same atomic still may not be relied upon to synchronize-with/create happens-before relation with it, no?
I.e. technically speaking all of these assertions are allowed to fail, right?
// Experiment 1, using C11 `atomic_thread_fence`: assertion is allowed to fail, right?
// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;
// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
atomic_store_explicit(&lock, true, memory_order_relaxed);
// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
// We should really `memory_order_acquire` the `lock`
// or `atomic_thread_fence(memory_order_acquire)` here,
// or this assertion may fail, no?
assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}
// Experiment 2, using `SeqCst` directly on the atomic store
// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;
// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_store_explicit(&lock, true, memory_order_seq_cst);
// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
// Again we should really `memory_order_acquire` the `lock`
// or `atomic_thread_fence(memory_order_acquire)` here,
// or this assertion may fail, no?
assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}
// Experiment 3, using GCC built-in: assertion is allowed to fail, right?
// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;
// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
__sync_synchronize();
atomic_store_explicit(&lock, true, memory_order_relaxed);
// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
// we should somehow put a `LoadLoad` memory barrier here,
// or the assert might fail, no?
assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}
I've tried these snippets on my RPi 5 but I don't see assertions fails. Yes this doesn't formally prove anything but it also doesn't shed light on differentiating between __sync_synchronize
and std::atomic_thread_fence(memory_order_seq_cst)
.
So I've came across Jeff Preshing's wonderful blog posts on what's Acquire
/Release
and how they may be achieved with some CPU barriers.
I've also read that SeqCst
is about some total order that's guaranteed to be consistent with not coherence-after relation - though at times it might contradict with happens-before relation established by plain Acquire
/Release
operations due to historical reasons.
My question is, how does the old GCC built-ins map into the memory model proposed by C++11 (and later revisions)? In particular, how to map __sync_synchronize()
into C++11 or later modern C/C++?
In the GCC manual this call is simply described as a full memory barrier, which I suppose is the combination of all four major kind of barrier i.e. LoadLoad
/LoadStore
/StoreLoad
/StoreStore
barriers all at once. But is sync_synchronize
equivalent to std::atomic_thread_fence(memory_order_seq_cst)
? Or maybe, formally speaking, one of them is stronger than the other (which I suppose is the case here: in general a SeqCst
fence should be stronger, since it requires the toolchain/platform to improvise a global ordering somehow, no?), and it just happens that most of the CPUs out there provides only instructions that satisfies both (full memory barrier by __sync_synchronize
, total sequential ordering by std::atomic_thread_fence(memory_order_seq_cst)
) at once, for example x86 mfence
and PowerPC hwsync
?
Either __sync_synchronize
and std::atomic_thread_fence(memory_order_seq_cst)
are formally equal or they are effectively equal (i.e. formally speaking they are different but no commercialized CPU bother to differentiate between the two), technically speaking a memory_order_relaxed
load on the same atomic still may not be relied upon to synchronize-with/create happens-before relation with it, no?
I.e. technically speaking all of these assertions are allowed to fail, right?
// Experiment 1, using C11 `atomic_thread_fence`: assertion is allowed to fail, right?
// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;
// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
atomic_store_explicit(&lock, true, memory_order_relaxed);
// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
// We should really `memory_order_acquire` the `lock`
// or `atomic_thread_fence(memory_order_acquire)` here,
// or this assertion may fail, no?
assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}
// Experiment 2, using `SeqCst` directly on the atomic store
// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;
// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_store_explicit(&lock, true, memory_order_seq_cst);
// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
// Again we should really `memory_order_acquire` the `lock`
// or `atomic_thread_fence(memory_order_acquire)` here,
// or this assertion may fail, no?
assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}
// Experiment 3, using GCC built-in: assertion is allowed to fail, right?
// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;
// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
__sync_synchronize();
atomic_store_explicit(&lock, true, memory_order_relaxed);
// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
// we should somehow put a `LoadLoad` memory barrier here,
// or the assert might fail, no?
assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}
I've tried these snippets on my RPi 5 but I don't see assertions fails. Yes this doesn't formally prove anything but it also doesn't shed light on differentiating between __sync_synchronize
and std::atomic_thread_fence(memory_order_seq_cst)
.
- This post seems to contain two questions that are really independent. The second half just seems to be about the fact that in order to actually achieve any synchronization or deduce anything about ordering, you need appropriate barriers in both threads, which seems to me pretty obvious. LoadLoad reordering in Thread 2 could certainly make the assert fail, and there's no way that any actions whatsoever in Thread 1 could prevent Thread 2 from doing so. – Nate Eldredge Commented Mar 23 at 19:43
- 1 Here's a test that should work on your RPi 5 (it does on mine): Example of LoadLoad reordering. The two keys are (1) have the variables in separate cache lines; (2) have a test that you can repeat quickly, without re-running the program or spawning new threads every time, and that isn't reliant on any particular timing synchronization between the threads. – Nate Eldredge Commented Mar 23 at 20:02
- 1 On my machine the number of required iterations is usually in the thousands, which is instantaneous when you're running them at full speed, but might never be seen if you're just taking one shot per run of your program. Another trick that can help (though not needed here) is to manually evict a cache line if you want a load or store to be delayed after one that's later in program order. – Nate Eldredge Commented Mar 23 at 20:04
- Thx for the snippet. I was not aware of the cache line part, and I did spawn thread each time, which makes it quite slow. Thx for the tips on how to experiment on such concurrency problems. – Not A Name Commented Mar 24 at 13:22
1 Answer
Reset to default 2Yes, __sync_synchronize()
is at least in practice equivalent to std::atomic_thread_fence(memory_order_seq_cst)
.
Formally, __sync_synchronize()
operates in terms of memory barriers and blocking memory reordering, since it predates the existence of C++11's formal memory model. atomic_thread_fence
operates in terms of C++11's memory model; compiling to a full-barrier instruction is an implementation detail.
So for example it's not required by the standard for thread_fence
to do anything in a program where there aren't any std::atomic<>
objects because its behaviour is only defined in terms of atomics. While __sync_synchronize()
(and thread_fence
in practice as an implementation detail in GCC/clang) could let you hack something up in terms of synchronizing on plain int
variables. That's UB in C++11, and a bad idea even in terms of a known implementation like GCC; see Who's afraid of a big bad optimizing compiler? re: the obvious vs. non-obvious badness (like invented loads) that can happen when you just use memory barriers instead of std::atomic
with relaxed
for shared variables to stop a compilers from keeping them in registers.
But my point is, in practice they work the same, but they're from different memory models: the __sync
builtins are in terms of barriers against local reordering of accesses to cache-coherent shared memory (i.e. a CPU-architecture view), vs. C++11 std::atomic
stuff being in terms of its formalism with modification orders and syncs-with / happens-before. Which formally allows some things that aren't plausible on a real CPU which uses cache-coherent shared memory.
Yes, in your code blocks, the assertion could fail on a CPU where LoadLoad reordering is possible. It's probably not possible with both variables in the same cache line. See C++ atomic variable memory order problem can not reproduce LoadStore reordering example for another case of trying to repro memory-reordering.