I wanted a reliable benchmark which has a lot of cache misses, and the obvious go-to seemed to be a pointer chasing benchmark. I ended up trying google/multichase but the results don't seem to be what I expected.
Multichase is essentially just your classic pointer chasing benchmark, except it seems like the authors have gone to greater lengths to make sure that the hardware prefetcher fails. It reports the average latency for de-referencing your pointer, and has the option to interleave "work", which is basically just an empty loop. The relevant code snippet is as follows:
static void chase_work(per_thread_t *t) {
void *p = t->x.cycle[0];
size_t extra_work = strtoul(t->x.extra_args, 0, 0);
size_t work = 0;
size_t i;
// the extra work is intended to be overlapped with a dereference,
// but we don't want it to skip past the next dereference. so
// we fold in the value of the pointer, and launch the deref then
// go into a loop performing extra work, hopefully while the
// deref occurs.
do {
x25(work += (uintptr_t)p; p = *(void **)p;
// (I uncomment the following line for the second benchmark)
// (it was not a part of the original source code)
//work += (uintptr_t)p;
for (i = 0; i < extra_work; ++i) { work ^= i; })
} while (__sync_add_and_fetch(&t->x.count, 25));
// we never actually reach here, but the compiler doesn't know that
t->x.cycle[0] = p;
t->x.dummy = work;
}
(x25
is a macro which repeats the argument 25 times, so x25(y)
is just y y y .... y
25 times)
extra_work
is passed in as a parameter, and in an ideal scenario the processor should execute the "work" loop out of order while the value of the pointer is fetched. This would also mean that as extra_work
is varied, initially it should not lead to much of an increase in the runtime, whereas beyond a point the runtime will be linear with extra_work
when the cache misses are no longer the bottleneck.
Unfortunately there was no such flat region at the start, and it was linear throughout. This made me suspect that the loop is not actually run while the processor waits for the pointer dereference.
To confirm this, I added a work += (uintptr_t)p;
line just before the loop but after the dereference as shown in the code above. This should make the dereference and loop ordered, so the loop cannot proceed till the dereference succeeds. And the code already had the property that the next dereference cannot proceed till the loop finished.
So, if the loop was executing out-of-order with the dereference before, one would expect a sharp rise in the runtime with this change - however, that is not observed and the rise is barely of a few ns. I have attached the graph below, where I'm varying extra_work
from 1 to 500.
My processor is 11th Gen Intel(R) Core(TM) i5-11400H and I'm on Ubuntu 22.04. It seems obvious to me that the loop should be executed out-of-order with the dereference, but that is clearly not happening, hence my questions are:
(1) why isn't it happening? (2) how to fix this to get the benchmark working as expected?
(To reproduce, clone the multichase github repo, make and run ./multichase -c work:N
)
I wanted a reliable benchmark which has a lot of cache misses, and the obvious go-to seemed to be a pointer chasing benchmark. I ended up trying google/multichase but the results don't seem to be what I expected.
Multichase is essentially just your classic pointer chasing benchmark, except it seems like the authors have gone to greater lengths to make sure that the hardware prefetcher fails. It reports the average latency for de-referencing your pointer, and has the option to interleave "work", which is basically just an empty loop. The relevant code snippet is as follows:
static void chase_work(per_thread_t *t) {
void *p = t->x.cycle[0];
size_t extra_work = strtoul(t->x.extra_args, 0, 0);
size_t work = 0;
size_t i;
// the extra work is intended to be overlapped with a dereference,
// but we don't want it to skip past the next dereference. so
// we fold in the value of the pointer, and launch the deref then
// go into a loop performing extra work, hopefully while the
// deref occurs.
do {
x25(work += (uintptr_t)p; p = *(void **)p;
// (I uncomment the following line for the second benchmark)
// (it was not a part of the original source code)
//work += (uintptr_t)p;
for (i = 0; i < extra_work; ++i) { work ^= i; })
} while (__sync_add_and_fetch(&t->x.count, 25));
// we never actually reach here, but the compiler doesn't know that
t->x.cycle[0] = p;
t->x.dummy = work;
}
(x25
is a macro which repeats the argument 25 times, so x25(y)
is just y y y .... y
25 times)
extra_work
is passed in as a parameter, and in an ideal scenario the processor should execute the "work" loop out of order while the value of the pointer is fetched. This would also mean that as extra_work
is varied, initially it should not lead to much of an increase in the runtime, whereas beyond a point the runtime will be linear with extra_work
when the cache misses are no longer the bottleneck.
Unfortunately there was no such flat region at the start, and it was linear throughout. This made me suspect that the loop is not actually run while the processor waits for the pointer dereference.
To confirm this, I added a work += (uintptr_t)p;
line just before the loop but after the dereference as shown in the code above. This should make the dereference and loop ordered, so the loop cannot proceed till the dereference succeeds. And the code already had the property that the next dereference cannot proceed till the loop finished.
So, if the loop was executing out-of-order with the dereference before, one would expect a sharp rise in the runtime with this change - however, that is not observed and the rise is barely of a few ns. I have attached the graph below, where I'm varying extra_work
from 1 to 500.
My processor is 11th Gen Intel(R) Core(TM) i5-11400H and I'm on Ubuntu 22.04. It seems obvious to me that the loop should be executed out-of-order with the dereference, but that is clearly not happening, hence my questions are:
(1) why isn't it happening? (2) how to fix this to get the benchmark working as expected?
(To reproduce, clone the multichase github repo, make and run ./multichase -c work:N
)
1 Answer
Reset to default 3The issue was that my compiler (gcc version 11.4.0) seems to be reordering the loops and pointer de-references - it de-references all 25 pointers in order, stores their results in various registers and on the stack, and then runs all loops together using the stored values as and when required. So indeed the loop was not running out-of-order with the pointer de-reference.
@PeterCordes suggests asm("" : "+r"(work) :: "memory")
and asm("" : "+r"(work), "+r"(p));
in the comments and they both work well. I have used the latter since it avoids clobbering the full memory.
static void chase_work(per_thread_t *t) {
void *p = t->x.cycle[0];
size_t extra_work = strtoul(t->x.extra_args, 0, 0);
size_t work = 0;
size_t i;
// the extra work is intended to be overlapped with a dereference,
// but we don't want it to skip past the next dereference. so
// we fold in the value of the pointer, and launch the deref then
// go into a loop performing extra work, hopefully while the
// deref occurs.
do {
x25(work += (uintptr_t)p; p = *(void **)p;
for (i = 0; i < extra_work; ++i) { work ^= i; }
asm("" : "+r"(work), "+r"(p)); )
} while (__sync_add_and_fetch(&t->x.count, 25));
// we never actually reach here, but the compiler doesn't know that
t->x.cycle[0] = p;
t->x.dummy = work;
}
It works as expected, and the flat region is observed at the start :-)
However, this method is compiler specific since the inline ASM syntax is compiler specific, and it also clobbers the full memory which is probably not ideal. Doesn't seem to affect performance here, but I wonder if there's a neater way to stop the reordering.
The disassembly with the bunched up pointer de-references and work loops are here and here in case someone is interested (pretty weird compiler optimization isn't it xD)
cc
(so GCC on Ubuntu)-O3 -fno-unroll-loops
. Can you confirm you enabled optimization for your testing?__sync_add_and_fetch
is a full memory barrier on x86 due tolock add
, so store/reload of local vars in theextra_work
loop couldn't overlap across iterations. Although should still run in the shadow of each pointer-chasing load. Oh, but there's only one such barrier for every 25 pointer derefs, so IDK. Still, extra stores/reloads could matter when trying to figure out what's going on. – Peter Cordes Commented Mar 4 at 20:27x25
tox200
(and the value in the add & fetch) and see if similar results persist. – Box Box Box Box Commented Mar 4 at 20:36-c work:0
I get about 62.3 cycles. Withwork:20
I still get 62.3 cycles, fully in the shadow of the load.work:50
- 68 to 69 cycles (at 2.8 to 3.2GHz).work:70
about 77c. (SKL ROB capacity is 224 uops, the inner loops are 4 or 3 uops per iteration (the x25 manual unrolled loops aren't all the same), and there are 2 uops of NOPs to align the top of the loop plus an xor-zero. – Peter Cordes Commented Mar 4 at 20:57work
by x25 or x200 in terms of how many instructions the ROB has to hide for it to run in the shadow of a cache-miss load. You could post that as an answer. Anasm("" : "+r"(work) :: "memory")
statement could probably fix that. – Peter Cordes Commented Mar 4 at 21:37