最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

c - Pointer chasing benchmark - unexpected lack of out of order execution? - Stack Overflow

programmeradmin1浏览0评论

I wanted a reliable benchmark which has a lot of cache misses, and the obvious go-to seemed to be a pointer chasing benchmark. I ended up trying google/multichase but the results don't seem to be what I expected.

Multichase is essentially just your classic pointer chasing benchmark, except it seems like the authors have gone to greater lengths to make sure that the hardware prefetcher fails. It reports the average latency for de-referencing your pointer, and has the option to interleave "work", which is basically just an empty loop. The relevant code snippet is as follows:

static void chase_work(per_thread_t *t) {
  void *p = t->x.cycle[0];
  size_t extra_work = strtoul(t->x.extra_args, 0, 0);
  size_t work = 0;
  size_t i;

  // the extra work is intended to be overlapped with a dereference,
  // but we don't want it to skip past the next dereference.  so
  // we fold in the value of the pointer, and launch the deref then
  // go into a loop performing extra work, hopefully while the
  // deref occurs.
  do {
    x25(work += (uintptr_t)p; p = *(void **)p;
        // (I uncomment the following line for the second benchmark)
        // (it was not a part of the original source code)
        //work += (uintptr_t)p;
        for (i = 0; i < extra_work; ++i) { work ^= i; })
  } while (__sync_add_and_fetch(&t->x.count, 25));

  // we never actually reach here, but the compiler doesn't know that
  t->x.cycle[0] = p;
  t->x.dummy = work;
}

(x25 is a macro which repeats the argument 25 times, so x25(y) is just y y y .... y 25 times)

extra_work is passed in as a parameter, and in an ideal scenario the processor should execute the "work" loop out of order while the value of the pointer is fetched. This would also mean that as extra_work is varied, initially it should not lead to much of an increase in the runtime, whereas beyond a point the runtime will be linear with extra_work when the cache misses are no longer the bottleneck.

Unfortunately there was no such flat region at the start, and it was linear throughout. This made me suspect that the loop is not actually run while the processor waits for the pointer dereference.

To confirm this, I added a work += (uintptr_t)p; line just before the loop but after the dereference as shown in the code above. This should make the dereference and loop ordered, so the loop cannot proceed till the dereference succeeds. And the code already had the property that the next dereference cannot proceed till the loop finished.

So, if the loop was executing out-of-order with the dereference before, one would expect a sharp rise in the runtime with this change - however, that is not observed and the rise is barely of a few ns. I have attached the graph below, where I'm varying extra_work from 1 to 500.

My processor is 11th Gen Intel(R) Core(TM) i5-11400H and I'm on Ubuntu 22.04. It seems obvious to me that the loop should be executed out-of-order with the dereference, but that is clearly not happening, hence my questions are:

(1) why isn't it happening? (2) how to fix this to get the benchmark working as expected?

(To reproduce, clone the multichase github repo, make and run ./multichase -c work:N)

I wanted a reliable benchmark which has a lot of cache misses, and the obvious go-to seemed to be a pointer chasing benchmark. I ended up trying google/multichase but the results don't seem to be what I expected.

Multichase is essentially just your classic pointer chasing benchmark, except it seems like the authors have gone to greater lengths to make sure that the hardware prefetcher fails. It reports the average latency for de-referencing your pointer, and has the option to interleave "work", which is basically just an empty loop. The relevant code snippet is as follows:

static void chase_work(per_thread_t *t) {
  void *p = t->x.cycle[0];
  size_t extra_work = strtoul(t->x.extra_args, 0, 0);
  size_t work = 0;
  size_t i;

  // the extra work is intended to be overlapped with a dereference,
  // but we don't want it to skip past the next dereference.  so
  // we fold in the value of the pointer, and launch the deref then
  // go into a loop performing extra work, hopefully while the
  // deref occurs.
  do {
    x25(work += (uintptr_t)p; p = *(void **)p;
        // (I uncomment the following line for the second benchmark)
        // (it was not a part of the original source code)
        //work += (uintptr_t)p;
        for (i = 0; i < extra_work; ++i) { work ^= i; })
  } while (__sync_add_and_fetch(&t->x.count, 25));

  // we never actually reach here, but the compiler doesn't know that
  t->x.cycle[0] = p;
  t->x.dummy = work;
}

(x25 is a macro which repeats the argument 25 times, so x25(y) is just y y y .... y 25 times)

extra_work is passed in as a parameter, and in an ideal scenario the processor should execute the "work" loop out of order while the value of the pointer is fetched. This would also mean that as extra_work is varied, initially it should not lead to much of an increase in the runtime, whereas beyond a point the runtime will be linear with extra_work when the cache misses are no longer the bottleneck.

Unfortunately there was no such flat region at the start, and it was linear throughout. This made me suspect that the loop is not actually run while the processor waits for the pointer dereference.

To confirm this, I added a work += (uintptr_t)p; line just before the loop but after the dereference as shown in the code above. This should make the dereference and loop ordered, so the loop cannot proceed till the dereference succeeds. And the code already had the property that the next dereference cannot proceed till the loop finished.

So, if the loop was executing out-of-order with the dereference before, one would expect a sharp rise in the runtime with this change - however, that is not observed and the rise is barely of a few ns. I have attached the graph below, where I'm varying extra_work from 1 to 500.

My processor is 11th Gen Intel(R) Core(TM) i5-11400H and I'm on Ubuntu 22.04. It seems obvious to me that the loop should be executed out-of-order with the dereference, but that is clearly not happening, hence my questions are:

(1) why isn't it happening? (2) how to fix this to get the benchmark working as expected?

(To reproduce, clone the multichase github repo, make and run ./multichase -c work:N)

Share Improve this question edited Mar 4 at 20:11 Box Box Box Box asked Mar 4 at 19:56 Box Box Box BoxBox Box Box Box 5,38010 gold badges52 silver badges71 bronze badges 11
  • From your Makefile, this looks like it's compiled with cc (so GCC on Ubuntu) -O3 -fno-unroll-loops. Can you confirm you enabled optimization for your testing? __sync_add_and_fetch is a full memory barrier on x86 due to lock add, so store/reload of local vars in the extra_work loop couldn't overlap across iterations. Although should still run in the shadow of each pointer-chasing load. Oh, but there's only one such barrier for every 25 pointer derefs, so IDK. Still, extra stores/reloads could matter when trying to figure out what's going on. – Peter Cordes Commented Mar 4 at 20:27
  • 1 @PeterCordes yes, I have just used the default makefile with all the optimizations enabled. I have also looked at the disassembly and nothing particularly surprising popped up, the work loops are just a bunch of standard xor, add, cmp, jne instructions. I will try changing x25 to x200 (and the value in the add & fetch) and see if similar results persist. – Box Box Box Box Commented Mar 4 at 20:36
  • 1 Can't repro, works as expected for me on Skylake (i7-6700k on Linux with EPP = balance_performance, so 3.9GHz normally, but actually 2.6GHz while this memory-bound workload is running). With -c work:0 I get about 62.3 cycles. With work:20 I still get 62.3 cycles, fully in the shadow of the load. work:50 - 68 to 69 cycles (at 2.8 to 3.2GHz). work:70 about 77c. (SKL ROB capacity is 224 uops, the inner loops are 4 or 3 uops per iteration (the x25 manual unrolled loops aren't all the same), and there are 2 uops of NOPs to align the top of the loop plus an xor-zero. – Peter Cordes Commented Mar 4 at 20:57
  • 1 Ah, looks like I found the issue - for some reason the disassembly has all the pointer dereferences inside the main loop bunched up together, followed by the 25 work loops bunched up together. Not really sure why it's happening or how to fix it, but oh well at least it explains the cause. @PeterCordes could you maybe check if on your system the disassembly looks similar? On my system it is dereferencing and storing the values on the stack +some registers and later reusing them in the loops all bunched up together, instead of interleaving the work loops :-| – Box Box Box Box Commented Mar 4 at 21:34
  • 2 Oh, yes, GCC14.2's asm output has a couple of the work-loops without a load, but most of them are interleaved with pointer-chasing loops as expected. (I had noticed the absence of a load once, but then looked again at a different inner loop that did have a load and fot about it, xD.) But yeah, that would for sure explain it, effectively multiplying work by x25 or x200 in terms of how many instructions the ROB has to hide for it to run in the shadow of a cache-miss load. You could post that as an answer. An asm("" : "+r"(work) :: "memory") statement could probably fix that. – Peter Cordes Commented Mar 4 at 21:37
 |  Show 6 more comments

1 Answer 1

Reset to default 3

The issue was that my compiler (gcc version 11.4.0) seems to be reordering the loops and pointer de-references - it de-references all 25 pointers in order, stores their results in various registers and on the stack, and then runs all loops together using the stored values as and when required. So indeed the loop was not running out-of-order with the pointer de-reference.

@PeterCordes suggests asm("" : "+r"(work) :: "memory") and asm("" : "+r"(work), "+r"(p)); in the comments and they both work well. I have used the latter since it avoids clobbering the full memory.

static void chase_work(per_thread_t *t) {
  void *p = t->x.cycle[0];
  size_t extra_work = strtoul(t->x.extra_args, 0, 0);
  size_t work = 0;
  size_t i;

  // the extra work is intended to be overlapped with a dereference,
  // but we don't want it to skip past the next dereference.  so
  // we fold in the value of the pointer, and launch the deref then
  // go into a loop performing extra work, hopefully while the
  // deref occurs.
  do {
    x25(work += (uintptr_t)p; p = *(void **)p;
        for (i = 0; i < extra_work; ++i) { work ^= i; }
        asm("" : "+r"(work), "+r"(p)); )
  } while (__sync_add_and_fetch(&t->x.count, 25));

  // we never actually reach here, but the compiler doesn't know that
  t->x.cycle[0] = p;
  t->x.dummy = work;
}

It works as expected, and the flat region is observed at the start :-)

However, this method is compiler specific since the inline ASM syntax is compiler specific, and it also clobbers the full memory which is probably not ideal. Doesn't seem to affect performance here, but I wonder if there's a neater way to stop the reordering.

The disassembly with the bunched up pointer de-references and work loops are here and here in case someone is interested (pretty weird compiler optimization isn't it xD)

发布评论

评论列表(0)

  1. 暂无评论