performance - Can reads for ownership (RFO) be eliminated without using non-temporal instructions on Alderlake?

I was running the STREAM benchmark on my new Intel i7-12700H and noticed the following result (I changed the allocation of a, b, c from stack to heap allocation, aligned to 64 bytes). This a system with dual channel 4800 MT/s RAM, which is a theoretic peak of 76.8 GB/s.

# Only use performance cores
OMP_NUM_THREADS=8 numactl -C 0-7 ./stream_c.exe

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 400000000 (elements), Offset = 0 (elements)
Memory per array = 3051.8 MiB (= 3.0 GiB).
Total memory required = 9155.3 MiB (= 8.9 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 122497 microseconds.
   (= 122497 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           52570.6     0.124356     0.121741     0.130264
Scale:          53046.5     0.123956     0.120649     0.125336
Add:            51644.6     0.187153     0.185886     0.189544
Triad:          51564.1     0.187952     0.186176     0.190740
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

This seems normal, but it uses the default values of bytes

static double   bytes[4] = {
    2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
    2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
    3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
    3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
    };

This is not accurate for systems where writes to uncached memory incur a read request, called a write allocation (WA) or read for ownership (RFO) depending on source. To make it accurate, we would get

static double   bytes[4] = {
    3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
    3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
    4 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
    4 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
    };

and this gives

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           79788.6     0.123881     0.120318     0.127139
Scale:          78227.6     0.124536     0.122719     0.127953
Add:            68944.0     0.186474     0.185658     0.187657
Triad:          70287.2     0.187193     0.182110     0.192889

However, this is higher than the theoretic peak.

If I change the scale benchmark to be in-place, and count 2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE because there is no WA/RFO, we get the 53 GB/s again

Scale:          53705.2     0.123021     0.119169     0.132730

To me, this strongly suggest that there is no RFO / WA. This is surprises me because I thought that was only possible with non-temporal stores, and I have verified that these are not in the binary. There is also no call to memcpy generated.

So am I wrong, is Alderlake able to optimise away RFO / WA reads in certain cases? Does anyone have more information on what CPUs can do this, and under what conditions this happens?

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

performance - Can reads for ownership (RFO) be eliminated without using non-temporal instructions on Alderlake? - Stack Overflow

与本文相关的文章

评论列表(0)