I was running the STREAM benchmark on my new Intel i7-12700H and noticed the following result (I changed the allocation of a, b, c from stack to heap allocation, aligned to 64 bytes). This a system with dual channel 4800 MT/s RAM, which is a theoretic peak of 76.8 GB/s.
# Only use performance cores
OMP_NUM_THREADS=8 numactl -C 0-7 ./stream_c.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 400000000 (elements), Offset = 0 (elements)
Memory per array = 3051.8 MiB (= 3.0 GiB).
Total memory required = 9155.3 MiB (= 8.9 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 122497 microseconds.
(= 122497 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 52570.6 0.124356 0.121741 0.130264
Scale: 53046.5 0.123956 0.120649 0.125336
Add: 51644.6 0.187153 0.185886 0.189544
Triad: 51564.1 0.187952 0.186176 0.190740
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
This seems normal, but it uses the default values of bytes
static double bytes[4] = {
2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
};
This is not accurate for systems where writes to uncached memory incur a read request, called a write allocation (WA) or read for ownership (RFO) depending on source. To make it accurate, we would get
static double bytes[4] = {
3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
4 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,
4 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
};
and this gives
Function Best Rate MB/s Avg time Min time Max time
Copy: 79788.6 0.123881 0.120318 0.127139
Scale: 78227.6 0.124536 0.122719 0.127953
Add: 68944.0 0.186474 0.185658 0.187657
Triad: 70287.2 0.187193 0.182110 0.192889
However, this is higher than the theoretic peak.
If I change the scale benchmark to be in-place, and count 2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE
because there is no WA/RFO, we get the 53 GB/s again
Scale: 53705.2 0.123021 0.119169 0.132730
To me, this strongly suggest that there is no RFO / WA. This is surprises me because I thought that was only possible with non-temporal stores, and I have verified that these are not in the binary. There is also no call to memcpy
generated.
So am I wrong, is Alderlake able to optimise away RFO / WA reads in certain cases? Does anyone have more information on what CPUs can do this, and under what conditions this happens?