While implementing tiled matrix multiplication with avx512 and _mm512_i32gather_pd targeting 32 KiB L1 $ and 1 MiB L2 $, I’ve noticed a significant increase in matmul time when the value of kend changes from 144 to 152. Bellow is the code I wrote. The loop is unrolled considering throughput of 5-cycle gather and 0.5-cycle FMA.
void matmul_tile(int i, int k, double *c, double const *a, double const *b, int iend, int jend, int kend){
__m256i baseb = _mm256_setr_epi32(
0 * j, 1 * j, 2 * j, 3 * j,
4 * j, 5 * j, 6 * j, 7 * j
);
for(int ii = 0; ii <= iend - 10; ii += 10)
for(int jj = 0; jj < jend; ++jj){
__m512d accums[10] = {
_mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(),
_mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd()
};
for(int kk = 0; kk <= kend - 8; kk += 8){
__m512d b_vals = _mm512_i32gather_pd(baseb, &b[kk * j + jj], sizeof(double));
double const *a_base = &a[(ii + 0) * k + kk];
accums[0] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[0]), accums[0]);
accums[1] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[k]), accums[1]);
accums[2] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[2 * k]), accums[2]);
accums[3] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[3 * k]), accums[3]);
accums[4] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[4 * k]), accums[4]);
accums[5] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[5 * k]), accums[5]);
accums[6] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[6 * k]), accums[6]);
accums[7] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[7 * k]), accums[7]);
accums[8] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[8 * k]), accums[8]);
accums[9] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[9 * k]), accums[9]);
}
double *c_base = &c[(ii + 0) * j + jj];
c_base[0] += _mm512_reduce_add_pd(accums[0]);
c_base[j] += _mm512_reduce_add_pd(accums[1]);
c_base[2 * j] += _mm512_reduce_add_pd(accums[2]);
c_base[3 * j] += _mm512_reduce_add_pd(accums[3]);
c_base[4 * j] += _mm512_reduce_add_pd(accums[4]);
c_base[5 * j] += _mm512_reduce_add_pd(accums[5]);
c_base[6 * j] += _mm512_reduce_add_pd(accums[6]);
c_base[7 * j] += _mm512_reduce_add_pd(accums[7]);
c_base[8 * j] += _mm512_reduce_add_pd(accums[8]);
c_base[9 * j] += _mm512_reduce_add_pd(accums[9]);
}
}
I suspect that this performance drop is related to the L1 cache size and the effect of prefetching on the gather instruction. The estimated cache usage to compute 10 rows, assuming L1 cache prefetching is enabled, is as follows:
when kend is 144: 10 * 144 * 8 + 128 * 144 ≈ 30 KiB
when kend is 152: 10 * 152 * 8 + 128 * 152 ≈ 32 KiB.
So when kend is 152, it seems that the prefetched next column elements of matrix B are evicted before the next jj iteration starts, which slows down the gather instruction. I heard prefetched lines are evicted before lines actually used.
To increase the kend, I am considering inserting MOV instructions within kk iteration to manually bring in lines of next columns of b to L1 cache, since prefetch instruction on some processor only fetches a requested line to L2. My concerns are
- Is there way to insert MOV instruction in the code without using
volatile asm
? - Could inserting MOV instruction hurt
_mm512_loadu_pd
throughput?
Any insights or suggestions would be greatly appreciated!