c++ - Inserting MOV instruction to prefetch data to L1 cache

While implementing tiled matrix multiplication with avx512 and _mm512_i32gather_pd targeting 32 KiB L1 $ and 1 MiB L2 $, I’ve noticed a significant increase in matmul time when the value of kend changes from 144 to 152. Bellow is the code I wrote. The loop is unrolled considering throughput of 5-cycle gather and 0.5-cycle FMA.

void matmul_tile(int i, int k, double *c, double const *a, double const *b, int iend, int jend, int kend){
    __m256i baseb = _mm256_setr_epi32(
        0 * j, 1 * j, 2 * j, 3 * j,
        4 * j, 5 * j, 6 * j, 7 * j
    );
    for(int ii = 0; ii <= iend - 10; ii += 10)
      for(int jj = 0; jj < jend; ++jj){
          __m512d accums[10] = {
              _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(),
              _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd()
          };

          for(int kk = 0; kk <= kend - 8; kk += 8){
              __m512d b_vals = _mm512_i32gather_pd(baseb, &b[kk * j + jj], sizeof(double));
              double const *a_base = &a[(ii + 0) * k + kk];

              accums[0] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[0]), accums[0]);
              accums[1] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[k]), accums[1]);
              accums[2] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[2 * k]), accums[2]);
              accums[3] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[3 * k]), accums[3]);
              accums[4] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[4 * k]), accums[4]);
              accums[5] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[5 * k]), accums[5]);
              accums[6] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[6 * k]), accums[6]);
              accums[7] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[7 * k]), accums[7]);
              accums[8] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[8 * k]), accums[8]);
              accums[9] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[9 * k]), accums[9]);
          }

          double *c_base = &c[(ii + 0) * j + jj];
          c_base[0] += _mm512_reduce_add_pd(accums[0]);
          c_base[j] += _mm512_reduce_add_pd(accums[1]);
          c_base[2 * j] += _mm512_reduce_add_pd(accums[2]);
          c_base[3 * j] += _mm512_reduce_add_pd(accums[3]);
          c_base[4 * j] += _mm512_reduce_add_pd(accums[4]);
          c_base[5 * j] += _mm512_reduce_add_pd(accums[5]);
          c_base[6 * j] += _mm512_reduce_add_pd(accums[6]);
          c_base[7 * j] += _mm512_reduce_add_pd(accums[7]);
          c_base[8 * j] += _mm512_reduce_add_pd(accums[8]);
          c_base[9 * j] += _mm512_reduce_add_pd(accums[9]);
      }
}

I suspect that this performance drop is related to the L1 cache size and the effect of prefetching on the gather instruction. The estimated cache usage to compute 10 rows, assuming L1 cache prefetching is enabled, is as follows:

when kend is 144: 10 * 144 * 8 + 128 * 144 ≈ 30 KiB

when kend is 152: 10 * 152 * 8 + 128 * 152 ≈ 32 KiB.

So when kend is 152, it seems that the prefetched next column elements of matrix B are evicted before the next jj iteration starts, which slows down the gather instruction. I heard prefetched lines are evicted before lines actually used.

To increase the kend, I am considering inserting MOV instructions within kk iteration to manually bring in lines of next columns of b to L1 cache, since prefetch instruction on some processor only fetches a requested line to L2. My concerns are

Is there way to insert MOV instruction in the code without using volatile asm?
Could inserting MOV instruction hurt _mm512_loadu_pd throughput?

Any insights or suggestions would be greatly appreciated!

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

c++ - Inserting MOV instruction to prefetch data to L1 cache - Stack Overflow

与本文相关的文章

评论列表(0)