最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

c++ - Inserting MOV instruction to prefetch data to L1 cache - Stack Overflow

programmeradmin3浏览0评论

While implementing tiled matrix multiplication with avx512 and _mm512_i32gather_pd targeting 32 KiB L1 $ and 1 MiB L2 $, I’ve noticed a significant increase in matmul time when the value of kend changes from 144 to 152. Bellow is the code I wrote. The loop is unrolled considering throughput of 5-cycle gather and 0.5-cycle FMA.

void matmul_tile(int i, int k, double *c, double const *a, double const *b, int iend, int jend, int kend){
    __m256i baseb = _mm256_setr_epi32(
        0 * j, 1 * j, 2 * j, 3 * j,
        4 * j, 5 * j, 6 * j, 7 * j
    );
    for(int ii = 0; ii <= iend - 10; ii += 10)
      for(int jj = 0; jj < jend; ++jj){
          __m512d accums[10] = {
              _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(),
              _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd(), _mm512_setzero_pd()
          };

          for(int kk = 0; kk <= kend - 8; kk += 8){
              __m512d b_vals = _mm512_i32gather_pd(baseb, &b[kk * j + jj], sizeof(double));
              double const *a_base = &a[(ii + 0) * k + kk];

              accums[0] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[0]), accums[0]);
              accums[1] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[k]), accums[1]);
              accums[2] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[2 * k]), accums[2]);
              accums[3] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[3 * k]), accums[3]);
              accums[4] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[4 * k]), accums[4]);
              accums[5] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[5 * k]), accums[5]);
              accums[6] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[6 * k]), accums[6]);
              accums[7] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[7 * k]), accums[7]);
              accums[8] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[8 * k]), accums[8]);
              accums[9] = _mm512_fmadd_pd(b_vals, _mm512_loadu_pd(&a_base[9 * k]), accums[9]);
          }

          double *c_base = &c[(ii + 0) * j + jj];
          c_base[0] += _mm512_reduce_add_pd(accums[0]);
          c_base[j] += _mm512_reduce_add_pd(accums[1]);
          c_base[2 * j] += _mm512_reduce_add_pd(accums[2]);
          c_base[3 * j] += _mm512_reduce_add_pd(accums[3]);
          c_base[4 * j] += _mm512_reduce_add_pd(accums[4]);
          c_base[5 * j] += _mm512_reduce_add_pd(accums[5]);
          c_base[6 * j] += _mm512_reduce_add_pd(accums[6]);
          c_base[7 * j] += _mm512_reduce_add_pd(accums[7]);
          c_base[8 * j] += _mm512_reduce_add_pd(accums[8]);
          c_base[9 * j] += _mm512_reduce_add_pd(accums[9]);
      }
}

I suspect that this performance drop is related to the L1 cache size and the effect of prefetching on the gather instruction. The estimated cache usage to compute 10 rows, assuming L1 cache prefetching is enabled, is as follows:

when kend is 144: 10 * 144 * 8 + 128 * 144 ≈ 30 KiB

when kend is 152: 10 * 152 * 8 + 128 * 152 ≈ 32 KiB.

So when kend is 152, it seems that the prefetched next column elements of matrix B are evicted before the next jj iteration starts, which slows down the gather instruction. I heard prefetched lines are evicted before lines actually used.

To increase the kend, I am considering inserting MOV instructions within kk iteration to manually bring in lines of next columns of b to L1 cache, since prefetch instruction on some processor only fetches a requested line to L2. My concerns are

  1. Is there way to insert MOV instruction in the code without using volatile asm?
  2. Could inserting MOV instruction hurt _mm512_loadu_pd throughput?

Any insights or suggestions would be greatly appreciated!

发布评论

评论列表(0)

  1. 暂无评论