cuda - Why is my shared memory blocking kernel slower than my global memory coalescing kernel?

I have started to learn CUDA programming and I am currently focusing on reproducing the code from . I am implementing the third kernel with shared memory blocking but when I evaluate the performances I get lower FLOPS than with the global memory coalescing kernel. I am basically using the same code as the article for initialization and benchmarking. Except I use fixed 1024x1024 matrices. I have the following specs :

NVIDIA RTX A6000
Compute Capability: 8.6
CUDA 11.8

Kernel code (heavily commented because I write my thought process when I write code):

#define CEIL_DIV(a,b) (a+b-1)/b
__global__ void shared_memory_blocking_kernel(int M, int N, int K, float alpha, 
                                                    const float* A, const float* B,
                                                    float beta, float* C) {
    // This block is responsible for computing submatrix C_xy
    const uint bx = blockIdx.x;
    const uint by = blockIdx.y;

    // This block dimension
    const uint bd = blockDim.x;

    // This thread submatrix responsibility
    const uint tx = threadIdx.x;
    const uint ty = threadIdx.y;
    const uint local_mem_acces = ty * bd + tx; 

    // This thread target 
    const uint x = blockIdx.x * blockDim.x + threadIdx.x;
    const uint y = blockIdx.y * blockDim.y + threadIdx.y;

    float sum = 0.0f; // Value that will be computed by this thread

    // Allocate shared memory for submatrices A and B :
    extern __shared__ float sharedMemory[];
    float* subA = sharedMemory;
    float* subB = (float*)&sharedMemory[blockDim.x * blockDim.y];

    // Should run K/blockdim.x times WE FORCE BLOCK DIM TO BE SQUARED HERE
    int num_iter = CEIL_DIV(K, bd);
    
    // bx*bd+ty was alway computed, we can just move the A pointer there
    A += (bx * bd + ty)*K; // Now we point to the row this thread will be responsible for
    // similarly, we can move the B pointer to by*bd + ty * K
    B += by * bd + ty * K; // Now we point to the column this thread will be responsible for
    for (int i=0; i < num_iter; i++)
    {
        // Load A and B submatrices in shared memory with coalesced trick
        subA[local_mem_acces] = A[tx]; //ensures coalescing by having consecutive threads access consecutive memory locations
        subB[local_mem_acces] = B[tx];
        

        __syncthreads();
        // We just need now to jump by bd to access the next submatrix
        A += bd;
        // We need to jump by bd*K to access the next submatrix
        B += bd*K;
        // Compute the partial sum
        for (int k = 0; k < bd; k++){
            sum += subA[ty * bd + k] * subB[k * bd + tx];
        } 

        __syncthreads();
    }

    // This thread is still taking care of its own C_ij
    C[y * N + x] = alpha * sum + beta * C[y * N + x];
}

To run this code I use :

void run_shared_memory_blocking_kernel(int M, int N, int K, float alpha,
    float* A, float* B, 
    float beta, float* C) {
    dim3 gridDim(CEIL_DIV(M, 32) , CEIL_DIV(N, 32));
    dim3 blockDim(32, 32);
    size_t sharedMemorySize = 2 * blockDim.x * blockDim.y * sizeof(float);
    shared_memory_blocking_kernel<<<gridDim, blockDim, sharedMemorySize>>>(M, N, K, alpha, A, B, beta, C);
}

When running my code I get roughly 1800 GFLOPS but when running sibohem's code I get 2800 GFLOPS for the 1024 matrix size. I am just starting to learn to use the profiler so an advice on how to identify where the issue is would be greatly appreciated.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

cuda - Why is my shared memory blocking kernel slower than my global memory coalescing kernel? - Stack Overflow

与本文相关的文章

评论列表(0)