c++ - Why Does this CUDA Code Loop Indefinitely?

The following code runs indefinitely, as kernel_loop is stuck in an infinite loop. Shouldn't it be the case that the two small kernels get launched concurrently?

#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>


int *d_x;
static constexpr int N = 1;

__global__ void init_buff(int *buff) {
  for (int i = 0; i < N; i++) {
    buff[i] = i;
  }
}

__global__ void kernel_loop(volatile int *buff) {
  while (true) {
    __threadfence();
    if (buff[0]) {
      break;
    }
  }
}

__global__ void kernel_write(volatile int *buff) {
  buff[0] = 1;
}

int main() {
  cudaMalloc(reinterpret_cast<void **>(&d_x), sizeof(int) * N);

  init_buff<<<1, 1>>>(d_x);

  cudaDeviceSynchronize();

  cudaStream_t stream1, stream2;
  cudaStreamCreateWithFlags(&stream1, cudaStreamDefault);
  cudaStreamCreateWithFlags(&stream2, cudaStreamDefault);

  cudaDeviceSynchronize();


  kernel_loop<<<1, 1, 0, stream1>>>(d_x);
  kernel_write<<<1, 1, 0, stream2>>>(d_x);

  cudaDeviceSynchronize();

  return 0;
}

Additionally, if I change the order of the launches like so:

  kernel_write<<<1, 1, 0, stream2>>>(d_x);
  kernel_loop<<<1, 1, 0, stream1>>>(d_x);

the program runs to completion.

Furthermore,

  cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
  cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);

also causes an infinte loop.

For the record, __nanosleep-ing the looping thread also doesn't work.

EDIT:

As per Ext3h's comment, added a __threadfence() to the writer kernel.

The following code runs indefinitely, as kernel_loop is stuck in an infinite loop. Shouldn't it be the case that the two small kernels get launched concurrently?

#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>


int *d_x;
static constexpr int N = 1;

__global__ void init_buff(int *buff) {
  for (int i = 0; i < N; i++) {
    buff[i] = i;
  }
}

__global__ void kernel_loop(volatile int *buff) {
  while (true) {
    __threadfence();
    if (buff[0]) {
      break;
    }
  }
}

__global__ void kernel_write(volatile int *buff) {
  buff[0] = 1;
}

int main() {
  cudaMalloc(reinterpret_cast<void **>(&d_x), sizeof(int) * N);

  init_buff<<<1, 1>>>(d_x);

  cudaDeviceSynchronize();

  cudaStream_t stream1, stream2;
  cudaStreamCreateWithFlags(&stream1, cudaStreamDefault);
  cudaStreamCreateWithFlags(&stream2, cudaStreamDefault);

  cudaDeviceSynchronize();


  kernel_loop<<<1, 1, 0, stream1>>>(d_x);
  kernel_write<<<1, 1, 0, stream2>>>(d_x);

  cudaDeviceSynchronize();

  return 0;
}

Additionally, if I change the order of the launches like so:

  kernel_write<<<1, 1, 0, stream2>>>(d_x);
  kernel_loop<<<1, 1, 0, stream1>>>(d_x);

the program runs to completion.

Furthermore,

  cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
  cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);

also causes an infinte loop.

For the record, __nanosleep-ing the looping thread also doesn't work.

EDIT:

As per Ext3h's comment, added a __threadfence() to the writer kernel.

Share Improve this question edited Feb 4 at 12:54 asked Feb 4 at 7:26 Elvir Crncevic 5054 silver badges21 bronze badges

I see two problems here: 1. AFAIK the CUDA execution model does not guarantee that changes made by one kernel will be visible to another kernel. The kernel might update its L1 cache locally without updating the L2. Or the other kernel may read the same value from the L1 without fetching the L2. Cache coherence is supposed to fix this issue but IDK how much coherent cache are on nvidia GPUs. Besides it require memory barriers that are not present. Atomics are the way to go but a quick tests show that this does not help. IDK how to enforce loads to be atomic (or use a memory barrier). – Jérôme Richard Commented Feb 4 at 11:40
1 2. you assume that the 2 kernel will run in parallel. This is AFAIK not guaranteed. They talk about concurrent execution in the programmer guide, not parallel execution. Concurrent kernels can be executed sequentially. In this case, your code is broken. You should not use such kind of codes on GPUs anyway (for the same reason locks are bad on GPU or more generally mutual exclusion, as opposed to cooperative execution). This waste computing resources (and results in issues like this). TL;DR: "launched concurrently" != "launched in parallel" – Jérôme Richard Commented Feb 4 at 11:46
Just wanted to point this out: docs.nvidia/cuda/cuda-c-programming-guide/… regarding the volatile qualifier used in this example. – Elvir Crncevic Commented Feb 4 at 12:17
"You should not use such kind..." is also a fairly strong statement, there are various places where this is beneficial, same with "locks are bad". Appreciate your input, either way. – Elvir Crncevic Commented Feb 4 at 12:38
In C++ volatile does not make code thread safe, only in Java does it do that. volatile merely prevents the compiler from reordering instructions. This is helpful if the compiler thinks there is no dependency, but you know better. – Johan Commented Feb 4 at 18:36

| Show 1 more comment

2 Answers 2

Sorted by: Reset to default 2

Shouldn't it be the case that the two small kernels get launched concurrently?

Not necessarily, there's a very long list of other constraints may prohibit it.

The most important one when trying to run this on Windows: The driver is batching kernel launches into more coarse command buffers that get sent to the GPU. Grids that end up in the same buffer and don't have a dependency (i.e. they belong to a different stream) have a chance to run in parallel. Grids that end up in different buffers are very unlikely to overlap.

If you are running your example on an idle GPU, following cudaDeviceSynchronize the driver will not wait for the second kernel dispatch before sending a work package to the GPU, as it tries to get the GPU back to working ASAP. Batching only occurs under load.

You can use GPUView to inspect the actual work packages that end up going to the GPU, how they end up scheduled and blocked etc. It will not let you see grids, but you will understand why something isn't running.

nsight can also tell you something similar, but it will usually not let you see why two grids did not overlap. In return it will let you see timing on a grid granularity.

Lauched kernels are not guaranteed to run concurrently, however there is a way to enforce concurrent runs (as long as you stay within the limits of the GPU).

If you change the two __global__ kernel functions into __device__ functions and call them both from a single kernel, then they are guaranteed to run concurrently as long as you do not start more threads than can fit into the GPU according to the NVidia occupancy calculator (see: https://xmartlabs.github.io/cuda-calculator/).

Here's example code (godbolt link: https://cuda.godbolt./z/dsfd3Mzbe)

#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <stdio.h>


int *d_x;
static constexpr int N = 1;

__global__ void init_buff(int *buff) {
  for (int i = 0; i < N; i++) {
    buff[i] = i;
  }
}

__device__ void kernel_loop(volatile int *buff) {
  while (true) {
    __threadfence();
    if (buff[0]) {
      break;
    }
  }
  printf("done looping\n");
}

__device__ void kernel_write(volatile int *buff) {
  buff[0] = 1;
}

__global__ void kernel(volatile int* buff) {
    if (blockIdx.x == 0) { kernel_loop(buff); }
    else { kernel_write(buff); }
}

int main() {
  cudaMalloc(reinterpret_cast<void **>(&d_x), sizeof(int) * N);

  init_buff<<<1, 1>>>(d_x);

  cudaDeviceSynchronize();

  cudaStream_t stream1, stream2;
  cudaStreamCreateWithFlags(&stream1, cudaStreamDefault);
  cudaStreamCreateWithFlags(&stream2, cudaStreamDefault);

  cudaDeviceSynchronize();


  //kernel_loop<<<1, 1, 0, stream1>>>(d_x);
  //kernel_write<<<1, 1, 0, stream2>>>(d_x);
  kernel<<<2,1>>>(d_x);

  cudaDeviceSynchronize();

  return 0;
}

The thing to remember is that for performance it is important to NOT divert threads in the same warp (all thread in a warp should execute the same code most of the time). However, it is totally fine to divert between different warps and between different blocks.

In this example I start 2 blocks. Block 0 gets assigned the kernel_write task and all other blocks (well block 1 really) run the kernel_loop task.

If the total launch does not exceed GPU limits (no. of blocks, no of threads, shared memory, and register allocation), then all threads in that grid will run concurrently and thus your requirement is satisfied.

Comment on why volatile is bad
Yes, on the NVidia GPU volatile will pin your data to the globally visible L2 cache, but this has a number of drawbacks.

code is less transferable between the GPU and the CPU.
volatile does not allow fine-grained access, whereas <cuda/atomics> does.
It cannot synchronize the whole system (GPU+CPU) (as <cuda/atomics>) does.
If you have a large struct and you wish to implement a lock-free/wait-free scheme on it, this cannot be done using volatile, whereas there are known ways to do this using atomic compare_exchange that can be ported from existing CPU implementations to the GPU more or less as-is.
volatile does not add thread_fences, whereas atomics from the <cuda/atomic> header add these automatically in the right places with the right semantics.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

c++ - Why Does this CUDA Code Loop Indefinitely? - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)