The following code runs indefinitely, as kernel_loop
is stuck in an infinite loop. Shouldn't it be the case that the two small kernels get launched concurrently?
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
int *d_x;
static constexpr int N = 1;
__global__ void init_buff(int *buff) {
for (int i = 0; i < N; i++) {
buff[i] = i;
}
}
__global__ void kernel_loop(volatile int *buff) {
while (true) {
__threadfence();
if (buff[0]) {
break;
}
}
}
__global__ void kernel_write(volatile int *buff) {
buff[0] = 1;
}
int main() {
cudaMalloc(reinterpret_cast<void **>(&d_x), sizeof(int) * N);
init_buff<<<1, 1>>>(d_x);
cudaDeviceSynchronize();
cudaStream_t stream1, stream2;
cudaStreamCreateWithFlags(&stream1, cudaStreamDefault);
cudaStreamCreateWithFlags(&stream2, cudaStreamDefault);
cudaDeviceSynchronize();
kernel_loop<<<1, 1, 0, stream1>>>(d_x);
kernel_write<<<1, 1, 0, stream2>>>(d_x);
cudaDeviceSynchronize();
return 0;
}
Additionally, if I change the order of the launches like so:
kernel_write<<<1, 1, 0, stream2>>>(d_x);
kernel_loop<<<1, 1, 0, stream1>>>(d_x);
the program runs to completion.
Furthermore,
cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);
also causes an infinte loop.
For the record, __nanosleep
-ing the looping thread also doesn't work.
EDIT:
As per Ext3h's comment, added a __threadfence() to the writer kernel.
The following code runs indefinitely, as kernel_loop
is stuck in an infinite loop. Shouldn't it be the case that the two small kernels get launched concurrently?
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
int *d_x;
static constexpr int N = 1;
__global__ void init_buff(int *buff) {
for (int i = 0; i < N; i++) {
buff[i] = i;
}
}
__global__ void kernel_loop(volatile int *buff) {
while (true) {
__threadfence();
if (buff[0]) {
break;
}
}
}
__global__ void kernel_write(volatile int *buff) {
buff[0] = 1;
}
int main() {
cudaMalloc(reinterpret_cast<void **>(&d_x), sizeof(int) * N);
init_buff<<<1, 1>>>(d_x);
cudaDeviceSynchronize();
cudaStream_t stream1, stream2;
cudaStreamCreateWithFlags(&stream1, cudaStreamDefault);
cudaStreamCreateWithFlags(&stream2, cudaStreamDefault);
cudaDeviceSynchronize();
kernel_loop<<<1, 1, 0, stream1>>>(d_x);
kernel_write<<<1, 1, 0, stream2>>>(d_x);
cudaDeviceSynchronize();
return 0;
}
Additionally, if I change the order of the launches like so:
kernel_write<<<1, 1, 0, stream2>>>(d_x);
kernel_loop<<<1, 1, 0, stream1>>>(d_x);
the program runs to completion.
Furthermore,
cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);
also causes an infinte loop.
For the record, __nanosleep
-ing the looping thread also doesn't work.
EDIT:
As per Ext3h's comment, added a __threadfence() to the writer kernel.
Share Improve this question edited Feb 4 at 12:54 Elvir Crncevic asked Feb 4 at 7:26 Elvir CrncevicElvir Crncevic 5054 silver badges21 bronze badges 6 | Show 1 more comment2 Answers
Reset to default 2Shouldn't it be the case that the two small kernels get launched concurrently?
Not necessarily, there's a very long list of other constraints may prohibit it.
The most important one when trying to run this on Windows: The driver is batching kernel launches into more coarse command buffers that get sent to the GPU. Grids that end up in the same buffer and don't have a dependency (i.e. they belong to a different stream) have a chance to run in parallel. Grids that end up in different buffers are very unlikely to overlap.
If you are running your example on an idle GPU, following cudaDeviceSynchronize
the driver will not wait for the second kernel dispatch before sending a work package to the GPU, as it tries to get the GPU back to working ASAP. Batching only occurs under load.
You can use GPUView to inspect the actual work packages that end up going to the GPU, how they end up scheduled and blocked etc. It will not let you see grids, but you will understand why something isn't running.
nsight can also tell you something similar, but it will usually not let you see why two grids did not overlap. In return it will let you see timing on a grid granularity.
Lauched kernels are not guaranteed to run concurrently, however there is a way to enforce concurrent runs (as long as you stay within the limits of the GPU).
If you change the two __global__
kernel functions into __device__
functions and call them both from a single kernel, then they are guaranteed to run concurrently as long as you do not start more threads than can fit into the GPU according to the NVidia occupancy calculator (see: https://xmartlabs.github.io/cuda-calculator/).
Here's example code (godbolt link: https://cuda.godbolt./z/dsfd3Mzbe)
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <stdio.h>
int *d_x;
static constexpr int N = 1;
__global__ void init_buff(int *buff) {
for (int i = 0; i < N; i++) {
buff[i] = i;
}
}
__device__ void kernel_loop(volatile int *buff) {
while (true) {
__threadfence();
if (buff[0]) {
break;
}
}
printf("done looping\n");
}
__device__ void kernel_write(volatile int *buff) {
buff[0] = 1;
}
__global__ void kernel(volatile int* buff) {
if (blockIdx.x == 0) { kernel_loop(buff); }
else { kernel_write(buff); }
}
int main() {
cudaMalloc(reinterpret_cast<void **>(&d_x), sizeof(int) * N);
init_buff<<<1, 1>>>(d_x);
cudaDeviceSynchronize();
cudaStream_t stream1, stream2;
cudaStreamCreateWithFlags(&stream1, cudaStreamDefault);
cudaStreamCreateWithFlags(&stream2, cudaStreamDefault);
cudaDeviceSynchronize();
//kernel_loop<<<1, 1, 0, stream1>>>(d_x);
//kernel_write<<<1, 1, 0, stream2>>>(d_x);
kernel<<<2,1>>>(d_x);
cudaDeviceSynchronize();
return 0;
}
The thing to remember is that for performance it is important to NOT divert threads in the same warp (all thread in a warp should execute the same code most of the time). However, it is totally fine to divert between different warps and between different blocks.
In this example I start 2 blocks. Block 0 gets assigned the kernel_write
task and all other blocks (well block 1 really) run the kernel_loop
task.
If the total launch does not exceed GPU limits (no. of blocks, no of threads, shared memory, and register allocation), then all threads in that grid will run concurrently and thus your requirement is satisfied.
Comment on why volatile
is bad
Yes, on the NVidia GPU volatile will pin your data to the globally visible L2 cache, but this has a number of drawbacks.
- code is less transferable between the GPU and the CPU.
volatile
does not allow fine-grained access, whereas<cuda/atomics>
does.- It cannot synchronize the whole system (GPU+CPU) (as
<cuda/atomics>
) does. - If you have a large struct and you wish to implement a lock-free/wait-free scheme on it, this cannot be done using
volatile
, whereas there are known ways to do this using atomic compare_exchange that can be ported from existing CPU implementations to the GPU more or less as-is. volatile
does not add thread_fences, whereas atomics from the<cuda/atomic>
header add these automatically in the right places with the right semantics.
volatile
does not make code thread safe, only in Java does it do that.volatile
merely prevents the compiler from reordering instructions. This is helpful if the compiler thinks there is no dependency, but you know better. – Johan Commented Feb 4 at 18:36