Following is my test
__global__ void HelloFromGPU(int nth) {
nth = 4;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
while (1) {
}
}
int main() {
int count;
cudaGetDeviceCount(&count);
printf("CPU\n");
dim3 grid(2,2);
dim3 block(2,2);
HelloFromGPU<<<grid,block>>>(32);
printf("after kernel\n");
printf("start malloc\n");
int *dev;
int ret = cudaMalloc((void**)&dev, sizeof(int));
printf("after malloc %d\n", ret);
printf("start memcpy\n");
ret = cudaMemset(dev, 0, sizeof(int));
printf("after memset %d\n", ret);
ret = cudaMemcpy(dev, &count, sizeof(int), cudaMemcpyHostToDevice);
printf("after memcpy %d\n", ret);
cudaFree(dev);
printf("end\n");
return 0;
}
As I known, cudaMalloc()
has implicit block, need to wait all kernels or memory operations finishing. But in my test, the output is:
CPU
after kernel
start malloc
after malloc 0
start memcpy
after memset 0
cudaMalloc()
and cudaMemset()
finish successfully even though hello kernel has not finished. This doesn't align with what I know.
And how to understand Implicit Synchronization in nvidia doc.
Following is my test
__global__ void HelloFromGPU(int nth) {
nth = 4;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
while (1) {
}
}
int main() {
int count;
cudaGetDeviceCount(&count);
printf("CPU\n");
dim3 grid(2,2);
dim3 block(2,2);
HelloFromGPU<<<grid,block>>>(32);
printf("after kernel\n");
printf("start malloc\n");
int *dev;
int ret = cudaMalloc((void**)&dev, sizeof(int));
printf("after malloc %d\n", ret);
printf("start memcpy\n");
ret = cudaMemset(dev, 0, sizeof(int));
printf("after memset %d\n", ret);
ret = cudaMemcpy(dev, &count, sizeof(int), cudaMemcpyHostToDevice);
printf("after memcpy %d\n", ret);
cudaFree(dev);
printf("end\n");
return 0;
}
As I known, cudaMalloc()
has implicit block, need to wait all kernels or memory operations finishing. But in my test, the output is:
CPU
after kernel
start malloc
after malloc 0
start memcpy
after memset 0
cudaMalloc()
and cudaMemset()
finish successfully even though hello kernel has not finished. This doesn't align with what I know.
And how to understand Implicit Synchronization in nvidia doc.
1 Answer
Reset to default 1All kernel executions are asynchronous by default. All cudaMalloc
calls are synchronous by default in their respective stream (or NULL stream in your case), but those are not blocking(!) to the host operations. During the execution of the cudaMalloc
, the device will freeze execution and then will resume the execution of a kernel.Unlike cudaMalloc
, cudaMemcpy
is synchronous to GPU and blocking(!) to the HOST, therefore you get an expected behavior.
To monitor this you can check the output of the profiling (I modified your kernel so that it runs for some time and then quits. The modified kernel performs simple floating point operations in a loop). You can see that the cudaMalloc
and cudaMemset
functions overlap the kernel execution, but after that there's a long blocking line of the cudaMemcpy
which copies memory only after the kernel execution is finished.
With the cudaMalloc
function, the duration of the kernel execution is 2.00129s, but if I add cudaDeviceSynchronize
right after the kernel call, then the execution time of the kernel is 1.93112s.
EDIT: tested this on cuda: 10.2, 11.2, 11.5, 11.8, 12.1 and 12.5 for NVIDIA GPUs: Titan Xp, V100 and RTX3090
cudaMemcpy()
: "Kernel launches and host<->
device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order." CUDA C++ Programming Guide: Programming Interface: CUDA Runtime: Asynchronous Concurrent Execution: Streams: Default Stream There is no mention ofcudaMalloc()
orcudaMemset()
implicitly using the default stream. – paleonix Commented Mar 19 at 13:31cudaMalloc()
. Although I am not sure if that was the case in very early Thrust. – paleonix Commented Mar 19 at 21:11