cuda - cudaMalloc need not block, why?

Following is my test

__global__ void HelloFromGPU(int nth) {
    nth = 4;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    while (1) {

    }
}

int main() {
    int count;
    cudaGetDeviceCount(&count);
    printf("CPU\n");
    dim3 grid(2,2);
    dim3 block(2,2);
    HelloFromGPU<<<grid,block>>>(32);
    printf("after kernel\n");
    printf("start malloc\n");
    int *dev;
    int ret = cudaMalloc((void**)&dev, sizeof(int));
    printf("after malloc %d\n", ret);
    printf("start memcpy\n");
    ret = cudaMemset(dev, 0, sizeof(int));
    printf("after memset %d\n", ret);
    ret = cudaMemcpy(dev, &count, sizeof(int), cudaMemcpyHostToDevice);
    printf("after memcpy %d\n", ret);
    cudaFree(dev);
    printf("end\n");
    return 0;
}

As I known, cudaMalloc() has implicit block, need to wait all kernels or memory operations finishing. But in my test, the output is:

CPU
after kernel
start malloc
after malloc 0
start memcpy
after memset 0

cudaMalloc() and cudaMemset() finish successfully even though hello kernel has not finished. This doesn't align with what I know. And how to understand Implicit Synchronization in nvidia doc.

Following is my test

__global__ void HelloFromGPU(int nth) {
    nth = 4;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    while (1) {

    }
}

int main() {
    int count;
    cudaGetDeviceCount(&count);
    printf("CPU\n");
    dim3 grid(2,2);
    dim3 block(2,2);
    HelloFromGPU<<<grid,block>>>(32);
    printf("after kernel\n");
    printf("start malloc\n");
    int *dev;
    int ret = cudaMalloc((void**)&dev, sizeof(int));
    printf("after malloc %d\n", ret);
    printf("start memcpy\n");
    ret = cudaMemset(dev, 0, sizeof(int));
    printf("after memset %d\n", ret);
    ret = cudaMemcpy(dev, &count, sizeof(int), cudaMemcpyHostToDevice);
    printf("after memcpy %d\n", ret);
    cudaFree(dev);
    printf("end\n");
    return 0;
}

As I known, cudaMalloc() has implicit block, need to wait all kernels or memory operations finishing. But in my test, the output is:

CPU
after kernel
start malloc
after malloc 0
start memcpy
after memset 0

cudaMalloc() and cudaMemset() finish successfully even though hello kernel has not finished. This doesn't align with what I know. And how to understand Implicit Synchronization in nvidia doc.

Share Improve this question edited Mar 21 at 3:57 asked Mar 19 at 12:00 xiaobin 114 bronze badges

"Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources. Such behavior is subject to change and undocumented behavior should not be relied upon." CUDA Runtime API: API Synchronization Behavior – paleonix Commented Mar 19 at 13:15
I know of this kind of behavior just for cudaMemcpy(): "Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order." CUDA C++ Programming Guide: Programming Interface: CUDA Runtime: Asynchronous Concurrent Execution: Streams: Default Stream There is no mention of cudaMalloc() or cudaMemset() implicitly using the default stream. – paleonix Commented Mar 19 at 13:31
"As I known, cudaMalloc() has implicit block, need to wait all kernels or memory operations finishing." That is an observation one can make with respect to device side activity. You have now proven it is not applicable to host code behavior. When I am teaching CUDA, I usually try to use the word "blocking" to reflect impact to host code/thread behavior progress, and "synchronizing" to refer to the effect if any, on device side activity/progress. – Robert Crovella Commented Mar 19 at 19:14
@RobertCrovella I'm not sure the linked Q&A is a good example. I would expect Thrust to explicitly synchronize between kernel (from previous algorithm) and cudaMalloc(). Although I am not sure if that was the case in very early Thrust. – paleonix Commented Mar 19 at 21:11
@paleonix if kernel use specific stream and cudaMemcpy use default, will cpy op wait kernel finishing? – xiaobin Commented Mar 21 at 2:43

| Show 4 more comments

1 Answer 1

Sorted by: Reset to default 1

All kernel executions are asynchronous by default. All cudaMalloc calls are synchronous by default in their respective stream (or NULL stream in your case), but those are not blocking(!) to the host operations. During the execution of the cudaMalloc, the device will freeze execution and then will resume the execution of a kernel.Unlike cudaMalloc , cudaMemcpy is synchronous to GPU and blocking(!) to the HOST, therefore you get an expected behavior.
To monitor this you can check the output of the profiling (I modified your kernel so that it runs for some time and then quits. The modified kernel performs simple floating point operations in a loop). You can see that the cudaMalloc and cudaMemset functions overlap the kernel execution, but after that there's a long blocking line of the cudaMemcpy which copies memory only after the kernel execution is finished.
With the cudaMalloc function, the duration of the kernel execution is 2.00129s, but if I add cudaDeviceSynchronize right after the kernel call, then the execution time of the kernel is 1.93112s.

EDIT: tested this on cuda: 10.2, 11.2, 11.5, 11.8, 12.1 and 12.5 for NVIDIA GPUs: Titan Xp, V100 and RTX3090

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

cuda - cudaMalloc need not block, why? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)