最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

cuda - cudaMalloc need not block, why? - Stack Overflow

programmeradmin4浏览0评论

Following is my test

__global__ void HelloFromGPU(int nth) {
    nth = 4;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    while (1) {

    }
}

int main() {
    int count;
    cudaGetDeviceCount(&count);
    printf("CPU\n");
    dim3 grid(2,2);
    dim3 block(2,2);
    HelloFromGPU<<<grid,block>>>(32);
    printf("after kernel\n");
    printf("start malloc\n");
    int *dev;
    int ret = cudaMalloc((void**)&dev, sizeof(int));
    printf("after malloc %d\n", ret);
    printf("start memcpy\n");
    ret = cudaMemset(dev, 0, sizeof(int));
    printf("after memset %d\n", ret);
    ret = cudaMemcpy(dev, &count, sizeof(int), cudaMemcpyHostToDevice);
    printf("after memcpy %d\n", ret);
    cudaFree(dev);
    printf("end\n");
    return 0;
}

As I known, cudaMalloc() has implicit block, need to wait all kernels or memory operations finishing. But in my test, the output is:

CPU
after kernel
start malloc
after malloc 0
start memcpy
after memset 0

cudaMalloc() and cudaMemset() finish successfully even though hello kernel has not finished. This doesn't align with what I know. And how to understand Implicit Synchronization in nvidia doc.

Following is my test

__global__ void HelloFromGPU(int nth) {
    nth = 4;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    while (1) {

    }
}

int main() {
    int count;
    cudaGetDeviceCount(&count);
    printf("CPU\n");
    dim3 grid(2,2);
    dim3 block(2,2);
    HelloFromGPU<<<grid,block>>>(32);
    printf("after kernel\n");
    printf("start malloc\n");
    int *dev;
    int ret = cudaMalloc((void**)&dev, sizeof(int));
    printf("after malloc %d\n", ret);
    printf("start memcpy\n");
    ret = cudaMemset(dev, 0, sizeof(int));
    printf("after memset %d\n", ret);
    ret = cudaMemcpy(dev, &count, sizeof(int), cudaMemcpyHostToDevice);
    printf("after memcpy %d\n", ret);
    cudaFree(dev);
    printf("end\n");
    return 0;
}

As I known, cudaMalloc() has implicit block, need to wait all kernels or memory operations finishing. But in my test, the output is:

CPU
after kernel
start malloc
after malloc 0
start memcpy
after memset 0

cudaMalloc() and cudaMemset() finish successfully even though hello kernel has not finished. This doesn't align with what I know. And how to understand Implicit Synchronization in nvidia doc.

Share Improve this question edited Mar 21 at 3:57 xiaobin asked Mar 19 at 12:00 xiaobinxiaobin 114 bronze badges 9
  • "Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources. Such behavior is subject to change and undocumented behavior should not be relied upon." CUDA Runtime API: API Synchronization Behavior – paleonix Commented Mar 19 at 13:15
  • I know of this kind of behavior just for cudaMemcpy(): "Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order." CUDA C++ Programming Guide: Programming Interface: CUDA Runtime: Asynchronous Concurrent Execution: Streams: Default Stream There is no mention of cudaMalloc() or cudaMemset() implicitly using the default stream. – paleonix Commented Mar 19 at 13:31
  • "As I known, cudaMalloc() has implicit block, need to wait all kernels or memory operations finishing." That is an observation one can make with respect to device side activity. You have now proven it is not applicable to host code behavior. When I am teaching CUDA, I usually try to use the word "blocking" to reflect impact to host code/thread behavior progress, and "synchronizing" to refer to the effect if any, on device side activity/progress. – Robert Crovella Commented Mar 19 at 19:14
  • @RobertCrovella I'm not sure the linked Q&A is a good example. I would expect Thrust to explicitly synchronize between kernel (from previous algorithm) and cudaMalloc(). Although I am not sure if that was the case in very early Thrust. – paleonix Commented Mar 19 at 21:11
  • @paleonix if kernel use specific stream and cudaMemcpy use default, will cpy op wait kernel finishing? – xiaobin Commented Mar 21 at 2:43
 |  Show 4 more comments

1 Answer 1

Reset to default 1

All kernel executions are asynchronous by default. All cudaMalloc calls are synchronous by default in their respective stream (or NULL stream in your case), but those are not blocking(!) to the host operations. During the execution of the cudaMalloc, the device will freeze execution and then will resume the execution of a kernel.Unlike cudaMalloc , cudaMemcpy is synchronous to GPU and blocking(!) to the HOST, therefore you get an expected behavior.
To monitor this you can check the output of the profiling (I modified your kernel so that it runs for some time and then quits. The modified kernel performs simple floating point operations in a loop). You can see that the cudaMalloc and cudaMemset functions overlap the kernel execution, but after that there's a long blocking line of the cudaMemcpy which copies memory only after the kernel execution is finished.
With the cudaMalloc function, the duration of the kernel execution is 2.00129s, but if I add cudaDeviceSynchronize right after the kernel call, then the execution time of the kernel is 1.93112s.

EDIT: tested this on cuda: 10.2, 11.2, 11.5, 11.8, 12.1 and 12.5 for NVIDIA GPUs: Titan Xp, V100 and RTX3090

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论