最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Concurrent kernels using oneAPI SYCL and Nvidia GPU plugin - Stack Overflow

programmeradmin1浏览0评论

I'm trying to run two kernels concurrently on a single Nvidia GPU using oneAPI SYCL and the Nvidia plugin. Is this possible? If not, why? Here is where I'm at so far: I'm able to run two kernels concurrently when they are the same lambda function. When I split up the two kernels into two separate lambda functions, the kernels run serially. Here's some code:

#include <iostream>

#include <sycl.hpp>

int main()
{
    std::vector< sycl::queue > queues;
    queues.emplace_back( sycl::gpu_selector_v );
    sycl::device d = queues[0].get_device();
    queues.emplace_back( d );

    auto name = d.get_info<sycl::info::device::name>();
    std::cout << name << "..." << std::flush;

    volatile int* ptr = sycl::malloc_device<int>( 1, queues[0] );

#if 1
    // This works!
    for( size_t i=0; i<2; i++ )
    {
        queues[i].single_task( [=]()
        {
            if( i == 0 )
            {
                *ptr = 0xaaaaaaaa;
                while( *ptr != 0xbbbbbbbb );
            }
            else
            {
                while( *ptr != 0xaaaaaaaa );
                *ptr = 0xbbbbbbbb;
            }
        } );
    }
#else
    // This one doesn't work
    queues[0].single_task( [=]()
    {
        *ptr = 0xaaaaaaaa;
        while( *ptr != 0xbbbbbbbb );
    } );
    queues[1].single_task( [=]()
    {
        while( *ptr != 0xaaaaaaaa );
        *ptr = 0xbbbbbbbb;
    } );
#endif
    queues[0].wait();
    queues[1].wait();
    std::cout << " passed!" << std::endl;
}

If the kernels run serially, deadlock is experienced. If the kernels run concurrently, deadlock is not experienced.

With oneAPI 2025.0 and the associated Nvidia plugin, the first branch of the #if doesn't experience deadlock, but the second branch does.

How can I get the second case to run without deadlock?

Edit:

Jonas's answer finds that the volatile keyword is the culprit here. I don't think this is the case.

Consider this example:

#include <iostream>

#include <sycl.hpp>

#if defined(__SYCL_DEVICE_ONLY__)
#define GET_CYCLE_COUNT_STR(VAL) "mov.u32 %0, %%clock;" : "=r"(VAL) :: "memory"
#define GET_CYCLE_COUNT(VAL) asm volatile( GET_CYCLE_COUNT_STR(VAL) )
#else
#define GET_CYCLE_COUNT(VAL)
#endif

void delay()
{
    uint32_t cc0, cc1;
    GET_CYCLE_COUNT(cc0);
    do{
        GET_CYCLE_COUNT(cc1);
    } while( ( cc1 - cc0 ) < 1e9 );
}

int main()
{
    sycl::queue queue( sycl::gpu_selector_v );
    sycl::device d = queue.get_device();

    auto name = d.get_info<sycl::info::device::name>();
    std::cout << name << std::endl;

    auto tp = std::chrono::high_resolution_clock::now();
    // These run concurrently
    for( size_t i=0; i<2; i++ )
    {
        queue.single_task( [=](){ delay(); } );
    }
    queue.wait();
    auto elapsed = std::chrono::high_resolution_clock::now() - tp;
    std::cout << " - first test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;

    tp = std::chrono::high_resolution_clock::now();
    // These run serially
    queue.single_task( [=](){ delay(); } );
    queue.single_task( [=](){ delay(); } );
    queue.wait();
    elapsed = std::chrono::high_resolution_clock::now() - tp;
    std::cout << " - second test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;
}

Here, instead of testing concurrency with deadlock, we test with a busy wait function. When I run this, I see that the second test takes 2x as long to run as the first. When I use the nsys profiler, I clearly see the kernels in the second case are run serially.

I understand that SYCL doesn't guarantee that kernels that can launch concurrently do, but it seems silly that a parallel compute implementation doesn't run these kernels as simple as they are in parallel!

I'm trying to run two kernels concurrently on a single Nvidia GPU using oneAPI SYCL and the Nvidia plugin. Is this possible? If not, why? Here is where I'm at so far: I'm able to run two kernels concurrently when they are the same lambda function. When I split up the two kernels into two separate lambda functions, the kernels run serially. Here's some code:

#include <iostream>

#include <sycl.hpp>

int main()
{
    std::vector< sycl::queue > queues;
    queues.emplace_back( sycl::gpu_selector_v );
    sycl::device d = queues[0].get_device();
    queues.emplace_back( d );

    auto name = d.get_info<sycl::info::device::name>();
    std::cout << name << "..." << std::flush;

    volatile int* ptr = sycl::malloc_device<int>( 1, queues[0] );

#if 1
    // This works!
    for( size_t i=0; i<2; i++ )
    {
        queues[i].single_task( [=]()
        {
            if( i == 0 )
            {
                *ptr = 0xaaaaaaaa;
                while( *ptr != 0xbbbbbbbb );
            }
            else
            {
                while( *ptr != 0xaaaaaaaa );
                *ptr = 0xbbbbbbbb;
            }
        } );
    }
#else
    // This one doesn't work
    queues[0].single_task( [=]()
    {
        *ptr = 0xaaaaaaaa;
        while( *ptr != 0xbbbbbbbb );
    } );
    queues[1].single_task( [=]()
    {
        while( *ptr != 0xaaaaaaaa );
        *ptr = 0xbbbbbbbb;
    } );
#endif
    queues[0].wait();
    queues[1].wait();
    std::cout << " passed!" << std::endl;
}

If the kernels run serially, deadlock is experienced. If the kernels run concurrently, deadlock is not experienced.

With oneAPI 2025.0 and the associated Nvidia plugin, the first branch of the #if doesn't experience deadlock, but the second branch does.

How can I get the second case to run without deadlock?

Edit:

Jonas's answer finds that the volatile keyword is the culprit here. I don't think this is the case.

Consider this example:

#include <iostream>

#include <sycl.hpp>

#if defined(__SYCL_DEVICE_ONLY__)
#define GET_CYCLE_COUNT_STR(VAL) "mov.u32 %0, %%clock;" : "=r"(VAL) :: "memory"
#define GET_CYCLE_COUNT(VAL) asm volatile( GET_CYCLE_COUNT_STR(VAL) )
#else
#define GET_CYCLE_COUNT(VAL)
#endif

void delay()
{
    uint32_t cc0, cc1;
    GET_CYCLE_COUNT(cc0);
    do{
        GET_CYCLE_COUNT(cc1);
    } while( ( cc1 - cc0 ) < 1e9 );
}

int main()
{
    sycl::queue queue( sycl::gpu_selector_v );
    sycl::device d = queue.get_device();

    auto name = d.get_info<sycl::info::device::name>();
    std::cout << name << std::endl;

    auto tp = std::chrono::high_resolution_clock::now();
    // These run concurrently
    for( size_t i=0; i<2; i++ )
    {
        queue.single_task( [=](){ delay(); } );
    }
    queue.wait();
    auto elapsed = std::chrono::high_resolution_clock::now() - tp;
    std::cout << " - first test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;

    tp = std::chrono::high_resolution_clock::now();
    // These run serially
    queue.single_task( [=](){ delay(); } );
    queue.single_task( [=](){ delay(); } );
    queue.wait();
    elapsed = std::chrono::high_resolution_clock::now() - tp;
    std::cout << " - second test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;
}

Here, instead of testing concurrency with deadlock, we test with a busy wait function. When I run this, I see that the second test takes 2x as long to run as the first. When I use the nsys profiler, I clearly see the kernels in the second case are run serially.

I understand that SYCL doesn't guarantee that kernels that can launch concurrently do, but it seems silly that a parallel compute implementation doesn't run these kernels as simple as they are in parallel!

Share Improve this question edited Jan 31 at 17:19 paleonix 3,1405 gold badges17 silver badges39 bronze badges asked Jan 29 at 17:17 rubikssolver4rubikssolver4 736 bronze badges 1
  • FYI: Use steady_clock for timing purposes, not high_resolution_clock. Take it from the library author. – paleonix Commented Jan 31 at 17:21
Add a comment  | 

2 Answers 2

Reset to default 3

I was able to reproduce the behaviour you described on an RTX4090, oddly the code passes if using -fyscl-targets=x86_64 and the sycl::cpu_selector_v this leads me to believe that the issue is constrained to the Nvidia backend.

Furthermore I tried the following:

  1. Manual loop unrolling using the same lambda function in each task submission. This also failed.
  2. Removing the volatile keyword. This worked.

The volatile keyword seems to be the culprit here I am conjecturing that the Nvidia backend must be generating code that serialises the kernel launches in one case and yet not when launched inside a for-loop.

I will however make a couple of notes: the SYCL specification makes no guarantee that kernels that can launch concurrently, will do so.

Command-groups with non-overlapping requirements may execute concurrently.

So relying on this behaviour is unsafe.

Another adjacent point is that implementations of sycl::queue use multiple streams to enqueue kernels concurrently when no dependency is detected through use of the sycl::buffer / sycl::accessor model or through explicit sycl::event dependencies. This is all to say, you do not need multiple sycl::queue objects in order to launch multiple kernels concurrently.

After further investigation, I've found why the kernels are being run in serial. By default, after cuda 12.2, a setting called CUDA_MODULE_LOADING is set to lazy. The cuda C++ programming guide outlines issues with lazy CUDA_MODULE_LOADING with respect to concurrent execution of kernels:

https://docs.nvidia/cuda/cuda-c-programming-guide/index.html#concurrent-execution

Concurrent execution of kernels is described by the guide as an anti-pattern, but a workaround is to set the environment variable: CUDA_MODULE_LOADING=EAGER

发布评论

评论列表(0)

  1. 暂无评论