Concurrent kernels using oneAPI SYCL and Nvidia GPU plugin

I'm trying to run two kernels concurrently on a single Nvidia GPU using oneAPI SYCL and the Nvidia plugin. Is this possible? If not, why? Here is where I'm at so far: I'm able to run two kernels concurrently when they are the same lambda function. When I split up the two kernels into two separate lambda functions, the kernels run serially. Here's some code:

#include <iostream>

#include <sycl.hpp>

int main()
{
    std::vector< sycl::queue > queues;
    queues.emplace_back( sycl::gpu_selector_v );
    sycl::device d = queues[0].get_device();
    queues.emplace_back( d );

    auto name = d.get_info<sycl::info::device::name>();
    std::cout << name << "..." << std::flush;

    volatile int* ptr = sycl::malloc_device<int>( 1, queues[0] );

#if 1
    // This works!
    for( size_t i=0; i<2; i++ )
    {
        queues[i].single_task( [=]()
        {
            if( i == 0 )
            {
                *ptr = 0xaaaaaaaa;
                while( *ptr != 0xbbbbbbbb );
            }
            else
            {
                while( *ptr != 0xaaaaaaaa );
                *ptr = 0xbbbbbbbb;
            }
        } );
    }
#else
    // This one doesn't work
    queues[0].single_task( [=]()
    {
        *ptr = 0xaaaaaaaa;
        while( *ptr != 0xbbbbbbbb );
    } );
    queues[1].single_task( [=]()
    {
        while( *ptr != 0xaaaaaaaa );
        *ptr = 0xbbbbbbbb;
    } );
#endif
    queues[0].wait();
    queues[1].wait();
    std::cout << " passed!" << std::endl;
}

If the kernels run serially, deadlock is experienced. If the kernels run concurrently, deadlock is not experienced.

With oneAPI 2025.0 and the associated Nvidia plugin, the first branch of the #if doesn't experience deadlock, but the second branch does.

How can I get the second case to run without deadlock?

Edit:

Jonas's answer finds that the volatile keyword is the culprit here. I don't think this is the case.

Consider this example:

#include <iostream>

#include <sycl.hpp>

#if defined(__SYCL_DEVICE_ONLY__)
#define GET_CYCLE_COUNT_STR(VAL) "mov.u32 %0, %%clock;" : "=r"(VAL) :: "memory"
#define GET_CYCLE_COUNT(VAL) asm volatile( GET_CYCLE_COUNT_STR(VAL) )
#else
#define GET_CYCLE_COUNT(VAL)
#endif

void delay()
{
    uint32_t cc0, cc1;
    GET_CYCLE_COUNT(cc0);
    do{
        GET_CYCLE_COUNT(cc1);
    } while( ( cc1 - cc0 ) < 1e9 );
}

int main()
{
    sycl::queue queue( sycl::gpu_selector_v );
    sycl::device d = queue.get_device();

    auto name = d.get_info<sycl::info::device::name>();
    std::cout << name << std::endl;

    auto tp = std::chrono::high_resolution_clock::now();
    // These run concurrently
    for( size_t i=0; i<2; i++ )
    {
        queue.single_task( [=](){ delay(); } );
    }
    queue.wait();
    auto elapsed = std::chrono::high_resolution_clock::now() - tp;
    std::cout << " - first test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;

    tp = std::chrono::high_resolution_clock::now();
    // These run serially
    queue.single_task( [=](){ delay(); } );
    queue.single_task( [=](){ delay(); } );
    queue.wait();
    elapsed = std::chrono::high_resolution_clock::now() - tp;
    std::cout << " - second test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;
}

Here, instead of testing concurrency with deadlock, we test with a busy wait function. When I run this, I see that the second test takes 2x as long to run as the first. When I use the nsys profiler, I clearly see the kernels in the second case are run serially.

I understand that SYCL doesn't guarantee that kernels that can launch concurrently do, but it seems silly that a parallel compute implementation doesn't run these kernels as simple as they are in parallel!

#include <iostream>

#include <sycl.hpp>

int main()
{
    std::vector< sycl::queue > queues;
    queues.emplace_back( sycl::gpu_selector_v );
    sycl::device d = queues[0].get_device();
    queues.emplace_back( d );

    auto name = d.get_info<sycl::info::device::name>();
    std::cout << name << "..." << std::flush;

    volatile int* ptr = sycl::malloc_device<int>( 1, queues[0] );

#if 1
    // This works!
    for( size_t i=0; i<2; i++ )
    {
        queues[i].single_task( [=]()
        {
            if( i == 0 )
            {
                *ptr = 0xaaaaaaaa;
                while( *ptr != 0xbbbbbbbb );
            }
            else
            {
                while( *ptr != 0xaaaaaaaa );
                *ptr = 0xbbbbbbbb;
            }
        } );
    }
#else
    // This one doesn't work
    queues[0].single_task( [=]()
    {
        *ptr = 0xaaaaaaaa;
        while( *ptr != 0xbbbbbbbb );
    } );
    queues[1].single_task( [=]()
    {
        while( *ptr != 0xaaaaaaaa );
        *ptr = 0xbbbbbbbb;
    } );
#endif
    queues[0].wait();
    queues[1].wait();
    std::cout << " passed!" << std::endl;
}

If the kernels run serially, deadlock is experienced. If the kernels run concurrently, deadlock is not experienced.

With oneAPI 2025.0 and the associated Nvidia plugin, the first branch of the #if doesn't experience deadlock, but the second branch does.

How can I get the second case to run without deadlock?

Edit:

Jonas's answer finds that the volatile keyword is the culprit here. I don't think this is the case.

Consider this example:

#include <iostream>

#include <sycl.hpp>

#if defined(__SYCL_DEVICE_ONLY__)
#define GET_CYCLE_COUNT_STR(VAL) "mov.u32 %0, %%clock;" : "=r"(VAL) :: "memory"
#define GET_CYCLE_COUNT(VAL) asm volatile( GET_CYCLE_COUNT_STR(VAL) )
#else
#define GET_CYCLE_COUNT(VAL)
#endif

void delay()
{
    uint32_t cc0, cc1;
    GET_CYCLE_COUNT(cc0);
    do{
        GET_CYCLE_COUNT(cc1);
    } while( ( cc1 - cc0 ) < 1e9 );
}

int main()
{
    sycl::queue queue( sycl::gpu_selector_v );
    sycl::device d = queue.get_device();

    auto name = d.get_info<sycl::info::device::name>();
    std::cout << name << std::endl;

    auto tp = std::chrono::high_resolution_clock::now();
    // These run concurrently
    for( size_t i=0; i<2; i++ )
    {
        queue.single_task( [=](){ delay(); } );
    }
    queue.wait();
    auto elapsed = std::chrono::high_resolution_clock::now() - tp;
    std::cout << " - first test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;

    tp = std::chrono::high_resolution_clock::now();
    // These run serially
    queue.single_task( [=](){ delay(); } );
    queue.single_task( [=](){ delay(); } );
    queue.wait();
    elapsed = std::chrono::high_resolution_clock::now() - tp;
    std::cout << " - second test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;
}

Share Improve this question edited Jan 31 at 17:19 paleonix 3,1405 gold badges17 silver badges39 bronze badges asked Jan 29 at 17:17 rubikssolver4 736 bronze badges

FYI: Use steady_clock for timing purposes, not high_resolution_clock. Take it from the library author. – paleonix Commented Jan 31 at 17:21

Add a comment |

2 Answers 2

Sorted by: Reset to default 3

I was able to reproduce the behaviour you described on an RTX4090, oddly the code passes if using -fyscl-targets=x86_64 and the sycl::cpu_selector_v this leads me to believe that the issue is constrained to the Nvidia backend.

Furthermore I tried the following:

Manual loop unrolling using the same lambda function in each task submission. This also failed.
Removing the volatile keyword. This worked.

The volatile keyword seems to be the culprit here I am conjecturing that the Nvidia backend must be generating code that serialises the kernel launches in one case and yet not when launched inside a for-loop.

I will however make a couple of notes: the SYCL specification makes no guarantee that kernels that can launch concurrently, will do so.

Command-groups with non-overlapping requirements may execute concurrently.

So relying on this behaviour is unsafe.

Another adjacent point is that implementations of sycl::queue use multiple streams to enqueue kernels concurrently when no dependency is detected through use of the sycl::buffer / sycl::accessor model or through explicit sycl::event dependencies. This is all to say, you do not need multiple sycl::queue objects in order to launch multiple kernels concurrently.

After further investigation, I've found why the kernels are being run in serial. By default, after cuda 12.2, a setting called CUDA_MODULE_LOADING is set to lazy. The cuda C++ programming guide outlines issues with lazy CUDA_MODULE_LOADING with respect to concurrent execution of kernels:

https://docs.nvidia/cuda/cuda-c-programming-guide/index.html#concurrent-execution

Concurrent execution of kernels is described by the guide as an anti-pattern, but a workaround is to set the environment variable: CUDA_MODULE_LOADING=EAGER

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Concurrent kernels using oneAPI SYCL and Nvidia GPU plugin - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)