I'm trying to run two kernels concurrently on a single Nvidia GPU using oneAPI SYCL and the Nvidia plugin. Is this possible? If not, why? Here is where I'm at so far: I'm able to run two kernels concurrently when they are the same lambda function. When I split up the two kernels into two separate lambda functions, the kernels run serially. Here's some code:
#include <iostream>
#include <sycl.hpp>
int main()
{
std::vector< sycl::queue > queues;
queues.emplace_back( sycl::gpu_selector_v );
sycl::device d = queues[0].get_device();
queues.emplace_back( d );
auto name = d.get_info<sycl::info::device::name>();
std::cout << name << "..." << std::flush;
volatile int* ptr = sycl::malloc_device<int>( 1, queues[0] );
#if 1
// This works!
for( size_t i=0; i<2; i++ )
{
queues[i].single_task( [=]()
{
if( i == 0 )
{
*ptr = 0xaaaaaaaa;
while( *ptr != 0xbbbbbbbb );
}
else
{
while( *ptr != 0xaaaaaaaa );
*ptr = 0xbbbbbbbb;
}
} );
}
#else
// This one doesn't work
queues[0].single_task( [=]()
{
*ptr = 0xaaaaaaaa;
while( *ptr != 0xbbbbbbbb );
} );
queues[1].single_task( [=]()
{
while( *ptr != 0xaaaaaaaa );
*ptr = 0xbbbbbbbb;
} );
#endif
queues[0].wait();
queues[1].wait();
std::cout << " passed!" << std::endl;
}
If the kernels run serially, deadlock is experienced. If the kernels run concurrently, deadlock is not experienced.
With oneAPI 2025.0 and the associated Nvidia plugin, the first branch of the #if
doesn't experience deadlock, but the second branch does.
How can I get the second case to run without deadlock?
Edit:
Jonas's answer finds that the volatile
keyword is the culprit here. I don't think this is the case.
Consider this example:
#include <iostream>
#include <sycl.hpp>
#if defined(__SYCL_DEVICE_ONLY__)
#define GET_CYCLE_COUNT_STR(VAL) "mov.u32 %0, %%clock;" : "=r"(VAL) :: "memory"
#define GET_CYCLE_COUNT(VAL) asm volatile( GET_CYCLE_COUNT_STR(VAL) )
#else
#define GET_CYCLE_COUNT(VAL)
#endif
void delay()
{
uint32_t cc0, cc1;
GET_CYCLE_COUNT(cc0);
do{
GET_CYCLE_COUNT(cc1);
} while( ( cc1 - cc0 ) < 1e9 );
}
int main()
{
sycl::queue queue( sycl::gpu_selector_v );
sycl::device d = queue.get_device();
auto name = d.get_info<sycl::info::device::name>();
std::cout << name << std::endl;
auto tp = std::chrono::high_resolution_clock::now();
// These run concurrently
for( size_t i=0; i<2; i++ )
{
queue.single_task( [=](){ delay(); } );
}
queue.wait();
auto elapsed = std::chrono::high_resolution_clock::now() - tp;
std::cout << " - first test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;
tp = std::chrono::high_resolution_clock::now();
// These run serially
queue.single_task( [=](){ delay(); } );
queue.single_task( [=](){ delay(); } );
queue.wait();
elapsed = std::chrono::high_resolution_clock::now() - tp;
std::cout << " - second test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;
}
Here, instead of testing concurrency with deadlock, we test with a busy wait function. When I run this, I see that the second test takes 2x as long to run as the first. When I use the nsys profiler, I clearly see the kernels in the second case are run serially.
I understand that SYCL doesn't guarantee that kernels that can launch concurrently do, but it seems silly that a parallel compute implementation doesn't run these kernels as simple as they are in parallel!
I'm trying to run two kernels concurrently on a single Nvidia GPU using oneAPI SYCL and the Nvidia plugin. Is this possible? If not, why? Here is where I'm at so far: I'm able to run two kernels concurrently when they are the same lambda function. When I split up the two kernels into two separate lambda functions, the kernels run serially. Here's some code:
#include <iostream>
#include <sycl.hpp>
int main()
{
std::vector< sycl::queue > queues;
queues.emplace_back( sycl::gpu_selector_v );
sycl::device d = queues[0].get_device();
queues.emplace_back( d );
auto name = d.get_info<sycl::info::device::name>();
std::cout << name << "..." << std::flush;
volatile int* ptr = sycl::malloc_device<int>( 1, queues[0] );
#if 1
// This works!
for( size_t i=0; i<2; i++ )
{
queues[i].single_task( [=]()
{
if( i == 0 )
{
*ptr = 0xaaaaaaaa;
while( *ptr != 0xbbbbbbbb );
}
else
{
while( *ptr != 0xaaaaaaaa );
*ptr = 0xbbbbbbbb;
}
} );
}
#else
// This one doesn't work
queues[0].single_task( [=]()
{
*ptr = 0xaaaaaaaa;
while( *ptr != 0xbbbbbbbb );
} );
queues[1].single_task( [=]()
{
while( *ptr != 0xaaaaaaaa );
*ptr = 0xbbbbbbbb;
} );
#endif
queues[0].wait();
queues[1].wait();
std::cout << " passed!" << std::endl;
}
If the kernels run serially, deadlock is experienced. If the kernels run concurrently, deadlock is not experienced.
With oneAPI 2025.0 and the associated Nvidia plugin, the first branch of the #if
doesn't experience deadlock, but the second branch does.
How can I get the second case to run without deadlock?
Edit:
Jonas's answer finds that the volatile
keyword is the culprit here. I don't think this is the case.
Consider this example:
#include <iostream>
#include <sycl.hpp>
#if defined(__SYCL_DEVICE_ONLY__)
#define GET_CYCLE_COUNT_STR(VAL) "mov.u32 %0, %%clock;" : "=r"(VAL) :: "memory"
#define GET_CYCLE_COUNT(VAL) asm volatile( GET_CYCLE_COUNT_STR(VAL) )
#else
#define GET_CYCLE_COUNT(VAL)
#endif
void delay()
{
uint32_t cc0, cc1;
GET_CYCLE_COUNT(cc0);
do{
GET_CYCLE_COUNT(cc1);
} while( ( cc1 - cc0 ) < 1e9 );
}
int main()
{
sycl::queue queue( sycl::gpu_selector_v );
sycl::device d = queue.get_device();
auto name = d.get_info<sycl::info::device::name>();
std::cout << name << std::endl;
auto tp = std::chrono::high_resolution_clock::now();
// These run concurrently
for( size_t i=0; i<2; i++ )
{
queue.single_task( [=](){ delay(); } );
}
queue.wait();
auto elapsed = std::chrono::high_resolution_clock::now() - tp;
std::cout << " - first test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;
tp = std::chrono::high_resolution_clock::now();
// These run serially
queue.single_task( [=](){ delay(); } );
queue.single_task( [=](){ delay(); } );
queue.wait();
elapsed = std::chrono::high_resolution_clock::now() - tp;
std::cout << " - second test: " << std::chrono::duration<double,std::milli>( elapsed ).count() << "ms" << std::endl;
}
Here, instead of testing concurrency with deadlock, we test with a busy wait function. When I run this, I see that the second test takes 2x as long to run as the first. When I use the nsys profiler, I clearly see the kernels in the second case are run serially.
I understand that SYCL doesn't guarantee that kernels that can launch concurrently do, but it seems silly that a parallel compute implementation doesn't run these kernels as simple as they are in parallel!
Share Improve this question edited Jan 31 at 17:19 paleonix 3,1405 gold badges17 silver badges39 bronze badges asked Jan 29 at 17:17 rubikssolver4rubikssolver4 736 bronze badges 1 |2 Answers
Reset to default 3I was able to reproduce the behaviour you described on an RTX4090, oddly the code passes if using -fyscl-targets=x86_64
and the sycl::cpu_selector_v
this leads me to believe that the issue is constrained to the Nvidia backend.
Furthermore I tried the following:
- Manual loop unrolling using the same lambda function in each task submission. This also failed.
- Removing the
volatile
keyword. This worked.
The volatile
keyword seems to be the culprit here I am conjecturing that the Nvidia backend must be generating code that serialises the kernel launches in one case and yet not when launched inside a for-loop.
I will however make a couple of notes: the SYCL specification makes no guarantee that kernels that can launch concurrently, will do so.
Command-groups with non-overlapping requirements may execute concurrently.
So relying on this behaviour is unsafe.
Another adjacent point is that implementations of sycl::queue
use multiple streams to enqueue kernels concurrently when no dependency is detected through use of the sycl::buffer
/ sycl::accessor
model or through explicit sycl::event
dependencies. This is all to say, you do not need multiple sycl::queue
objects in order to launch multiple kernels concurrently.
After further investigation, I've found why the kernels are being run in serial. By default, after cuda 12.2, a setting called CUDA_MODULE_LOADING
is set to lazy. The cuda C++ programming guide outlines issues with lazy CUDA_MODULE_LOADING
with respect to concurrent execution of kernels:
https://docs.nvidia/cuda/cuda-c-programming-guide/index.html#concurrent-execution
Concurrent execution of kernels is described by the guide as an anti-pattern, but a workaround is to set the environment variable: CUDA_MODULE_LOADING=EAGER
steady_clock
for timing purposes, nothigh_resolution_clock
. Take it from the library author. – paleonix Commented Jan 31 at 17:21