I am writing some code in C in which I want to add the optional ability to have certain sections of the code accelerated using OpenMP, and with an additional optional ability to have them accelerated with devices such as GPUs. For example, my matrix multiplication function leverages GPU acceleration:
/* ... */
int numeric_matmul(const float_t *pt_a, const float_t *pt_b, float_t *pt_c, uintmax_t t_m, uintmax_t t_k, uintmax_t t_n)
{
#ifdef _OPENMP
#pragma omp target teams distribute parallel for collapse(2) schedule(dynamic) map(to: pt_a[0 : t_m * t_k], pt_b[0 : t_k * t_n]) map(from: pt_c[0 : t_m * t_n])
#endif
for(uintmax_t l_i = 0; l_i < t_m; l_i++)
{
for(uintmax_t l_j = 0; l_j < t_n; l_j++)
{
/* Compute the sum. */
float_t l_sum = 0.0;
for(uintmax_t l_p = 0; l_p < t_k; l_p++) l_sum += pt_a[l_i * t_k + l_p] * pt_b[l_p * t_n + l_j];
/* Store the result. */
pt_c[l_i * t_n + l_j] = l_sum;
}
}
/* Return with success. */
return 0;
}
And it works fine. However, when I try to use common mathematics-related functions (from math.h
), I face an obstacle; how can I use those functions for both the CPU and the GPU?
I have this function here:
/**
* @brief Perform the sigmoid function on a value.
* @param t_x The input value.
* @param pt_y The output value.
* @return The result status code. In this case, it'll always return 0.
*/
static inline int numeric_sigmoid(float_t t_x, float_t *pt_y)
{
/* Set the output value to the sigmoid of the input value. */
*pt_y = 1.0 / (1.0 + expf(-t_x));
/* Return with success. */
return 0;
}
Which relies on the expf
function. I want this function to be both capable of being run on the CPU, and the GPU. It runs fine on the CPU-side of my codebase, but as soon as I try leveraging the GPU's power:
#pragma omp target teams distribute parallel for schedule(dynamic) map(to: pt_feedforward->ppt_hidden_layer_bias_buffer[l_i][0 : l_next_layer_activation_buffer_size]) map(from: pl_next_layer_activation_buffer[0 : l_next_layer_activation_buffer_size])
for(uintmax_t l_j = 0; l_j < l_next_layer_activation_buffer_size; l_j++)
{
pl_next_layer_activation_buffer[l_j] += pt_feedforward->ppt_hidden_layer_bias_buffer[l_i][l_j];
numeric_sigmoid(pl_next_layer_activation_buffer[l_j], &pl_next_layer_activation_buffer[l_j]);
}
I face a runtime error:
libgomp: pointer target not mapped for attach
I get GCC to compile with GPU offloading using my NVIDIA card by telling CMake to use some additional parameters:
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS="-fopenmp -foffload=nvptx-none -foffload-options=-misa=sm_80 -fcf-protection=none -fno-stack-protector -no-pie" ..
I also tried using -ffast-math
, but that didn't seem to do anything either...
Does anyone here know how to properly accomplish the use of math.h
& it's features on GPU-accelerated contexts as well when dealing with OpenMP's GPU offloading feature?
I am writing some code in C in which I want to add the optional ability to have certain sections of the code accelerated using OpenMP, and with an additional optional ability to have them accelerated with devices such as GPUs. For example, my matrix multiplication function leverages GPU acceleration:
/* ... */
int numeric_matmul(const float_t *pt_a, const float_t *pt_b, float_t *pt_c, uintmax_t t_m, uintmax_t t_k, uintmax_t t_n)
{
#ifdef _OPENMP
#pragma omp target teams distribute parallel for collapse(2) schedule(dynamic) map(to: pt_a[0 : t_m * t_k], pt_b[0 : t_k * t_n]) map(from: pt_c[0 : t_m * t_n])
#endif
for(uintmax_t l_i = 0; l_i < t_m; l_i++)
{
for(uintmax_t l_j = 0; l_j < t_n; l_j++)
{
/* Compute the sum. */
float_t l_sum = 0.0;
for(uintmax_t l_p = 0; l_p < t_k; l_p++) l_sum += pt_a[l_i * t_k + l_p] * pt_b[l_p * t_n + l_j];
/* Store the result. */
pt_c[l_i * t_n + l_j] = l_sum;
}
}
/* Return with success. */
return 0;
}
And it works fine. However, when I try to use common mathematics-related functions (from math.h
), I face an obstacle; how can I use those functions for both the CPU and the GPU?
I have this function here:
/**
* @brief Perform the sigmoid function on a value.
* @param t_x The input value.
* @param pt_y The output value.
* @return The result status code. In this case, it'll always return 0.
*/
static inline int numeric_sigmoid(float_t t_x, float_t *pt_y)
{
/* Set the output value to the sigmoid of the input value. */
*pt_y = 1.0 / (1.0 + expf(-t_x));
/* Return with success. */
return 0;
}
Which relies on the expf
function. I want this function to be both capable of being run on the CPU, and the GPU. It runs fine on the CPU-side of my codebase, but as soon as I try leveraging the GPU's power:
#pragma omp target teams distribute parallel for schedule(dynamic) map(to: pt_feedforward->ppt_hidden_layer_bias_buffer[l_i][0 : l_next_layer_activation_buffer_size]) map(from: pl_next_layer_activation_buffer[0 : l_next_layer_activation_buffer_size])
for(uintmax_t l_j = 0; l_j < l_next_layer_activation_buffer_size; l_j++)
{
pl_next_layer_activation_buffer[l_j] += pt_feedforward->ppt_hidden_layer_bias_buffer[l_i][l_j];
numeric_sigmoid(pl_next_layer_activation_buffer[l_j], &pl_next_layer_activation_buffer[l_j]);
}
I face a runtime error:
libgomp: pointer target not mapped for attach
I get GCC to compile with GPU offloading using my NVIDIA card by telling CMake to use some additional parameters:
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS="-fopenmp -foffload=nvptx-none -foffload-options=-misa=sm_80 -fcf-protection=none -fno-stack-protector -no-pie" ..
I also tried using -ffast-math
, but that didn't seem to do anything either...
Does anyone here know how to properly accomplish the use of math.h
& it's features on GPU-accelerated contexts as well when dealing with OpenMP's GPU offloading feature?
Share
Improve this question
edited Mar 31 at 17:22
Marco Bonelli
69.7k21 gold badges127 silver badges146 bronze badges
asked Mar 30 at 19:01
Matthew G.Matthew G.
1241 silver badge9 bronze badges
4
|
2 Answers
Reset to default 2From OpenMP spec perspective, calling functions from the math library is not different from calling any other function.
To call a function from within a target region, the function must be declared as target function (#pragma omp declare target
) and also compiled for the targeted device.
Some math functions might have a specialized instruction or implementation available for the targeted architecture. In such case, the compiler would replace the function call by calling the specialized implementation.
For math functions, most OpenMP implementations ship libraries compiled for the targeted architecture. Like any other library you want to use from an application code, you also need to link the math library (-lm
). I think, that many OpenMP programmers are not used to explicitly link the math library, when needed, because for a long time gcc's -fopenmp
implicitly linked the math library.
With gcc-13, you need to link with -lm -foffload=-lm
. With gcc-14, it is sufficient to link with -lm
, the offloading math library is implied then.
This looks like neural network code. There's a rather fundamental problem that you are completely overlooking here. GPU's physically have their own memory. If you are trying to execute operations on CPU and GPU, modern GPU's can hide some of the complexity, but you simply cannot hide the time it physically takes to move data between different memories.
Hence, what you absolutely want to do is to execute one big chunk of operations on the GPU. That means you copy the raw input in, and copy the output out. Leave the CPU for things like file reading, things that are not mathematically hard.
Specifically, use CUDA/CuDNN for all operations that they directly implement. Your OpenMP code is never going to be nearly as fast.
1.0 / (1.0 + expf(-t_x));
certainly uses the double precision for the computation so you should use1.0f / (1.0f + expf(-t_x));
instead. Double precision is generally (much) slower on GPU (especially on client-side ones). – Jérôme Richard Commented Mar 30 at 19:36-foffload=-lm
for older gcc versions: gcc.gnu./wiki/Offloading – Joachim Commented Mar 30 at 19:54