I have a vector called test
of size 7,200,000,000 that holds complex shorts (variable stored is sample).
std::vector<std::complex<short> > test;
std::complex<short> sample;
I pre-allocate this as:
size_t nSamples = 7200000000;
test.reserve(nSamples);
I then want to go through a for loop to populate each entry in this test
vector:
for ( size_t iSample = 0; iSample < nSamples; iSample++ )
{
// code to generate a complex value called "sample"
// sample =
test.push_back(sample);
}
This works, but it takes 4 minutes to populate this vector and it uses 1 thread at 100%. The amount of RAM used is about 100GB initially, but as the code runs it drops down by about 20GB.
I have a AMD Threadripper PRO 7975WX 32 Core / 64 threads with 251 GB DDR5-4800. OS is Linux Mint 21.3
So I tried to use openMP to parallelize this and make use of 64 threads as follows:
size_t nSamples = 7200000000;
test.reserve(nSamples)
#pragma omp parallel for
for ( size_t iSample = 0; iSample < nSamples; iSample++ )
{
// code to generate a complex value called "sample"
// sample =
#pragma omp critical
test.push_back(sample);
}
What happens is that all threads (all 64) go to 100% utilization and all 251GB of RAM get used up in just a few seconds. Linux then terminates the C++ code with a "terminated by signal SIGKILL (Forced quit)"
.
Any advice on why all the RAM gets used up? What am I doing wrong?
EDIT: Feb 16: Complete reproducible example of the problem:
#include <iostream>
#include <vector>
#include <complex>
#include <math.h>
#include <signal.h>
#include <iostream>
#include "/usr/lib/llvm-14/lib/clang/14.0.0/include/omp.h"
int main()
{
// this vector holds the total samples that define waveform
std::vector<std::complex<short> > pulseSequence;
float duration = 180; // 180 seconds chirp sounding
float sampleRate = 40e6; // sample beyond Nyquist limit + prevent aliasing on SDR
size_t nSamples = duration * sampleRate;
std::complex<short> sample; // store wavepacket
pulseSequence.reserve(nSamples); //allocate memory for large vector (7,200,000,000 samples)
#pragma omp parallel for
for ( size_t iSample=0; iSample < nSamples; iSample++ )
{
// the following is bogus/test code for example purposes:
sample = std::complex<short>(0*iSample,0);
//save the sample into the main vector
#pragma omp critical // <-- tried with/without
{
pulseSequence.push_back(sample);
}
//output some progress:
if (iSample % 100000000 == 0)
{
std::cout << float(iSample)*100.0/float(nSamples) << "\%" << std::endl;
}
}
return 0; // end program
}
compile: g++ test.cpp -o test -fopenmp
I have a vector called test
of size 7,200,000,000 that holds complex shorts (variable stored is sample).
std::vector<std::complex<short> > test;
std::complex<short> sample;
I pre-allocate this as:
size_t nSamples = 7200000000;
test.reserve(nSamples);
I then want to go through a for loop to populate each entry in this test
vector:
for ( size_t iSample = 0; iSample < nSamples; iSample++ )
{
// code to generate a complex value called "sample"
// sample =
test.push_back(sample);
}
This works, but it takes 4 minutes to populate this vector and it uses 1 thread at 100%. The amount of RAM used is about 100GB initially, but as the code runs it drops down by about 20GB.
I have a AMD Threadripper PRO 7975WX 32 Core / 64 threads with 251 GB DDR5-4800. OS is Linux Mint 21.3
So I tried to use openMP to parallelize this and make use of 64 threads as follows:
size_t nSamples = 7200000000;
test.reserve(nSamples)
#pragma omp parallel for
for ( size_t iSample = 0; iSample < nSamples; iSample++ )
{
// code to generate a complex value called "sample"
// sample =
#pragma omp critical
test.push_back(sample);
}
What happens is that all threads (all 64) go to 100% utilization and all 251GB of RAM get used up in just a few seconds. Linux then terminates the C++ code with a "terminated by signal SIGKILL (Forced quit)"
.
Any advice on why all the RAM gets used up? What am I doing wrong?
EDIT: Feb 16: Complete reproducible example of the problem:
#include <iostream>
#include <vector>
#include <complex>
#include <math.h>
#include <signal.h>
#include <iostream>
#include "/usr/lib/llvm-14/lib/clang/14.0.0/include/omp.h"
int main()
{
// this vector holds the total samples that define waveform
std::vector<std::complex<short> > pulseSequence;
float duration = 180; // 180 seconds chirp sounding
float sampleRate = 40e6; // sample beyond Nyquist limit + prevent aliasing on SDR
size_t nSamples = duration * sampleRate;
std::complex<short> sample; // store wavepacket
pulseSequence.reserve(nSamples); //allocate memory for large vector (7,200,000,000 samples)
#pragma omp parallel for
for ( size_t iSample=0; iSample < nSamples; iSample++ )
{
// the following is bogus/test code for example purposes:
sample = std::complex<short>(0*iSample,0);
//save the sample into the main vector
#pragma omp critical // <-- tried with/without
{
pulseSequence.push_back(sample);
}
//output some progress:
if (iSample % 100000000 == 0)
{
std::cout << float(iSample)*100.0/float(nSamples) << "\%" << std::endl;
}
}
return 0; // end program
}
compile: g++ test.cpp -o test -fopenmp
3 Answers
Reset to default 2While Dominik Kaszewski's answer (using resize()
and operator[]
) is a start, the resize will initialize the memory which will waste considerable time again. This part of the C++ STL is not very nice to use with parallelism.
Another potential performance issue with sequential initialization is that given the "first touch" policy on NUMA system all memory will be local to the thread initializing it which is a problem when you keep accessing the data in parallel.
The solution is often to use a custom allocator like the one in How to make my uninitialised_allocator safe?, or better use the allocator adaptor provided by Boost with <boost/core/noinit_adaptor.hpp>
.
In the improbable case that your actual initialization is as simple as the one in the MRE, you should probably use a lazy view instead, see C++20's std::views::iota
or for older standards use e.g. <boost/iterator/counting_iterator.hpp>
. Slightly more complicated patterns can be created by nesting this view/iterator into a transform-view/iterator.
Using critical sections/locks around the only line of code in your parallel loop means that the work is effectively sequentialized, i.e. it is pretty much always slower than the sequential code due to parallel overheads. Critical sections should always be only a small part of the work in a parallel loop if necessary at all.
You can replace reserve
and push_back
with resize
and operator[]
. This will be more performant as you only modify size
once and allow you to remove the #pragma omp critical
lock.
You can also try using <algorithm>
to fill your vector, benchmark to compare its performance against OpenMP:
samples.resize(count);
std::transform(std::execution::par_unseq, samples.begin(), samples.end(), samples.begin(), generate_sample);
You should not use push_back
which requires an ordering of the iterations.
- Allocate with size zero
- Use
resize
so that the space is there - Then use
[]
indexing to fill the vector in parallel.
This works for single-socket systems. I you have a dual (quad, ... ) socket system you need to account for "first touch".
test.push_back
is not thread safe. Multiple threads updating the structure will collide. – Tim Roberts Commented Feb 16 at 4:22