c++ - Parallelize a for loop with populating a large vector with complex values --> out-of-memory issue

I have a vector called test of size 7,200,000,000 that holds complex shorts (variable stored is sample).

std::vector<std::complex<short> > test;
std::complex<short> sample;

I pre-allocate this as:

size_t nSamples = 7200000000;
test.reserve(nSamples);

I then want to go through a for loop to populate each entry in this test vector:

for ( size_t iSample = 0; iSample < nSamples; iSample++ )
{

    // code to generate a complex value called "sample"
    //  sample = 

    test.push_back(sample);
}

This works, but it takes 4 minutes to populate this vector and it uses 1 thread at 100%. The amount of RAM used is about 100GB initially, but as the code runs it drops down by about 20GB.

I have a AMD Threadripper PRO 7975WX 32 Core / 64 threads with 251 GB DDR5-4800. OS is Linux Mint 21.3

So I tried to use openMP to parallelize this and make use of 64 threads as follows:

size_t nSamples = 7200000000;
test.reserve(nSamples)

#pragma omp parallel for
for ( size_t iSample = 0; iSample < nSamples; iSample++ )
{

    // code to generate a complex value called "sample"
    //  sample = 

    #pragma omp critical
    test.push_back(sample);
}

What happens is that all threads (all 64) go to 100% utilization and all 251GB of RAM get used up in just a few seconds. Linux then terminates the C++ code with a "terminated by signal SIGKILL (Forced quit)".

Any advice on why all the RAM gets used up? What am I doing wrong?

EDIT: Feb 16: Complete reproducible example of the problem:

#include <iostream>
#include <vector>
#include <complex>
#include <math.h>
#include <signal.h>
#include <iostream>
#include "/usr/lib/llvm-14/lib/clang/14.0.0/include/omp.h"

int main()
{
        // this vector holds the total samples that define waveform
        std::vector<std::complex<short> > pulseSequence;

        float duration = 180; // 180 seconds chirp sounding

        float sampleRate = 40e6; // sample beyond Nyquist limit + prevent aliasing on SDR

        size_t nSamples = duration * sampleRate;

        std::complex<short> sample; // store wavepacket

        pulseSequence.reserve(nSamples); //allocate memory for large vector (7,200,000,000 samples)

        #pragma omp parallel for
        for ( size_t iSample=0; iSample < nSamples; iSample++ )
        {
                // the following is bogus/test code for example purposes:
                sample = std::complex<short>(0*iSample,0);

                //save the sample into the main vector
                #pragma omp critical // <-- tried with/without
                {
                        pulseSequence.push_back(sample);
                }

                //output some progress:
                if (iSample % 100000000 == 0)
                {
                    std::cout << float(iSample)*100.0/float(nSamples) << "\%" << std::endl;
                }

        }

        return 0; // end program
}

compile: g++ test.cpp -o test -fopenmp

I have a vector called test of size 7,200,000,000 that holds complex shorts (variable stored is sample).

std::vector<std::complex<short> > test;
std::complex<short> sample;

I pre-allocate this as:

size_t nSamples = 7200000000;
test.reserve(nSamples);

I then want to go through a for loop to populate each entry in this test vector:

for ( size_t iSample = 0; iSample < nSamples; iSample++ )
{

    // code to generate a complex value called "sample"
    //  sample = 

    test.push_back(sample);
}

This works, but it takes 4 minutes to populate this vector and it uses 1 thread at 100%. The amount of RAM used is about 100GB initially, but as the code runs it drops down by about 20GB.

I have a AMD Threadripper PRO 7975WX 32 Core / 64 threads with 251 GB DDR5-4800. OS is Linux Mint 21.3

So I tried to use openMP to parallelize this and make use of 64 threads as follows:

size_t nSamples = 7200000000;
test.reserve(nSamples)

#pragma omp parallel for
for ( size_t iSample = 0; iSample < nSamples; iSample++ )
{

    // code to generate a complex value called "sample"
    //  sample = 

    #pragma omp critical
    test.push_back(sample);
}

Any advice on why all the RAM gets used up? What am I doing wrong?

EDIT: Feb 16: Complete reproducible example of the problem:

#include <iostream>
#include <vector>
#include <complex>
#include <math.h>
#include <signal.h>
#include <iostream>
#include "/usr/lib/llvm-14/lib/clang/14.0.0/include/omp.h"

int main()
{
        // this vector holds the total samples that define waveform
        std::vector<std::complex<short> > pulseSequence;

        float duration = 180; // 180 seconds chirp sounding

        float sampleRate = 40e6; // sample beyond Nyquist limit + prevent aliasing on SDR

        size_t nSamples = duration * sampleRate;

        std::complex<short> sample; // store wavepacket

        pulseSequence.reserve(nSamples); //allocate memory for large vector (7,200,000,000 samples)

        #pragma omp parallel for
        for ( size_t iSample=0; iSample < nSamples; iSample++ )
        {
                // the following is bogus/test code for example purposes:
                sample = std::complex<short>(0*iSample,0);

                //save the sample into the main vector
                #pragma omp critical // <-- tried with/without
                {
                        pulseSequence.push_back(sample);
                }

                //output some progress:
                if (iSample % 100000000 == 0)
                {
                    std::cout << float(iSample)*100.0/float(nSamples) << "\%" << std::endl;
                }

        }

        return 0; // end program
}

compile: g++ test.cpp -o test -fopenmp

Share Improve this question edited Feb 16 at 17:57 asked Feb 16 at 4:14 NoRainDropsInTheSky 112 bronze badges

8 test.push_back is not thread safe. Multiple threads updating the structure will collide. – Tim Roberts Commented Feb 16 at 4:22
3 I have a C++ code, in which a vector called "test" of size 7,200,000,000 -- This sounds more like a design issue rather than a memory issue. A vector of 7 billion+ elements? Why does your application "require" a vector of 7,200,000,000 elements? – PaulMcKenzie Commented Feb 16 at 4:38
2 @PaulMcKenzie Given that they have the hardware with enough ram for that, something tells me they know what they're doing. :P – HolyBlackCat Commented Feb 16 at 7:14
4 It would be extremely inefficient to put a critical section there. Allocate (do not reserve) the memory and then assign (do not push) the value to the element at a specific index. No idea about the out-pf-memory error though. I would reduce the size to something manageable and use a memory profiler to find out. – n. m. could be an AI Commented Feb 16 at 7:27
2 @PaulMcKenzie Physics simulations regularly deal with billions of elements. People in that field switched to 64 bit integers for indexing 2 decades ago. Machines I deal with have 100G-200G of memory. I have no idea why your comment gets upvoted so much. – Victor Eijkhout Commented Feb 16 at 14:27

| Show 11 more comments

3 Answers 3

Sorted by: Reset to default 2

While Dominik Kaszewski's answer (using resize() and operator[]) is a start, the resize will initialize the memory which will waste considerable time again. This part of the C++ STL is not very nice to use with parallelism.

Another potential performance issue with sequential initialization is that given the "first touch" policy on NUMA system all memory will be local to the thread initializing it which is a problem when you keep accessing the data in parallel.

The solution is often to use a custom allocator like the one in How to make my uninitialised_allocator safe?, or better use the allocator adaptor provided by Boost with <boost/core/noinit_adaptor.hpp>.

In the improbable case that your actual initialization is as simple as the one in the MRE, you should probably use a lazy view instead, see C++20's std::views::iota or for older standards use e.g. <boost/iterator/counting_iterator.hpp>. Slightly more complicated patterns can be created by nesting this view/iterator into a transform-view/iterator.

Using critical sections/locks around the only line of code in your parallel loop means that the work is effectively sequentialized, i.e. it is pretty much always slower than the sequential code due to parallel overheads. Critical sections should always be only a small part of the work in a parallel loop if necessary at all.

You can replace reserve and push_back with resize and operator[]. This will be more performant as you only modify size once and allow you to remove the #pragma omp critical lock.

You can also try using <algorithm> to fill your vector, benchmark to compare its performance against OpenMP:

samples.resize(count);
std::transform(std::execution::par_unseq, samples.begin(), samples.end(), samples.begin(), generate_sample);

You should not use push_back which requires an ordering of the iterations.

Allocate with size zero
Use resize so that the space is there
Then use [] indexing to fill the vector in parallel.

This works for single-socket systems. I you have a dual (quad, ... ) socket system you need to account for "first touch".

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

c++ - Parallelize a for loop with populating a large vector with complex values --> out-of-memory issue - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)