最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

c++ - Parallelized ray tracer is slower than single thread version - Stack Overflow

programmeradmin2浏览0评论

i'm writing a raytracer based on the popular raytracing in a weekend series. i've tried to parallelize it by dividing the image in slices and give each thread a portion of the image to render:

void Camera::render(const Hittable& world, const Hittable& lights) {
    unsigned n_threads = std::thread::hardware_concurrency();
    unsigned rows_per_job = img_height_val / n_threads;
    unsigned leftover = img_height_val % n_threads;
    std::vector<job_block_t> jobs;
    std::vector<std::thread> threads;

    for (unsigned i = 0; i < n_threads; ++i) {
        job_block_t job;
        job.row_start = i * rows_per_job;
        job.row_end = job.row_start + rows_per_job;
        if (i == (n_threads - 1)) {
            job.row_end += leftover;
        }

        job.row_size = img_width_val;
        jobs.push_back(job);
    }

    for (unsigned i = 0; i < (n_threads - 1); ++i) {
        threads.emplace_back([this, &world, &lights, &jobs, i] {
            color_per_job(world, lights, jobs[i]);
        });
    }

    color_per_job(world, lights, jobs[n_threads - 1]);

    {
        std::unique_lock<std::mutex> lock(mutex);
        cv.wait(lock, [this, n_threads] {
            return image_blocks.size() == n_threads;
        });
    }

    for (auto& t : threads) {
        if (t.joinable()) {
            t.join();
        }
    }

    reconstruct_image(std::cout);
    std::clog << "\nDone.\n";
}

void Camera::color_per_job(const Hittable& world, const Hittable& lights, job_block_t& job) {
    for (int j=job.row_start; j<job.row_end; ++j) {
        for (int i = 0; i < job.row_size; ++i) {
            Color pixel_color(0,0,0);
            for (int s_j = 0; s_j < samples_sqrt; ++s_j) {
                for (int s_i = 0; s_i < samples_sqrt; ++s_i) {
                    Ray r = get_ray(i, j, s_i, s_j);
                    pixel_color += ray_color(r, max_depth, world, lights);
                }
            }

            pixel_color *= pixel_samples_scale;
            pixel_color.set_x(linear_to_gamma(pixel_color.x()));
            pixel_color.set_y(linear_to_gamma(pixel_color.y()));
            pixel_color.set_z(linear_to_gamma(pixel_color.z()));

            const int index = j * job.row_size + i;
            job.indices.push_back(index);
            job.colors.push_back(pixel_color);
        }
    }

    {
        std::lock_guard<std::mutex> lock(mutex);
        image_blocks.push_back(job);
        cv.notify_one();
    }
}

Color Camera::ray_color(const Ray& r, int depth, const Hittable& world, const Hittable& lights) const {
    if (depth <= 0) {
        // no light returned after too many bounces
        return Color(0, 0, 0);
    }

    HitRecord rec;
    double shadow_acne_offset = 0.001;

    // no hits
    if (!world.hit(r, Interval(shadow_acne_offset, infinity), rec)) {
        return background;
    }

    ScatterRecord srec;
    Color color_from_emission = 
        rec.material()->emitted(r, rec, rec.u(), rec.v(), rec.point());

    if (!rec.material()->scatter(r,rec,srec)) {
        return color_from_emission;
    }
 
    Color color_from_scatter = Color();
    for (const auto& ray_t : srec.scattered_rays) {
        if (ray_t.skip_pdf || ray_t.pdf == nullptr) {
            color_from_scatter += ray_t.attenuation * ray_color(ray_t.skip_pdf_ray, depth-1, world, lights);
        } else {
            auto light_ptr = std::make_shared<HittablePDF>(lights, rec.point());
            MixturePDF p(light_ptr, ray_t.pdf);
            Ray scattered = Ray(rec.point(), p.generate(), r.time());
            double pdf_value = p.value(scattered.direction(), r.direction(), w);
            double scattering_pdf = rec.material()->scattering_pdf(r, rec, scattered, w);
            Color sample_color = ray_color(scattered, depth-1, world, lights);
                
            color_from_scatter += (ray_t.attenuation * scattering_pdf * sample_color) / pdf_value;
        }
    }

    return color_from_emission + color_from_scatter;
}

typedef struct JobBlock {
    int row_start;
    int row_end;
    int row_size;
    std::vector<int> indices;
    std::vector<Vec3> colors;
} job_block_t; // block_job_t 

I' adding the most important part of the code, if more are needed i can post also those, anyway this is the repo. It can be built and executed in release mode using ./build.sh -r -run.

I'm aware that the architecture per se is extremely slow and need refactoring but still it is surprising that when multithreaded is even slower:

To render this image (400x400 1000 spp and 10 ray bounces) it took 64 seconds with 1 thread and 82 with 16 threads, by using 2 or 4 threads it gets again to around 64 some i'm assuming there might be some contention between the threads. i'm new to multithreading so i might have made some big mistake.

i'm writing a raytracer based on the popular raytracing in a weekend series. i've tried to parallelize it by dividing the image in slices and give each thread a portion of the image to render:

void Camera::render(const Hittable& world, const Hittable& lights) {
    unsigned n_threads = std::thread::hardware_concurrency();
    unsigned rows_per_job = img_height_val / n_threads;
    unsigned leftover = img_height_val % n_threads;
    std::vector<job_block_t> jobs;
    std::vector<std::thread> threads;

    for (unsigned i = 0; i < n_threads; ++i) {
        job_block_t job;
        job.row_start = i * rows_per_job;
        job.row_end = job.row_start + rows_per_job;
        if (i == (n_threads - 1)) {
            job.row_end += leftover;
        }

        job.row_size = img_width_val;
        jobs.push_back(job);
    }

    for (unsigned i = 0; i < (n_threads - 1); ++i) {
        threads.emplace_back([this, &world, &lights, &jobs, i] {
            color_per_job(world, lights, jobs[i]);
        });
    }

    color_per_job(world, lights, jobs[n_threads - 1]);

    {
        std::unique_lock<std::mutex> lock(mutex);
        cv.wait(lock, [this, n_threads] {
            return image_blocks.size() == n_threads;
        });
    }

    for (auto& t : threads) {
        if (t.joinable()) {
            t.join();
        }
    }

    reconstruct_image(std::cout);
    std::clog << "\nDone.\n";
}

void Camera::color_per_job(const Hittable& world, const Hittable& lights, job_block_t& job) {
    for (int j=job.row_start; j<job.row_end; ++j) {
        for (int i = 0; i < job.row_size; ++i) {
            Color pixel_color(0,0,0);
            for (int s_j = 0; s_j < samples_sqrt; ++s_j) {
                for (int s_i = 0; s_i < samples_sqrt; ++s_i) {
                    Ray r = get_ray(i, j, s_i, s_j);
                    pixel_color += ray_color(r, max_depth, world, lights);
                }
            }

            pixel_color *= pixel_samples_scale;
            pixel_color.set_x(linear_to_gamma(pixel_color.x()));
            pixel_color.set_y(linear_to_gamma(pixel_color.y()));
            pixel_color.set_z(linear_to_gamma(pixel_color.z()));

            const int index = j * job.row_size + i;
            job.indices.push_back(index);
            job.colors.push_back(pixel_color);
        }
    }

    {
        std::lock_guard<std::mutex> lock(mutex);
        image_blocks.push_back(job);
        cv.notify_one();
    }
}

Color Camera::ray_color(const Ray& r, int depth, const Hittable& world, const Hittable& lights) const {
    if (depth <= 0) {
        // no light returned after too many bounces
        return Color(0, 0, 0);
    }

    HitRecord rec;
    double shadow_acne_offset = 0.001;

    // no hits
    if (!world.hit(r, Interval(shadow_acne_offset, infinity), rec)) {
        return background;
    }

    ScatterRecord srec;
    Color color_from_emission = 
        rec.material()->emitted(r, rec, rec.u(), rec.v(), rec.point());

    if (!rec.material()->scatter(r,rec,srec)) {
        return color_from_emission;
    }
 
    Color color_from_scatter = Color();
    for (const auto& ray_t : srec.scattered_rays) {
        if (ray_t.skip_pdf || ray_t.pdf == nullptr) {
            color_from_scatter += ray_t.attenuation * ray_color(ray_t.skip_pdf_ray, depth-1, world, lights);
        } else {
            auto light_ptr = std::make_shared<HittablePDF>(lights, rec.point());
            MixturePDF p(light_ptr, ray_t.pdf);
            Ray scattered = Ray(rec.point(), p.generate(), r.time());
            double pdf_value = p.value(scattered.direction(), r.direction(), w);
            double scattering_pdf = rec.material()->scattering_pdf(r, rec, scattered, w);
            Color sample_color = ray_color(scattered, depth-1, world, lights);
                
            color_from_scatter += (ray_t.attenuation * scattering_pdf * sample_color) / pdf_value;
        }
    }

    return color_from_emission + color_from_scatter;
}

typedef struct JobBlock {
    int row_start;
    int row_end;
    int row_size;
    std::vector<int> indices;
    std::vector<Vec3> colors;
} job_block_t; // block_job_t 

I' adding the most important part of the code, if more are needed i can post also those, anyway this is the repo. It can be built and executed in release mode using ./build.sh -r -run.

I'm aware that the architecture per se is extremely slow and need refactoring but still it is surprising that when multithreaded is even slower:

To render this image (400x400 1000 spp and 10 ray bounces) it took 64 seconds with 1 thread and 82 with 16 threads, by using 2 or 4 threads it gets again to around 64 some i'm assuming there might be some contention between the threads. i'm new to multithreading so i might have made some big mistake.

Share asked Mar 4 at 11:43 Niccolò TiezziNiccolò Tiezzi 1871 silver badge8 bronze badges 7
  • 1 Unrelated: your cv.wait(...) construction at the end of ::render is unnecessary: calling join() on a thread already waits for it to terminate. – Botje Commented Mar 4 at 12:12
  • What you really should do (whenever you have performance issues) is run your program in a profiler, it will exactly show you where your hot path is and thus where your bottlenecks are. That will point you to the code that is causing the most slow downs (and would probably point out the locations shown in the answers below too). – Pepijn Kramer Commented Mar 4 at 13:17
  • I would recommend to run your application under some profiler tool to locate a bottle neck. When you will have data from a profiler it would be easier to locate the issue. Personally I suspect false sharing problem. – Marek R Commented Mar 4 at 13:23
  • 1 Thank you all for the comments, in case you are still interested i've opened this follow up stackoverflow/beta/discussions/79484313/… – Niccolò Tiezzi Commented Mar 4 at 16:43
  • 1 @NiccolòTiezzi it is at least a good learning opportunity for you to learn when to use shared_ptr and when not to use it. your code is a good example of when not to use it, simulators, and algorithms usually use as little indirection as possible, and almost no runtime polymorphism in some cases. see clean code horrible performance – Ahmed AEK Commented Mar 5 at 10:50
 |  Show 2 more comments

3 Answers 3

Reset to default 3
for (int j=job.row_start; j<job.row_end; ++j) {
    for (int i = 0; i < job.row_size; ++i) {
        // some code here
        job.indices.push_back(index);
        job.colors.push_back(pixel_color);
    }
}

you are doing a lot of push_back in a tight loop, and every time the vector capacity grows it has to allocate memory, which also causes a lot of page faults as data is copied from old to new storage and makes the code memory-bound.

also the global allocator is generally not made for fast concurrent allocations, it usually has a mutex in there causing all threads to wait on one another. you need to reserve the memory up front, so the vector capacity never grows and no reallocation happens.

indices.reserve((job.row_end - job.row_start) * job.row_size);
colors.reserve((job.row_end - job.row_start) * job.row_size);
for (int j=job.row_start; j<job.row_end; ++j) {
...

another point where threads are stepping on one another

auto light_ptr = std::make_shared<HittablePDF>(lights, rec.point());
MixturePDF p(light_ptr, ray_t.pdf);

again the global allocator is not good at concurrent allocations, you can likely allocate those on the stack.

auto light_obj = HittablePDF{lights, rec.point()};
MixturePDF p(light_obj, ray_t.pdf);

you will need to modify MixturePDF to store a raw pointer or reference instead of a shared_ptr.

You use a lot of push_back or emplace_back, but I don't see any calls to reserve. Threads all need to get memory from the same place, so any re-allocation would cause them to go through that bottleneck.

Your best bet for multithreading is to pre-allocate the buffers as if they would be serviced by a single thread. Once you do that, the worker threads should each change values in different portions of that buffer (by reference). Threads are for computing so try to eliminate anywhere that they need to perform memory allocation.

I believe the bottleneck you're facing is the use of the job queue. I've seen many solutions online that have the same problem.

First you're using hardware_concurrency to determine the number of threads you want to use. The fact is that the call returns the number of logical processors (see SMT or Hyperthreading), if you're doing a lot of calculation maybe you should try something closer to the physical CPU count or you won't see much speedup.

Also you're using a mutex and a condition var, which is correct, but prone to frequent context switch that can mess with the scaling of your solution.

I'd try to see if batching can be implemented, or maybe trying some active waiting methods (i.e. spinlocks instead of locks). Also as other suggested, reserving the memory in advance can be good, but std::vector makes a good job already, also memory caches are really efficient (so probably the bottleneck isn't there).

There are also a lot of job queues that are lock-free. See for example LPRQ which is a multiproducer-multiconsumer queue. The paper has also an artifact section from which you can get the actual implementation.

If you find the implementation too complicated you can think of having a buffer from the producer to every consumer (in a lock free manner), the implementation is much more simple See here and probably scales much better than a single buffer shared between threads (assuming the thread count is known in advance).

发布评论

评论列表(0)

  1. 暂无评论