i'm writing a raytracer based on the popular raytracing in a weekend series. i've tried to parallelize it by dividing the image in slices and give each thread a portion of the image to render:
void Camera::render(const Hittable& world, const Hittable& lights) {
unsigned n_threads = std::thread::hardware_concurrency();
unsigned rows_per_job = img_height_val / n_threads;
unsigned leftover = img_height_val % n_threads;
std::vector<job_block_t> jobs;
std::vector<std::thread> threads;
for (unsigned i = 0; i < n_threads; ++i) {
job_block_t job;
job.row_start = i * rows_per_job;
job.row_end = job.row_start + rows_per_job;
if (i == (n_threads - 1)) {
job.row_end += leftover;
}
job.row_size = img_width_val;
jobs.push_back(job);
}
for (unsigned i = 0; i < (n_threads - 1); ++i) {
threads.emplace_back([this, &world, &lights, &jobs, i] {
color_per_job(world, lights, jobs[i]);
});
}
color_per_job(world, lights, jobs[n_threads - 1]);
{
std::unique_lock<std::mutex> lock(mutex);
cv.wait(lock, [this, n_threads] {
return image_blocks.size() == n_threads;
});
}
for (auto& t : threads) {
if (t.joinable()) {
t.join();
}
}
reconstruct_image(std::cout);
std::clog << "\nDone.\n";
}
void Camera::color_per_job(const Hittable& world, const Hittable& lights, job_block_t& job) {
for (int j=job.row_start; j<job.row_end; ++j) {
for (int i = 0; i < job.row_size; ++i) {
Color pixel_color(0,0,0);
for (int s_j = 0; s_j < samples_sqrt; ++s_j) {
for (int s_i = 0; s_i < samples_sqrt; ++s_i) {
Ray r = get_ray(i, j, s_i, s_j);
pixel_color += ray_color(r, max_depth, world, lights);
}
}
pixel_color *= pixel_samples_scale;
pixel_color.set_x(linear_to_gamma(pixel_color.x()));
pixel_color.set_y(linear_to_gamma(pixel_color.y()));
pixel_color.set_z(linear_to_gamma(pixel_color.z()));
const int index = j * job.row_size + i;
job.indices.push_back(index);
job.colors.push_back(pixel_color);
}
}
{
std::lock_guard<std::mutex> lock(mutex);
image_blocks.push_back(job);
cv.notify_one();
}
}
Color Camera::ray_color(const Ray& r, int depth, const Hittable& world, const Hittable& lights) const {
if (depth <= 0) {
// no light returned after too many bounces
return Color(0, 0, 0);
}
HitRecord rec;
double shadow_acne_offset = 0.001;
// no hits
if (!world.hit(r, Interval(shadow_acne_offset, infinity), rec)) {
return background;
}
ScatterRecord srec;
Color color_from_emission =
rec.material()->emitted(r, rec, rec.u(), rec.v(), rec.point());
if (!rec.material()->scatter(r,rec,srec)) {
return color_from_emission;
}
Color color_from_scatter = Color();
for (const auto& ray_t : srec.scattered_rays) {
if (ray_t.skip_pdf || ray_t.pdf == nullptr) {
color_from_scatter += ray_t.attenuation * ray_color(ray_t.skip_pdf_ray, depth-1, world, lights);
} else {
auto light_ptr = std::make_shared<HittablePDF>(lights, rec.point());
MixturePDF p(light_ptr, ray_t.pdf);
Ray scattered = Ray(rec.point(), p.generate(), r.time());
double pdf_value = p.value(scattered.direction(), r.direction(), w);
double scattering_pdf = rec.material()->scattering_pdf(r, rec, scattered, w);
Color sample_color = ray_color(scattered, depth-1, world, lights);
color_from_scatter += (ray_t.attenuation * scattering_pdf * sample_color) / pdf_value;
}
}
return color_from_emission + color_from_scatter;
}
typedef struct JobBlock {
int row_start;
int row_end;
int row_size;
std::vector<int> indices;
std::vector<Vec3> colors;
} job_block_t; // block_job_t
I' adding the most important part of the code, if more are needed i can post also those, anyway this is the repo. It can be built and executed in release mode using ./build.sh -r -run
.
I'm aware that the architecture per se is extremely slow and need refactoring but still it is surprising that when multithreaded is even slower:
To render this image (400x400 1000 spp and 10 ray bounces) it took 64 seconds with 1 thread and 82 with 16 threads, by using 2 or 4 threads it gets again to around 64 some i'm assuming there might be some contention between the threads. i'm new to multithreading so i might have made some big mistake.
i'm writing a raytracer based on the popular raytracing in a weekend series. i've tried to parallelize it by dividing the image in slices and give each thread a portion of the image to render:
void Camera::render(const Hittable& world, const Hittable& lights) {
unsigned n_threads = std::thread::hardware_concurrency();
unsigned rows_per_job = img_height_val / n_threads;
unsigned leftover = img_height_val % n_threads;
std::vector<job_block_t> jobs;
std::vector<std::thread> threads;
for (unsigned i = 0; i < n_threads; ++i) {
job_block_t job;
job.row_start = i * rows_per_job;
job.row_end = job.row_start + rows_per_job;
if (i == (n_threads - 1)) {
job.row_end += leftover;
}
job.row_size = img_width_val;
jobs.push_back(job);
}
for (unsigned i = 0; i < (n_threads - 1); ++i) {
threads.emplace_back([this, &world, &lights, &jobs, i] {
color_per_job(world, lights, jobs[i]);
});
}
color_per_job(world, lights, jobs[n_threads - 1]);
{
std::unique_lock<std::mutex> lock(mutex);
cv.wait(lock, [this, n_threads] {
return image_blocks.size() == n_threads;
});
}
for (auto& t : threads) {
if (t.joinable()) {
t.join();
}
}
reconstruct_image(std::cout);
std::clog << "\nDone.\n";
}
void Camera::color_per_job(const Hittable& world, const Hittable& lights, job_block_t& job) {
for (int j=job.row_start; j<job.row_end; ++j) {
for (int i = 0; i < job.row_size; ++i) {
Color pixel_color(0,0,0);
for (int s_j = 0; s_j < samples_sqrt; ++s_j) {
for (int s_i = 0; s_i < samples_sqrt; ++s_i) {
Ray r = get_ray(i, j, s_i, s_j);
pixel_color += ray_color(r, max_depth, world, lights);
}
}
pixel_color *= pixel_samples_scale;
pixel_color.set_x(linear_to_gamma(pixel_color.x()));
pixel_color.set_y(linear_to_gamma(pixel_color.y()));
pixel_color.set_z(linear_to_gamma(pixel_color.z()));
const int index = j * job.row_size + i;
job.indices.push_back(index);
job.colors.push_back(pixel_color);
}
}
{
std::lock_guard<std::mutex> lock(mutex);
image_blocks.push_back(job);
cv.notify_one();
}
}
Color Camera::ray_color(const Ray& r, int depth, const Hittable& world, const Hittable& lights) const {
if (depth <= 0) {
// no light returned after too many bounces
return Color(0, 0, 0);
}
HitRecord rec;
double shadow_acne_offset = 0.001;
// no hits
if (!world.hit(r, Interval(shadow_acne_offset, infinity), rec)) {
return background;
}
ScatterRecord srec;
Color color_from_emission =
rec.material()->emitted(r, rec, rec.u(), rec.v(), rec.point());
if (!rec.material()->scatter(r,rec,srec)) {
return color_from_emission;
}
Color color_from_scatter = Color();
for (const auto& ray_t : srec.scattered_rays) {
if (ray_t.skip_pdf || ray_t.pdf == nullptr) {
color_from_scatter += ray_t.attenuation * ray_color(ray_t.skip_pdf_ray, depth-1, world, lights);
} else {
auto light_ptr = std::make_shared<HittablePDF>(lights, rec.point());
MixturePDF p(light_ptr, ray_t.pdf);
Ray scattered = Ray(rec.point(), p.generate(), r.time());
double pdf_value = p.value(scattered.direction(), r.direction(), w);
double scattering_pdf = rec.material()->scattering_pdf(r, rec, scattered, w);
Color sample_color = ray_color(scattered, depth-1, world, lights);
color_from_scatter += (ray_t.attenuation * scattering_pdf * sample_color) / pdf_value;
}
}
return color_from_emission + color_from_scatter;
}
typedef struct JobBlock {
int row_start;
int row_end;
int row_size;
std::vector<int> indices;
std::vector<Vec3> colors;
} job_block_t; // block_job_t
I' adding the most important part of the code, if more are needed i can post also those, anyway this is the repo. It can be built and executed in release mode using ./build.sh -r -run
.
I'm aware that the architecture per se is extremely slow and need refactoring but still it is surprising that when multithreaded is even slower:
To render this image (400x400 1000 spp and 10 ray bounces) it took 64 seconds with 1 thread and 82 with 16 threads, by using 2 or 4 threads it gets again to around 64 some i'm assuming there might be some contention between the threads. i'm new to multithreading so i might have made some big mistake.
Share asked Mar 4 at 11:43 Niccolò TiezziNiccolò Tiezzi 1871 silver badge8 bronze badges 7 | Show 2 more comments3 Answers
Reset to default 3for (int j=job.row_start; j<job.row_end; ++j) {
for (int i = 0; i < job.row_size; ++i) {
// some code here
job.indices.push_back(index);
job.colors.push_back(pixel_color);
}
}
you are doing a lot of push_back
in a tight loop, and every time the vector capacity grows it has to allocate memory, which also causes a lot of page faults as data is copied from old to new storage and makes the code memory-bound.
also the global allocator is generally not made for fast concurrent allocations, it usually has a mutex in there causing all threads to wait on one another. you need to reserve the memory up front, so the vector capacity never grows and no reallocation happens.
indices.reserve((job.row_end - job.row_start) * job.row_size);
colors.reserve((job.row_end - job.row_start) * job.row_size);
for (int j=job.row_start; j<job.row_end; ++j) {
...
another point where threads are stepping on one another
auto light_ptr = std::make_shared<HittablePDF>(lights, rec.point());
MixturePDF p(light_ptr, ray_t.pdf);
again the global allocator is not good at concurrent allocations, you can likely allocate those on the stack.
auto light_obj = HittablePDF{lights, rec.point()};
MixturePDF p(light_obj, ray_t.pdf);
you will need to modify MixturePDF
to store a raw pointer or reference instead of a shared_ptr
.
You use a lot of push_back or emplace_back, but I don't see any calls to reserve. Threads all need to get memory from the same place, so any re-allocation would cause them to go through that bottleneck.
Your best bet for multithreading is to pre-allocate the buffers as if they would be serviced by a single thread. Once you do that, the worker threads should each change values in different portions of that buffer (by reference). Threads are for computing so try to eliminate anywhere that they need to perform memory allocation.
I believe the bottleneck you're facing is the use of the job queue. I've seen many solutions online that have the same problem.
First you're using hardware_concurrency
to determine the number of threads you want to use. The fact is that the call returns the number of logical processors (see SMT or Hyperthreading), if you're doing a lot of calculation maybe you should try something closer to the physical CPU count or you won't see much speedup.
Also you're using a mutex and a condition var, which is correct, but prone to frequent context switch that can mess with the scaling of your solution.
I'd try to see if batching can be implemented, or maybe trying some active waiting methods (i.e. spinlocks instead of locks). Also as other suggested, reserving the memory in advance can be good, but std::vector
makes a good job already, also memory caches are really efficient (so probably the bottleneck isn't there).
There are also a lot of job queues that are lock-free. See for example LPRQ which is a multiproducer-multiconsumer queue. The paper has also an artifact section from which you can get the actual implementation.
If you find the implementation too complicated you can think of having a buffer from the producer to every consumer (in a lock free manner), the implementation is much more simple See here and probably scales much better than a single buffer shared between threads (assuming the thread count is known in advance).
cv.wait(...)
construction at the end of::render
is unnecessary: callingjoin()
on a thread already waits for it to terminate. – Botje Commented Mar 4 at 12:12shared_ptr
and when not to use it. your code is a good example of when not to use it, simulators, and algorithms usually use as little indirection as possible, and almost no runtime polymorphism in some cases. see clean code horrible performance – Ahmed AEK Commented Mar 5 at 10:50