Something as simple as clearing an array of 32-bit RGBA pixels to a single color should be blazing fast. But that's not what's happening in my case
typedef struct PXLimage
{
int width;
int height;
unsigned char* data;
} PXLimage;
typedef struct PXLcolor
{
unsigned char rgba[4];
} PXLcolor;
void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
for(uint32_t i = 0; i < image->width * image->height; i++)
{
memcpy(image->data + (i * 4), color.rgba, 4);
}
}
I profiled this software rendering C library using gprof and got the following results:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
49.85 45.70 45.70 200000 228.50 228.50 pxlImageClearColor
23.42 67.17 21.47 _mcount_private
11.52 77.73 10.56 1137265408 0.01 0.01 pxlImageSetPixelColor
9.75 86.67 8.94 __fentry__
4.51 90.80 4.13 200000 20.65 206.13 pxlRendererDrawTriangle
0.76 91.50 0.70 200000 3.50 66.64 pxlRendererDrawRect
0.14 91.63 0.13 200000 0.65 4.23 pxlRendererDrawLine
0.01 91.64 0.01 200000 0.05 0.05 pxlGetKey
0.01 91.65 0.01 200000 0.05 228.55 pxlRendererClearColor
0.01 91.66 0.01 _pxlOutOfImageRange
0.01 91.67 0.01 main
0.00 91.67 0.00 2000000 0.00 0.00 pxlRendererSetDrawColor
0.00 91.67 0.00 200000 0.00 0.00 pxlWindowPresent
0.00 91.67 0.00 100000 0.00 0.00 pxlWindowPollEvents
0.00 91.67 0.00 10 0.00 0.00 _pxlFree
0.00 91.67 0.00 10 0.00 0.00 _pxlMalloc
48%? Why? So I tried optimizing more
void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
for(uint32_t i = 0; i < image->width * image->height; i++)
{
((uint32_t*)image->data)[i] = *((uint32_t*)color.rgba);
}
}
Profiler results:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
48.45 39.80 39.80 200000 199.00 199.00 pxlImageClearColor
24.31 59.77 19.97 _mcount_private
11.43 69.16 9.39 1137265408 0.01 0.01 pxlImageSetPixelColor
10.24 77.57 8.41 __fentry__
4.59 81.34 3.77 200000 18.85 183.78 pxlRendererDrawTriangle
0.73 81.94 0.60 200000 3.00 59.15 pxlRendererDrawRect
0.15 82.06 0.12 pxlImageGetPixelColor
0.09 82.13 0.07 200000 0.35 3.54 pxlRendererDrawLine
0.01 82.14 0.01 200000 0.05 0.05 pxlGetKey
0.00 82.14 0.00 2000000 0.00 0.00 pxlRendererSetDrawColor
0.00 82.14 0.00 200000 0.00 199.00 pxlRendererClearColor
0.00 82.14 0.00 200000 0.00 0.00 pxlWindowPresent
0.00 82.14 0.00 100000 0.00 0.00 pxlWindowPollEvents
0.00 82.14 0.00 10 0.00 0.00 _pxlFree
0.00 82.14 0.00 10 0.00 0.00 _pxlMalloc
A little faster but still very slow, What am I missing? how can I optimize this?
This is all running on my CPU (i5-12450H) on a single core, an array of pixels is being renderer into a Win32 window each frame. We clear the image to a single color, draw something on top, and then present it on the window.
Something as simple as clearing an array of 32-bit RGBA pixels to a single color should be blazing fast. But that's not what's happening in my case
typedef struct PXLimage
{
int width;
int height;
unsigned char* data;
} PXLimage;
typedef struct PXLcolor
{
unsigned char rgba[4];
} PXLcolor;
void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
for(uint32_t i = 0; i < image->width * image->height; i++)
{
memcpy(image->data + (i * 4), color.rgba, 4);
}
}
I profiled this software rendering C library using gprof and got the following results:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
49.85 45.70 45.70 200000 228.50 228.50 pxlImageClearColor
23.42 67.17 21.47 _mcount_private
11.52 77.73 10.56 1137265408 0.01 0.01 pxlImageSetPixelColor
9.75 86.67 8.94 __fentry__
4.51 90.80 4.13 200000 20.65 206.13 pxlRendererDrawTriangle
0.76 91.50 0.70 200000 3.50 66.64 pxlRendererDrawRect
0.14 91.63 0.13 200000 0.65 4.23 pxlRendererDrawLine
0.01 91.64 0.01 200000 0.05 0.05 pxlGetKey
0.01 91.65 0.01 200000 0.05 228.55 pxlRendererClearColor
0.01 91.66 0.01 _pxlOutOfImageRange
0.01 91.67 0.01 main
0.00 91.67 0.00 2000000 0.00 0.00 pxlRendererSetDrawColor
0.00 91.67 0.00 200000 0.00 0.00 pxlWindowPresent
0.00 91.67 0.00 100000 0.00 0.00 pxlWindowPollEvents
0.00 91.67 0.00 10 0.00 0.00 _pxlFree
0.00 91.67 0.00 10 0.00 0.00 _pxlMalloc
48%? Why? So I tried optimizing more
void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
for(uint32_t i = 0; i < image->width * image->height; i++)
{
((uint32_t*)image->data)[i] = *((uint32_t*)color.rgba);
}
}
Profiler results:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
48.45 39.80 39.80 200000 199.00 199.00 pxlImageClearColor
24.31 59.77 19.97 _mcount_private
11.43 69.16 9.39 1137265408 0.01 0.01 pxlImageSetPixelColor
10.24 77.57 8.41 __fentry__
4.59 81.34 3.77 200000 18.85 183.78 pxlRendererDrawTriangle
0.73 81.94 0.60 200000 3.00 59.15 pxlRendererDrawRect
0.15 82.06 0.12 pxlImageGetPixelColor
0.09 82.13 0.07 200000 0.35 3.54 pxlRendererDrawLine
0.01 82.14 0.01 200000 0.05 0.05 pxlGetKey
0.00 82.14 0.00 2000000 0.00 0.00 pxlRendererSetDrawColor
0.00 82.14 0.00 200000 0.00 199.00 pxlRendererClearColor
0.00 82.14 0.00 200000 0.00 0.00 pxlWindowPresent
0.00 82.14 0.00 100000 0.00 0.00 pxlWindowPollEvents
0.00 82.14 0.00 10 0.00 0.00 _pxlFree
0.00 82.14 0.00 10 0.00 0.00 _pxlMalloc
A little faster but still very slow, What am I missing? how can I optimize this?
This is all running on my CPU (i5-12450H) on a single core, an array of pixels is being renderer into a Win32 window each frame. We clear the image to a single color, draw something on top, and then present it on the window.
Share Improve this question asked Jan 31 at 1:38 zakariamokninezakariamoknine 311 silver badge6 bronze badges 2 |2 Answers
Reset to default 1The most drastic optimization would be utilizing parrallel processing, which should be supported on your CPU. You should look into SIMD libraries that allow you to clear multiple pixels at once. However, even with no parallelism, the function can be improved. The issue is with the amount of times you're writing to memory, by only copying one pixel at a time. Because your second implementation was also only copying one pixel at a time (the small improvement is probably due to less call overhead). Memset would be much better but it will only work if the color is black or white (0x00000000 or 0xFFFFFFFF) because memset can only take an 8-bit parameter (00 or FF) and copy it to all 4 channels of every pixel. You can add a check at the beginning for the color if the r == g == b == a (color.rgba[0-3]) and just use memset(image->data, color.rgba[0], image->width * image->height * 4);
if that is the case. Your fallback should be your own custom memset-like implementation that allows for more than an 8-bit paramater, in this case a 32-bit parameter. Your memset32 might look like:
`
void memset32(void* dest, uint32_t value, size_t count)
{
uint32_t* dest32 = (uint32_t*)dest;
for (size_t i = 0; i < count; i++)
{
dest32[i] = value;
}
}
`
As @j-kadditz mentioned I probably should parallelize the process using SIMD or other parallelizing utilities, which is what I eventually will end up doing.
But, I implemented a much faster algorithm, all thanks to @weather-vane who suggested the idea (back when this post was on the Staging Ground).
You basically write the first pixel (4 bytes), then for each iteration we double the size of memory to be written. So we write 4 bytes, the next iteration memcpy
the data now it's 8 bytes, the next iteration memcpy
everything again now it's 16 bytes, again and again while checking that doubling the block doesn't exceed the total image size. If it does, just write the remaining pixels/bytes.
void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
uint32_t nbytes = image->width * image->height * 4;
memcpy(image->data, color.rgba, 4);
uint32_t bytes_filled = 4;
uint32_t next_fill;
while (bytes_filled < nbytes)
{
next_fill = bytes_filled << 1;
if (next_fill > nbytes)
{
next_fill = nbytes;
}
memcpy(image->data + bytes_filled, image->data, next_fill - bytes_filled);
bytes_filled = next_fill;
}
}
I profiled using gprof
and check this out
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
47.67 19.58 19.58 _mcount_private
22.25 28.72 9.14 1137265408 0.01 0.01 pxlImageSetPixelColor
19.04 36.54 7.82 __fentry__
9.30 40.36 3.82 200000 19.10 179.64 pxlRendererDrawTriangle
1.44 40.95 0.59 200000 2.95 57.60 pxlRendererDrawRect
0.17 41.02 0.07 200000 0.35 3.45 pxlRendererDrawLine
0.05 41.04 0.02 200000 0.10 0.10 pxlImageClearColor
0.05 41.06 0.02 200000 0.10 0.10 pxlWindowPresent
0.02 41.07 0.01 main
0.00 41.07 0.00 2000000 0.00 0.00 pxlRendererSetDrawColor
0.00 41.07 0.00 200000 0.00 0.00 pxlGetKey
0.00 41.07 0.00 200000 0.00 0.10 pxlRendererClearColor
0.00 41.07 0.00 100000 0.00 0.00 pxlWindowPollEvents
0.00 41.07 0.00 10 0.00 0.00 _pxlFree
0.00 41.07 0.00 10 0.00 0.00 _pxlMalloc
From 200 us/call
to 0.1 us/call
The bottleneck now is _mcount_private
, which is what gprof
uses to record timing of functions.
memset()
variations that take other than one-byte params. There's also sample code there by user @Cosmin using mixed C and assemblymemset32()
which may be just what you need: it takes a 32-bit int and replicates it throughout the buffer using asm instructionrep stosd
-- I don't think it gets any faster than that. – greg spears Commented Jan 31 at 1:42memmove
withmemcpy
. – Ian Abbott Commented Jan 31 at 10:43