最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

optimization - Fastest way to clear an RGBA image to a single color in C - Stack Overflow

programmeradmin1浏览0评论

Something as simple as clearing an array of 32-bit RGBA pixels to a single color should be blazing fast. But that's not what's happening in my case

typedef struct PXLimage
{
    int width;

    int height;

    unsigned char* data;

} PXLimage;


typedef struct PXLcolor
{
    unsigned char rgba[4];

} PXLcolor;


void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
    for(uint32_t i = 0; i < image->width * image->height; i++)
    {
        memcpy(image->data + (i * 4), color.rgba, 4);
    }
}

I profiled this software rendering C library using gprof and got the following results:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 49.85     45.70    45.70   200000   228.50   228.50  pxlImageClearColor
 23.42     67.17    21.47                             _mcount_private
 11.52     77.73    10.56 1137265408     0.01     0.01  pxlImageSetPixelColor
  9.75     86.67     8.94                             __fentry__
  4.51     90.80     4.13   200000    20.65   206.13  pxlRendererDrawTriangle
  0.76     91.50     0.70   200000     3.50    66.64  pxlRendererDrawRect
  0.14     91.63     0.13   200000     0.65     4.23  pxlRendererDrawLine
  0.01     91.64     0.01   200000     0.05     0.05  pxlGetKey
  0.01     91.65     0.01   200000     0.05   228.55  pxlRendererClearColor
  0.01     91.66     0.01                             _pxlOutOfImageRange
  0.01     91.67     0.01                             main
  0.00     91.67     0.00  2000000     0.00     0.00  pxlRendererSetDrawColor
  0.00     91.67     0.00   200000     0.00     0.00  pxlWindowPresent
  0.00     91.67     0.00   100000     0.00     0.00  pxlWindowPollEvents
  0.00     91.67     0.00       10     0.00     0.00  _pxlFree
  0.00     91.67     0.00       10     0.00     0.00  _pxlMalloc

48%? Why? So I tried optimizing more

void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
    for(uint32_t i = 0; i < image->width * image->height; i++)
    {
        ((uint32_t*)image->data)[i] =  *((uint32_t*)color.rgba);
    }
}

Profiler results:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 48.45     39.80    39.80   200000   199.00   199.00  pxlImageClearColor
 24.31     59.77    19.97                             _mcount_private
 11.43     69.16     9.39 1137265408     0.01     0.01  pxlImageSetPixelColor
 10.24     77.57     8.41                             __fentry__
  4.59     81.34     3.77   200000    18.85   183.78  pxlRendererDrawTriangle
  0.73     81.94     0.60   200000     3.00    59.15  pxlRendererDrawRect
  0.15     82.06     0.12                             pxlImageGetPixelColor
  0.09     82.13     0.07   200000     0.35     3.54  pxlRendererDrawLine
  0.01     82.14     0.01   200000     0.05     0.05  pxlGetKey
  0.00     82.14     0.00  2000000     0.00     0.00  pxlRendererSetDrawColor
  0.00     82.14     0.00   200000     0.00   199.00  pxlRendererClearColor
  0.00     82.14     0.00   200000     0.00     0.00  pxlWindowPresent
  0.00     82.14     0.00   100000     0.00     0.00  pxlWindowPollEvents
  0.00     82.14     0.00       10     0.00     0.00  _pxlFree
  0.00     82.14     0.00       10     0.00     0.00  _pxlMalloc

A little faster but still very slow, What am I missing? how can I optimize this?

This is all running on my CPU (i5-12450H) on a single core, an array of pixels is being renderer into a Win32 window each frame. We clear the image to a single color, draw something on top, and then present it on the window.

Something as simple as clearing an array of 32-bit RGBA pixels to a single color should be blazing fast. But that's not what's happening in my case

typedef struct PXLimage
{
    int width;

    int height;

    unsigned char* data;

} PXLimage;


typedef struct PXLcolor
{
    unsigned char rgba[4];

} PXLcolor;


void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
    for(uint32_t i = 0; i < image->width * image->height; i++)
    {
        memcpy(image->data + (i * 4), color.rgba, 4);
    }
}

I profiled this software rendering C library using gprof and got the following results:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 49.85     45.70    45.70   200000   228.50   228.50  pxlImageClearColor
 23.42     67.17    21.47                             _mcount_private
 11.52     77.73    10.56 1137265408     0.01     0.01  pxlImageSetPixelColor
  9.75     86.67     8.94                             __fentry__
  4.51     90.80     4.13   200000    20.65   206.13  pxlRendererDrawTriangle
  0.76     91.50     0.70   200000     3.50    66.64  pxlRendererDrawRect
  0.14     91.63     0.13   200000     0.65     4.23  pxlRendererDrawLine
  0.01     91.64     0.01   200000     0.05     0.05  pxlGetKey
  0.01     91.65     0.01   200000     0.05   228.55  pxlRendererClearColor
  0.01     91.66     0.01                             _pxlOutOfImageRange
  0.01     91.67     0.01                             main
  0.00     91.67     0.00  2000000     0.00     0.00  pxlRendererSetDrawColor
  0.00     91.67     0.00   200000     0.00     0.00  pxlWindowPresent
  0.00     91.67     0.00   100000     0.00     0.00  pxlWindowPollEvents
  0.00     91.67     0.00       10     0.00     0.00  _pxlFree
  0.00     91.67     0.00       10     0.00     0.00  _pxlMalloc

48%? Why? So I tried optimizing more

void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
    for(uint32_t i = 0; i < image->width * image->height; i++)
    {
        ((uint32_t*)image->data)[i] =  *((uint32_t*)color.rgba);
    }
}

Profiler results:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 48.45     39.80    39.80   200000   199.00   199.00  pxlImageClearColor
 24.31     59.77    19.97                             _mcount_private
 11.43     69.16     9.39 1137265408     0.01     0.01  pxlImageSetPixelColor
 10.24     77.57     8.41                             __fentry__
  4.59     81.34     3.77   200000    18.85   183.78  pxlRendererDrawTriangle
  0.73     81.94     0.60   200000     3.00    59.15  pxlRendererDrawRect
  0.15     82.06     0.12                             pxlImageGetPixelColor
  0.09     82.13     0.07   200000     0.35     3.54  pxlRendererDrawLine
  0.01     82.14     0.01   200000     0.05     0.05  pxlGetKey
  0.00     82.14     0.00  2000000     0.00     0.00  pxlRendererSetDrawColor
  0.00     82.14     0.00   200000     0.00   199.00  pxlRendererClearColor
  0.00     82.14     0.00   200000     0.00     0.00  pxlWindowPresent
  0.00     82.14     0.00   100000     0.00     0.00  pxlWindowPollEvents
  0.00     82.14     0.00       10     0.00     0.00  _pxlFree
  0.00     82.14     0.00       10     0.00     0.00  _pxlMalloc

A little faster but still very slow, What am I missing? how can I optimize this?

This is all running on my CPU (i5-12450H) on a single core, an array of pixels is being renderer into a Win32 window each frame. We clear the image to a single color, draw something on top, and then present it on the window.

Share Improve this question asked Jan 31 at 1:38 zakariamokninezakariamoknine 311 silver badge6 bronze badges 2
  • So, here is a great discussion on memset() variations that take other than one-byte params. There's also sample code there by user @Cosmin using mixed C and assembly memset32() which may be just what you need: it takes a 32-bit int and replicates it throughout the buffer using asm instruction rep stosd -- I don't think it gets any faster than that. – greg spears Commented Jan 31 at 1:42
  • 1 This is not an exact duplicate of How to memset() memory to a certain pattern instead of a single byte?, but the answers there should be applicable in this case. The answer by Aaron Digulla looks interesting but better to replace his memmove with memcpy. – Ian Abbott Commented Jan 31 at 10:43
Add a comment  | 

2 Answers 2

Reset to default 1

The most drastic optimization would be utilizing parrallel processing, which should be supported on your CPU. You should look into SIMD libraries that allow you to clear multiple pixels at once. However, even with no parallelism, the function can be improved. The issue is with the amount of times you're writing to memory, by only copying one pixel at a time. Because your second implementation was also only copying one pixel at a time (the small improvement is probably due to less call overhead). Memset would be much better but it will only work if the color is black or white (0x00000000 or 0xFFFFFFFF) because memset can only take an 8-bit parameter (00 or FF) and copy it to all 4 channels of every pixel. You can add a check at the beginning for the color if the r == g == b == a (color.rgba[0-3]) and just use memset(image->data, color.rgba[0], image->width * image->height * 4); if that is the case. Your fallback should be your own custom memset-like implementation that allows for more than an 8-bit paramater, in this case a 32-bit parameter. Your memset32 might look like: `

void memset32(void* dest, uint32_t value, size_t count)
{
    uint32_t* dest32 = (uint32_t*)dest; 
    for (size_t i = 0; i < count; i++)
    {
        dest32[i] = value;
    }
}

`

As @j-kadditz mentioned I probably should parallelize the process using SIMD or other parallelizing utilities, which is what I eventually will end up doing.

But, I implemented a much faster algorithm, all thanks to @weather-vane who suggested the idea (back when this post was on the Staging Ground).

You basically write the first pixel (4 bytes), then for each iteration we double the size of memory to be written. So we write 4 bytes, the next iteration memcpy the data now it's 8 bytes, the next iteration memcpy everything again now it's 16 bytes, again and again while checking that doubling the block doesn't exceed the total image size. If it does, just write the remaining pixels/bytes.

void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
    uint32_t nbytes = image->width * image->height * 4;

    memcpy(image->data, color.rgba, 4);

    uint32_t bytes_filled = 4;
    uint32_t next_fill;

    while (bytes_filled < nbytes)
    {
        next_fill = bytes_filled << 1;

        if (next_fill > nbytes)
        {
            next_fill = nbytes;
        }

        memcpy(image->data + bytes_filled, image->data, next_fill - bytes_filled);
        bytes_filled = next_fill;
    }
}

I profiled using gprof and check this out

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 47.67     19.58    19.58                             _mcount_private
 22.25     28.72     9.14 1137265408     0.01     0.01  pxlImageSetPixelColor
 19.04     36.54     7.82                             __fentry__
  9.30     40.36     3.82   200000    19.10   179.64  pxlRendererDrawTriangle
  1.44     40.95     0.59   200000     2.95    57.60  pxlRendererDrawRect
  0.17     41.02     0.07   200000     0.35     3.45  pxlRendererDrawLine
  0.05     41.04     0.02   200000     0.10     0.10  pxlImageClearColor
  0.05     41.06     0.02   200000     0.10     0.10  pxlWindowPresent
  0.02     41.07     0.01                             main
  0.00     41.07     0.00  2000000     0.00     0.00  pxlRendererSetDrawColor
  0.00     41.07     0.00   200000     0.00     0.00  pxlGetKey
  0.00     41.07     0.00   200000     0.00     0.10  pxlRendererClearColor
  0.00     41.07     0.00   100000     0.00     0.00  pxlWindowPollEvents
  0.00     41.07     0.00       10     0.00     0.00  _pxlFree
  0.00     41.07     0.00       10     0.00     0.00  _pxlMalloc

From 200 us/call to 0.1 us/call

The bottleneck now is _mcount_private, which is what gprof uses to record timing of functions.

发布评论

评论列表(0)

  1. 暂无评论