c - Why does gcc generate different code for s->a[i] and *(s->a+i)?

I compile this with gcc -O1:

#include <stdint.h>

struct mystruct{
    uint32_t arr[1];
    uint32_t something_after_arr;
};

uint32_t sum1(const struct mystruct *s,uintptr_t n){
    uint32_t result=0;
    for(uintptr_t i=0;i<n;++i)result+=s->arr[i];
    return result;
}

uint32_t sum2(const struct mystruct *s,uintptr_t n){
    uint32_t result=0;
    for(uintptr_t i=0;i<n;++i)result+=i[s->arr];
    return result;
}

uint32_t sum3(const struct mystruct *s,uintptr_t n){
    uint32_t result=0;
    for(uintptr_t i=0;i<n;++i)result+=*(s->arr+i);
    return result;
}

compiler (…) generated this x86_64 assembly (assembler directives removed):

sum1:
        mov     eax, 0
        test    rsi, rsi
        je      .L1
        mov     eax, DWORD PTR [rdi]
.L1:
        ret
sum2:
        mov     eax, 0
        test    rsi, rsi
        je      .L4
        mov     eax, DWORD PTR [rdi]
.L4:
        ret
sum3:
        test    rsi, rsi
        je      .L10
        mov     eax, 0
        mov     edx, 0
.L9:
        add     edx, DWORD PTR [rdi+rax*4]
        add     rax, 1
        cmp     rsi, rax
        jne     .L9
.L7:
        mov     eax, edx
        ret
.L10:
        mov     edx, 0
        jmp     .L7

Only sum3 has backwards jumps. sum1 and sum2 only have forward jump (which means no looping).

Why are sum3 and sum1 different and why no loop in sum1?

I expected sum1 and sum2 and sum3 have same code with loop (like clang (…)).

I compile this with gcc -O1:

#include <stdint.h>

struct mystruct{
    uint32_t arr[1];
    uint32_t something_after_arr;
};

uint32_t sum1(const struct mystruct *s,uintptr_t n){
    uint32_t result=0;
    for(uintptr_t i=0;i<n;++i)result+=s->arr[i];
    return result;
}

uint32_t sum2(const struct mystruct *s,uintptr_t n){
    uint32_t result=0;
    for(uintptr_t i=0;i<n;++i)result+=i[s->arr];
    return result;
}

uint32_t sum3(const struct mystruct *s,uintptr_t n){
    uint32_t result=0;
    for(uintptr_t i=0;i<n;++i)result+=*(s->arr+i);
    return result;
}

compiler (https://godbolt./#…) generated this x86_64 assembly (assembler directives removed):

sum1:
        mov     eax, 0
        test    rsi, rsi
        je      .L1
        mov     eax, DWORD PTR [rdi]
.L1:
        ret
sum2:
        mov     eax, 0
        test    rsi, rsi
        je      .L4
        mov     eax, DWORD PTR [rdi]
.L4:
        ret
sum3:
        test    rsi, rsi
        je      .L10
        mov     eax, 0
        mov     edx, 0
.L9:
        add     edx, DWORD PTR [rdi+rax*4]
        add     rax, 1
        cmp     rsi, rax
        jne     .L9
.L7:
        mov     eax, edx
        ret
.L10:
        mov     edx, 0
        jmp     .L7

Only sum3 has backwards jumps. sum1 and sum2 only have forward jump (which means no looping).

Why are sum3 and sum1 different and why no loop in sum1?

I expected sum1 and sum2 and sum3 have same code with loop (like clang (https://godbolt./#…)).

Share Improve this question edited Feb 4 at 15:53 3CxEZiVlQ 39.1k11 gold badges80 silver badges92 bronze badges asked Feb 4 at 13:44 BZZZZ 615 bronze badges

4 If n is anything but 0, all three versions invoke undefined behavior. So the whole code is kind of pointless in all 3 versions. – Lundin Commented Feb 4 at 13:49
2 @Lundin n can be 0 or 1. Did you mean i? – Patrick Roberts Commented Feb 4 at 13:53
3 @PatrickRoberts Ah I see what you mean yes I mean i. – Lundin Commented Feb 4 at 13:58
2 OT: Why are you using uintptr_t for n? Why not size_t? – Some programmer dude Commented Feb 4 at 14:03
2 @BZZZZ that's not the correct way to think about it. I would say that gcc is not aware that *(s->arr+i) is not allowed to overlap with something_after_arr. It is undefined behavior to access that member through arr no matter how you spell that access. – Patrick Roberts Commented Feb 4 at 14:08

| Show 4 more comments

2 Answers 2

Sorted by: Reset to default 7

Rather than trying to answer why they are different (which would require looking at the implementation of the compiler), I'll answer why the optimization applied to two of the three cases is valid here.

Given that the array only has one element, the compiler is allowed to assume that the only possible values for n are 0 or 1, since anything else would invoke undefined behavior. Because of that, it doesn't need to generate a loop, only a branch.

seems like gcc is aware that *(s->arr+i) can overlap with something_after_arr but not aware that s->arr[i] can overlap with something_after_arr

It's the opposite. GCC is not aware (at least, apparently, with -O1) that *(s->arr+i) cannot overlap with something_after_arr, but it is aware that s->arr[i] can't overlap with it. It's undefined behavior to access outside the bounds of an array, no matter how you spell that access, even if you know that other data may exist at the offset you specify.

Not that defending the principle of undefined behavior is within the scope of this answer, but as a developer, expecting code to access data outside of arr through it, is unintuitive, and makes that code incomprehensible. And if it's reasonable for a developer reading that code to assume it's not intended, then the compiler should be able to assume it as well, and provide useful optimizations based on that assumption as in the cases you see here.

For me, the moral of the story is to write intuitive code and expect useful optimizations, even if they may sometimes require studying to understand.

I would guess this is a side effect of all 3 versions having undefined behavior for any n larger than 1 and the only valid index for i is 0. When going beyond that, the compiler is free to generate any crap it wants.

Contrary to popular belief, C never allowed wild & crazy pointer arithmetic using any type across any random memory space. C only ever allowed pointer arithmetic within the bounds of a declared array, using a type corresponding to the array item type.

Meaning it is not allowed to do uint32_t pointer arithmetic outside the bounds of the array arr followed by dereferencing - that's undefined behavior regardless of what happens to exist in memory after the array. It's just how the additive operators are specified.

We may however iterate across the struct using a character pointer, byte by byte. This is thanks to a special rule in C23 6.3.2.3:

When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.

After fixing your code to remove undefined behavior by doing arithmetic on a character type instead (uint8_t is treated as a character type by the compiler here), I got the same machine code for all 3 versions:

uint32_t sum1(const struct mystruct *s,uintptr_t n){
    uint32_t result=0;
    const uint8_t* u8 = (const uint8_t*)s;
    for(uintptr_t i=0; i<n*4; i+=4)
      result += *(uint32_t*) &u8[i];
    return result;
}

uint32_t sum2(const struct mystruct *s,uintptr_t n){
    uint32_t result=0;
    const uint8_t* u8 = (const uint8_t*)s;
    for(uintptr_t i=0; i<n*4; i+=4)
      result += *(uint32_t*) &i[u8];
    return result;
}

uint32_t sum3(const struct mystruct *s,uintptr_t n){
    uint32_t result=0;
    const uint8_t* u8 = (const uint8_t*)s;
    for(uintptr_t i=0; i<n*4; i+=4)
      result += *(uint32_t*) (u8+i);
    return result;
}

Messy diff views from Godbolt with gcc x86 -O3: https://godbolt./z/e6G5MGfxT (And yeah this turned into some far less efficient SIMD mess that I'm not even going to try to understand.)

De-referencing is ok since there is no padding, the addresses are aligned and the effective type stored in the memory is uint32_t. Otherwise all manner of other UB could appear here.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

c - Why does gcc generate different code for s->a[i] and *(s->a+i)? - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)