I compile this with gcc -O1
:
#include <stdint.h>
struct mystruct{
uint32_t arr[1];
uint32_t something_after_arr;
};
uint32_t sum1(const struct mystruct *s,uintptr_t n){
uint32_t result=0;
for(uintptr_t i=0;i<n;++i)result+=s->arr[i];
return result;
}
uint32_t sum2(const struct mystruct *s,uintptr_t n){
uint32_t result=0;
for(uintptr_t i=0;i<n;++i)result+=i[s->arr];
return result;
}
uint32_t sum3(const struct mystruct *s,uintptr_t n){
uint32_t result=0;
for(uintptr_t i=0;i<n;++i)result+=*(s->arr+i);
return result;
}
compiler (…) generated this x86_64 assembly (assembler directives removed):
sum1:
mov eax, 0
test rsi, rsi
je .L1
mov eax, DWORD PTR [rdi]
.L1:
ret
sum2:
mov eax, 0
test rsi, rsi
je .L4
mov eax, DWORD PTR [rdi]
.L4:
ret
sum3:
test rsi, rsi
je .L10
mov eax, 0
mov edx, 0
.L9:
add edx, DWORD PTR [rdi+rax*4]
add rax, 1
cmp rsi, rax
jne .L9
.L7:
mov eax, edx
ret
.L10:
mov edx, 0
jmp .L7
Only sum3
has backwards jumps. sum1
and sum2
only have forward jump (which means no looping).
Why are sum3
and sum1
different and why no loop in sum1
?
I expected sum1
and sum2
and sum3
have same code with loop (like clang (…)).
I compile this with gcc -O1
:
#include <stdint.h>
struct mystruct{
uint32_t arr[1];
uint32_t something_after_arr;
};
uint32_t sum1(const struct mystruct *s,uintptr_t n){
uint32_t result=0;
for(uintptr_t i=0;i<n;++i)result+=s->arr[i];
return result;
}
uint32_t sum2(const struct mystruct *s,uintptr_t n){
uint32_t result=0;
for(uintptr_t i=0;i<n;++i)result+=i[s->arr];
return result;
}
uint32_t sum3(const struct mystruct *s,uintptr_t n){
uint32_t result=0;
for(uintptr_t i=0;i<n;++i)result+=*(s->arr+i);
return result;
}
compiler (https://godbolt./#…) generated this x86_64 assembly (assembler directives removed):
sum1:
mov eax, 0
test rsi, rsi
je .L1
mov eax, DWORD PTR [rdi]
.L1:
ret
sum2:
mov eax, 0
test rsi, rsi
je .L4
mov eax, DWORD PTR [rdi]
.L4:
ret
sum3:
test rsi, rsi
je .L10
mov eax, 0
mov edx, 0
.L9:
add edx, DWORD PTR [rdi+rax*4]
add rax, 1
cmp rsi, rax
jne .L9
.L7:
mov eax, edx
ret
.L10:
mov edx, 0
jmp .L7
Only sum3
has backwards jumps. sum1
and sum2
only have forward jump (which means no looping).
Why are sum3
and sum1
different and why no loop in sum1
?
I expected sum1
and sum2
and sum3
have same code with loop (like clang (https://godbolt./#…)).
2 Answers
Reset to default 7Rather than trying to answer why they are different (which would require looking at the implementation of the compiler), I'll answer why the optimization applied to two of the three cases is valid here.
Given that the array only has one element, the compiler is allowed to assume that the only possible values for n
are 0
or 1
, since anything else would invoke undefined behavior. Because of that, it doesn't need to generate a loop, only a branch.
seems like gcc is aware that
*(s->arr+i)
can overlap withsomething_after_arr
but not aware thats->arr[i]
can overlap withsomething_after_arr
It's the opposite. GCC is not aware (at least, apparently, with -O1
) that *(s->arr+i)
cannot overlap with something_after_arr
, but it is aware that s->arr[i]
can't overlap with it. It's undefined behavior to access outside the bounds of an array, no matter how you spell that access, even if you know that other data may exist at the offset you specify.
Not that defending the principle of undefined behavior is within the scope of this answer, but as a developer, expecting code to access data outside of arr
through it, is unintuitive, and makes that code incomprehensible. And if it's reasonable for a developer reading that code to assume it's not intended, then the compiler should be able to assume it as well, and provide useful optimizations based on that assumption as in the cases you see here.
For me, the moral of the story is to write intuitive code and expect useful optimizations, even if they may sometimes require studying to understand.
I would guess this is a side effect of all 3 versions having undefined behavior for any n
larger than 1 and the only valid index for i
is 0. When going beyond that, the compiler is free to generate any crap it wants.
Contrary to popular belief, C never allowed wild & crazy pointer arithmetic using any type across any random memory space. C only ever allowed pointer arithmetic within the bounds of a declared array, using a type corresponding to the array item type.
Meaning it is not allowed to do uint32_t
pointer arithmetic outside the bounds of the array arr
followed by dereferencing - that's undefined behavior regardless of what happens to exist in memory after the array. It's just how the additive operators are specified.
We may however iterate across the struct using a character pointer, byte by byte. This is thanks to a special rule in C23 6.3.2.3:
When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.
After fixing your code to remove undefined behavior by doing arithmetic on a character type instead (uint8_t
is treated as a character type by the compiler here), I got the same machine code for all 3 versions:
uint32_t sum1(const struct mystruct *s,uintptr_t n){
uint32_t result=0;
const uint8_t* u8 = (const uint8_t*)s;
for(uintptr_t i=0; i<n*4; i+=4)
result += *(uint32_t*) &u8[i];
return result;
}
uint32_t sum2(const struct mystruct *s,uintptr_t n){
uint32_t result=0;
const uint8_t* u8 = (const uint8_t*)s;
for(uintptr_t i=0; i<n*4; i+=4)
result += *(uint32_t*) &i[u8];
return result;
}
uint32_t sum3(const struct mystruct *s,uintptr_t n){
uint32_t result=0;
const uint8_t* u8 = (const uint8_t*)s;
for(uintptr_t i=0; i<n*4; i+=4)
result += *(uint32_t*) (u8+i);
return result;
}
Messy diff views from Godbolt with gcc x86 -O3: https://godbolt./z/e6G5MGfxT (And yeah this turned into some far less efficient SIMD mess that I'm not even going to try to understand.)
De-referencing is ok since there is no padding, the addresses are aligned and the effective type stored in the memory is uint32_t
. Otherwise all manner of other UB could appear here.
n
is anything but0
, all three versions invoke undefined behavior. So the whole code is kind of pointless in all 3 versions. – Lundin Commented Feb 4 at 13:49n
can be0
or1
. Did you meani
? – Patrick Roberts Commented Feb 4 at 13:53i
. – Lundin Commented Feb 4 at 13:58uintptr_t
forn
? Why notsize_t
? – Some programmer dude Commented Feb 4 at 14:03*(s->arr+i)
is not allowed to overlap withsomething_after_arr
. It is undefined behavior to access that member througharr
no matter how you spell that access. – Patrick Roberts Commented Feb 4 at 14:08