I have the following code:
#include <iostream>
#include <numeric>
int main() {
volatile float
a0[4] = {1, 2, 3, 4},
a1[4] = {4, 5, 6, 7};
std::cout << std::inner_product(a0, a0 + 4, a1, 0.0F) << std::endl;
return 0;
}
When I compile the code with -O3 -msse2
with GCC or Clang, I could not find evidence of vectorization in the output code.
- In the Clang version there are four
mulss
instructions and no looping instructions, suggesting that the multiplications are being performed individually. - In the GCC version there are looping instructions such as
jne
andje
which there should not be if vectorization is happening. - If the code is being properly vectorized, there should be a single vector multiplication instruction that computes the element-wise product of corresponding elements in the arrays all at once, before summing them up for the inner product.
- Note about
volatile
: The arrays were declaredvolatile
to prevent the compiler from simply evaluatinginner_product
at compile time. I also figured this might prevent the desired optimization, so I tried a different version withoutvolatile
but the result is the same, no evidence of vector multiplication instructions.
If GCC and Clang are able to auto-vectorize std::inner_product
(which might not be the case, in which case the answer is just 'You cannot'), what are the necessary/correct compiler flags to do so? Are there any vectorization-friendly adjustments to my code (preferably portable ones) that are necessary? Such as ensuring that the data is aligned to the size of a SIMD register, as a guess?
I have the following code:
#include <iostream>
#include <numeric>
int main() {
volatile float
a0[4] = {1, 2, 3, 4},
a1[4] = {4, 5, 6, 7};
std::cout << std::inner_product(a0, a0 + 4, a1, 0.0F) << std::endl;
return 0;
}
When I compile the code with -O3 -msse2
with GCC or Clang, I could not find evidence of vectorization in the output code.
- In the Clang version there are four
mulss
instructions and no looping instructions, suggesting that the multiplications are being performed individually. - In the GCC version there are looping instructions such as
jne
andje
which there should not be if vectorization is happening. - If the code is being properly vectorized, there should be a single vector multiplication instruction that computes the element-wise product of corresponding elements in the arrays all at once, before summing them up for the inner product.
- Note about
volatile
: The arrays were declaredvolatile
to prevent the compiler from simply evaluatinginner_product
at compile time. I also figured this might prevent the desired optimization, so I tried a different version withoutvolatile
but the result is the same, no evidence of vector multiplication instructions.
If GCC and Clang are able to auto-vectorize std::inner_product
(which might not be the case, in which case the answer is just 'You cannot'), what are the necessary/correct compiler flags to do so? Are there any vectorization-friendly adjustments to my code (preferably portable ones) that are necessary? Such as ensuring that the data is aligned to the size of a SIMD register, as a guess?
1 Answer
Reset to default 1Using comments on the question and further experimentation, I was able to find the answer.
The reason is because vectorizing some float
operations, including those used in inner_product
, requires -ffast-math
to be enabled, because they can introduce rounding errors that performing them in the correct order one-at-a-time would not introduce.
The problem being with the float
can be further shown by using int
instead:
#include <iostream>
#include <numeric>
int main() {
int a0[4], a1[4];
std::cin >> a0[0] >> a0[1] >> a0[2] >> a0[3] >> a1[0] >> a1[1] >> a1[2] >> a1[3];
std::cout << std::inner_product(a0, a0 + 4, a1, 0) << std::endl;
return 0;
}
With just -03
or -Os
, this produces assembly that appears to at least partially vectorize the operation:
pshufd xmm2, xmm0, 245
pmuludq xmm0, xmm1
pshufd xmm0, xmm0, 232
pshufd xmm1, xmm1, 245
pmuludq xmm1, xmm2
pshufd xmm1, xmm1, 232
punpckldq xmm0, xmm1
pshufd xmm1, xmm0, 238
paddd xmm1, xmm0
pshufd xmm0, xmm1, 85
paddd xmm0, xmm1
To get the float
version to vectorize, I can use -ffast-math
(Demo) to tell the compiler to optimize the code regardless the posibility for rounding errors:
movaps xmm0, xmmword ptr [rsp]
mulps xmm0, xmmword ptr [rsp + 16]
movaps xmm1, xmm0
unpckhpd xmm1, xmm0
addps xmm1, xmm0
movaps xmm0, xmm1
shufps xmm0, xmm1, 85
addss xmm0, xmm1
-ffast-math
since addition is not associative. Also,-msse2
is very outdated and only relevant to 32 bit code. Every 64 bit CPU supports SSE2. The compiler knows that.-march=x86-64-v2
or-v3
are better baselines today – Homer512 Commented Feb 5 at 18:49-ffast-math
does seem to work, but I do not really understand why the multiply part is forbidden from vectorization without-ffast-math
. The values being multiplied with each other are independent from one another, so I see no opportunity for rounding errors. – CPlus Commented Feb 5 at 19:00std::transform(…, std::multiply<float>{})
. But extracting scalars from the vector for the final addition is too costly so it doesn't make sense in an inner product – Homer512 Commented Feb 5 at 19:01