最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

c++ - Is GCCClang able to auto-vectorize std::inner_product? - Stack Overflow

programmeradmin2浏览0评论

I have the following code:

#include <iostream>
#include <numeric>
int main() {
    volatile float
        a0[4] = {1, 2, 3, 4},
        a1[4] = {4, 5, 6, 7};
    std::cout << std::inner_product(a0, a0 + 4, a1, 0.0F) << std::endl;
    return 0;
}

When I compile the code with -O3 -msse2 with GCC or Clang, I could not find evidence of vectorization in the output code.

  • In the Clang version there are four mulss instructions and no looping instructions, suggesting that the multiplications are being performed individually.
  • In the GCC version there are looping instructions such as jne and je which there should not be if vectorization is happening.
  • If the code is being properly vectorized, there should be a single vector multiplication instruction that computes the element-wise product of corresponding elements in the arrays all at once, before summing them up for the inner product.
  • Note about volatile: The arrays were declared volatile to prevent the compiler from simply evaluating inner_product at compile time. I also figured this might prevent the desired optimization, so I tried a different version without volatile but the result is the same, no evidence of vector multiplication instructions.

If GCC and Clang are able to auto-vectorize std::inner_product (which might not be the case, in which case the answer is just 'You cannot'), what are the necessary/correct compiler flags to do so? Are there any vectorization-friendly adjustments to my code (preferably portable ones) that are necessary? Such as ensuring that the data is aligned to the size of a SIMD register, as a guess?

I have the following code:

#include <iostream>
#include <numeric>
int main() {
    volatile float
        a0[4] = {1, 2, 3, 4},
        a1[4] = {4, 5, 6, 7};
    std::cout << std::inner_product(a0, a0 + 4, a1, 0.0F) << std::endl;
    return 0;
}

When I compile the code with -O3 -msse2 with GCC or Clang, I could not find evidence of vectorization in the output code.

  • In the Clang version there are four mulss instructions and no looping instructions, suggesting that the multiplications are being performed individually.
  • In the GCC version there are looping instructions such as jne and je which there should not be if vectorization is happening.
  • If the code is being properly vectorized, there should be a single vector multiplication instruction that computes the element-wise product of corresponding elements in the arrays all at once, before summing them up for the inner product.
  • Note about volatile: The arrays were declared volatile to prevent the compiler from simply evaluating inner_product at compile time. I also figured this might prevent the desired optimization, so I tried a different version without volatile but the result is the same, no evidence of vector multiplication instructions.

If GCC and Clang are able to auto-vectorize std::inner_product (which might not be the case, in which case the answer is just 'You cannot'), what are the necessary/correct compiler flags to do so? Are there any vectorization-friendly adjustments to my code (preferably portable ones) that are necessary? Such as ensuring that the data is aligned to the size of a SIMD register, as a guess?

Share Improve this question edited Feb 5 at 18:27 CPlus asked Feb 5 at 18:17 CPlusCPlus 4,79644 gold badges30 silver badges73 bronze badges 6
  • 3 You cannot vectorize a floating point sum without -ffast-math since addition is not associative. Also, -msse2 is very outdated and only relevant to 32 bit code. Every 64 bit CPU supports SSE2. The compiler knows that. -march=x86-64-v2 or -v3 are better baselines today – Homer512 Commented Feb 5 at 18:49
  • 1 Note in regard to this: Can -ffast-math be safely used on a typical project? Personally I always limit its use to specific compilation units that only contain code that I know will not be negatively affected. Or I simply use OpenBLAS, Eigen, etc which are better at vectorization anyway – Homer512 Commented Feb 5 at 18:55
  • @Homer512 -ffast-math does seem to work, but I do not really understand why the multiply part is forbidden from vectorization without -ffast-math. The values being multiplied with each other are independent from one another, so I see no opportunity for rounding errors. – CPlus Commented Feb 5 at 19:00
  • 2 Multiplication alone would be vectorized (e.g. an std::transform(…, std::multiply<float>{}). But extracting scalars from the vector for the final addition is too costly so it doesn't make sense in an inner product – Homer512 Commented Feb 5 at 19:01
  • @Homer512 So the reason is: The performance penalty of unpacking the vector to perform the un-vectorized addition is not worth the performance gain of vectorizing the first step of multiplication? – CPlus Commented Feb 5 at 19:04
 |  Show 1 more comment

1 Answer 1

Reset to default 1

Using comments on the question and further experimentation, I was able to find the answer.

The reason is because vectorizing some float operations, including those used in inner_product, requires -ffast-math to be enabled, because they can introduce rounding errors that performing them in the correct order one-at-a-time would not introduce.

The problem being with the float can be further shown by using int instead:

#include <iostream>
#include <numeric>
int main() {
    int a0[4], a1[4];
    std::cin >> a0[0] >> a0[1] >> a0[2] >> a0[3] >> a1[0] >> a1[1] >> a1[2] >> a1[3];
    std::cout << std::inner_product(a0, a0 + 4, a1, 0) << std::endl;
    return 0;
}

With just -03 or -Os, this produces assembly that appears to at least partially vectorize the operation:

pshufd  xmm2, xmm0, 245
pmuludq xmm0, xmm1
pshufd  xmm0, xmm0, 232
pshufd  xmm1, xmm1, 245
pmuludq xmm1, xmm2
pshufd  xmm1, xmm1, 232
punpckldq       xmm0, xmm1
pshufd  xmm1, xmm0, 238
paddd   xmm1, xmm0
pshufd  xmm0, xmm1, 85
paddd   xmm0, xmm1

To get the float version to vectorize, I can use -ffast-math (Demo) to tell the compiler to optimize the code regardless the posibility for rounding errors:

movaps  xmm0, xmmword ptr [rsp]
mulps   xmm0, xmmword ptr [rsp + 16]
movaps  xmm1, xmm0
unpckhpd        xmm1, xmm0
addps   xmm1, xmm0
movaps  xmm0, xmm1
shufps  xmm0, xmm1, 85
addss   xmm0, xmm1
发布评论

评论列表(0)

  1. 暂无评论