arm - Does Pico 2, utilizes SIMD-instructions or just loop-unrolling during arm_dot_prod_f32?

The title says almost everything. Does (Cortex-M33) RP2350 rasberry-pi-pico 2 utilizes any SIMD-instructions (at all) via DSP during arm_dot_prod_f32 or does it simply loop-unrolls? I know for certain that RP2040 doesn't have SIMD. Or that ESP32-S3 uses it's own cool dsps_dotprod_f32_ae32, but for RP2350...

Share Improve this question edited Mar 28 at 18:28 artless-noise-bye-due2AI 22.5k6 gold badges73 silver badges110 bronze badges asked Mar 28 at 18:06 Gios Xou 2,2441 gold badge21 silver badges40 bronze badges

I don't believe the Cortex-M33 includes the instructions used by arm_dot_prod_f32. – Tim Roberts Commented Mar 28 at 18:32
f32x4_t vecA, vecB; are SIMD vector types and vfmaq() is the intrinsic for a 16-byte vector FMA. But that code is inside #if defined(ARM_MATH_NEON). You can use a disassembler to see if it uses and d or q registers, or if it only uses s registers with scalar FP like en.wikipedia./wiki/ARM_Cortex-M#Cortex-M33 says is the only hardware FPU option on Cortex-M33. Assuming Wikipedia is correct and the build system defines the appropriate macros for -mcpu=cortex-m33, it will have to use the scalar code paths that at most unrolls a loop. – Peter Cordes Commented Mar 28 at 18:32
Blog on Cortex-M DSP, PDF on Cortex-M DSP. The RP2350 'colophon' you cite says the device has DSP features. MAC (multiply accumulate) is useful for dot product and it is more than loop unrolling. It is not SIMD. (info for code writer, not library users). – artless-noise-bye-due2AI Commented Mar 28 at 18:35
Also quite possible that CMIS will translate vfmaq to a DSP instruction via includes (at least for some data types). – artless-noise-bye-due2AI Commented Mar 28 at 18:45

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Today I had my Pico 2 delivered. I downloaded this Arduino-IDE core and by adding a few simple #error "messages" at arm_dot_prod_f32(...) in ~/Arduino/libraries/Arduino_CMSIS-DSP/src/arm_math.h I figured out it actually doesn't even compile loop-unrolling.

Moreover

Even though __ARM_FEATURE_DSP is enabled via -march=armv8-m.main+fp+dsp and -mcpu=cortex-m33 as seen at boards.rxt (and tested via #pragma message), compiling a sketch with:

#define ARM_MATH_NEON results in incompatibility errors
#define ARM_MATH_MVEF -> #error "MVE feature not supported"
#define LOOPUNROLL ON or #define ARM_MATH_LOOPUNROLL does nothing

Therefore I either have to add -DLOOPUNROLL=ON at boards.txt or see if anything else is supposed to make it work.

Results

I run this poor example a few times both with loop-unrolling (by manually editing the source) and normal-loop, using -Ofast vs -O0-(disabled)

#include <arm_math.h>

void setup() {
  Serial.begin(9600);

  float x[500];
  float y[500];
  float dest;

  for (int i=0; i<500; ++i){
    x[i] = (i+1)/1000.0;
    y[i] = i/1100.0;
  } 

  unsigned long startTime = micros(); 

  for (int i=0; i<500; ++i)
    arm_dot_prod_f32(x, y, i, &dest); 

  Serial.print(micros() - startTime);
  Serial.print(" microseconds | ");
  Serial.println(dest,7);
}

for -O0 (disabled) average results were:

normal-loop	loop-unroll	Difference	x,y,i
20190 μs	14656 μs	5534 μs	500
12985 μs	9457 μs	3528 μs	400
7336 μs	5365 μs	1971 μs	300
3289 μs	2441 μs	848 μs	200
861 μs	657 μs	204 μs	100
220 μs	178 μs	42 μs	50
64 μs	54 μs	10 μs	25

^{_{(TODO: -Ofast)}}

Conclusion

Loop-unrolling has an effect but unfortunately it doesn't compile by default or I wasn't able to do so without edditing the source. Also, Don't take me on word, might still be wrong. But I tried my best.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

arm - Does Pico 2, utilizes SIMD-instructions or just loop-unrolling during arm_dot_prod_f32? - Stack Overflow

1 Answer 1

Moreover

Results

Conclusion

与本文相关的文章

评论列表(0)