The title says almost everything. Does (Cortex-M33) RP2350 rasberry-pi-pico 2 utilizes any SIMD-instructions (at all) via DSP during arm_dot_prod_f32
or does it simply loop-unrolls? I know for certain that RP2040 doesn't have SIMD. Or that ESP32-S3 uses it's own cool dsps_dotprod_f32_ae32
, but for RP2350...
The title says almost everything. Does (Cortex-M33) RP2350 rasberry-pi-pico 2 utilizes any SIMD-instructions (at all) via DSP during arm_dot_prod_f32
or does it simply loop-unrolls? I know for certain that RP2040 doesn't have SIMD. Or that ESP32-S3 uses it's own cool dsps_dotprod_f32_ae32
, but for RP2350...
1 Answer
Reset to default 0Today I had my Pico 2 delivered. I downloaded this Arduino-IDE core and by adding a few simple #error "messages"
at arm_dot_prod_f32(...)
in ~/Arduino/libraries/Arduino_CMSIS-DSP/src/arm_math.h
I figured out it actually doesn't even compile loop-unrolling.
Moreover
Even though __ARM_FEATURE_DSP
is enabled via -march=armv8-m.main+fp+dsp
and -mcpu=cortex-m33
as seen at boards.rxt (and tested via #pragma message
), compiling a sketch with:
#define ARM_MATH_NEON
results in incompatibility errors#define ARM_MATH_MVEF
->#error "MVE feature not supported"
#define LOOPUNROLL ON
or#define ARM_MATH_LOOPUNROLL
does nothing
Therefore I either have to add -DLOOPUNROLL=ON
at boards.txt or see if anything else is supposed to make it work.
Results
I run this poor example a few times both with loop-unrolling (by manually editing the source) and normal-loop, using -Ofast
vs -O0
-(disabled)
#include <arm_math.h>
void setup() {
Serial.begin(9600);
float x[500];
float y[500];
float dest;
for (int i=0; i<500; ++i){
x[i] = (i+1)/1000.0;
y[i] = i/1100.0;
}
unsigned long startTime = micros();
for (int i=0; i<500; ++i)
arm_dot_prod_f32(x, y, i, &dest);
Serial.print(micros() - startTime);
Serial.print(" microseconds | ");
Serial.println(dest,7);
}
for -O0
(disabled) average results were:
normal-loop | loop-unroll | Difference | x,y,i |
---|---|---|---|
20190 μs | 14656 μs | 5534 μs | 500 |
12985 μs | 9457 μs | 3528 μs | 400 |
7336 μs | 5365 μs | 1971 μs | 300 |
3289 μs | 2441 μs | 848 μs | 200 |
861 μs | 657 μs | 204 μs | 100 |
220 μs | 178 μs | 42 μs | 50 |
64 μs | 54 μs | 10 μs | 25 |
(TODO: -Ofast
)
Conclusion
Loop-unrolling has an effect but unfortunately it doesn't compile by default or I wasn't able to do so without edditing the source. Also, Don't take me on word, might still be wrong. But I tried my best.
arm_dot_prod_f32
. – Tim Roberts Commented Mar 28 at 18:32f32x4_t vecA, vecB;
are SIMD vector types andvfmaq()
is the intrinsic for a 16-byte vector FMA. But that code is inside#if defined(ARM_MATH_NEON)
. You can use a disassembler to see if it uses andd
orq
registers, or if it only usess
registers with scalar FP like en.wikipedia./wiki/ARM_Cortex-M#Cortex-M33 says is the only hardware FPU option on Cortex-M33. Assuming Wikipedia is correct and the build system defines the appropriate macros for-mcpu=cortex-m33
, it will have to use the scalar code paths that at most unrolls a loop. – Peter Cordes Commented Mar 28 at 18:32vfmaq
to a DSP instruction via includes (at least for some data types). – artless-noise-bye-due2AI Commented Mar 28 at 18:45