Hi,
I had a code in which I do a lot of multiplications between some elements of a table. I wanted to optimize it by using the _qmpysp intrinsic.
The thing is that I get more cycles when I use this intrinsic followed by 4 _get32f_128() instructions (to get my values).
Does it mean that I don't use the instructions correctly, or that my code is far from being optimized even without the instructions of the DSP, or something else ?
I saw that the 4 instructions to get my values take 28 cycles, so I wanted to compare this number by using a mask.
This is how I tried to do it :
from a __x128_t value, which I'm gonna name "a" here, I do output = a >> 96 & 0xFFFFFFFF to get my first value of the __x128_t. But the error "expression must have integral type" appears. So, how can I do that ?
Thanks by advance for your answers.
Best regards,
Alex