Using the AM3358, I need to do some small amount of filtering on samples being read by the PRU, and I would like to do that filtering within the PRU. I have been looking at the various choices for doing this and am trying to find the best/fastest way to get a signed result from a 32x32 signed multiply.
The fact that the hardware MAC is unsigned-only makes this a bit more difficult so I wrote it in assembly but am not very proud of the number of cycles it takes to complete. In my case, I am taking the whole 64-bit result for the accumulations (since it is there) and will truncate later.
If anyone has a better way to achieve signed multiplies on an unsigned multiplier, I would like to see it. I am hoping someone has a brilliant observation of signed/unsigned math that suggests a better way.
In my algorithm, I am testing the sign of the result by XORing the two multiplicands and saving the msb of that result, then I take the absolute value of each of those multiplicands. After multiplying the two now-non-negative numbers, I pull out the 64-bit result and negate it if the result is supposed to be negative (from that first XOR).
That takes a lot of cycles, but still less than doing a shift-and-add multiply algorithm for that many bits.
; determine sign of si * y, then abs both args XOR r18, r14, r15 ; r18.t31 = sign of si * y QBBC BCMAA1, r14, 31 ; go around if si >= 0 RSB r14, r14, 0 ; si = |si| BCMAA1: MOV r28, r14 QBBC BCMAA2, r15, 31 ; go around if y >= 0 RSB r15, r15, 0 ; y = |y| BCMAA2: MOV r29, r15 MOV r29, r15 ; delay cycle for MPY to complete ; XOR r19, r16, r17 ; r19.t31 = sign of co * x, early to save a cycle XIN 0, &r26, 2*4 ; get 64-bit unsigned result QBBC BCMAA3, r18, 31 ; skip if result >= 0 NOT r20, r26 NOT r21, r27 ADD r20, r20, 1 ADC r21, r21, 0 QBA BCMAA4 BCMAA3: MOV r20, r26 MOV r21, r27 BCMAA4:
This takes up to 14 cycles. Do you have a faster way, please?
Regards,
RandyP