Hi,
I want to make a function that computes the sum of products of 16 bit operands, but want it to be optimized, so i have done the coding in linear assembly. The following is the code
.def _SOP
_SOP: .cproc vec1, vec2, n, qvec1, qvec2, qout
.reg i, L_temp
ZERO L_temp
MVK 2, i
Loop1: LDHU *vec1++, A0
LDHU *vec2++, A1
LDHU *vec1++, A2
LDHU *vec2++, A3
LDHU *vec1++, B0
LDHU *vec2++, B1
LDHU *vec1++, B2
LDHU *vec2++, B3
SHL A0, 16, A0
SHL B0, 16, B0
SHL A1, 16, A1
SHL B1, 16, B1
OR A0, A2, A0
OR B0, B2, B0
OR A1, A3, A1
OR B1, B3, B1
SMPY2 A0, A1, A5:A4
SMPY2 B0, B1, B5:B4
SADD A4, A5, A5
SADD B4, B5, B5
SADD A5, B5, A5
SADD L_temp, A5, L_temp
SUB i,1,i
[i] B Loop1
.return L_temp
.endproc
Vec1 and Vec2 are 16 bit operands.
The basic idea is to use blocks(.M .S .L .D) in parallel. But once i check the assembly of this function, the compiler does make the instructions parallel.
any help would be highly appreciated.