Hello, everyone. Once again I'm looking for help to optimize an algorithm using the C6000 Assembly Language of the TMS320C6670 I'm developing on.
My goal is to make a sinewave generator to operate at the highest speed possible, based on an unstable FIR filter as explained in the SPRA708 Application note (http://www.ee.ic.ac.uk/pcheung/teaching/ee3_Study_Project/Sinewave%20generation%28708%29.pdf).
The C function has a declaration like this:
void singen_asm(int32_t* output, int32_t* seed, int32_t* coef, int32_t np);
Where output is the signal buffer, seed are the initial values of the filter output, coef the filter coefficients and np the number of points to be calculated.
The operation performed in every iteration, with a fix point shifting of q1 = 29 and q2 = 59, is:
y[i] = A1*y[i-1] >> q1 + A2*y[i-1] >> q2 - y[n-2]
Then, the sinewave output has a precision of 30 bits (without including the sign) instead of 31 to avoid an overflow in the y accumulation.
I implemented a C version of the algorithm and an Assembly one, as follows:
;; Function prototype
;; void singen_asm(int32_t* output, int32_t* seed, int32_t* acoef, int32_t np);
.global singen_asm
.sect ".text:appFastCode"
.map output/A4, seed/B4, acoef/A6, np/B6
.map np_cnt/A2
.map acoef1/A24, acoef2/B24, acuma/A28, acumb/B28, acumah2/A27, acumah1/A26, acumbh2/B27, acumbh1/B26, ycuma2/A31, ycuma1/A30, aux1/A16
; C/C++ interface params.
.map rtnptr/B3, stkptr/B15
.asmfunc
singen_asm:
;Disable interrupts.
DINT
|| SUB np, 01, np_cnt
; Load the seed from memory as a double word.
|| LDDW .D2 *seed, acumbh2:acumbh1
; Load the first possition of acoef.
|| LDW .D1 *acoef++, acoef1
; Load the second possition of acoef.
LDW .D1 *acoef, acoef2
NOP 4
; Initialize the acummulators acuma and acumbInicializa los acumuladores acuma y acumb with seed[0] (from acumbh1).
MV acumbh1, ycuma1
|| MV acumbh1, acumb
MV acumbh2, ycuma2
|| MV ycuma1, acuma
singen_main_loop:
; Multiply acuma and acumb by the coeficients acoef1 and acoef2.
MPY32 .M1 acoef1, acuma, acumah2:acumah1
|| MPY32 .M2 acoef2, acumb, acumbh2:acumbh1
NOP 3
; Shift the product result.
SHRU .S1 acumah1, 29, acumah1
|| SHR .S2 acumbh2, 27, acumbh1
; Branch if np_cnt is 0 (last data was calculated) and reduce the counter.
[np_cnt] B singen_main_loop
;Resta 1 al contador np_cnt.
|| SUB .L1 np_cnt, 1, np_cnt
SHL .S1 acumah2, 3, acumah2
OR .S1 acumah2, acumah1, acumah1
; Adds the error correction.
|| MV acumbh1, acuma
; Substract ycuma2 from the partial result.
SUB .L1 acumah1, ycuma2, acumah1
; Move ycuma1 to ycuma2 (To move the filter memory).
|| MV ycuma1, ycuma2
; Acumulate the partial results from datapath A and B.
ADD .L1 acuma, acumah1, acuma
; Store the result into the output buffer.
STW .D1 acuma, *output++
; Move the result to ycuma1 and to acumb.
|| MV acuma, ycuma1
|| MV acuma, acumb
singen_end:
;Enable interrupts.
RINT
;Branch to return address.
B rtnptr
NOP 5
.endasmfunc.
The last version generates 1 output every 11 clock cycles aproximately, but I would like to increase it to 6 cycles more or less. I know this is possible using the SPLOOP structure to accelerate the main loop but I'm not very familiar with that option. Sometimes I couldn't even build the progam test due to resources conflicts or the result was not the expected.
Here is the last version using SPLOOP structure. This one buids and runs but doesn't generate a sinewave.
;; Function prototype
;; void singen_asm(int32_t* output, int32_t* seed, int32_t* acoef, int32_t np);
.global singen_asm
.sect ".text:appFastCode"
.map output/A4, seed/B4, acoef/A6, np/B6
.map np_cnt/B0
.map outputb/B2, acoef1/A24, acoef2/B24, acuma/A28, acumb/B28, acumah2/A27, acumah1/A26, acumbh2/B27, acumbh1/B26, ycuma2/A31, ycuma1/A30, aux1/A16
; C/C++ interface params.
.map rtnptr/B3, stkptr/B15
.asmfunc
singen_asm:
;Disable interrupts in code since code is software pipelined.
;; DINT
;; || SUB np, 01, np_cnt
;; || LDW .D2 *seed, ycumab1
;; NOP 4
;; MV ycumab1, ycuma1
;; || MVKL 0x0000FFFF, aux1
;; SHR .S1 ycuma1, 16, ycuma2
;; || AND ycuma1, aux1, acuma
;Deshabilita las interrupciones
DINT
;Resta 1 al contador np.
;|| SUB np, 01, np_cnt
|| MV np, np_cnt
;Carga doble palabra (64 bits) desde seed en el par de registros acumbh.
|| LDDW .D2 *seed, acumbh2:acumbh1
;Carga palabra (32 bits) desde acoef en el registro acoef1 y posincrementa.
|| LDW .D1 *acoef++, acoef1
;Carga palabra (32 bits) desde acoef en el registro acoef2.
LDW .D1 *acoef, acoef2
|| MV output, outputb
NOP 4
;Mueve el contenido desde el datapath B al A (registros ycuma1 y ycuma2).
;Inicializa los acumuladores acuma y acumb con acumbh1 (seed[0] -> y[n-1]).
MV acumbh1, ycuma1
|| MV acumbh1, acumb
MV acumbh2, ycuma2
|| MV ycuma1, acuma
singen_main_loop:
[np_cnt] SPLOOPW 5
;Multiplica acuma y acumb (y[n-1]) por los coeficientes acoef1 y acoef2 (acoef[0], acoef[1]).
MPY32 .M1 acoef1, acuma, acumah2:acumah1
|| MPY32 .M2 acoef2, acumb, acumbh2:acumbh1
NOP 3
;Desplaza el resultado de las multiplicaciones.
SHRU .S1 acumah1, 29, acumah1
|| SHR .S2 acumbh2, 27, acumbh1
;Resta 1 al contador np_cnt.
|| SUB .L2 np_cnt, 1, np_cnt
SHL .S1 acumah2, 3, acumah2
OR .D1 acumah2, acumah1, acumah1
;Suma la corrección del error usando el datapath cruzado.
|| MV acumbh1, acuma
;Resta de la multiplicación ycuma2 (y[n-2]);
SUB .L1 acumah1, ycuma2, acumah1
;Mueve ycuma1 a ycuma2 (y[n-2] = y[n-1]).
|| MV ycuma1, ycuma2
ADD .L1 acuma, acumah1, acuma
;Mueve el resultado a ycuma1 (y[n-1] = y[0]).
MV acuma, ycuma1
;Copia el resultado desde acuma a acumb (datapath A al B).
MV acuma, acumb
;Almacena el resultado en la dirección apuntada por output. Postincrementa el puntero.
|| STW .D2 acumb, *outputb++
;Finaliza el ciclo
SPKERNEL 0, 0
singen_end:
;Enable interrupts.
RINT
;Branch to return address.
B rtnptr
NOP 5
.endasmfunc
Is there someone familiar with the use of the SPLOOP structure who can explain me how to use it and what are the restrictions I must keep in mind? I would be very thankful if someone help me to migrate the branch-based loop to a SPLOOP optimized version, since I think it is the only way to improve the performance.
Best regards,
--Adrian