Assembly algortihm optimization

admeltech

Other Parts Discussed in Thread: TMS320C6670, SYSBIOS

Hello, everyone. Once again I'm looking for help to optimize an algorithm using the C6000 Assembly Language of the TMS320C6670 I'm developing on.

My goal is to make a sinewave generator to operate at the highest speed possible, based on an unstable FIR filter as explained in the SPRA708 Application note (http://www.ee.ic.ac.uk/pcheung/teaching/ee3_Study_Project/Sinewave%20generation%28708%29.pdf).

The C function has a declaration like this:

void singen_asm(int32_t* output, int32_t* seed, int32_t* coef, int32_t np);

Where output is the signal buffer, seed are the initial values of the filter output, coef the filter coefficients and np the number of points to be calculated.

The operation performed in every iteration, with a fix point shifting of q1 = 29 and q2 = 59, is:

y[i] = A1*y[i-1] >> q1 + A2*y[i-1] >> q2 - y[n-2]

Then, the sinewave output has a precision of 30 bits (without including the sign) instead of 31 to avoid an overflow in the y accumulation.

I implemented a C version of the algorithm and an Assembly one, as follows:

;; Function prototype
;; void singen_asm(int32_t* output, int32_t* seed, int32_t* acoef, int32_t np);

        .global singen_asm
        .sect ".text:appFastCode"

        .map output/A4, seed/B4, acoef/A6, np/B6
        .map np_cnt/A2
        .map acoef1/A24, acoef2/B24, acuma/A28, acumb/B28, acumah2/A27, acumah1/A26, acumbh2/B27, acumbh1/B26, ycuma2/A31, ycuma1/A30, aux1/A16
            ; C/C++ interface params.
        .map rtnptr/B3, stkptr/B15

        .asmfunc

singen_asm:
    ;Disable interrupts.
    DINT
    || SUB np, 01, np_cnt
    ; Load the seed from memory as a double word.
    || LDDW .D2 *seed, acumbh2:acumbh1
    ; Load the first possition of acoef.
    || LDW .D1 *acoef++, acoef1
    ; Load the second possition of acoef.
    LDW .D1 *acoef, acoef2
    NOP 4
    ; Initialize the acummulators acuma and acumbInicializa los acumuladores acuma y acumb with seed[0] (from acumbh1).

    MV acumbh1, ycuma1
    || MV acumbh1, acumb
    MV acumbh2, ycuma2
    || MV ycuma1, acuma

singen_main_loop:
    ; Multiply acuma and acumb by the coeficients acoef1 and acoef2.
    MPY32 .M1 acoef1, acuma, acumah2:acumah1
    || MPY32 .M2 acoef2, acumb, acumbh2:acumbh1
    NOP 3
    ; Shift the product result.
    SHRU .S1 acumah1, 29, acumah1
    || SHR .S2 acumbh2, 27, acumbh1
    ; Branch if np_cnt is 0 (last data was calculated) and reduce the counter.
    [np_cnt] B singen_main_loop
    ;Resta 1 al contador np_cnt.
    || SUB .L1 np_cnt, 1, np_cnt
    SHL .S1 acumah2, 3, acumah2
    OR .S1 acumah2, acumah1, acumah1
    ; Adds the error correction.
    || MV acumbh1, acuma
    ; Substract ycuma2 from the partial result.
    SUB .L1 acumah1, ycuma2, acumah1
    ; Move ycuma1 to ycuma2 (To move the filter memory).
    || MV ycuma1, ycuma2
    ; Acumulate the partial results from datapath A and B.
   ADD .L1 acuma, acumah1, acuma
    ; Store the result into the output buffer.
    STW .D1 acuma, *output++
    ; Move the result to ycuma1 and to acumb.
    || MV acuma, ycuma1
    || MV acuma, acumb

singen_end:
    ;Enable interrupts.
    RINT
    ;Branch to return address.
    B rtnptr
    NOP 5

    .endasmfunc.

The last version generates 1 output every 11 clock cycles aproximately, but I would like to increase it to 6 cycles more or less. I know this is possible using the SPLOOP structure to accelerate the main loop but I'm not very familiar with that option. Sometimes I couldn't even build the progam test due to resources conflicts or the result was not the expected.

Here is the last version using SPLOOP structure. This one buids and runs but doesn't generate a sinewave.

;; Function prototype
;; void singen_asm(int32_t* output, int32_t* seed, int32_t* acoef, int32_t np);

       .global singen_asm
       .sect ".text:appFastCode"

       .map output/A4, seed/B4, acoef/A6, np/B6
       .map np_cnt/B0
       .map outputb/B2, acoef1/A24, acoef2/B24, acuma/A28, acumb/B28, acumah2/A27, acumah1/A26, acumbh2/B27, acumbh1/B26, ycuma2/A31, ycuma1/A30, aux1/A16
           ; C/C++ interface params.
       .map rtnptr/B3, stkptr/B15

       .asmfunc

singen_asm:
   ;Disable interrupts in code since code is software pipelined.
   ;; DINT
   ;; || SUB np, 01, np_cnt
   ;; || LDW .D2 *seed, ycumab1
   ;; NOP 4
   ;; MV ycumab1, ycuma1
   ;; || MVKL 0x0000FFFF, aux1
   ;; SHR .S1 ycuma1, 16, ycuma2
   ;; || AND ycuma1, aux1, acuma

   ;Deshabilita las interrupciones
   DINT
   ;Resta 1 al contador np.
   ;|| SUB np, 01, np_cnt
   || MV np, np_cnt
   ;Carga doble palabra (64 bits) desde seed en el par de registros acumbh.
   || LDDW .D2 *seed, acumbh2:acumbh1
   ;Carga palabra (32 bits) desde acoef en el registro acoef1 y posincrementa.
   || LDW .D1 *acoef++, acoef1
   ;Carga palabra (32 bits) desde acoef en el registro acoef2.
   LDW .D1 *acoef, acoef2
   || MV output, outputb
   NOP 4
   ;Mueve el contenido desde el datapath B al A (registros ycuma1 y ycuma2).
   ;Inicializa los acumuladores acuma y acumb con acumbh1 (seed[0] -> y[n-1]).
   MV acumbh1, ycuma1
   || MV acumbh1, acumb
   MV acumbh2, ycuma2
   || MV ycuma1, acuma

singen_main_loop:
   [np_cnt] SPLOOPW 5
   ;Multiplica acuma y acumb (y[n-1]) por los coeficientes acoef1 y acoef2 (acoef[0], acoef[1]).
   MPY32 .M1 acoef1, acuma, acumah2:acumah1
   || MPY32 .M2 acoef2, acumb, acumbh2:acumbh1
   NOP 3
   ;Desplaza el resultado de las multiplicaciones.
   SHRU .S1 acumah1, 29, acumah1
   || SHR .S2 acumbh2, 27, acumbh1
   ;Resta 1 al contador np_cnt.
   || SUB .L2 np_cnt, 1, np_cnt
   SHL .S1 acumah2, 3, acumah2
   OR .D1 acumah2, acumah1, acumah1
   ;Suma la corrección del error usando el datapath cruzado.
   || MV acumbh1, acuma
   ;Resta de la multiplicación ycuma2 (y[n-2]);
   SUB .L1 acumah1, ycuma2, acumah1
   ;Mueve ycuma1 a ycuma2 (y[n-2] = y[n-1]).
   || MV ycuma1, ycuma2
   ADD .L1 acuma, acumah1, acuma
   ;Mueve el resultado a ycuma1 (y[n-1] = y[0]).
   MV acuma, ycuma1
   ;Copia el resultado desde acuma a acumb (datapath A al B).
   MV acuma, acumb
   ;Almacena el resultado en la dirección apuntada por output. Postincrementa el puntero.
   || STW .D2 acumb, *outputb++
   ;Finaliza el ciclo
   SPKERNEL 0, 0

singen_end:
   ;Enable interrupts.
   RINT
   ;Branch to return address.
   B rtnptr
   NOP 5

   .endasmfunc

Is there someone familiar with the use of the SPLOOP structure who can explain me how to use it and what are the restrictions I must keep in mind? I would be very thankful if someone help me to migrate the branch-based loop to a SPLOOP optimized version, since I think it is the only way to improve the performance.

Best regards,

--Adrian

over 11 years ago

0 George Mock over 11 years ago

TI__Guru**** 232880 points

Have you considered writing it in linear assembly instead? If you can't do that, then I can only refer you to my response on your last thread.

Thanks and regards,

-George

0 admeltech over 11 years ago in reply to George Mock

Prodigy 245 points

Hi, George.

I have never used the linear assembly. Besides trying to "full programming" the assembly functions I try to optimize, I sometimes use the C intrinsic instructions with good performance, but for really easy tasks such as multiplying 16-bit vectors or other operations I know are associated for sure with assembly instructions.

I don't know how good is the Assembly Optimizer to pipeline my code taking into account data restrictions, memory access times and so on.

How can I convert my design into linear assembly? How is it different from the method I'm using?

Thanks.

0 George Mock over 11 years ago in reply to admeltech

TI__Guru**** 232880 points

Adri��n Gonz��lez said:
How can I convert my design into linear assembly?

Learn about the assembly optimizer from the chapter titled Using the Assembly Optimizer of the C6000 compiler manual. Many examples of linear assembly can be found in the C6000 Programmer's Guide.

Adri��n Gonz��lez said:
How is it different from the method I'm using?

You are still writing assembly language. But you don't have to worry about scheduling instructions, partitioning instructions to the A or B side, or allocating machine registers. All of that is taken care of for you. It is a level of programming abstraction that lies between C and assembly.

Thanks and regards,

-George

0 Archaeologist over 11 years ago in reply to George Mock

TI__Guru* 84225 points

Before dropping from C to assembly or linear assembly, are you sure the C code couldn't be made more efficient? Could you post it here?

0 admeltech over 11 years ago in reply to George Mock

Prodigy 245 points

Hi George.

I just recently tested the linear assembly with a basic "strcpy" function I had already implemented in normal assembly using SPLOOP structure. I also compared its performance with the basic C operation, as shown:

C basic code

for(k = 0; k < BFSZ; k++)
buffd[k] = buffa[k];

Parallel assembly function

    .global cpyer_asm
    .sect ".text:appFastCode"

    .asmfunc

cpyer_asm:
    MV A6, A0;Do A6 loops
    [A0] SPLOOPW 1;Check loop
    LDH .D1 *A4++, A2;Load source value
    NOP 1
    SUB .S1 A0, 1, A0;Adjust loop counter
    NOP 2;Wait for source to load
    MV .L2X A2, B2;Position data for write
    SPKERNEL 0, 0;End loop
    || STH .D2 B2,*B4++;Store value

cpyer_asm_end:
    B B3
    NOP 5

    .endasmfunc

Linear assembly function

    .global cpyer_asm_linear
   .sect ".text:appFastCode"

cpyer_asm_linear:   .cproc orig, dest, size
    .reg auxcnt, aux
    .no_mdep

   MV size, auxcnt

cpyer_asm_linear_loop:   .trip 512
    .mdep r1, r2
   LDH *orig++{r1}, aux
   STH aux, *dest++{r2}
   SUB auxcnt, 1, auxcnt
   [auxcnt] B cpyer_asm_linear_loop

   .return
   .endproc

In the simulator, the C function takes 8703 clock cycles to copy 512 data (16.998 cycles/datum), the parallel assembly 535 (1.045 cycles/datum) and the linear assembly 6672 (13.031 cycles/datum). I have to admit that linear assembly is faster than C language even with the "optimization for speed" option enabled and that is easy to write since there is not necessary to assign any resources, but from 535 to 6672 cycles there is an unexpected jump.

Do I have to include any directive in the .sa file or change the Build Options of my project to make those optimizations apply?

0 Archaeologist over 11 years ago in reply to admeltech

TI__Guru* 84225 points

I get different results. With just the options "-o2 -mv6400+", the compiler turns the C code into much better code than the ASM or SA, which are about the same. I am measuring only CPU cycles; memory stalls are not counted. The reason the C code does better is that it combines several LDH into LDNDW. Here is the code I used. If this is not representative of the code you used, please post a complete, compilable test case including all of the command-line options.

/* 
   cl6x -o2 -mv6400+ --abi=eabi x.asm y.sa w.c

   C version: 274 CPU cycles
   ASM version: 533 CPU cycles
   SA version: 526 CPU cycles
*/

#include <stdio.h>
#include <time.h>

#define BFSZ 512

void copy(short buffd[restrict], short buffa[])
{
    int k;
    for(k = 0; k < BFSZ; k++)
        buffd[k] = buffa[k];
}

short dst[BFSZ], src[BFSZ];

extern void cpyer_asm(short orig[], short dest[], int size);
extern void cpyer_asm_linear(short orig[], short dest[], int size);

int main()
{
    clock_t start, stop, overhead = clock();
    overhead = clock() - overhead;

    start = clock();
    copy(dst, src);
    stop = clock();
    printf("C version: %d CPU cycles", (int)(stop - start - overhead));

    start = clock();
    cpyer_asm(src, dst, BFSZ);
    stop = clock();
    printf("ASM version: %d CPU cycles", (int)(stop - start - overhead));

    start = clock();
    cpyer_asm_linear(src, dst, BFSZ);
    stop = clock();
    printf("SA version: %d CPU cycles", (int)(stop - start - overhead));

    return 0;
}

0 Douglas Gwyn over 11 years ago in reply to Archaeologist

Expert 2210 points

Indeed, in my previous bout with the C6xxx family several years ago, I was pleasantly surprised by how well the C compiler optimized such code. My own policy has been to code everything possible in C, and resort to any form of assembly language only where the code is found to be a bottleneck.

0 admeltech over 11 years ago in reply to Archaeologist

Prodigy 245 points

Hi, Archaeologist. I implemented your version with the "restricted" option for the buffer as well as the "time.h" library to measure the function performance instead of the Code compose Clock tool. The result I get is almost the same or even worse:

ASM: 534 cycles, SA: 6671 cycles, C: 13341 cycles

I'm ashamed to ask this, but I don't know well how to set the optimization options. The "Build > c6000 Compiler > Optimizations" page in the "Build options" in CCS5 is where I try to modify these options. I don't find the --opt_level (i.e, -O2) but the --opt_for_speed, which I set in 2. The --gen_opt_info is disabled (in 0) and the --call_asumptions in 0, what I actually see in my build log.

Note: cpyer.asm and cpyer_linear.sa are the 2 versions of the assembly function. The C version is in the main.c file.

**** Build of configuration Debug for project linear_asm_test ****

C:\ti\ccsv5\utils\bin\gmake -k all
'Building file: ../app.cfg'
'Invoking: XDCtools'
"C:/ti/xdctools_3_22_04_46/xs" --xdcpath="C:/ti/bios_6_32_05_54/packages;C:/ti/ipc_1_24_00_16/packages;" xdc.tools.configuro -o configPkg -t ti.targets.elf.C66 -p ti.platforms.evm6670 -r release -c "C:/ti/ccsv5/tools/compiler/c6000" --compileOptions "-g --optimize_with_debug" "../app.cfg"
making package.mak (because of package.bld) ...
generating interfaces for package configPkg (because package/package.xdc.inc is older than package.xdc) ...
configuring app.xe66 from package/cfg/app_pe66.cfg ...
generating custom ti.sysbios library makefile ...
Starting build of library sources ...
Build of libraries done.
cle66 package/cfg/app_pe66.c ...
'Finished building: ../app.cfg'
' '
'Building file: ../cpyer.asm'
'Invoking: C6000 Compiler'
"C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=2 --call_assumptions=1 --preproc_with_compile --preproc_dependency="cpyer.pp" --cmd_file="./configPkg/compiler.opt" "../cpyer.asm"
'Finished building: ../cpyer.asm'
' '
'Building file: ../cpyer_linear.sa'
'Invoking: C6000 Compiler'
"C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=2 --call_assumptions=1 --preproc_with_compile --preproc_dependency="cpyer_linear.pp" --cmd_file="./configPkg/compiler.opt" "../cpyer_linear.sa"
'Finished building: ../cpyer_linear.sa'
' '
'Building file: ../main.c'
'Invoking: C6000 Compiler'
"C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=2 --call_assumptions=1 --preproc_with_compile --preproc_dependency="main.pp" --cmd_file="./configPkg/compiler.opt" "../main.c"
"../main.c", line 70: warning #112-D: statement is unreachable
"../main.c", line 40: warning #179-D: variable "k" was declared but never referenced
'Finished building: ../main.c'
' '
'Building target: linear_asm_test.out'
'Invoking: C6000 Linker'
"C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=2 --call_assumptions=1 -z -m"linear_asm_test.map" --warn_sections -i"C:/ti/ccsv5/tools/compiler/c6000/lib" -i"C:/ti/ccsv5/tools/compiler/c6000/include" --reread_libs --rom_model -o "linear_asm_test.out" -l"./configPkg/linker.cmd" "./main.obj" "./cpyer_linear.obj" "./cpyer.obj" -l"libc.a"
<Linking>
'Finished building target: linear_asm_test.out'
' '

**** Build Finished ****

**** Build of configuration Debug for project linear_asm_test ****

C:\ti\ccsv5\utils\bin\gmake -k all
1 archivo(s) copiado(s).
making ../src/sysbios/sysbios.lib ...
gmake[1]: Nothing to be done for `all'.
'Building file: ../cpyer.asm'
'Invoking: C6000 Compiler'
"C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=0 --call_assumptions=1 --preproc_with_compile --preproc_dependency="cpyer.pp" --cmd_file="./configPkg/compiler.opt" "../cpyer.asm"
'Finished building: ../cpyer.asm'
' '
'Building file: ../cpyer_linear.sa'
'Invoking: C6000 Compiler'
"C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=0 --call_assumptions=1 --preproc_with_compile --preproc_dependency="cpyer_linear.pp" --cmd_file="./configPkg/compiler.opt" "../cpyer_linear.sa"
'Finished building: ../cpyer_linear.sa'
' '
'Building file: ../main.c'
'Invoking: C6000 Compiler'
"C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=0 --call_assumptions=1 --preproc_with_compile --preproc_dependency="main.pp" --cmd_file="./configPkg/compiler.opt" "../main.c"
"../main.c", line 70: warning #112-D: statement is unreachable
"../main.c", line 40: warning #179-D: variable "k" was declared but never referenced
'Finished building: ../main.c'
' '
'Building target: linear_asm_test.out'
'Invoking: C6000 Linker'
"C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=0 --call_assumptions=1 -z -m"linear_asm_test.map" --warn_sections -i"C:/ti/ccsv5/tools/compiler/c6000/lib" -i"C:/ti/ccsv5/tools/compiler/c6000/include" --reread_libs --rom_model -o "linear_asm_test.out" -l"./configPkg/linker.cmd" "./main.obj" "./cpyer_linear.obj" "./cpyer.obj" -l"libc.a"
<Linking>
'Finished building target: linear_asm_test.out'
' '

**** Build Finished ****

Must I set the optimization options by hand in a .cmd file or something? As I see, the optimization in working neither for the C filer or the SA file.

Thank for your help and patience.

0 Archaeologist over 11 years ago in reply to admeltech

TI__Guru* 84225 points

You must use --opt_level 2 or greater or else the SA and C versions will have awful performance.

What version of CCS are you using?

0 George Mock over 11 years ago in reply to admeltech

TI__Guru**** 232880 points

Adri��n Gonz��lez said:
I'm ashamed to ask this, but I don't know well how to set the optimization options. The "Build > c6000 Compiler > Optimizations" page in the "Build options" in CCS5 is where I try to modify these options. I don't find the --opt_level (i.e, -O2) but the --opt_for_speed, which I set in 2. The --gen_opt_info is disabled (in 0) and the --call_asumptions in 0, what I actually see in my build log.

Look in Build > C6000 Compiler > Basic Options. There is a drop-down box labeled Optimization level.

For general tips on what compiler options to use and why, please see this wiki article.

Thanks and regards,

-George

0 admeltech over 11 years ago in reply to George Mock

Prodigy 245 points

Hi, everyone.

I can´t believe I haven't noticed the Optimization level option in the project options. By activating it (-O2) I could get the following benches:

C version: 546 cycles, ASM version: 534 cycles, SA version: 538 cycles

Those numbers are awesomely close each other, even from the C version, although I didn't get 274 cycles Archaeologist did. I guess it is due to loop unrolling by two.

What is the recommended configuration for the Optimizer options? What criteria are used to select the levels?

---

I have then to take my main problem again (the sinewave generation) and trying to improve its performance, but I guess using at least linear assembly is enough to force the compiler using the hardware buffers for the internal loops.

Code Composer Studio™︎

Code Composer Studio forum

Assembly algortihm optimization