Assembly algortihm optimization

Hello, everyone. Once again I'm looking for help to optimize an algorithm using the C6000 Assembly Language of the TMS320C6670 I'm developing on.

My goal is to make a sinewave generator to operate at the highest speed possible, based on an unstable FIR filter as explained in the SPRA708 Application note (http://www.ee.ic.ac.uk/pcheung/teaching/ee3_Study_Project/Sinewave%20generation%28708%29.pdf).

The C function has a declaration like this:

void singen_asm(int32_t* output, int32_t* seed, int32_t* coef, int32_t np);

Where output is the signal buffer, seed are the initial values of the filter output, coef the filter coefficients and np the number of points to be calculated.

The operation performed in every iteration, with a fix point shifting of q1 = 29 and q2 = 59, is:

y[i] = A1*y[i-1] >> q1 + A2*y[i-1] >> q2 - y[n-2]

Then, the sinewave output has a precision of 30 bits (without including the sign) instead of 31 to avoid an overflow in the y accumulation.

I implemented a C version of the algorithm and an Assembly one, as follows:

;; Function prototype
;; void singen_asm(int32_t* output, int32_t* seed, int32_t* acoef, int32_t np);


        .global singen_asm
        .sect ".text:appFastCode"

        .map output/A4, seed/B4, acoef/A6, np/B6
        .map np_cnt/A2
        .map acoef1/A24, acoef2/B24, acuma/A28, acumb/B28, acumah2/A27, acumah1/A26, acumbh2/B27, acumbh1/B26, ycuma2/A31, ycuma1/A30, aux1/A16
            ; C/C++ interface params.
        .map rtnptr/B3, stkptr/B15

        .asmfunc


singen_asm:
    ;Disable interrupts.
    DINT
    || SUB np, 01, np_cnt
    ; Load the seed from memory as a double word.
    || LDDW .D2 *seed, acumbh2:acumbh1
    ; Load the first possition of acoef.
    || LDW .D1 *acoef++, acoef1
    ; Load the second possition of acoef.
    LDW .D1 *acoef, acoef2
    NOP 4
    ; Initialize the acummulators acuma and acumbInicializa los acumuladores acuma y acumb with seed[0] (from acumbh1).

    MV acumbh1, ycuma1
    || MV acumbh1, acumb
    MV acumbh2, ycuma2
    || MV ycuma1, acuma

singen_main_loop:
    ; Multiply acuma and acumb by the coeficients acoef1 and acoef2.
    MPY32 .M1 acoef1, acuma, acumah2:acumah1
    || MPY32 .M2 acoef2, acumb, acumbh2:acumbh1
    NOP 3
    ; Shift the product result.
    SHRU .S1 acumah1, 29, acumah1
    || SHR .S2 acumbh2, 27, acumbh1
    ; Branch if np_cnt is 0 (last data was calculated) and reduce the counter.
    [np_cnt] B singen_main_loop
    ;Resta 1 al contador np_cnt.
    || SUB .L1 np_cnt, 1, np_cnt
    SHL .S1 acumah2, 3, acumah2
    OR .S1 acumah2, acumah1, acumah1
    ; Adds the error correction.
    || MV acumbh1, acuma
    ; Substract ycuma2 from the partial result.
    SUB .L1 acumah1, ycuma2, acumah1
    ; Move ycuma1 to ycuma2 (To move the filter memory).
    || MV ycuma1, ycuma2
    ; Acumulate the partial results from datapath A and B.
    ADD .L1 acuma, acumah1, acuma
    ; Store the result into the output buffer.
    STW .D1 acuma, *output++
    ; Move the result to ycuma1 and to acumb.
    || MV acuma, ycuma1
    || MV acuma, acumb

singen_end:
    ;Enable interrupts.
    RINT
    ;Branch to return address.
    B rtnptr
    NOP 5

    .endasmfunc.

The last version generates 1 output every 11 clock cycles aproximately, but I would like to increase it to 6 cycles more or less. I know this is possible using the SPLOOP structure to accelerate the main loop but I'm not very familiar with that option. Sometimes I couldn't even build the progam test due to resources conflicts or the result was not the expected.

Here is the last version using SPLOOP structure. This one buids and runs but doesn't generate a sinewave.

;; Function prototype
;; void singen_asm(int32_t* output, int32_t* seed, int32_t* acoef, int32_t np);


        .global singen_asm
        .sect ".text:appFastCode"

        .map output/A4, seed/B4, acoef/A6, np/B6
        .map np_cnt/B0
        .map outputb/B2, acoef1/A24, acoef2/B24, acuma/A28, acumb/B28, acumah2/A27, acumah1/A26, acumbh2/B27, acumbh1/B26, ycuma2/A31, ycuma1/A30, aux1/A16
            ; C/C++ interface params.
        .map rtnptr/B3, stkptr/B15

        .asmfunc


singen_asm:
    ;Disable interrupts in code since code is software pipelined.
    ;; DINT
    ;; || SUB np, 01, np_cnt
    ;; || LDW .D2 *seed, ycumab1
    ;; NOP 4
    ;; MV ycumab1, ycuma1
    ;; || MVKL 0x0000FFFF, aux1
    ;; SHR .S1 ycuma1, 16, ycuma2
    ;; || AND ycuma1, aux1, acuma

    ;Deshabilita las interrupciones
    DINT
    ;Resta 1 al contador np.
    ;|| SUB np, 01, np_cnt
    || MV np, np_cnt
    ;Carga doble palabra (64 bits) desde seed en el par de registros acumbh.
    || LDDW .D2 *seed, acumbh2:acumbh1
    ;Carga palabra (32 bits) desde acoef en el registro acoef1 y posincrementa.
    || LDW .D1 *acoef++, acoef1
    ;Carga palabra (32 bits) desde acoef en el registro acoef2.
    LDW .D1 *acoef, acoef2
    || MV output, outputb
    NOP 4
    ;Mueve el contenido desde el datapath B al A (registros ycuma1 y ycuma2).
    ;Inicializa los acumuladores acuma y acumb con acumbh1 (seed[0] -> y[n-1]).
    MV acumbh1, ycuma1
    || MV acumbh1, acumb
    MV acumbh2, ycuma2
    || MV ycuma1, acuma

singen_main_loop:
    [np_cnt] SPLOOPW 5
    ;Multiplica acuma y acumb (y[n-1]) por los coeficientes acoef1 y acoef2 (acoef[0], acoef[1]).
    MPY32 .M1 acoef1, acuma, acumah2:acumah1
    || MPY32 .M2 acoef2, acumb, acumbh2:acumbh1
    NOP 3
    ;Desplaza el resultado de las multiplicaciones.
    SHRU .S1 acumah1, 29, acumah1
    || SHR .S2 acumbh2, 27, acumbh1
    ;Resta 1 al contador np_cnt.
    || SUB .L2 np_cnt, 1, np_cnt
    SHL .S1 acumah2, 3, acumah2
    OR .D1 acumah2, acumah1, acumah1
    ;Suma la corrección del error usando el datapath cruzado.
    || MV acumbh1, acuma
    ;Resta de la multiplicación ycuma2 (y[n-2]);
    SUB .L1 acumah1, ycuma2, acumah1
    ;Mueve ycuma1 a ycuma2 (y[n-2] = y[n-1]).
    || MV ycuma1, ycuma2
    ADD .L1 acuma, acumah1, acuma
    ;Mueve el resultado a ycuma1 (y[n-1] = y[0]).
    MV acuma, ycuma1
    ;Copia el resultado desde acuma a acumb (datapath A al B).
    MV acuma, acumb
    ;Almacena el resultado en la dirección apuntada por output. Postincrementa el puntero.
    || STW .D2 acumb, *outputb++
    ;Finaliza el ciclo
    SPKERNEL 0, 0

singen_end:
    ;Enable interrupts.
    RINT
    ;Branch to return address.
    B rtnptr
    NOP 5

    .endasmfunc

Is there someone familiar with the use of the SPLOOP structure who can explain me how to use it and what are the restrictions I must keep in mind? I would be very thankful if someone help me to migrate the branch-based loop to a SPLOOP optimized version, since I think it is the only way to improve the performance.

Best regards,

--Adrian

  • Have you considered writing it in linear assembly instead?  If you can't do that, then I can only refer you to my response on your last thread.

    Thanks and regards,

    -George

  • In reply to George Mock:

    Hi, George.

    I have never used the linear assembly. Besides trying to "full programming" the assembly functions I try to optimize, I sometimes use the C intrinsic instructions with good performance, but for really easy tasks such as multiplying 16-bit vectors or other operations I know are associated for sure with assembly instructions.

    I don't know how good is the Assembly Optimizer to pipeline my code taking into account data restrictions, memory access times and so on.

    How can I convert my design into linear assembly? How is it different from the method I'm using?

    Thanks.

  • In reply to admeltech:

    Adri��n Gonz��lez
    How can I convert my design into linear assembly?

    Learn about the assembly optimizer from the chapter titled Using the Assembly Optimizer of the C6000 compiler manual.  Many examples of linear assembly can be found in the C6000 Programmer's Guide.

    Adri��n Gonz��lez
    How is it different from the method I'm using?

    You are still writing assembly language.  But you don't have to worry about scheduling instructions, partitioning instructions to the A or B side, or allocating machine registers.  All of that is taken care of for you.  It is a level of programming abstraction that lies between C and assembly.

    Thanks and regards,

    -George

  • In reply to George Mock:

    Before dropping from C to assembly or linear assembly, are you sure the C code couldn't be made more efficient?  Could you post it here?

  • In reply to George Mock:

    Hi George.

    I just recently tested the linear assembly with a basic "strcpy" function I had already implemented in normal assembly using SPLOOP structure. I also compared its performance with the basic C operation, as shown:

    C basic code

    for(k = 0; k < BFSZ; k++)
      buffd[k] = buffa[k];

    Parallel assembly function

        .global cpyer_asm
        .sect ".text:appFastCode"

        .asmfunc

    cpyer_asm:
        MV A6, A0;Do A6 loops
        [A0] SPLOOPW 1;Check loop
        LDH .D1 *A4++, A2;Load source value
        NOP 1
        SUB .S1 A0, 1, A0;Adjust loop counter
        NOP 2;Wait for source to load
        MV .L2X A2, B2;Position data for write
        SPKERNEL 0, 0;End loop
        || STH .D2 B2,*B4++;Store value

    cpyer_asm_end:
        B B3
        NOP 5

        .endasmfunc

    Linear assembly function

        .global cpyer_asm_linear
        .sect ".text:appFastCode"


    cpyer_asm_linear:    .cproc orig, dest, size
        .reg auxcnt, aux
        .no_mdep

        MV size, auxcnt

    cpyer_asm_linear_loop:    .trip 512
        .mdep r1, r2
        LDH *orig++{r1}, aux
        STH aux, *dest++{r2}
        SUB auxcnt, 1, auxcnt
        [auxcnt] B cpyer_asm_linear_loop

        .return
        .endproc

    In the simulator, the C function takes 8703 clock cycles to copy 512 data (16.998 cycles/datum), the parallel assembly  535 (1.045 cycles/datum) and the linear assembly 6672 (13.031 cycles/datum). I have to admit that linear assembly is faster than C language even with  the "optimization for speed" option enabled and that is easy to write since there is not necessary to assign any resources, but from 535 to 6672 cycles there is an unexpected jump.

    Do I have to include any directive in the .sa file or change the Build Options of my project to make those optimizations apply?

  • In reply to admeltech:

    I get different results.  With just the options "-o2 -mv6400+", the compiler turns the C code into much better code than the ASM or SA, which are about the same.   I am measuring only CPU cycles; memory stalls are not counted.  The reason the C code does better is that it combines several LDH into LDNDW.  Here is the code I used.  If this is not representative of the code you used, please post a complete, compilable test case including all of the command-line options.

    /* 
       cl6x -o2 -mv6400+ --abi=eabi x.asm y.sa w.c
    
       C version: 274 CPU cycles
       ASM version: 533 CPU cycles
       SA version: 526 CPU cycles
    */
    
    #include <stdio.h>
    #include <time.h>
    
    #define BFSZ 512
    
    void copy(short buffd[restrict], short buffa[])
    {
        int k;
        for(k = 0; k < BFSZ; k++)
            buffd[k] = buffa[k];
    }
    
    short dst[BFSZ], src[BFSZ];
    
    extern void cpyer_asm(short orig[], short dest[], int size);
    extern void cpyer_asm_linear(short orig[], short dest[], int size);
    
    int main()
    {
        clock_t start, stop, overhead = clock();
        overhead = clock() - overhead;
    
        start = clock();
        copy(dst, src);
        stop = clock();
        printf("C version: %d CPU cycles", (int)(stop - start - overhead));
    
        start = clock();
        cpyer_asm(src, dst, BFSZ);
        stop = clock();
        printf("ASM version: %d CPU cycles", (int)(stop - start - overhead));
    
        start = clock();
        cpyer_asm_linear(src, dst, BFSZ);
        stop = clock();
        printf("SA version: %d CPU cycles", (int)(stop - start - overhead));
    
        return 0;
    }
    
  • In reply to Archaeologist:

    Indeed, in my previous bout with the C6xxx family several years ago, I was pleasantly surprised by how well the C compiler optimized such code.  My own policy has been to code everything possible in C, and resort to any form of assembly language only where the code is found to be a bottleneck.

  • In reply to Archaeologist:

    Hi, Archaeologist. I implemented your version with the "restricted" option for the buffer as well as the "time.h" library to measure the function performance instead of the Code compose Clock tool. The result I get is almost the same or even worse:

    ASM: 534 cycles, SA: 6671 cycles, C: 13341 cycles

    I'm ashamed to ask this, but I don't know well how to set the optimization options. The "Build > c6000 Compiler > Optimizations" page in the "Build options" in CCS5 is where I try to modify these options. I don't find the --opt_level  (i.e, -O2) but the --opt_for_speed, which I set in 2. The --gen_opt_info is disabled (in 0) and the --call_asumptions in 0, what I actually see in my build log.

    Note: cpyer.asm and cpyer_linear.sa are the 2 versions of the assembly function. The C version is in the main.c file.

    **** Build of configuration Debug for project linear_asm_test ****

    C:\ti\ccsv5\utils\bin\gmake -k all
    'Building file: ../app.cfg'
    'Invoking: XDCtools'
    "C:/ti/xdctools_3_22_04_46/xs" --xdcpath="C:/ti/bios_6_32_05_54/packages;C:/ti/ipc_1_24_00_16/packages;" xdc.tools.configuro -o configPkg -t ti.targets.elf.C66 -p ti.platforms.evm6670 -r release -c "C:/ti/ccsv5/tools/compiler/c6000" --compileOptions "-g --optimize_with_debug" "../app.cfg"
    making package.mak (because of package.bld) ...
    generating interfaces for package configPkg (because package/package.xdc.inc is older than package.xdc) ...
    configuring app.xe66 from package/cfg/app_pe66.cfg ...
    generating custom ti.sysbios library makefile ...
    Starting build of library sources ...
    Build of libraries done.
    cle66 package/cfg/app_pe66.c ...
    'Finished building: ../app.cfg'
    ' '
    'Building file: ../cpyer.asm'
    'Invoking: C6000 Compiler'
    "C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=2 --call_assumptions=1 --preproc_with_compile --preproc_dependency="cpyer.pp" --cmd_file="./configPkg/compiler.opt"  "../cpyer.asm"
    'Finished building: ../cpyer.asm'
    ' '
    'Building file: ../cpyer_linear.sa'
    'Invoking: C6000 Compiler'
    "C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=2 --call_assumptions=1 --preproc_with_compile --preproc_dependency="cpyer_linear.pp" --cmd_file="./configPkg/compiler.opt"  "../cpyer_linear.sa"
    'Finished building: ../cpyer_linear.sa'
    ' '
    'Building file: ../main.c'
    'Invoking: C6000 Compiler'
    "C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=2 --call_assumptions=1 --preproc_with_compile --preproc_dependency="main.pp" --cmd_file="./configPkg/compiler.opt"  "../main.c"
    "../main.c", line 70: warning #112-D: statement is unreachable
    "../main.c", line 40: warning #179-D: variable "k" was declared but never referenced
    'Finished building: ../main.c'
    ' '
    'Building target: linear_asm_test.out'
    'Invoking: C6000 Linker'
    "C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=2 --call_assumptions=1 -z -m"linear_asm_test.map" --warn_sections -i"C:/ti/ccsv5/tools/compiler/c6000/lib" -i"C:/ti/ccsv5/tools/compiler/c6000/include" --reread_libs --rom_model -o "linear_asm_test.out" -l"./configPkg/linker.cmd"  "./main.obj" "./cpyer_linear.obj" "./cpyer.obj" -l"libc.a"
    <Linking>
    'Finished building target: linear_asm_test.out'
    ' '

    **** Build Finished ****

    **** Build of configuration Debug for project linear_asm_test ****

    C:\ti\ccsv5\utils\bin\gmake -k all
            1 archivo(s) copiado(s).
    making ../src/sysbios/sysbios.lib ...
    gmake[1]: Nothing to be done for `all'.
    'Building file: ../cpyer.asm'
    'Invoking: C6000 Compiler'
    "C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=0 --call_assumptions=1 --preproc_with_compile --preproc_dependency="cpyer.pp" --cmd_file="./configPkg/compiler.opt"  "../cpyer.asm"
    'Finished building: ../cpyer.asm'
    ' '
    'Building file: ../cpyer_linear.sa'
    'Invoking: C6000 Compiler'
    "C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=0 --call_assumptions=1 --preproc_with_compile --preproc_dependency="cpyer_linear.pp" --cmd_file="./configPkg/compiler.opt"  "../cpyer_linear.sa"
    'Finished building: ../cpyer_linear.sa'
    ' '
    'Building file: ../main.c'
    'Invoking: C6000 Compiler'
    "C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000/include" --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=0 --call_assumptions=1 --preproc_with_compile --preproc_dependency="main.pp" --cmd_file="./configPkg/compiler.opt"  "../main.c"
    "../main.c", line 70: warning #112-D: statement is unreachable
    "../main.c", line 40: warning #179-D: variable "k" was declared but never referenced
    'Finished building: ../main.c'
    ' '
    'Building target: linear_asm_test.out'
    'Invoking: C6000 Linker'
    "C:/ti/ccsv5/tools/compiler/c6000/bin/cl6x" -mv6600 -g --display_error_number --diag_warning=225 --abi=eabi --opt_for_speed=2 --gen_opt_info=0 --call_assumptions=1 -z -m"linear_asm_test.map" --warn_sections -i"C:/ti/ccsv5/tools/compiler/c6000/lib" -i"C:/ti/ccsv5/tools/compiler/c6000/include" --reread_libs --rom_model -o "linear_asm_test.out" -l"./configPkg/linker.cmd"  "./main.obj" "./cpyer_linear.obj" "./cpyer.obj" -l"libc.a"
    <Linking>
    'Finished building target: linear_asm_test.out'
    ' '

    **** Build Finished ****

    Must I set the optimization options by hand in a .cmd file or something? As I see, the optimization in working neither for the C filer or the SA file.

    Thank for your help and patience.

  • In reply to admeltech:

    You must use --opt_level 2 or greater or else the SA and C versions will have awful performance.

    What version of CCS are you using?

  • In reply to admeltech:

    Adri��n Gonz��lez
    I'm ashamed to ask this, but I don't know well how to set the optimization options. The "Build > c6000 Compiler > Optimizations" page in the "Build options" in CCS5 is where I try to modify these options. I don't find the --opt_level  (i.e, -O2) but the --opt_for_speed, which I set in 2. The --gen_opt_info is disabled (in 0) and the --call_asumptions in 0, what I actually see in my build log.

    Look in Build > C6000 Compiler > Basic Options.  There is a drop-down box labeled Optimization level.  

    For general tips on what compiler options to use and why, please see this wiki article.

    Thanks and regards,

    -George