This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

code performance optimization



Dear community,

How can I increase performance of such cycle?

 

            for(i=0;i<XMIT_BUFF_SIZE;i++)

            {

             X[i] =  A[i] + B[i] + C[i] + D[i];                                                            

            }        

IDE version is Version: 4.1.0.02002 
Thanks!
Max

  • Hi,

    What device are you using?

    Regards,

    Hyun

  • hi,

     

    You may go with assembly. C5515 has a dual mac operation with one cycle. You can refer http://www.ti.com/lit/ug/swpu068e/swpu068e.pdf

    Regards,

    Hyun

  • You could also look into calling multiple instances of the Optimized vector addition from DSPLIB: http://focus.ti.com/docs/toolsw/folders/print/sprc100.html

    ushort oflag = add (DATA *x, DATA *y, DATA *r, ushort nx, ushort scale)

    This function adds two vectors, element by element.

    Also when you download the DSPLIB, full optimized assembly source code is provided. You could start with the vector add routine and modify it accordingly.

    Hope this helps,
    Mark

  • Hi Mark,

     

    Thanks for your advice. 

    I modifed add.asm from dsplib in such a way

     

    ; ---------------------------------------------------------------

    ; Start of outer loop 

    ; for (i=0; iP<nx; i++)

    ;    R(i) = X(i) + Y(i) + A(i) + B(i);

    ; ---------------------------------------------------------------

    RPTBLOCAL loop ;start the outer loop

    ADD *AR0+, *AR1+, AC0 ;vector add of two inputs

    MOV HI(AC0), AR5

    ADD *AR3+, *AR4+, AC0 ;vector add of two inputs

    MOV HI(AC0), AR6

    ADD *AR5,  *AR6, AC2 ;vector add of two inputs

    ; ---------------------------------------------------------------

    ; To implement scaling:

    ; if(scale = #1) then AC0=AC0/2;

    ; otherwise *r_ptr+ = AC0

    ; ---------------------------------------------------------------

    XCC loop, T1!=#0 ;testing for scaling

          ||SFTA AC2, -1 ;if scale=1, AC0=AC0/2

    loop: MOV HI(AC2), *AR2+ ;end of outer loop

     

     

    Processing time has decreased four times.

     

    So, I pointed two new data vectors to auxiliary registers AR3 and AR4.

    And performed addition pair by pair. 

    In this code I use auxiliary registers AR5, AR6 to store temporary result of addition. According to swpu068e I have only 8 such registers. 

     

    Could I store temporary result of addition in other location without reduction of performance? 

    In case of increasing amount of variables I need more auxiliry registers to point to them data vectors.

     

    Thanks!

     

    Max