This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SOP Optimization in Assembly !

Other Parts Discussed in Thread: CCSTUDIO

Hi,

I want to make a function that computes the sum of products of 16 bit operands, but want it to be optimized, so i have done the coding in linear assembly. The following is the code

                .def _SOP
_SOP: .cproc vec1, vec2, n, qvec1, qvec2, qout
                .reg         i, L_temp
                ZERO        L_temp
                MVK            2, i
               
Loop1:            LDHU        *vec1++, A0
                LDHU        *vec2++, A1
                LDHU        *vec1++, A2
                LDHU        *vec2++, A3

                LDHU        *vec1++, B0
                LDHU        *vec2++, B1
                LDHU        *vec1++, B2
                LDHU        *vec2++, B3

                SHL        A0, 16, A0
               

                SHL        B0, 16, B0
                SHL        A1, 16, A1
                SHL        B1, 16, B1

                OR          A0, A2, A0
                OR          B0, B2, B0
                OR        A1, A3, A1
               
               
                OR        B1, B3, B1
               
                SMPY2    A0, A1, A5:A4
                SMPY2    B0, B1, B5:B4
               
                SADD    A4, A5, A5
                SADD    B4, B5, B5
                SADD    A5, B5, A5
                SADD    L_temp, A5, L_temp

                SUB        i,1,i
            [i]    B        Loop1
               
                .return L_temp       
                .endproc

Vec1 and Vec2 are 16 bit operands.

The basic idea is to use blocks(.M .S .L .D) in parallel. But once i check the assembly of this function, the compiler does make the instructions parallel.

any help would be highly appreciated.

 

 

  •  Array  length of vec1 and vec2 is 8

    it can be vec1 ={1, 2, 3, 4 ,5 ,6, 7, 8};

  • Hello Saad, I'm not sure I completely understand your question. Are you saying that the compiler does make the instructions paralle or do you mean it doesn't make the instructions parallel?
    I'm glad to see you attempte dto write optimize code for C6x, but I always using intrinsice C isntead of SA or ASM. In majority cases the compiler can give as good performance. Now, you may also want to consider studying the FIR filter implementations in the DSPLIB. What you are trying to do can also be seen as FIR filtering and you can study the various FIR filters provided here: C:\CCStudio_v3.3\c64plus\dsplib_v210\src

    That should give you good idea on how an optimal function can be written for what you are trying to do. If you face any issues, please write back and we will be happy to help you.

    Regards,
    Gagan

  • thanks Gagan,

    my question basically was that i know at least theoratically that the linear assembly is more optimized as compared to the C. Now i wrote the code of SOP keeping the architecture in mind, what i was expecting was that the assembler would make the statements parallel which it did not. say for example i was using the .L1 and .L2 in statements right after each other but the assembler did not make it parrallel.

    Am i missing any options that i need to set in the build options?

  • Generally, the linear assembly will not care much about the function unit you specify, but the A/B side specification will guide the compiler to optimize. So you using .L1 and .L2, it functions the same as .1 and .2 only specify which side to use and will not affect the actually allocated function unit by the compiler. As to why the two statement not parallel executed, it is the compiler which will decide which statements are parallel, you can't determine the parallelism. Hop to help 

  • Hello Saad, I apologize but I still don't completely understand your question. Anyways, I tried to compile the code you provided. I get the below schedule. 

    ;*----------------------------------------------------------------------------*
    ;*   SOFTWARE PIPELINE INFORMATION
    ;*
    ;*      Loop source line                 : 7
    ;*      Loop closing brace source line   : 40
    ;*      Known Minimum Trip Count         : 1                   
    ;*      Known Max Trip Count Factor      : 1
    ;*      Loop Carried Dependency Bound(^) : 3
    ;*      Unpartitioned Resource Bound     : 4
    ;*      Partitioned Resource Bound(*)    : 4
    ;*      Resource Partition:
    ;*                                A-side   B-side
    ;*      .L units                     0        0    
    ;*      .S units                     2        2    
    ;*      .D units                     4*       4*   
    ;*      .M units                     1        1    
    ;*      .X cross paths               1        0    
    ;*      .T address paths             4*       4*   
    ;*      Long read paths              0        0    
    ;*      Long write paths             0        0    
    ;*      Logical  ops (.LS)           3        1     (.L or .S unit)
    ;*      Addition ops (.LSD)          2        2     (.L or .S or .D unit)
    ;*      Bound(.L .S .LS)             3        2    
    ;*      Bound(.L .S .D .LS .LSD)     4*       3    
    ;*
    ;*      Searching for software pipeline schedule at ...
    ;*         ii = 4  Schedule found with 5 iterations in parallel
    ;*

    As you can see the loop is bound due to stress on the D unit. I also see in you code you are using many half word loads. You can easily replace them with word loads or even load double word. Also, the SHL and OR you are trying to do can be easily replaced with PACK instructions. All these changes are much easily done in C.

    Can you be more specific on what you expect to happen and what you are observing?

    Regards,
    Gagan

  • Thanks alot Gagan, you are really helping.

    the thing is that i have a C function  that i need to optimize. The first step that i have done is used internics. It got optimized but i wanted to optimize even more.  So i found out that the next step would be to write the code in Linear assembly which is did. The frustating part is that after using linear assembly the code was not further optimized. Then i opted to go to actual assembly, here again my code wasnt optimized. In certain cases my code was slower then what the C compiler.

     

    I am confused if assembly is more optimized then why isnt my code showing me that ?

     

  • Hello Saad, I'm glad I'm being able to help! Please see below:

    >  the thing is that i have a C function that i need to optimize. The first step that i have done is used internics. It got optimized but i wanted to optimize even more. So i found out that the next step would be to write the code in Linear assembly which is did. The frustating part is that after using linear assembly the code was not further optimized. Then i opted to go to actual assembly, here again my code wasnt optimized. In certain cases my code was slower then what the C compiler.

    I am sorry about your issue. Please note optimization is not about writing Serial Assembly or Hand Assembly. The only advantage that serial assembly or hand ASM gives you is more control in deciding how the functional units are used or what ASM instructions are picked up. Also for very convoluted loops, you can sometimes do tricks to get little better performance. There are few big issues with this approach:

    • Performance is limited to user's understanding of how the functionality can be represented in ASM. This many times may not be the best if the user is not aware of some special instructions or optimization tricks that can potentially be used.
    • Any functionality change will require serious rewriting of the code. In some cases, the user may have to completely redo the code making the earlier coding effort go completely wasted
    • Debugging and bug fixes is difficult. Also, if the code is moved to newer generation of the core, it will require rewriting of the code to use the newer enhancements.

    TI has spent many years enhancing the C6000 C compiler to allow developers get maximum performance form the architecture. The compiler not only tries various optimization techniques to get best performance, it also provides feedback on issues that it sees to get better performance. Most of the times, the user can provide additional information to the compiler (using pragmas) that can help compiler make more efficient optimization decisions. Also, compiler allows user to specify the unique device instructions using intrinsics. Considering the advantages of C SW development, the efficiency/flexibility of the compiler and the suitability of the architecture for C code development, I highly recommend you choose SA or ASM very very infrequently.

    For your kernel, please consider the dot product kernel in DSPLIB: C:\CCStudio_v3.3\c64plus\dsplib_v210\src\DSP_dotprod

    The kernel implements the same algorithm that you are trying to implement:

    int DSP_dotprod_cn (
        const short *m,    /* Pointer to first vector  */
        const short *n,    /* Pointer to second vector */
        int count          /* Length of vectors.       */
    )
    {
        int i, sum = 0;
        for (i = 0; i < count; i++) {
            sum += m[i] * n[i];
        }
        return sum;
    }

    Please see below the optimized C code:

    int DSP_dotprod (
        short * restrict m,
        short * restrict n,
        int count
    )
    {
        int i;
        int sum1 = 0;
        int sum2 = 0;
        /* The kernel assumes that the data pointers are double word aligned */
        _nassert((int)m % 8 == 0);
        _nassert((int)n % 8 == 0);

        /* The kernel assumes that the input count is multiple of 4 */
        for (i = 0; i < count; i+=4) {
            sum1 += _dotp2(_lo(_amemd8_const(&m[i])),  _lo(_amemd8_const(&n[i])));
            sum2 += _dotp2(_hi(_amemd8_const(&m[i])),  _hi(_amemd8_const(&n[i])));
        }
        return (sum1+sum2);
    }

    Please see below the performance comparison of the above code with the SA code you provided:
    SA Code: 1 clock per sample
    Above C code: 0.25 clock per sample

    That is, the code is 4 times faster than the one you provided.

    As you will notice, it is not the SA/ASM that gives performance. It is an understanding of the architecture and correct usage of the device that gives performance. In this case the performance was achieved by using special instructions like double word load (_amem8) and dot product instruction (dotp2)


    I hope the above helps.

    Regards,
    Gagan