This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

linear assembly optimization problem on ccs v5

when i  reuse c64x platform linear assembly (.sa files) on c66x platform,

the .sa function consume much more cycles on c66x platform(ccs v5) than on c64x (ccs v3.3). (18000cycles on ccs v5 , 2000 cycles on ccs v3)

Then i compare the .asm files generated from the same .sa file , and find out the optimization pipeline of ccs v3.3 is much better than the ccs v5.

compiler version:  7.3.0(ccs v5),  6.1.11(ccs v3). 

ps: on my ccs v5 project optmilization level o3 is already applied

i'm ancious about this problem, can anyone give me advice abt this issue?

Thanks!

  • How is the performance 7.3.0 when you compile that file for C64x?

    Could you please post the software pipelining information comment block from the output assembly file for each version?

  • Now i know the linear assembly low efficiency on c6670 is caused by a wrong setting in "basci options", where i filled 0  in "optimize for code " option while --O3 is seleted.

    After i leave "optimize for code "  its default option(null), the efficiency is improved a lot.

    However, when i compare the efficiency of c66 to c64 of the same .sa file, the difference is still obvious.

    i will give out a simple source code example below.

          /****************.sa code***************/

    /*******  this fuction is to dot product two vectors and then right shift the result with expected  number of bits***********/

                           .text
                     .global  Bas_CplxShrMul16
           
    xxxxx: .cproc     A_x,       B_x,      A_d0,   B_t,     A_nx
     
                     .no_mdep
     
                     .reg       A_01, A_23,   A_32:A_10, A_1:A_0 
                     .reg       B_01, B_23,   B_32:B_10
                     .reg       A_tempr0,     A_tempi0,  A_tempr1,  A_tempi1
                     .reg       A_i,   A_p,   A_t  
                      
                      MV        B_t,       A_t
                      SHR       A_nx,      1,         A_i               
                      SUB       A_i,       2,         A_i       ;                 
    LOOP:          
                      LDNDW    *A_x++,      A_32:A_10
                      LDNDW    *B_x++,      B_32:B_10
                                                                ;

                      SWAP2     A_10,       A_01                       
                      SWAP2     B_10,       B_01
                      SWAP2     A_32,       A_23                       
                      SWAP2     B_32,       B_23
                                                 

    DOTPN2    A_01,       B_01,      A_tempr0    
                      DOTP2     A_01,       B_10,      A_tempi0
                      DOTPN2    A_23,       B_23,      A_tempr1    
                      DOTP2     A_23,       B_32,      A_tempi1       
                     
                      SHR       A_tempr0,   A_t,       A_tempr0 ;
                      SHR       A_tempi0,   B_t,       A_tempi0
                      SHR       A_tempr1,   A_t,       A_tempr1 ;

                      SHR       A_tempi1,   B_t,       A_tempi1

                      PACK2     A_tempi0,   A_tempr0,  A_0 
                      PACK2     A_tempi1,   A_tempr1,  A_1       
                      STNDW     A_1:A_0,   *A_d0++
                                    
                      BDEC      LOOP,       A_i 
                     
                      AND       0x01,  A_nx,    A_p
               [!A_p] B         END
                      LDNW      *A_x,       A_10
                      LDNW      *B_x,       B_10
                                                                ;

                      SWAP2     A_10,       A_01                       
                      SWAP2     B_10,       B_01
                                                                ;

                      DOTPN2    A_01,       B_01,      A_tempr0    
                      DOTP2     A_01,       B_10,      A_tempi0
                     
                      SHR       A_tempr0,   A_t,       A_tempr0 ;

                      SHR       A_tempi0,   B_t,       A_tempi0
                       PACK2     A_tempi0,   A_tempr0,  A_0 
                      STNW      A_0,       *A_d0
    END:            
                     .endproc

    /*******************************loop kernel of .asm generated from ccs v5****************************/

    $C$L4:    ; PIPED LOOP KERNEL
               SHR     .S2     A_tempi0$1,B_t,A_tempi0$2 ; |89| <0,12>
    ||         SHR     .S1     A_tempr1',A_t,A_tempr1' ; |90| <0,12> 做右移  
    ||         DOTP2   .M2X    A_01',B_10,A_tempi0$1 ; |84| <1,8>
    ||         DOTPN2  .M1X    A_23',B_23',A_tempr1' ; |85| <1,8>
    ||         LDNDW   .D2T2   *A_x++(8),A_32:A_10 ; |74| <3,0>

               MV      .D2X    A_tempr0,A_tempr0$4 ; |88| <0,13> Define a twin register
    ||         PACKLH2 .L2     A_32,A_32,A_23'   ; |79| <2,5>
    ||         PACKLH2 .S2     A_10,A_10,A_01'   ; |77| <2,5>
    ||         LDNDW   .D1T1   *B_x++(8),B_32:B_10 ; |75| <3,1>

               PACK2   .L2     A_tempi0$2,A_tempr0$4,A_0 ; |94| <0,14>
    ||         PACK2   .S2X    A_tempi1$2,A_tempr1',A_1 ; |95| <0,14>
    ||         BDEC    .S1     $C$L4,A_i         ; |98| <1,10>
    ||         PACKLH2 .L1     B_10,B_10,B_01$2  ; |78| <2,6>

       [ A0]   SUB     .D1     A0,1,A0           ; <0,15>
    || [!A0]   STNDW   .D2T2   A_1:A_0,*A_d0++(8) ; |96| <0,15>
    ||         SHR     .S2     A_tempi1$2,B_t,A_tempi1$2 ; |91| <1,11>
    ||         SHR     .S1     A_tempr0$1,A_t,A_tempr0 ; |88| <1,11> 做右移  
    ||         DOTP2   .M2X    A_23',B_32,A_tempi1$2 ; |86| <2,7>
    ||         PACKLH2 .L1     B_32,B_32,B_23'   ; |80| <2,7>
    ||         DOTPN2  .M1X    A_01',B_01$2,A_tempr0$1 ; |83| <2,7>

     

    /*******************************kernel loop of .asm generated from ccs v3.3****************************/

    $C$L4:    ; PIPED LOOP KERNEL

       [ A0]   MPYSU   .M1     2,A0,A0           ; <0,15>
    ||         SHR     .S1     A_tempr0$2,A_t,A_tempr0$4 ; |88| <1,12> 做右移  
    ||         BDEC    .S2     $C$L4,A_i''       ; |98| <1,12>
    ||         PACKLH2 .L1     A_10'',A_10'',A_01' ; |77| <3,6>
    ||         PACKLH2 .L2X    A_32$2,A_32$1,A_23' ; |79| <3,6>  ^
    ||         MV      .D1X    B_10$2,B_10$1     ; |75| <3,6> Define a twin register
    ||         LDNDW   .D2T1   *A_x++(8),A_32$1:A_10'' ; |74| <5,0>  ^

               MV      .S1X    A_1$2,A_1$1       ; |95| <0,16>  ^ Define a twin register
    ||         SHR     .S2     A_tempr1$1,A_t',A_tempr1$1 ; |90| <1,13> 做右移  
    ||         PACKLH2 .L1     B_10$1,B_10$1,B_01'' ; |78| <3,7>
    ||         DOTP2   .M2     A_23',B_32',A_tempi1$2 ; |86| <3,7>
    ||         PACKLH2 .L2     B_32',B_32',B_23' ; |80| <3,7>
    ||         DOTP2   .M1     A_01',B_10$1,A_tempi0$4 ; |84| <3,7>
    ||         LDNDW   .D1T2   *B_x++(8),B_32':B_10$2 ; |75| <5,1>

       [!A0]   STNDW   .D1T1   A_1$1:A_0'',*A_d0'++(8) ; |96| <0,17>
    ||         PACK2   .L2     A_tempi1$1,A_tempr1$1,A_1$2 ; |95| <1,14>  ^
    ||         PACK2   .L1     A_tempi0$3,A_tempr0$4,A_0'' ; |94| <1,14>
    ||         SHR     .S2     A_tempi1$2,B_t'',A_tempi1$1 ; |91| <2,11>
    ||         SHR     .S1     A_tempi0$4,B_t',A_tempi0$3 ; |89| <2,11>
    ||         DOTPN2  .M1     A_01',B_01'',A_tempr0$2 ; |83| <3,8>
    ||         DOTPN2  .M2     A_23',B_23',A_tempr1$1 ; |85| <3,8>
    ||         MV      .D2X    A_32$1,A_32$2     ; |74| <4,5>  ^ Define a twin register

    /*---------------------------------------------------source code end-----------------------------------------------------*/

    its obvious ccs v3.3 generated asm has  one cycle fewer than that of ccs v5.

    In ccs v5 i just set --O3 level optimizition and leave any other optimize related options in defaut.

    so now i want to know whether i made any mistake in properties setting ? because i believe 6670 should run faster than c6416 anyway.

  • This happens because internally the C66x compiler is slightly different and newer than the C64x compiler.  The C66x compiler usually generates better code, but there are still some places where the C64x compiler generates better code.  The C6000 compiler team is presently working on closing that performance gap.  This test case is a good example of one such gap.

    I've submitted SDSCM00046105 to track this issue.

  • I should mention that this has nothing to do with the CCS version; it's the change from C64x to C66x which exposes the problem.  You should be able to compile that serial assembly file as C64x and it will have the same performance as before.

  • I don't have test case ready, but I observed that sometimes 7 series compiler was producing worse pipelined loops for C6416 than 6 series one.

  • thank you for your response ,.

    actually i think this .sa source code  i posted above is a available test case.

  • thank you for your fast response.

    yes ,maybe sometimes i can  compile linear assembly as c64x and then use the generated .asm on c66x device.