linear assembly optimization problem on ccs v5

kabalagala

when i reuse c64x platform linear assembly (.sa files) on c66x platform,

the .sa function consume much more cycles on c66x platform(ccs v5) than on c64x (ccs v3.3). (18000cycles on ccs v5 , 2000 cycles on ccs v3)

Then i compare the .asm files generated from the same .sa file , and find out the optimization pipeline of ccs v3.3 is much better than the ccs v5.

compiler version: 7.3.0(ccs v5), 6.1.11(ccs v3).

ps: on my ccs v5 project optmilization level o3 is already applied

i'm ancious about this problem, can anyone give me advice abt this issue?

Thanks!

over 13 years ago

0 Archaeologist over 13 years ago

TI__Guru* 84285 points

How is the performance 7.3.0 when you compile that file for C64x?

Could you please post the software pipelining information comment block from the output assembly file for each version?

0 kabalagala over 13 years ago in reply to Archaeologist

Intellectual 300 points

Now i know the linear assembly low efficiency on c6670 is caused by a wrong setting in "basci options", where i filled 0 in "optimize for code " option while --O3 is seleted.

After i leave "optimize for code " its default option(null), the efficiency is improved a lot.

However, when i compare the efficiency of c66 to c64 of the same .sa file, the difference is still obvious.

i will give out a simple source code example below.

/****************.sa code***************/

/******* this fuction is to dot product two vectors and then right shift the result with expected number of bits***********/

                       .text
                 .global Bas_CplxShrMul16

xxxxx: .cproc     A_x,       B_x,      A_d0,   B_t,     A_nx

                 .no_mdep

                 .reg       A_01, A_23,   A_32:A_10, A_1:A_0
                 .reg       B_01, B_23,   B_32:B_10
                 .reg       A_tempr0,     A_tempi0, A_tempr1, A_tempi1
                 .reg       A_i,   A_p,   A_t

                  MV        B_t,       A_t
                  SHR       A_nx,      1,         A_i
                  SUB       A_i,       2,         A_i       ;
LOOP:
                  LDNDW    *A_x++,      A_32:A_10
                  LDNDW    *B_x++,      B_32:B_10
                                                            ;

                  SWAP2     A_10,       A_01
                  SWAP2     B_10,       B_01
                  SWAP2     A_32,       A_23
                  SWAP2     B_32,       B_23

DOTPN2    A_01,       B_01,      A_tempr0
                  DOTP2     A_01,       B_10,      A_tempi0
                  DOTPN2    A_23,       B_23,      A_tempr1
                  DOTP2     A_23,       B_32,      A_tempi1

                  SHR       A_tempr0,   A_t,       A_tempr0 ;
                  SHR       A_tempi0,   B_t,       A_tempi0
                  SHR       A_tempr1,   A_t,       A_tempr1 ;

SHR A_tempi1, B_t, A_tempi1

                  PACK2     A_tempi0,   A_tempr0, A_0
                  PACK2     A_tempi1,   A_tempr1, A_1
                  STNDW     A_1:A_0,   *A_d0++

                  BDEC      LOOP,       A_i

                  AND       0x01,  A_nx,    A_p
           [!A_p] B         END
                  LDNW      *A_x,       A_10
                  LDNW      *B_x,       B_10
                                                            ;

                  SWAP2     A_10,       A_01
                  SWAP2     B_10,       B_01
                                                            ;

                  DOTPN2    A_01,       B_01,      A_tempr0
                  DOTP2     A_01,       B_10,      A_tempi0

                  SHR       A_tempr0,   A_t,       A_tempr0 ;

                  SHR       A_tempi0,   B_t,       A_tempi0
                   PACK2     A_tempi0,   A_tempr0, A_0
                  STNW      A_0,       *A_d0
END:
                 .endproc

/*******************************loop kernel of .asm generated from ccs v5****************************/

$C$L4:    ; PIPED LOOP KERNEL
           SHR     .S2     A_tempi0$1,B_t,A_tempi0$2 ; |89| <0,12>
||         SHR     .S1     A_tempr1',A_t,A_tempr1' ; |90| <0,12> 做右移
||         DOTP2   .M2X    A_01',B_10,A_tempi0$1 ; |84| <1,8>
||         DOTPN2 .M1X    A_23',B_23',A_tempr1' ; |85| <1,8>
||         LDNDW   .D2T2   *A_x++(8),A_32:A_10 ; |74| <3,0>

           MV      .D2X    A_tempr0,A_tempr0$4 ; |88| <0,13> Define a twin register
||         PACKLH2 .L2     A_32,A_32,A_23'   ; |79| <2,5>
||         PACKLH2 .S2     A_10,A_10,A_01'   ; |77| <2,5>
||         LDNDW   .D1T1   *B_x++(8),B_32:B_10 ; |75| <3,1>

           PACK2   .L2     A_tempi0$2,A_tempr0$4,A_0 ; |94| <0,14>
||         PACK2   .S2X    A_tempi1$2,A_tempr1',A_1 ; |95| <0,14>
||         BDEC    .S1     $C$L4,A_i         ; |98| <1,10>
||         PACKLH2 .L1     B_10,B_10,B_01$2 ; |78| <2,6>

   [ A0]   SUB     .D1     A0,1,A0           ; <0,15>
|| [!A0]   STNDW   .D2T2   A_1:A_0,*A_d0++(8) ; |96| <0,15>
||         SHR     .S2     A_tempi1$2,B_t,A_tempi1$2 ; |91| <1,11>
||         SHR     .S1     A_tempr0$1,A_t,A_tempr0 ; |88| <1,11> 做右移
||         DOTP2   .M2X    A_23',B_32,A_tempi1$2 ; |86| <2,7>
||         PACKLH2 .L1     B_32,B_32,B_23'   ; |80| <2,7>
||         DOTPN2 .M1X    A_01',B_01$2,A_tempr0$1 ; |83| <2,7>

/*******************************kernel loop of .asm generated from ccs v3.3****************************/

$C$L4: ; PIPED LOOP KERNEL

   [ A0]   MPYSU   .M1     2,A0,A0           ; <0,15>
||         SHR     .S1     A_tempr0$2,A_t,A_tempr0$4 ; |88| <1,12> 做右移
||         BDEC    .S2     $C$L4,A_i''       ; |98| <1,12>
||         PACKLH2 .L1     A_10'',A_10'',A_01' ; |77| <3,6>
||         PACKLH2 .L2X    A_32$2,A_32$1,A_23' ; |79| <3,6> ^
||         MV      .D1X    B_10$2,B_10$1     ; |75| <3,6> Define a twin register
||         LDNDW   .D2T1   *A_x++(8),A_32$1:A_10'' ; |74| <5,0> ^

           MV      .S1X    A_1$2,A_1$1       ; |95| <0,16> ^ Define a twin register
||         SHR     .S2     A_tempr1$1,A_t',A_tempr1$1 ; |90| <1,13> 做右移
||         PACKLH2 .L1     B_10$1,B_10$1,B_01'' ; |78| <3,7>
||         DOTP2   .M2     A_23',B_32',A_tempi1$2 ; |86| <3,7>
||         PACKLH2 .L2     B_32',B_32',B_23' ; |80| <3,7>
||         DOTP2   .M1     A_01',B_10$1,A_tempi0$4 ; |84| <3,7>
||         LDNDW   .D1T2   *B_x++(8),B_32':B_10$2 ; |75| <5,1>

   [!A0]   STNDW   .D1T1   A_1$1:A_0'',*A_d0'++(8) ; |96| <0,17>
||         PACK2   .L2     A_tempi1$1,A_tempr1$1,A_1$2 ; |95| <1,14> ^
||         PACK2   .L1     A_tempi0$3,A_tempr0$4,A_0'' ; |94| <1,14>
||         SHR     .S2     A_tempi1$2,B_t'',A_tempi1$1 ; |91| <2,11>
||         SHR     .S1     A_tempi0$4,B_t',A_tempi0$3 ; |89| <2,11>
||         DOTPN2 .M1     A_01',B_01'',A_tempr0$2 ; |83| <3,8>
||         DOTPN2 .M2     A_23',B_23',A_tempr1$1 ; |85| <3,8>
||         MV      .D2X    A_32$1,A_32$2     ; |74| <4,5> ^ Define a twin register

/*---------------------------------------------------source code end-----------------------------------------------------*/

its obvious ccs v3.3 generated asm has one cycle fewer than that of ccs v5.

In ccs v5 i just set --O3 level optimizition and leave any other optimize related options in defaut.

so now i want to know whether i made any mistake in properties setting ? because i believe 6670 should run faster than c6416 anyway.

0 Archaeologist over 13 years ago in reply to kabalagala

TI__Guru* 84285 points

This happens because internally the C66x compiler is slightly different and newer than the C64x compiler. The C66x compiler usually generates better code, but there are still some places where the C64x compiler generates better code. The C6000 compiler team is presently working on closing that performance gap. This test case is a good example of one such gap.

I've submitted SDSCM00046105 to track this issue.

0 Archaeologist over 13 years ago in reply to Archaeologist

TI__Guru* 84285 points

I should mention that this has nothing to do with the CCS version; it's the change from C64x to C66x which exposes the problem. You should be able to compile that serial assembly file as C64x and it will have the same performance as before.

0 Victor Kazmirenko over 13 years ago in reply to Archaeologist

Guru 13202 points

I don't have test case ready, but I observed that sometimes 7 series compiler was producing worse pipelined loops for C6416 than 6 series one.

0 kabalagala over 13 years ago in reply to Victor Kazmirenko

Intellectual 300 points

thank you for your response ,.

actually i think this .sa source code i posted above is a available test case.

0 kabalagala over 13 years ago in reply to Archaeologist

Intellectual 300 points

thank you for your fast response.

yes ,maybe sometimes i can compile linear assembly as c64x and then use the generated .asm on c66x device.

Code Composer Studio™︎

Code Composer Studio forum

linear assembly optimization problem on ccs v5