This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6455: Software pipelining

Part Number: TMS320C6455

Hi Team,

Good day.

Our customer is having issues with optimizing a simple code below.

for(i=0:16000){
x[i] =i ; y[i]= i;}

When they set optimization level -O2 in ccsv5, they see that the assembly file iteration interval for this loop is set to 1
So the loop must  delay 16000 clock but when they run the code and measure clocks it delays 120000 clocks. What do you think could be the issue?

Without optimization level (-Ooff), the delay was 400,000 clocks and with -O1 optimization its 150,000. Attached is the software pipelining information of the loop.

Thanks and regards,

Art

  • Hi Art,

    Few things:

    1. Where are X and Y variables kept? DDR or internal memory?

    2. Is Cache enabled?

    3. From the generated asm code, (without any intrisincs), I see that STW and LDW are used. Note that C66x arch can do double word stores. So, further optimization may be possible on this (Possibly, you can refer to the guides below that I referenced).

    SPLOOPD 1 ;2 ; [] (P)

    $C$L2: ; PIPED LOOP KERNEL

    ADD .L2 1,B4,B4 ; [B_L66] |9| (P) <0,0> ^

    || ADD .L1 1,A3,A3 ; [A_L66] |9| (P) <0,0> ^ Define a twin register

    || STW .D1T1 A3,*A4++(4) ; [A_D64P] |11| (P) <0,0> ^

    || STW .D2T2 B4,*B5++(4) ; [B_D64P] |11| (P) <0,0> ^

    SPKERNEL 1,0 ; []

    Have you considered looking at https://www.ti.com/lit/ug/sprui04b/sprui04b.pdf guide? You can optimize the code using intrinsics to get better cycle optimization.

    You can also refer to https://www.ti.com/lit/an/sprabg7/sprabg7.pdf?ts=1620277770543&ref_url=https%253A%252F%252Fwww.google.com%252F for further optimization ideas.

    Let me know if you are able to proceed with the above.

    Thanks