This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/TCI6630K2L: Why software pipelining loop Taking more cycles on c66x.

Part Number: TCI6630K2L

Tool/software: Code Composer Studio

Hi,

Below, the code of a loop .

for(ii = 0; ii < len_words; ii++)
{
     src1 = _amemd8(pdSrc++);
     src2 = _amemd8(pdSrc2++);
     _amemd8(pdDst++) = src1;
    _amemd8(pdDst2++) = src2;

}

The software pipeline information for this loop is,

;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : C:/Users/Lekha/workspac_Tarana_Optimiz/Source/memcopy.c
;* Loop source line : 43
;* Loop opening brace source line : 44
;* Loop closing brace source line : 49
;* Known Minimum Trip Count : 1
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 0
;* Unpartitioned Resource Bound : 2
;* Partitioned Resource Bound(*) : 2
;* Resource Partition:
;* A-side B-side
;* .L units 0 0
;* .S units 0 0
;* .D units 2* 2*
;* .M units 0 0
;* .X cross paths 0 0
;* .T address paths 2* 2*
;* Long read paths 0 0
;* Long write paths 0 0
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 0 0 (.L or .S or .D unit)
;* Bound(.L .S .LS) 0 0
;* Bound(.L .S .D .LS .LSD) 1 1
;*
;* Searching for software pipeline schedule at ...
;* ii = 2 Schedule found with 3 iterations in parallel
;* Done
;*
;* Loop will be splooped
;* Collapsed epilog stages : 0
;* Collapsed prolog stages : 0
;* Minimum required memory pad : 0 bytes
;*
;* Minimum safe trip count : 1
;*----------------------------------------------------------------------------*
$C$L1: ; PIPED LOOP PROLOG

SPLOOP 2 ;6 ; (P)
|| MV .L2X A4,B6

;** --------------------------------------------------------------------------*
$C$L2: ; PIPED LOOP KERNEL
$C$DW$L$t_memcpy$3$B:

LDDW .D2T2 *B6++,B5:B4 ; |47| (P) <0,0>
|| LDDW .D1T1 *A3++,A5:A4 ; |48| (P) <0,0>

NOP 2

SPMASK L1
|| MV .L1 A7,A6

NOP 1

SPKERNEL 2,0
|| STDW .D2T2 B5:B4,*B7++ ; |47| <0,5>
|| STDW .D1T1 A5:A4,*A6++ ; |48| <0,5>

$C$DW$L$t_memcpy$3$E:
;** --------------------------------------------------------------------------*
$C$L3: ; PIPED LOOP EPILOG
;** --------------------------------------------------------------------------*

in software pipeline information showing as 2 cycles, but its taking 21 cycles for one iteration.I am not understanding here what is problem?

Thanks & regards,

Raj Bhavani.

  • There are two things to consider.

    The loop takes 2 CPU cycles per iteration once the steady state of the loop is achieved.  It takes a few cycles for the loop to pipe up to achieve steady state.  And a few cycles to pipe down when the loop concludes.  

    By saying "CPU cycles" I ignore cycles lost to cache misses, memory latencies, and the like.  Is this code located in external memory?  Is either the source, or the destination, or both, located in external memory?  If so, more cycles are required.

    Thanks and regards,

    -George

  • Hi George,

    Code , source and destination are located in local memory.

    Thanks & Regards.
    Raj Bhavani.
  • Raj Bhavani B said:
    software pipeline information showing as 2 cycles, but its taking 21 cycles for one iteration

    Exactly how do you measure the number cycles executed for one iteration?

    Thanks and regards,

    -George

  • Please reply to the question about how you measure cycles.

    Thanks and regards,

    -George

  • Since it has been a while, I presume you have resolved your problem.  Please let us know how you resolved it.

    Thanks and regards,

    -George

  • Hi George,

    Sorry for delay.

    I have measured the cycles of the loop using TSCL and found the number of cycles for one complete iteration. For example 630 Cycles were taking to complete the loop(30 iterations), So here for one iteration 21 cycles.

    We have observed different time cycles taken by software pipeline code function, Very first time this loop single iteration is taking 21 cycles but second time on wards its taking 2 cycles per iteration.

    Thanks,
    Raj Bhavani.
  • Raj,

    If a routine takes a long time the first time and is faster the next time, that is a classic example of the speed improvement of cached memory. In spite of your comment that all program and data is in local memory, there are obvious cache effects taking place here, and in particular longer times for accessing the data during the first pass through the loop.

    There are advanced tools available for trace features that can shed light on memory access delays. And there are several documents that may help you, such as the Optimizing Loops app note and the DSP cache user guide. For detailed software support for the TCI6630K2L, you will want to contact one of our support Partners, CommAgility or Azcom.

    Regards,
    RandyP