This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/TMS320C6742: Loop Carried Dependency Bound, for a simple dot product

Part Number: TMS320C6742
Other Parts Discussed in Thread: TEST

Tool/software: Code Composer Studio

Looking to optimize c++ code that is a simple dot product of timeData  

void Math::DotProduct(float* restrict timeData, float* restrict operand1, float* restrict operand2)
{
	int i = 0;
	float real = 0;
	float imag = 0;

	_nassert((int) timeData % 8 == 0);
	_nassert((int) operand1 % 8 == 0);
	_nassert((int) operand2 % 8 == 0);
	#pragma MUST_ITERATE(8,,2)
for (i=0; i<m_nLen; i++) { real += timeData[i] * operand1[i]; imag += timeData[i] * operand2[i]; } m_real = real; m_imag = imag; }

Doing a multiply and add; the compiler optimized loop info:

;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : C:/Code/SineMsmt.cpp
;*      Loop source line                 : 312
;*      Loop opening brace source line   : 312
;*      Loop closing brace source line   : 315
;*      Loop Unroll Multiple             : 2x
;*      Known Minimum Trip Count         : 2                    
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 4
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound(*)    : 3
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0     
;*      .S units                     0        0     
;*      .D units                     2        1     
;*      .M units                     2        2     
;*      .X cross paths               1        3*    
;*      .T address paths             3*       3*    
;*      Long read paths              0        0     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           2        2     (.L or .S unit)
;*      Addition ops (.LSD)          0        1     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             1        1     
;*      Bound(.L .S .D .LS .LSD)     2        2     
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 4  Schedule found with 4 iterations in parallel
;*
;*      Register Usage Table:
;*          +-----------------------------------------------------------------+
;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*          |00000000001111111111222222222233|00000000001111111111222222222233|
;*          |01234567890123456789012345678901|01234567890123456789012345678901|
;*          |--------------------------------+--------------------------------|
;*       0: |   *** * *      *               |      *                         |
;*       1: |   *            *               |    ***                         |
;*       2: |   ***          *               |    *** **                      |
;*       3: |   ******       **              |    ****                        |
;*          +-----------------------------------------------------------------+
;*
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages       : 0
;*      Collapsed prolog stages       : 0
;*      Minimum required memory pad   : 0 bytes
;*
;*      For further improvement on this loop, try option -mh8
;*
;*      Minimum safe trip count       : 1 (after unrolling)
;*      Min. prof. trip count  (est.) : 3 (after unrolling)
;*
;*      Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
;*      Mem bank perf. penalty (est.) : 0.0%
;*
;*
;*      Total cycles (est.)         : 12 + trip_cnt * 4        
;*----------------------------------------------------------------------------*
;*        SINGLE SCHEDULED ITERATION
;*
;*        $C$C867:
;*   0              LDNDW   .D2T2   *B6++(8),B5:B4    ; |313| 
;*   1              LDNDW   .D1T1   *A16++(8),A5:A4   ; |313| 
;*   2              LDNDW   .D1T1   *A3++(8),A7:A6    ; |314| 
;*   3              NOP             3
;*   6              MPYSP   .M2X    A5,B5,B9          ; |313| 
;*   7              MPYSP   .M1     A4,A6,A17         ; |314| 
;*     ||           MPYSP   .M2X    A4,B4,B5          ; |313| 
;*   8              MPYSP   .M1     A5,A7,A4          ; |314| 
;*   9              NOP             1
;*  10              ADDSP   .L2     B9,B8,B8          ; |313|  ^ 
;*  11              ADDSP   .L2     B5,B7,B7          ; |313|  ^ 
;*     ||           ADDSP   .L1     A17,A8,A8         ; |314|  ^ 
;*  12              ADDSP   .L1     A4,A9,A9          ; |314|  ^ 
;*     ||           SPBR            $C$C867
;*  13              NOP             3
;*  16              ; BRANCHCC OCCURS {$C$C867}       ; |312| 

The ii is limited by the add functions, which depend on the previous loop. 

Since this is a simple add function though, two (or more) totals can be done in parallel and summed later; loop ii+1 doesnt really depend on loop ii.

Is there any way to tell the compiler this, or get the compiler to use this to lower the Loop Carry Dependence Bound?