CCS/TMS320C6742: Loop Carried Dependency Bound, for a simple dot product

Raman Sridahran

Part Number: TMS320C6742

Tool/software: Code Composer Studio

Looking to optimize c++ code that is a simple dot product of timeData

void Math::DotProduct(float* restrict timeData, float* restrict operand1, float* restrict operand2)
{
	int i = 0;
	float real = 0;
	float imag = 0;

	_nassert((int) timeData % 8 == 0);
	_nassert((int) operand1 % 8 == 0);
	_nassert((int) operand2 % 8 == 0);
	#pragma MUST_ITERATE(8,,2)

	for (i=0; i<m_nLen; i++) {
		real += timeData[i] * operand1[i];
		imag += timeData[i] * operand2[i];
	}
	
	m_real = real;
	m_imag = imag;

}

Doing a multiply and add; the compiler optimized loop info:

;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : C:/Code/SineMsmt.cpp
;*      Loop source line                 : 312
;*      Loop opening brace source line   : 312
;*      Loop closing brace source line   : 315
;*      Loop Unroll Multiple             : 2x
;*      Known Minimum Trip Count         : 2                    
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 4
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound(*)    : 3
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0     
;*      .S units                     0        0     
;*      .D units                     2        1     
;*      .M units                     2        2     
;*      .X cross paths               1        3*    
;*      .T address paths             3*       3*    
;*      Long read paths              0        0     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           2        2     (.L or .S unit)
;*      Addition ops (.LSD)          0        1     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             1        1     
;*      Bound(.L .S .D .LS .LSD)     2        2     
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 4  Schedule found with 4 iterations in parallel
;*
;*      Register Usage Table:
;*          +-----------------------------------------------------------------+
;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*          |00000000001111111111222222222233|00000000001111111111222222222233|
;*          |01234567890123456789012345678901|01234567890123456789012345678901|
;*          |--------------------------------+--------------------------------|
;*       0: |   *** * *      *               |      *                         |
;*       1: |   *            *               |    ***                         |
;*       2: |   ***          *               |    *** **                      |
;*       3: |   ******       **              |    ****                        |
;*          +-----------------------------------------------------------------+
;*
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages       : 0
;*      Collapsed prolog stages       : 0
;*      Minimum required memory pad   : 0 bytes
;*
;*      For further improvement on this loop, try option -mh8
;*
;*      Minimum safe trip count       : 1 (after unrolling)
;*      Min. prof. trip count  (est.) : 3 (after unrolling)
;*
;*      Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
;*      Mem bank perf. penalty (est.) : 0.0%
;*
;*
;*      Total cycles (est.)         : 12 + trip_cnt * 4        
;*----------------------------------------------------------------------------*
;*        SINGLE SCHEDULED ITERATION
;*
;*        $C$C867:
;*   0              LDNDW   .D2T2   *B6++(8),B5:B4    ; |313| 
;*   1              LDNDW   .D1T1   *A16++(8),A5:A4   ; |313| 
;*   2              LDNDW   .D1T1   *A3++(8),A7:A6    ; |314| 
;*   3              NOP             3
;*   6              MPYSP   .M2X    A5,B5,B9          ; |313| 
;*   7              MPYSP   .M1     A4,A6,A17         ; |314| 
;*     ||           MPYSP   .M2X    A4,B4,B5          ; |313| 
;*   8              MPYSP   .M1     A5,A7,A4          ; |314| 
;*   9              NOP             1
;*  10              ADDSP   .L2     B9,B8,B8          ; |313|  ^ 
;*  11              ADDSP   .L2     B5,B7,B7          ; |313|  ^ 
;*     ||           ADDSP   .L1     A17,A8,A8         ; |314|  ^ 
;*  12              ADDSP   .L1     A4,A9,A9          ; |314|  ^ 
;*     ||           SPBR            $C$C867
;*  13              NOP             3
;*  16              ; BRANCHCC OCCURS {$C$C867}       ; |312|

The ii is limited by the add functions, which depend on the previous loop.

Since this is a simple add function though, two (or more) totals can be done in parallel and summed later; loop ii+1 doesnt really depend on loop ii.

Is there any way to tell the compiler this, or get the compiler to use this to lower the Loop Carry Dependence Bound?

over 3 years ago

0 George Mock over 3 years ago

TI__Guru**** 232670 points

I'd like to reproduce this result. For the source file SineMsmt.cpp, please follow the directions in the article How to Submit a Compiler Test Case.

Thanks and regards,

-George

0 Raman Sridahran over 3 years ago in reply to George Mock

Prodigy 130 points

Unfortunately I dont want to expose the source code for the entire preprocess file. Is there a snippit I could send; focused on this loop?

Compiler options below; CC v5.4

-mv6740 
--abi=eabi -O3 
--include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.4/include" 
--include_path="../../../../../../hw/c67xmathlib_2_01_00_00/inc" 
--include_path="../../../../../../include" 
--define=omapl138 
--define=_TI_DSP_CCS 
--display_error_number 
--diag_warning=225 
--interrupt_threshold=456000 
--opt_for_speed=5 
--cmd_file="C:/Code/compiler.opt"

0 George Mock over 3 years ago in reply to Raman Sridahran

TI__Guru**** 232670 points

Please note How to Submit a Compiler Test Case includes directions on how to send the test case just to me by private message.

If that is not good enough ... Consider changing the preprocessed file to remove all the functions except Math::DotProduct. Be sure to leave in the definition of the Math class, and everything it depends on.

Since your options include ...

Raman Sridahran said:
--cmd_file="C:/Code/compiler.opt"

... please include that file, or show the contents of the file.

Thanks and regards,

-George

0 Raman Sridahran over 3 years ago in reply to George Mock

Prodigy 130 points

Geroge,

I have sent the files your way via PM.

Thanks for the assistance.

0 George Mock over 3 years ago in reply to Raman Sridahran

TI__Guru**** 232670 points

This thread was closed in private messages. The code emitted by the compiler was optimal for the input. But it took detailed explanation to make that point clear.

Thanks and regards,

-George

Processors

Processors forum

CCS/TMS320C6742: Loop Carried Dependency Bound, for a simple dot product