Tool/software: Code Composer Studio
Looking to optimize c++ code that is a simple dot product of timeData
void Math::DotProduct(float* restrict timeData, float* restrict operand1, float* restrict operand2) { int i = 0; float real = 0; float imag = 0; _nassert((int) timeData % 8 == 0); _nassert((int) operand1 % 8 == 0); _nassert((int) operand2 % 8 == 0); #pragma MUST_ITERATE(8,,2)
for (i=0; i<m_nLen; i++) { real += timeData[i] * operand1[i]; imag += timeData[i] * operand2[i]; } m_real = real; m_imag = imag; }
Doing a multiply and add; the compiler optimized loop info:
;*----------------------------------------------------------------------------* ;* SOFTWARE PIPELINE INFORMATION ;* ;* Loop found in file : C:/Code/SineMsmt.cpp ;* Loop source line : 312 ;* Loop opening brace source line : 312 ;* Loop closing brace source line : 315 ;* Loop Unroll Multiple : 2x ;* Known Minimum Trip Count : 2 ;* Known Max Trip Count Factor : 1 ;* Loop Carried Dependency Bound(^) : 4 ;* Unpartitioned Resource Bound : 2 ;* Partitioned Resource Bound(*) : 3 ;* Resource Partition: ;* A-side B-side ;* .L units 0 0 ;* .S units 0 0 ;* .D units 2 1 ;* .M units 2 2 ;* .X cross paths 1 3* ;* .T address paths 3* 3* ;* Long read paths 0 0 ;* Long write paths 0 0 ;* Logical ops (.LS) 2 2 (.L or .S unit) ;* Addition ops (.LSD) 0 1 (.L or .S or .D unit) ;* Bound(.L .S .LS) 1 1 ;* Bound(.L .S .D .LS .LSD) 2 2 ;* ;* Searching for software pipeline schedule at ... ;* ii = 4 Schedule found with 4 iterations in parallel ;* ;* Register Usage Table: ;* +-----------------------------------------------------------------+ ;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB| ;* |00000000001111111111222222222233|00000000001111111111222222222233| ;* |01234567890123456789012345678901|01234567890123456789012345678901| ;* |--------------------------------+--------------------------------| ;* 0: | *** * * * | * | ;* 1: | * * | *** | ;* 2: | *** * | *** ** | ;* 3: | ****** ** | **** | ;* +-----------------------------------------------------------------+ ;* ;* Done ;* ;* Loop will be splooped ;* Collapsed epilog stages : 0 ;* Collapsed prolog stages : 0 ;* Minimum required memory pad : 0 bytes ;* ;* For further improvement on this loop, try option -mh8 ;* ;* Minimum safe trip count : 1 (after unrolling) ;* Min. prof. trip count (est.) : 3 (after unrolling) ;* ;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 } ;* Mem bank perf. penalty (est.) : 0.0% ;* ;* ;* Total cycles (est.) : 12 + trip_cnt * 4 ;*----------------------------------------------------------------------------* ;* SINGLE SCHEDULED ITERATION ;* ;* $C$C867: ;* 0 LDNDW .D2T2 *B6++(8),B5:B4 ; |313| ;* 1 LDNDW .D1T1 *A16++(8),A5:A4 ; |313| ;* 2 LDNDW .D1T1 *A3++(8),A7:A6 ; |314| ;* 3 NOP 3 ;* 6 MPYSP .M2X A5,B5,B9 ; |313| ;* 7 MPYSP .M1 A4,A6,A17 ; |314| ;* || MPYSP .M2X A4,B4,B5 ; |313| ;* 8 MPYSP .M1 A5,A7,A4 ; |314| ;* 9 NOP 1 ;* 10 ADDSP .L2 B9,B8,B8 ; |313| ^ ;* 11 ADDSP .L2 B5,B7,B7 ; |313| ^ ;* || ADDSP .L1 A17,A8,A8 ; |314| ^ ;* 12 ADDSP .L1 A4,A9,A9 ; |314| ^ ;* || SPBR $C$C867 ;* 13 NOP 3 ;* 16 ; BRANCHCC OCCURS {$C$C867} ; |312|
The ii is limited by the add functions, which depend on the previous loop.
Since this is a simple add function though, two (or more) totals can be done in parallel and summed later; loop ii+1 doesnt really depend on loop ii.
Is there any way to tell the compiler this, or get the compiler to use this to lower the Loop Carry Dependence Bound?