Hi
I ran into a curious problem where the compiler generates a less-than-ideal software pipelined loop after inlining a function call. Consider the following code:
#include <stdint.h> //#define EXPAND typedef struct X { uint32_t *mem; } X; void f1(const X* x, uint16_t a, uint16_t b) { volatile uint32_t * ptr = (volatile uint32_t*) x->mem; *ptr = a + b; } void f2(const X* restrict x, const uint16_t * restrict data1, const uint16_t * restrict data2, const uint32_t len) { if(len < 16) { uint32_t i; for(i=0; i<len; ++i) { #ifdef EXPAND volatile uint32_t * ptr = (volatile uint32_t*) x->mem; *ptr = data1[i] + data2[i]; #else f1(x, data1[i], data2[i]); #endif } } }
The compiler behaves very differently depending on whether or not EXPAND is defined (yes, I am aware that the compiler internally takes completely very different execution paths).
When EXPAND is defined, the compiler generates the following for f2:
;****************************************************************************** ;* TMS320C6x C/C++ Codegen PC v7.4.7 * ;* Date/Time created: Mon Nov 17 10:08:23 2014 * ;****************************************************************************** .compiler_opts --abi=eabi --c64p_l1d_workaround=default --endian=little --hll_source=on --long_precision_bits=32 --mem_model:code=near --mem_model:const=data --mem_model:data=far_aggregates --object_format=elf --silicon_version=6500 --symdebug:none ;****************************************************************************** ;* GLOBAL FILE PARAMETERS * ;* * ;* Architecture : TMS320C64x+ * ;* Optimization : Enabled at level 3 * ;* Optimizing for : Speed * ;* Based on options: -o3, no -ms * ;* Endian : Little * ;* Interrupt Thrshld : Disabled * ;* Data Access Model : Far Aggregate Data * ;* Pipelining : Enabled * ;* Speculate Loads : Enabled with threshold = 0 * ;* Memory Aliases : Presume are aliases (pessimistic) * ;* Debug Info : No Debug Info * ;* * ;****************************************************************************** ;****************************************************************************** ;* FUNCTION NAME: f2 * ;* * ;* Regs Modified : A0,A3,A4,A5,A6,B4,B5 * ;* Regs Used : A0,A3,A4,A5,A6,B3,B4,B5,B6 * ;* Local Frame Size : 0 Args + 0 Auto + 0 Save = 0 byte * ;****************************************************************************** f2: ;** --------------------------------------------------------------------------* MV .L1X B4,A5 ; |16| || CMPEQ .L2 B6,0,B4 ; |17| || MV .S2X A6,B5 ; |16| XOR .L2 1,B4,B4 ; |17| CMPLTU .L1X B6,16,A3 ; |17| AND .L1X B4,A3,A0 ; |17| [!A0] BNOP .S1 $C$L4,5 ; |17| || [ A0] MVC .S2 B6,ILC ; BRANCHCC OCCURS {$C$L4} ; |17| ;*----------------------------------------------------------------------------* ;* SOFTWARE PIPELINE INFORMATION ;* ;* Loop found in file : C:/test.c ;* Loop source line : 20 ;* Loop opening brace source line : 21 ;* Loop closing brace source line : 29 ;* Known Minimum Trip Count : 1 ;* Known Maximum Trip Count : 15 ;* Known Max Trip Count Factor : 1 ;* Loop Carried Dependency Bound(^) : 0 ;* Unpartitioned Resource Bound : 2 ;* Partitioned Resource Bound(*) : 2 ;* Resource Partition: ;* A-side B-side ;* .L units 0 0 ;* .S units 0 0 ;* .D units 2* 1 ;* .M units 0 0 ;* .X cross paths 1 0 ;* .T address paths 2* 1 ;* Long read paths 0 0 ;* Long write paths 0 0 ;* Logical ops (.LS) 0 0 (.L or .S unit) ;* Addition ops (.LSD) 1 0 (.L or .S or .D unit) ;* Bound(.L .S .LS) 0 0 ;* Bound(.L .S .D .LS .LSD) 1 1 ;* ;* Searching for software pipeline schedule at ... ;* ii = 2 Schedule found with 4 iterations in parallel ;* Done ;* ;* Loop will be splooped ;* Collapsed epilog stages : 0 ;* Collapsed prolog stages : 0 ;* Minimum required memory pad : 0 bytes ;* ;* Minimum safe trip count : 1 ;*----------------------------------------------------------------------------* $C$L1: ; PIPED LOOP PROLOG SPLOOP 2 ;8 ; (P) || LDW .D1T1 *A4,A6 ;** --------------------------------------------------------------------------* $C$L2: ; PIPED LOOP KERNEL LDHU .D1T1 *A5++,A4 ; |24| (P) <0,0> LDHU .D2T2 *B5++,B4 ; |24| (P) <0,1> NOP 4 ADD .L1X B4,A4,A3 ; |24| <0,6> SPKERNEL 3,0 || STW .D1T1 A3,*A6 ; |24| <0,7> ;** --------------------------------------------------------------------------* $C$L3: ; PIPED LOOP EPILOG ;** --------------------------------------------------------------------------* $C$L4: RETNOP .S2 B3,5 ; |31| ; BRANCH OCCURS {B3} ; |31|
In contrast, when EXPAND is not defined, the compiler generates
;****************************************************************************** ;* TMS320C6x C/C++ Codegen PC v7.4.7 * ;* Date/Time created: Mon Nov 17 10:10:07 2014 * ;****************************************************************************** .compiler_opts --abi=eabi --c64p_l1d_workaround=default --endian=little --hll_source=on --long_precision_bits=32 --mem_model:code=near --mem_model:const=data --mem_model:data=far_aggregates --object_format=elf --silicon_version=6500 --symdebug:none ;****************************************************************************** ;* GLOBAL FILE PARAMETERS * ;* * ;* Architecture : TMS320C64x+ * ;* Optimization : Enabled at level 3 * ;* Optimizing for : Speed * ;* Based on options: -o3, no -ms * ;* Endian : Little * ;* Interrupt Thrshld : Disabled * ;* Data Access Model : Far Aggregate Data * ;* Pipelining : Enabled * ;* Speculate Loads : Enabled with threshold = 0 * ;* Memory Aliases : Presume are aliases (pessimistic) * ;* Debug Info : No Debug Info * ;* * ;****************************************************************************** ;****************************************************************************** ;* FUNCTION NAME: f2 * ;* * ;* Regs Modified : A0,A3,A4,A5,B4,B5 * ;* Regs Used : A0,A3,A4,A5,A6,B3,B4,B5,B6 * ;* Local Frame Size : 0 Args + 0 Auto + 0 Save = 0 byte * ;****************************************************************************** f2: ;** --------------------------------------------------------------------------* CMPEQ .L2 B6,0,B5 ; |17| XOR .L2 1,B5,B5 ; |17| CMPLTU .L1X B6,16,A3 ; |17| AND .L1X B5,A3,A0 ; |17| [!A0] BNOP .S1 $C$L4,5 ; |17| || [ A0] SUB .L2 B6,1,B5 ; BRANCHCC OCCURS {$C$L4} ; |17| ;*----------------------------------------------------------------------------* ;* SOFTWARE PIPELINE INFORMATION ;* ;* Loop found in file : C:/test.c ;* Loop source line : 20 ;* Loop opening brace source line : 21 ;* Loop closing brace source line : 29 ;* Known Minimum Trip Count : 1 ;* Known Maximum Trip Count : 15 ;* Known Max Trip Count Factor : 1 ;* Loop Carried Dependency Bound(^) : 7 ;* Unpartitioned Resource Bound : 2 ;* Partitioned Resource Bound(*) : 2 ;* Resource Partition: ;* A-side B-side ;* .L units 0 0 ;* .S units 0 0 ;* .D units 2* 2* ;* .M units 0 0 ;* .X cross paths 1 0 ;* .T address paths 2* 2* ;* Long read paths 0 0 ;* Long write paths 0 0 ;* Logical ops (.LS) 0 0 (.L or .S unit) ;* Addition ops (.LSD) 1 0 (.L or .S or .D unit) ;* Bound(.L .S .LS) 0 0 ;* Bound(.L .S .D .LS .LSD) 1 1 ;* ;* Searching for software pipeline schedule at ... ;* ii = 7 Schedule found with 2 iterations in parallel ;* Done ;* ;* Loop will be splooped ;* Collapsed epilog stages : 0 ;* Collapsed prolog stages : 0 ;* Minimum required memory pad : 0 bytes ;* ;* Minimum safe trip count : 1 ;*----------------------------------------------------------------------------* $C$L1: ; PIPED LOOP PROLOG SPLOOPD 7 ;14 ; (P) || MV .L1 A4,A5 || MV .L2X A6,B5 || MV .S1X B4,A4 || MVC .S2 B5,ILC ;** --------------------------------------------------------------------------* $C$L2: ; PIPED LOOP KERNEL LDHU .D1T1 *A4++,A3 ; |11| (P) <0,0> ^ || LDHU .D2T2 *B5++,B4 ; |11| (P) <0,0> ^ LDW .D1T2 *A5,B4 ; |11| (P) <0,1> NOP 3 ADD .L1X B4,A3,A3 ; |11| (P) <0,5> ^ STW .D2T1 A3,*B4 ; |11| (P) <0,6> ^ SPKERNEL 0,0 ;** --------------------------------------------------------------------------* $C$L3: ; PIPED LOOP EPILOG NOP 1 ;** --------------------------------------------------------------------------* $C$L4: RETNOP .S2 B3,5 ; |31| ; BRANCH OCCURS {B3} ; |31|
So after inlining, the compiler doesn't see that a and b cannot alias x->mem, reloads x->mem in every loop iteration, and thus generates an ii=7 (2 loops in parallel) schedule as opposed to the ii=2 (4 loops in parallel) schedule above.
Now I'm not overly worried about that behavior, as it's unlikely to affect the overall performance too much, and I can always optimize where relevant, but I'd like to maintain a coding style that does not induce unnecessary performance penalties, so I wonder if there is anything I am missing and what modifications (if any) would let the compiler do a better job (enabling "optimistic" aliasing assumptions seems to help, but these assumptions are a lot stricter than what's required by ISO C and too strict for my purposes).
Regards and thanks for any help
Markus