This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hi
I ran into a curious problem where the compiler generates a less-than-ideal software pipelined loop after inlining a function call. Consider the following code:
#include <stdint.h> //#define EXPAND typedef struct X { uint32_t *mem; } X; void f1(const X* x, uint16_t a, uint16_t b) { volatile uint32_t * ptr = (volatile uint32_t*) x->mem; *ptr = a + b; } void f2(const X* restrict x, const uint16_t * restrict data1, const uint16_t * restrict data2, const uint32_t len) { if(len < 16) { uint32_t i; for(i=0; i<len; ++i) { #ifdef EXPAND volatile uint32_t * ptr = (volatile uint32_t*) x->mem; *ptr = data1[i] + data2[i]; #else f1(x, data1[i], data2[i]); #endif } } }
The compiler behaves very differently depending on whether or not EXPAND is defined (yes, I am aware that the compiler internally takes completely very different execution paths).
When EXPAND is defined, the compiler generates the following for f2:
;****************************************************************************** ;* TMS320C6x C/C++ Codegen PC v7.4.7 * ;* Date/Time created: Mon Nov 17 10:08:23 2014 * ;****************************************************************************** .compiler_opts --abi=eabi --c64p_l1d_workaround=default --endian=little --hll_source=on --long_precision_bits=32 --mem_model:code=near --mem_model:const=data --mem_model:data=far_aggregates --object_format=elf --silicon_version=6500 --symdebug:none ;****************************************************************************** ;* GLOBAL FILE PARAMETERS * ;* * ;* Architecture : TMS320C64x+ * ;* Optimization : Enabled at level 3 * ;* Optimizing for : Speed * ;* Based on options: -o3, no -ms * ;* Endian : Little * ;* Interrupt Thrshld : Disabled * ;* Data Access Model : Far Aggregate Data * ;* Pipelining : Enabled * ;* Speculate Loads : Enabled with threshold = 0 * ;* Memory Aliases : Presume are aliases (pessimistic) * ;* Debug Info : No Debug Info * ;* * ;****************************************************************************** ;****************************************************************************** ;* FUNCTION NAME: f2 * ;* * ;* Regs Modified : A0,A3,A4,A5,A6,B4,B5 * ;* Regs Used : A0,A3,A4,A5,A6,B3,B4,B5,B6 * ;* Local Frame Size : 0 Args + 0 Auto + 0 Save = 0 byte * ;****************************************************************************** f2: ;** --------------------------------------------------------------------------* MV .L1X B4,A5 ; |16| || CMPEQ .L2 B6,0,B4 ; |17| || MV .S2X A6,B5 ; |16| XOR .L2 1,B4,B4 ; |17| CMPLTU .L1X B6,16,A3 ; |17| AND .L1X B4,A3,A0 ; |17| [!A0] BNOP .S1 $C$L4,5 ; |17| || [ A0] MVC .S2 B6,ILC ; BRANCHCC OCCURS {$C$L4} ; |17| ;*----------------------------------------------------------------------------* ;* SOFTWARE PIPELINE INFORMATION ;* ;* Loop found in file : C:/test.c ;* Loop source line : 20 ;* Loop opening brace source line : 21 ;* Loop closing brace source line : 29 ;* Known Minimum Trip Count : 1 ;* Known Maximum Trip Count : 15 ;* Known Max Trip Count Factor : 1 ;* Loop Carried Dependency Bound(^) : 0 ;* Unpartitioned Resource Bound : 2 ;* Partitioned Resource Bound(*) : 2 ;* Resource Partition: ;* A-side B-side ;* .L units 0 0 ;* .S units 0 0 ;* .D units 2* 1 ;* .M units 0 0 ;* .X cross paths 1 0 ;* .T address paths 2* 1 ;* Long read paths 0 0 ;* Long write paths 0 0 ;* Logical ops (.LS) 0 0 (.L or .S unit) ;* Addition ops (.LSD) 1 0 (.L or .S or .D unit) ;* Bound(.L .S .LS) 0 0 ;* Bound(.L .S .D .LS .LSD) 1 1 ;* ;* Searching for software pipeline schedule at ... ;* ii = 2 Schedule found with 4 iterations in parallel ;* Done ;* ;* Loop will be splooped ;* Collapsed epilog stages : 0 ;* Collapsed prolog stages : 0 ;* Minimum required memory pad : 0 bytes ;* ;* Minimum safe trip count : 1 ;*----------------------------------------------------------------------------* $C$L1: ; PIPED LOOP PROLOG SPLOOP 2 ;8 ; (P) || LDW .D1T1 *A4,A6 ;** --------------------------------------------------------------------------* $C$L2: ; PIPED LOOP KERNEL LDHU .D1T1 *A5++,A4 ; |24| (P) <0,0> LDHU .D2T2 *B5++,B4 ; |24| (P) <0,1> NOP 4 ADD .L1X B4,A4,A3 ; |24| <0,6> SPKERNEL 3,0 || STW .D1T1 A3,*A6 ; |24| <0,7> ;** --------------------------------------------------------------------------* $C$L3: ; PIPED LOOP EPILOG ;** --------------------------------------------------------------------------* $C$L4: RETNOP .S2 B3,5 ; |31| ; BRANCH OCCURS {B3} ; |31|
In contrast, when EXPAND is not defined, the compiler generates
;****************************************************************************** ;* TMS320C6x C/C++ Codegen PC v7.4.7 * ;* Date/Time created: Mon Nov 17 10:10:07 2014 * ;****************************************************************************** .compiler_opts --abi=eabi --c64p_l1d_workaround=default --endian=little --hll_source=on --long_precision_bits=32 --mem_model:code=near --mem_model:const=data --mem_model:data=far_aggregates --object_format=elf --silicon_version=6500 --symdebug:none ;****************************************************************************** ;* GLOBAL FILE PARAMETERS * ;* * ;* Architecture : TMS320C64x+ * ;* Optimization : Enabled at level 3 * ;* Optimizing for : Speed * ;* Based on options: -o3, no -ms * ;* Endian : Little * ;* Interrupt Thrshld : Disabled * ;* Data Access Model : Far Aggregate Data * ;* Pipelining : Enabled * ;* Speculate Loads : Enabled with threshold = 0 * ;* Memory Aliases : Presume are aliases (pessimistic) * ;* Debug Info : No Debug Info * ;* * ;****************************************************************************** ;****************************************************************************** ;* FUNCTION NAME: f2 * ;* * ;* Regs Modified : A0,A3,A4,A5,B4,B5 * ;* Regs Used : A0,A3,A4,A5,A6,B3,B4,B5,B6 * ;* Local Frame Size : 0 Args + 0 Auto + 0 Save = 0 byte * ;****************************************************************************** f2: ;** --------------------------------------------------------------------------* CMPEQ .L2 B6,0,B5 ; |17| XOR .L2 1,B5,B5 ; |17| CMPLTU .L1X B6,16,A3 ; |17| AND .L1X B5,A3,A0 ; |17| [!A0] BNOP .S1 $C$L4,5 ; |17| || [ A0] SUB .L2 B6,1,B5 ; BRANCHCC OCCURS {$C$L4} ; |17| ;*----------------------------------------------------------------------------* ;* SOFTWARE PIPELINE INFORMATION ;* ;* Loop found in file : C:/test.c ;* Loop source line : 20 ;* Loop opening brace source line : 21 ;* Loop closing brace source line : 29 ;* Known Minimum Trip Count : 1 ;* Known Maximum Trip Count : 15 ;* Known Max Trip Count Factor : 1 ;* Loop Carried Dependency Bound(^) : 7 ;* Unpartitioned Resource Bound : 2 ;* Partitioned Resource Bound(*) : 2 ;* Resource Partition: ;* A-side B-side ;* .L units 0 0 ;* .S units 0 0 ;* .D units 2* 2* ;* .M units 0 0 ;* .X cross paths 1 0 ;* .T address paths 2* 2* ;* Long read paths 0 0 ;* Long write paths 0 0 ;* Logical ops (.LS) 0 0 (.L or .S unit) ;* Addition ops (.LSD) 1 0 (.L or .S or .D unit) ;* Bound(.L .S .LS) 0 0 ;* Bound(.L .S .D .LS .LSD) 1 1 ;* ;* Searching for software pipeline schedule at ... ;* ii = 7 Schedule found with 2 iterations in parallel ;* Done ;* ;* Loop will be splooped ;* Collapsed epilog stages : 0 ;* Collapsed prolog stages : 0 ;* Minimum required memory pad : 0 bytes ;* ;* Minimum safe trip count : 1 ;*----------------------------------------------------------------------------* $C$L1: ; PIPED LOOP PROLOG SPLOOPD 7 ;14 ; (P) || MV .L1 A4,A5 || MV .L2X A6,B5 || MV .S1X B4,A4 || MVC .S2 B5,ILC ;** --------------------------------------------------------------------------* $C$L2: ; PIPED LOOP KERNEL LDHU .D1T1 *A4++,A3 ; |11| (P) <0,0> ^ || LDHU .D2T2 *B5++,B4 ; |11| (P) <0,0> ^ LDW .D1T2 *A5,B4 ; |11| (P) <0,1> NOP 3 ADD .L1X B4,A3,A3 ; |11| (P) <0,5> ^ STW .D2T1 A3,*B4 ; |11| (P) <0,6> ^ SPKERNEL 0,0 ;** --------------------------------------------------------------------------* $C$L3: ; PIPED LOOP EPILOG NOP 1 ;** --------------------------------------------------------------------------* $C$L4: RETNOP .S2 B3,5 ; |31| ; BRANCH OCCURS {B3} ; |31|
So after inlining, the compiler doesn't see that a and b cannot alias x->mem, reloads x->mem in every loop iteration, and thus generates an ii=7 (2 loops in parallel) schedule as opposed to the ii=2 (4 loops in parallel) schedule above.
Now I'm not overly worried about that behavior, as it's unlikely to affect the overall performance too much, and I can always optimize where relevant, but I'd like to maintain a coding style that does not induce unnecessary performance penalties, so I wonder if there is anything I am missing and what modifications (if any) would let the compiler do a better job (enabling "optimistic" aliasing assumptions seems to help, but these assumptions are a lot stricter than what's required by ISO C and too strict for my purposes).
Regards and thanks for any help
Markus
Please show the exact compiler build options.
Thanks and regards,
-George
Do you need more than the .compiler_opts line in the assembler output? If so, what else?
Regards
Markus
Using the exact same build options as the customer is the best way to insure the same results are seen. In this case, I am able to reproduce the results only by looking at the settings in the GLOBAL FILE PARAMETERS comment block. That worked this time, but is not always reliable.
I filed SDSCM00051184 in the SDOWP system to have this investigated. It is filed not as a defect, but as a performance issue. There are two possible results. One, this is judged to be a problem in the compiler which will get fixed. Two, an explanation is given which shows why the compiler cannot treat these two cases the same way. Feel free to follow this issue with the SDOWP link below in my signature.
I'm sorry I cannot explain it to you. I think the scope difference between a function call or not has something to do with it. Those volatile variables might contribute some effect or another. I'm just not sure, and that is why I filed the report.
Thanks and regards,
-George
George Mock said:Using the exact same build options as the customer is the best way to insure the same results are seen. In this case, I am able to reproduce the results only by looking at the settings in the GLOBAL FILE PARAMETERS comment block. That worked this time, but is not always reliable.
Of course. I didn't doubt that but I was not at my desk when I wrote the reply, so I couldn't check the options. I now see that the .compiler_opts line is indeed very incomplete.
Thank you for all your efforts
Markus
Something went wrong when I filed the first report on this issue. It is not in the system. So I filed another one. The ID number is SDSCM00051184 . I changed the earlier post to show this second ID number. But I'm posting this as well just to avoid any possible confusion.
Thanks and regards,
-George
George Mock said:I think the scope difference between a function call or not has something to do with it.
I believe this is correct, but I have no details yet. The inlined case goes from ii=2 to ii=7 between 6.0.x and 6.1.x, and 6.1.x brings improved behavior to "restrict" through better scope annotations. It's possible that something else became too conservative.
I also see that while 7.4.x is still ii=7, 8.x is back to ii=2, thanks to improved alias analysis.
I haven't looked at the underpinnings enough to comment further.