This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Optimization, aliasing, and inline functions

Hi


I ran into a curious problem where the compiler generates a less-than-ideal software pipelined loop after inlining a function call. Consider the following code:

#include <stdint.h>

//#define EXPAND
typedef struct X
{
	uint32_t *mem;
} X;

void f1(const X* x, uint16_t a, uint16_t b)
{
	volatile uint32_t * ptr = (volatile uint32_t*) x->mem;
	*ptr = a + b;
}

void f2(const X* restrict x, const uint16_t * restrict data1, const uint16_t * restrict data2, const uint32_t len)
{
	if(len < 16)
	{
		uint32_t i;
		for(i=0; i<len; ++i)
		{
#ifdef EXPAND
			volatile uint32_t * ptr = (volatile uint32_t*) x->mem;
			*ptr = data1[i] + data2[i];
#else
			f1(x, data1[i], data2[i]);
#endif

		}
	}
}

The compiler behaves very differently depending on whether or not EXPAND is defined (yes, I am aware that the compiler internally takes completely very different execution paths).
When EXPAND is defined, the compiler generates the following for f2:

;******************************************************************************
;* TMS320C6x C/C++ Codegen                                          PC v7.4.7 *
;* Date/Time created: Mon Nov 17 10:08:23 2014                                *
;******************************************************************************
	.compiler_opts --abi=eabi --c64p_l1d_workaround=default --endian=little --hll_source=on --long_precision_bits=32 --mem_model:code=near --mem_model:const=data --mem_model:data=far_aggregates --object_format=elf --silicon_version=6500 --symdebug:none 

;******************************************************************************
;* GLOBAL FILE PARAMETERS                                                     *
;*                                                                            *
;*   Architecture      : TMS320C64x+                                          *
;*   Optimization      : Enabled at level 3                                   *
;*   Optimizing for    : Speed                                                *
;*                       Based on options: -o3, no -ms                        *
;*   Endian            : Little                                               *
;*   Interrupt Thrshld : Disabled                                             *
;*   Data Access Model : Far Aggregate Data                                   *
;*   Pipelining        : Enabled                                              *
;*   Speculate Loads   : Enabled with threshold = 0                           *
;*   Memory Aliases    : Presume are aliases (pessimistic)                    *
;*   Debug Info        : No Debug Info                                        *
;*                                                                            *
;******************************************************************************

;******************************************************************************
;* FUNCTION NAME: f2                                                          *
;*                                                                            *
;*   Regs Modified     : A0,A3,A4,A5,A6,B4,B5                                 *
;*   Regs Used         : A0,A3,A4,A5,A6,B3,B4,B5,B6                           *
;*   Local Frame Size  : 0 Args + 0 Auto + 0 Save = 0 byte                    *
;******************************************************************************
f2:
;** --------------------------------------------------------------------------*

           MV      .L1X    B4,A5             ; |16| 
||         CMPEQ   .L2     B6,0,B4           ; |17| 
||         MV      .S2X    A6,B5             ; |16| 

           XOR     .L2     1,B4,B4           ; |17| 
           CMPLTU  .L1X    B6,16,A3          ; |17| 
           AND     .L1X    B4,A3,A0          ; |17| 

   [!A0]   BNOP    .S1     $C$L4,5           ; |17| 
|| [ A0]   MVC     .S2     B6,ILC

           ; BRANCHCC OCCURS {$C$L4}         ; |17| 
;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : C:/test.c
;*      Loop source line                 : 20
;*      Loop opening brace source line   : 21
;*      Loop closing brace source line   : 29
;*      Known Minimum Trip Count         : 1                    
;*      Known Maximum Trip Count         : 15                    
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound(*)    : 2
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0     
;*      .S units                     0        0     
;*      .D units                     2*       1     
;*      .M units                     0        0     
;*      .X cross paths               1        0     
;*      .T address paths             2*       1     
;*      Long read paths              0        0     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          1        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             0        0     
;*      Bound(.L .S .D .LS .LSD)     1        1     
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 2  Schedule found with 4 iterations in parallel
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages       : 0
;*      Collapsed prolog stages       : 0
;*      Minimum required memory pad   : 0 bytes
;*
;*      Minimum safe trip count       : 1
;*----------------------------------------------------------------------------*
$C$L1:    ; PIPED LOOP PROLOG

           SPLOOP  2       ;8                ; (P) 
||         LDW     .D1T1   *A4,A6

;** --------------------------------------------------------------------------*
$C$L2:    ; PIPED LOOP KERNEL
           LDHU    .D1T1   *A5++,A4          ; |24| (P) <0,0> 
           LDHU    .D2T2   *B5++,B4          ; |24| (P) <0,1> 
           NOP             4
           ADD     .L1X    B4,A4,A3          ; |24| <0,6> 

           SPKERNEL 3,0
||         STW     .D1T1   A3,*A6            ; |24| <0,7> 

;** --------------------------------------------------------------------------*
$C$L3:    ; PIPED LOOP EPILOG
;** --------------------------------------------------------------------------*
$C$L4:    
           RETNOP  .S2     B3,5              ; |31| 
           ; BRANCH OCCURS {B3}              ; |31| 

In contrast, when EXPAND is not defined, the compiler generates

;******************************************************************************
;* TMS320C6x C/C++ Codegen                                          PC v7.4.7 *
;* Date/Time created: Mon Nov 17 10:10:07 2014                                *
;******************************************************************************
	.compiler_opts --abi=eabi --c64p_l1d_workaround=default --endian=little --hll_source=on --long_precision_bits=32 --mem_model:code=near --mem_model:const=data --mem_model:data=far_aggregates --object_format=elf --silicon_version=6500 --symdebug:none 

;******************************************************************************
;* GLOBAL FILE PARAMETERS                                                     *
;*                                                                            *
;*   Architecture      : TMS320C64x+                                          *
;*   Optimization      : Enabled at level 3                                   *
;*   Optimizing for    : Speed                                                *
;*                       Based on options: -o3, no -ms                        *
;*   Endian            : Little                                               *
;*   Interrupt Thrshld : Disabled                                             *
;*   Data Access Model : Far Aggregate Data                                   *
;*   Pipelining        : Enabled                                              *
;*   Speculate Loads   : Enabled with threshold = 0                           *
;*   Memory Aliases    : Presume are aliases (pessimistic)                    *
;*   Debug Info        : No Debug Info                                        *
;*                                                                            *
;******************************************************************************

;******************************************************************************
;* FUNCTION NAME: f2                                                          *
;*                                                                            *
;*   Regs Modified     : A0,A3,A4,A5,B4,B5                                    *
;*   Regs Used         : A0,A3,A4,A5,A6,B3,B4,B5,B6                           *
;*   Local Frame Size  : 0 Args + 0 Auto + 0 Save = 0 byte                    *
;******************************************************************************
f2:
;** --------------------------------------------------------------------------*
           CMPEQ   .L2     B6,0,B5           ; |17| 
           XOR     .L2     1,B5,B5           ; |17| 
           CMPLTU  .L1X    B6,16,A3          ; |17| 
           AND     .L1X    B5,A3,A0          ; |17| 

   [!A0]   BNOP    .S1     $C$L4,5           ; |17| 
|| [ A0]   SUB     .L2     B6,1,B5

           ; BRANCHCC OCCURS {$C$L4}         ; |17| 
;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : C:/test.c
;*      Loop source line                 : 20
;*      Loop opening brace source line   : 21
;*      Loop closing brace source line   : 29
;*      Known Minimum Trip Count         : 1                    
;*      Known Maximum Trip Count         : 15                    
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 7
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound(*)    : 2
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0     
;*      .S units                     0        0     
;*      .D units                     2*       2*    
;*      .M units                     0        0     
;*      .X cross paths               1        0     
;*      .T address paths             2*       2*    
;*      Long read paths              0        0     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          1        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             0        0     
;*      Bound(.L .S .D .LS .LSD)     1        1     
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 7  Schedule found with 2 iterations in parallel
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages       : 0
;*      Collapsed prolog stages       : 0
;*      Minimum required memory pad   : 0 bytes
;*
;*      Minimum safe trip count       : 1
;*----------------------------------------------------------------------------*
$C$L1:    ; PIPED LOOP PROLOG

           SPLOOPD 7       ;14               ; (P) 
||         MV      .L1     A4,A5
||         MV      .L2X    A6,B5
||         MV      .S1X    B4,A4
||         MVC     .S2     B5,ILC

;** --------------------------------------------------------------------------*
$C$L2:    ; PIPED LOOP KERNEL

           LDHU    .D1T1   *A4++,A3          ; |11| (P) <0,0>  ^ 
||         LDHU    .D2T2   *B5++,B4          ; |11| (P) <0,0>  ^ 

           LDW     .D1T2   *A5,B4            ; |11| (P) <0,1> 
           NOP             3
           ADD     .L1X    B4,A3,A3          ; |11| (P) <0,5>  ^ 
           STW     .D2T1   A3,*B4            ; |11| (P) <0,6>  ^ 
           SPKERNEL 0,0
;** --------------------------------------------------------------------------*
$C$L3:    ; PIPED LOOP EPILOG
           NOP             1
;** --------------------------------------------------------------------------*
$C$L4:    
           RETNOP  .S2     B3,5              ; |31| 
           ; BRANCH OCCURS {B3}              ; |31| 

So after inlining, the compiler doesn't see that a and b cannot alias x->mem, reloads x->mem in every loop iteration, and thus generates an ii=7 (2 loops in parallel) schedule as opposed to the ii=2 (4 loops in parallel) schedule above.

Now I'm not overly worried about that behavior, as it's unlikely to affect the overall performance too much, and I can always optimize where relevant, but I'd like to maintain a coding style that does not induce unnecessary performance penalties, so I wonder if there is anything I am missing and what modifications (if any) would let the compiler do a better job (enabling "optimistic" aliasing assumptions seems to help, but these assumptions are a lot stricter than what's required by ISO C and too strict for my purposes).

Regards and thanks for any help

Markus

  • Please show the exact compiler build options.

    Thanks and regards,

    -George

  • Do you need more than the .compiler_opts line in the assembler output? If so, what else?

    Regards

    Markus

  • Using the exact same build options as the customer is the best way to insure the same results are seen.  In this case, I am able to reproduce the results only by looking at the settings in the GLOBAL FILE PARAMETERS comment block.  That worked this time, but is not always reliable.

    I filed SDSCM00051184 in the SDOWP system to have this investigated.  It is filed not as a defect, but as a performance issue.  There are two possible results.  One, this is judged to be a problem in the compiler which will get fixed.  Two, an explanation is given which shows why the compiler cannot treat these two cases the same way. Feel free to follow this issue with the SDOWP link below in my signature.

    I'm sorry I cannot explain it to you.  I think the scope difference between a function call or not has something to do with it.  Those volatile variables might contribute some effect or another.  I'm just not sure, and that is why I filed the report.

    Thanks and regards,

    -George

  • George Mock said:

    Using the exact same build options as the customer is the best way to insure the same results are seen.  In this case, I am able to reproduce the results only by looking at the settings in the GLOBAL FILE PARAMETERS comment block.  That worked this time, but is not always reliable.


    Of course. I didn't doubt that but I was not at my desk when I wrote the reply, so I couldn't check the options. I now see that the .compiler_opts line is indeed very incomplete.


    Thank you for all your efforts

    Markus

  • Something went wrong when I filed the first report on this issue.  It is not in the system.  So I filed another one.  The ID number is SDSCM00051184 .  I changed the earlier post to show this second ID number.  But I'm posting this as well just to avoid any possible confusion.

    Thanks and regards,

    -George

  • George Mock said:

    I think the scope difference between a function call or not has something to do with it.

    I believe this is correct, but I have no details yet.  The inlined case goes from ii=2 to ii=7 between 6.0.x and 6.1.x, and 6.1.x brings improved behavior to "restrict" through better scope annotations.  It's possible that something else became too conservative.

    I also see that while 7.4.x is still ii=7, 8.x is back to ii=2, thanks to improved alias analysis.

    I haven't looked at the underpinnings enough to comment further.