Hi.
I have a problem with how (and when) the TI compiler will inline functions that access volatile objects. In particular, I want to wrap some hardware register accesses on a C64x+. The actual code is lengthy with register base addresses added, arguments checked, etc. For demonstration purposes, I've reduced the program to:
static inline uint32_t Get(volatile uint32_t *ptr)
{
return *ptr;
}
int Test(uint32_t *ptr)
{
int i;
for(i=0; i<128; ++i)
{
if( Get(ptr) == 0 ) return 1;
}
return 0;
}
Now unfortunately the compiler does not want to inline "Get", and the assembler output is
;******************************************************************************
;* FUNCTION NAME: Test *
;* *
;* Regs Modified : A0,A1,A3,A4,A5,B3 *
;* Regs Used : A0,A1,A3,A4,A5,B3,SP *
;* Local Frame Size : 0 Args + 0 Auto + 0 Save = 0 byte *
;******************************************************************************
Test:
;** --------------------------------------------------------------------------*
CALL .S1 Get ; |31|
MV .L1 A4,A5 ; |27|
MV .L1 A5,A4 ; |31|
MV .L1X B3,A1 ; |27|
MVK .S1 0x80,A3 ; |29|
;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;* Disqualified loop: Loop contains a call
;* Disqualified loop: Loop contains non-pipelinable instructions
;*----------------------------------------------------------------------------*
$C$L1:
ADDKPC .S2 $C$RL0,B3,0 ; |31|
$C$RL0: ; CALL OCCURS {Get} {0} ; |31|
;** --------------------------------------------------------------------------*
CMPEQ .L1 A4,0,A4 ; |31|
|| SUB .S1 A3,1,A3 ; |29|
SUB .L1 A4,1,A4 ; |31|
AND .L1 A4,A3,A0 ; |29|
[ A0] B .S1 $C$L1 ; |29|
|| [ A0] MV .D1 A5,A4 ; |31|
|| [!A0] CMPEQ .L1 A4,0,A4 ; |33|
[ A0] CALL .S1 Get ; |31|
[!A0] RETNOP .S2X A1,3 ; |34|
; BRANCHCC OCCURS {$C$L1} ; |29|
;** --------------------------------------------------------------------------*
NOP 2
; BRANCH OCCURS {A1} ; |34|
My understanding is that the volatile parameter inhibits inlining (although I honestly have no clue as to why that wouldbe).
Therefore, I made the parameter non-volatile and instead used a cast to make the volatile access:
static inline uint32_t Get(uint32_t *src)
{
return *(volatile uint32_t*)src;
}
This gets inlined, but unfortunately the volatile access seems to be optimized away, as the load instruction was pulled outside the loop!
;******************************************************************************
;* FUNCTION NAME: Test *
;* *
;* Regs Modified : A3,A4,A5,A6,B0,B1,B4 *
;* Regs Used : A3,A4,A5,A6,B0,B1,B3,B4 *
;* Local Frame Size : 0 Args + 0 Auto + 0 Save = 0 byte *
;******************************************************************************
Test:
;** --------------------------------------------------------------------------*
LDW .D1T1 *A4,A6 ; |29|
MVK .L2 0x1,B1
NOP 1
;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : C:/Test.c
;* Loop source line : 29
;* Loop opening brace source line : 30
;* Loop closing brace source line : 32
;* Known Minimum Trip Count : 1
;* Known Maximum Trip Count : 128
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 2
;* Unpartitioned Resource Bound : 1
;* Partitioned Resource Bound(*) : 1
;* Resource Partition:
;* A-side B-side
;* .L units 1* 0
;* .S units 0 0
;* .D units 0 0
;* .M units 0 0
;* .X cross paths 0 1*
;* .T address paths 0 0
;* Long read paths 0 0
;* Long write paths 0 0
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 2 3 (.L or .S or .D unit)
;* Bound(.L .S .LS) 1* 0
;* Bound(.L .S .D .LS .LSD) 1* 1*
;*
;* Searching for software pipeline schedule at ...
;* ii = 2 Schedule found with 5 iterations in parallel
;* Done
;*
;* Loop will be splooped
;* Collapsed epilog stages : 4
;* Collapsed prolog stages : 0
;* Minimum required memory pad : 0 bytes
;*
;* Minimum safe trip count : 1
;*----------------------------------------------------------------------------*
$C$L1: ; PIPED LOOP PROLOG
[ B1] SPLOOPW 2 ;10 ; (P)
;** --------------------------------------------------------------------------*
$C$L2: ; PIPED LOOP KERNEL
NOP 1
CMPEQ .L1 A6,0,A3 ; |27| (P) <0,1>
SPMASK S2
|| MVK .S2 0x80,B4 ; |29|
|| SUB .S1 A3,1,A4 ; |27| (P) <0,2> ^
SUB .L2 B4,1,B4 ; |29| (P) <0,3>
[ B1] MV .L1 A4,A5 ; |29| (P) <0,4> ^
|| AND .L2X A4,B4,B0 ; |29| (P) <0,4> ^
[!B0] ZERO .S2 B1 ; |29| (P) <0,5> ^
NOP 2
NOP 1
SPKERNEL 0,0
;** --------------------------------------------------------------------------*
$C$L3: ; PIPED LOOP EPILOG
;** --------------------------------------------------------------------------*
RETNOP .S2 B3,4 ; |34|
CMPEQ .L1 A5,0,A4 ; |33|
; BRANCH OCCURS {B3} ; |34|
Finally, if I replace this last inline function by a macro (which to my understanding should not make a difference!), it seems to work:
#define Get(src) (*(volatile uint32_t*)src)
produces:
;******************************************************************************
;* FUNCTION NAME: Test *
;* *
;* Regs Modified : A0,A3,A4,A5,A6,A7,B0,B1 *
;* Regs Used : A0,A3,A4,A5,A6,A7,B0,B1,B3 *
;* Local Frame Size : 0 Args + 0 Auto + 0 Save = 0 byte *
;******************************************************************************
Test:
;** --------------------------------------------------------------------------*
MVK .L2 0x1,B1
;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : C:/Test.c
;* Loop source line : 29
;* Loop opening brace source line : 30
;* Loop closing brace source line : 32
;* Known Minimum Trip Count : 1
;* Known Maximum Trip Count : 128
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 9
;* Unpartitioned Resource Bound : 2
;* Partitioned Resource Bound(*) : 2
;* Resource Partition:
;* A-side B-side
;* .L units 1 0
;* .S units 0 0
;* .D units 1 0
;* .M units 0 0
;* .X cross paths 0 0
;* .T address paths 1 0
;* Long read paths 0 0
;* Long write paths 0 0
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 4 5 (.L or .S or .D unit)
;* Bound(.L .S .LS) 1 0
;* Bound(.L .S .D .LS .LSD) 2* 2*
;*
;* Searching for software pipeline schedule at ...
;* ii = 9 Schedule found with 2 iterations in parallel
;* Done
;*
;* Loop will be splooped
;* Collapsed epilog stages : 1
;* Collapsed prolog stages : 0
;* Minimum required memory pad : 0 bytes
;*
;* Minimum safe trip count : 1
;*----------------------------------------------------------------------------*
$C$L1: ; PIPED LOOP PROLOG
[ B1] SPLOOPW 9 ;18 ; (P)
;** --------------------------------------------------------------------------*
$C$L2: ; PIPED LOOP KERNEL
NOP 4
SPMASK L1
|| MV .L1 A4,A6
[ B1] LDW .D1T1 *A6,A4 ; |31| (P) <0,5> ^
NOP 2
SPMASK S1,L2
|| MVK .S1 0x80,A5 ; |29|
|| MV .L2 B1,B0
NOP 1
CMPEQ .L1 A4,0,A7 ; |31| <0,10> ^
SUB .L1 A5,1,A5 ; |29| <0,11>
|| SUB .S1 A7,1,A7 ; |31| <0,11> ^
[ B0] MV .L1 A7,A3 ; |29| <0,12>
|| AND .S1 A7,A5,A0 ; |29| <0,12> ^
[!A0] ZERO .L2 B1 ; |29| <0,13> ^
MV .L2 B1,B0 ; |29| <0,14> Split a long life(pre-sched)
NOP 1
NOP 1
SPKERNEL 0,0
;** --------------------------------------------------------------------------*
$C$L3: ; PIPED LOOP EPILOG
;** --------------------------------------------------------------------------*
RETNOP .S2 B3,4 ; |34|
CMPEQ .L1 A3,0,A4 ; |33|
; BRANCH OCCURS {B3} ; |34|
To me, the second case looks like a compiler bug, but I am not certain. I would like to minimize the amount of macros in my code, so I'd really prefer an inline function. Does anyone know how to safely achieve that?
Thanks a lot
Markus