Why the MUST_ITERATE and _nassert are both no effective?

SuitJune Young

Genius 3985 points

HI,TIer.

Now I want to optimize the algorithm in CCS 6.1.0.00104

The C function is :

static void demo(float * restrict a) {
float _buf[128];
float * restrict _fptr = _buf;
unsigned char i,k,j;

_nassert ((int)(_fptr) % 8 == 0);
_nassert ((int)(a) % 8 == 0);

//#pragma MUST_ITERATE(64, ,64)
for (i = 0; i < 64; i++)
{
_fptr[i] = a[i];
i++;
_fptr[i] = a[i];
}

#pragma MUST_ITERATE(128 ,128)

for (j = 0; j < 128; j++)
{ //Line 154
k = bitrv_LUT[j];
a[j] = _fptr[k];
} //Line 157

return;
}

There are MUST_ITERATE and _nassert in function.However,the Complier stiil advice that it will better if there are MUST_ITERATE and _nassert.

The corresponding asm file is:

;******************************************************************************
;* TMS320C6x C/C++ Codegen PC v7.4.12 *
;* Date/Time created: Wed Mar 11 13:32:43 2015 *
;******************************************************************************
.compiler_opts --abi=coffabi --c64p_l1d_workaround=off --endian=little --hll_source=on --long_precision_bits=40 --mem_model:code=near --mem_model:const=data --mem_model:data=far_aggregates --object_format=coff --silicon_version=6740 --symdebug:dwarf

;******************************************************************************
;* GLOBAL FILE PARAMETERS *
;* *
;* Architecture : TMS320C674x *
;* Optimization : Enabled at level 3 *
;* Optimizing for : Speed *
;* Based on options: -o3, no -ms *
;* Endian : Little *
;* Interrupt Thrshld : Disabled *
;* Data Access Model : Far Aggregate Data *
;* Pipelining : Enabled *
;* Speculate Loads : Enabled with threshold = 9 *
;* Memory Aliases : Presume are aliases (pessimistic) *
;* Debug Info : DWARF Debug *
;* *
;******************************************************************************

The following info is for loop2(Line154~Line157)

;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : ../hello.c
;* Loop source line : 153
;* Loop opening brace source line : 154
;* Loop closing brace source line : 157
;* Loop Unroll Multiple : 2x
;* Known Minimum Trip Count : 64
;* Known Maximum Trip Count : 64
;* Known Max Trip Count Factor : 64
;* Loop Carried Dependency Bound(^) : 0
;* Unpartitioned Resource Bound : 3
;* Partitioned Resource Bound(*) : 3
;* Resource Partition:
;* A-side B-side
;* .L units 0 0
;* .S units 0 0
;* .D units 3* 2
;* .M units 0 0
;* .X cross paths 1 0
;* .T address paths 3* 3*
;* Long read paths 0 0
;* Long write paths 0 0
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 1 0 (.L or .S or .D unit)
;* Bound(.L .S .LS) 0 0
;* Bound(.L .S .D .LS .LSD) 2 1
;*
;* Searching for software pipeline schedule at ...
;* ii = 3 Schedule found with 7 iterations in parallel
;*
;* Register Usage Table:
;* +-----------------------------------------------------------------+
;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;* |00000000001111111111222222222233|00000000001111111111222222222233|
;* |01234567890123456789012345678901|01234567890123456789012345678901|
;* |--------------------------------+--------------------------------|
;* 0: | ** ** | ***** |
;* 1: | * ** | ***** |
;* 2: | ***** | **** |
;* +-----------------------------------------------------------------+
;*
;* Done
;*
;* Loop will be splooped
;* Collapsed epilog stages : 0
;* Collapsed prolog stages : 0
;* Minimum required memory pad : 0 bytes
;*
;* Minimum safe trip count : 1 (after unrolling)
;* Min. prof. trip count (est.) : 3 (after unrolling)
;*
;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.250, max 2.000 }
;* Mem bank perf. penalty (est.) : 7.7%
;*
;* Effective ii : { min 3.00, est 3.25, max 5.00 }
;*
;*
;* Total cycles (est.) : 18 + min_trip_cnt * 3 = 210
;*----------------------------------------------------------------------------*
;* SETUP CODE
;*
;* MV A6,B7
;* ADD 1,A6,A6
;* MV A3,B6
;*
;* SINGLE SCHEDULED ITERATION
;*
;* $C$C36:
;* 0 LDBU .D2T2 *B7++(2),B8 ; |156|
;* 1 NOP 6
;* 7 LDW .D2T2 *+B6[B8],B4 ; |156|
;* 8 NOP 2
;* 10 LDBU .D1T1 *A6++(2),A4 ; |156|
;* 11 NOP 2
;* 13 MVD .M2 B4,B5 ; |156| Split a long life
;* 14 NOP 1
;* 15 LDW .D1T1 *+A3[A4],A5 ; |156|
;* 16 NOP 3
;* 19 MV .L1X B5,A4 ; |156| Define a twin register
;* 20 STNDW .D1T1 A5:A4,*A7++(8) ; |156|
;* || SPBR $C$C36
;* 21 ; BRANCHCC OCCURS {$C$C36} ; |153|
;*
;* If you know that this loop will always execute at a multiple of <128> and at least <128> times, try adding "#pragma MUST_ITERATE(128, ,128)" just before the loop.
;*
;* Consider adding assertions to indicate n-byte alignment of variables a if they are actually n-byte aligned: _nassert((int)(a) % == 0).
;*----------------------------------------------------------------------------*

Expect for any reply.

BR!

over 10 years ago

0 George Mock over 10 years ago

TI__Guru**** 244470 points

I suspect a bug in the compiler. However, I can reproduce only part of your results, and not all of them. Please preprocess the source file and attach to your next post. Also show the compiler version (different from the CCS version) and the build options exactly as the compiler sees them.

Thanks and regards,

-George

0 SuitJune Young over 10 years ago in reply to George Mock

Genius 3985 points

Thanks for your reply.

Complier info:
C6000 Compiler Tools 7.4.12 com.ti.cgt.c6000.7.4.win32.feature.group Texas Instruments

static void demo(float * restrict a) {
float _buf[128];
float * restrict _fptr = _buf;

unsigned char i,k,j;

_nassert ((int)(_fptr) % 8 == 0);
_nassert ((int)(a) % 8 == 0);

#pragma MUST_ITERATE(128, 128, )
for (i = 0; i < 128; i++)
{
_fptr[i] = a[i];
}

#pragma MUST_ITERATE(128, 128, )
for (j = 0; j < 128; j++)
{
k = bitrv_LUT[j];
a[j] = _fptr[k];
}

return;
}

For release version, the printed pipeline info:

For first loop:
Searching for software pipeline schedule at ...
;* ii = 4 Schedule found with 2 iterations in parallel

For second loop:
Searching for software pipeline schedule at ...
;* ii = 3 Schedule found with 5 iterations in parallel
If you know that this loop will always execute at a multiple of <128> and at least <128> times, try adding "#pragma MUST_ITERATE(128, ,128)" just before the loop.

I am confused that for the first loop,it doesn't advice me that I should use a MUST_ITERATE, but just 2 2 iterations in parallel.However,in the second loop,it advice that I should use a MUST_ITERATE,but 5 iterations in parallel.It seems that it have been loop unrolled.

opt_test.zip

0 George Mock over 10 years ago in reply to SuitJune Young

TI__Guru**** 244470 points

I apologize for the delay. Thank you for submitting a test case.

I focused on the second loop. That is the only one where I see the compiler issue advice. I can reproduce the same results you describe. I suspect some problem in the compiler as the cause. I filed SDSCM00051669 in the SDOWP system to have this investigated. Feel free to follow it with the SDOWP link below in my signature.

Thanks and regards,

-George

Code Composer Studio™︎

Code Composer Studio forum

Why the MUST_ITERATE and _nassert are both no effective?