I have a customer with the following TMS320F28335 code:
/* solution 1 */
#pragma MUST_ITERATE(16, 16)
#pragma UNROLL(16)
for (samplecount = 0; samplecount < 16; samplecount++)
{
acc += *psrc++;
}
And this is the result of the compiler with -g -os -on2 -o3 -b -ml -mn -v28 --float_support=fpu32 -mf5 --single_inline -mf:
00B8C3 0EC3 MOVU ACC,*+XAR3[0]
00B8C4 0DCB ADDU ACC,*+XAR3[1]
00B8C5 D108 MOVB XAR1,#8
00B8C6 D009 MOVB XAR0,#9
00B8C7 0DD3 ADDU ACC,*+XAR3[2]
00B8C8 0DDB ADDU ACC,*+XAR3[3]
00B8C9 0DE3 ADDU ACC,*+XAR3[4]
00B8CA 0DEB ADDU ACC,*+XAR3[5]
00B8CB 0DF3 ADDU ACC,*+XAR3[6]
00B8CC 0DFB ADDU ACC,*+XAR3[7]
00B8CD 0D9B ADDU ACC,*+XAR3[AR1]
00B8CE D10F MOVB XAR1,#15
00B8CF 0D93 ADDU ACC,*+XAR3[AR0]
00B8D0 D00A MOVB XAR0,#10
00B8D1 0D93 ADDU ACC,*+XAR3[AR0]
00B8D2 D00B MOVB XAR0,#11
00B8D3 0D93 ADDU ACC,*+XAR3[AR0]
00B8D4 D00C MOVB XAR0,#12
00B8D5 0D93 ADDU ACC,*+XAR3[AR0]
00B8D6 D00D MOVB XAR0,#13
00B8D7 0D93 ADDU ACC,*+XAR3[AR0]
00B8D8 D00E MOVB XAR0,#14
00B8D9 0D93 ADDU ACC,*+XAR3[AR0]
00B8DA 0D9B ADDU ACC,*+XAR3[AR1]
00B8DB DB10 ADDB XAR3,#16
If he changes #pragma MUST_ITERATE to the value (32, 32) he gets a more compact result:
/* solution 2 */
#pragma MUST_ITERATE(32, 32)
#pragma UNROLL(16)
for (samplecount = 0; samplecount < 16; samplecount++)
{
acc += *psrc++;
}
And this is the result of the compiler:
00B903 0D87 ADDU ACC,*XAR7++
00B904 0D87 ADDU ACC,*XAR7++
00B905 0D87 ADDU ACC,*XAR7++
00B906 0D87 ADDU ACC,*XAR7++
00B907 0D87 ADDU ACC,*XAR7++
00B908 0D87 ADDU ACC,*XAR7++
00B909 0D87 ADDU ACC,*XAR7++
00B90A 0D87 ADDU ACC,*XAR7++
00B90B 0D87 ADDU ACC,*XAR7++
00B90C 0D87 ADDU ACC,*XAR7++
00B90D 0D87 ADDU ACC,*XAR7++
00B90E 0D87 ADDU ACC,*XAR7++
00B90F 0D87 ADDU ACC,*XAR7++
00B910 0D87 ADDU ACC,*XAR7++
00B911 0D87 ADDU ACC,*XAR7++
00B912 0D87 ADDU ACC,*XAR7++
As far as I understand the description of #pragma MUST_ITERATE in the "TMS320C28x Optimizing C-C++ Compiler User's Guide"(spru514c) you have to enter exactly the number of times a loop executes to give the compiler a chance for a good and working optimization. But in above solution 2 the value MUST_ITERATE is set to "32" loops, in fact only 16 loops are executed and the compilers generates the better result. In solution 1 MUST_ITERATE is set to the correct value "16" and the result is a more inefficient code.
What's the explanation for this behavior?
Could there be a problem #pragma MUST_ITERATE(32, 32) and a loop that's executed only 16 times?
Thanks for clarification!
Best regards
Peter Forstner
MCU FAE Europe