This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F28335: #pragma MUST_ITERATE & #pragma UNROLL

Other Parts Discussed in Thread: TMS320F28335

 

Hello,
I have a customer with the following TMS320F28335 code:

/* solution 1 */
  #pragma MUST_ITERATE(16, 16)
  #pragma UNROLL(16)
  for (samplecount = 0; samplecount < 16; samplecount++)       
  {                                                                       
    acc += *psrc++;
  }  

And this is the result of the compiler with -g -os -on2 -o3 -b -ml -mn -v28 --float_support=fpu32 -mf5 --single_inline -mf:

  00B8C3 0EC3        MOVU       ACC,*+XAR3[0]
  00B8C4 0DCB        ADDU       ACC,*+XAR3[1]
  00B8C5 D108        MOVB       XAR1,#8
  00B8C6 D009        MOVB       XAR0,#9
  00B8C7 0DD3        ADDU       ACC,*+XAR3[2]
  00B8C8 0DDB        ADDU       ACC,*+XAR3[3]
  00B8C9 0DE3        ADDU       ACC,*+XAR3[4]
  00B8CA 0DEB        ADDU       ACC,*+XAR3[5]
  00B8CB 0DF3        ADDU       ACC,*+XAR3[6]
  00B8CC 0DFB        ADDU       ACC,*+XAR3[7]
  00B8CD 0D9B        ADDU       ACC,*+XAR3[AR1]
  00B8CE D10F        MOVB       XAR1,#15
  00B8CF 0D93        ADDU       ACC,*+XAR3[AR0]
  00B8D0 D00A        MOVB       XAR0,#10
  00B8D1 0D93        ADDU       ACC,*+XAR3[AR0]
  00B8D2 D00B        MOVB       XAR0,#11
  00B8D3 0D93        ADDU       ACC,*+XAR3[AR0]
  00B8D4 D00C        MOVB       XAR0,#12
  00B8D5 0D93        ADDU       ACC,*+XAR3[AR0]
  00B8D6 D00D        MOVB       XAR0,#13
  00B8D7 0D93        ADDU       ACC,*+XAR3[AR0]
  00B8D8 D00E        MOVB       XAR0,#14
  00B8D9 0D93        ADDU       ACC,*+XAR3[AR0]
  00B8DA 0D9B        ADDU       ACC,*+XAR3[AR1]
  00B8DB DB10        ADDB       XAR3,#16

If he changes #pragma MUST_ITERATE to the value (32, 32) he gets a more compact result:

/* solution 2 */
  #pragma MUST_ITERATE(32, 32)
  #pragma UNROLL(16)
  for (samplecount = 0; samplecount < 16; samplecount++)        
  {                                                                        
    acc += *psrc++;
  }     

And this is the result of the compiler:

  00B903 0D87        ADDU       ACC,*XAR7++
  00B904 0D87        ADDU       ACC,*XAR7++
  00B905 0D87        ADDU       ACC,*XAR7++
  00B906 0D87        ADDU       ACC,*XAR7++
  00B907 0D87        ADDU       ACC,*XAR7++
  00B908 0D87        ADDU       ACC,*XAR7++
  00B909 0D87        ADDU       ACC,*XAR7++
  00B90A 0D87        ADDU       ACC,*XAR7++
  00B90B 0D87        ADDU       ACC,*XAR7++
  00B90C 0D87        ADDU       ACC,*XAR7++
  00B90D 0D87        ADDU       ACC,*XAR7++
  00B90E 0D87        ADDU       ACC,*XAR7++
  00B90F 0D87        ADDU       ACC,*XAR7++
  00B910 0D87        ADDU       ACC,*XAR7++
  00B911 0D87        ADDU       ACC,*XAR7++
  00B912 0D87        ADDU       ACC,*XAR7++

As far as I understand the description of #pragma MUST_ITERATE in the "TMS320C28x Optimizing C-C++ Compiler User's Guide"(spru514c) you have to enter exactly the number of times a loop executes to give the compiler a chance for a good and working optimization. But in above solution 2 the value MUST_ITERATE is set to "32" loops, in fact only 16 loops are executed and the compilers generates the better result. In solution 1 MUST_ITERATE is set to the correct value "16" and the result is a more inefficient code.

What's the explanation for this behavior?
Could there be a problem #pragma MUST_ITERATE(32, 32) and a loop that's executed only 16 times?

Thanks for clarification!

Best regards
Peter Forstner

MCU FAE Europe
  • Peter,

    The scenario you described indeed looks unusual. Although I don't know all the internal details about the optimizer, to explain this behaviour keep in mind that the #pragma MUST_ITERATE informs the compiler about the loop behaviour, but it is the optimizer that makes the final decision about the generated code. Although it makes good assumptions most of the time, the optimizer has lots of other considerations to make and this may be a case where it simply didn't guess it right - but this is an intrinsic part of the process of hand-optimizing the code.

    Additional information about hand-optimizing loops can be obtained in this interesting app note (SPRA666) which, despite talking about another DSP family (C6000), explains the use of these two pragmas (page 19 and 20) and shows considerations about their use.

     

     

    Further analysis would require additional details about the development environment and the source code - check this topic.

     

    Hope this helps,

    Rafael