This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Using pragmas to unroll loops

Hi everyone!

In my application, I'm using the c28 core from a F28M35H52C1 to execute a control law at a rate of Fcontrol, and an external ADC AD7634 to sample my feedback signal. The ADC is the SPI Master, and sends 18-bits samples at a fixed frequency Fs. The c28 core receives these samples in the SPI module. I'm using Code Composer Studio v. 4.2.5.00005.

I'm doing oversampling, which means Fs > Fcontrol. Therefore, I'm using the FIFO buffer from SPI to acumulate samples, which is empty at every control cycle. For example, if Fs = 600kS/s and Fcontrol = 100 kHz, I configure the FIFO buffer to generate an interrupt every 12th received word (since they're 18 bits samples, SPI module is configured to receive 9-bits words, representing half-word of an ADC sample). The related ISR empties the FIFO buffer, treats the samples, makes the decimation and executes the control law. 

Because I only have 10us/control cycle, I want to optimize the code that empties the FIFO buffer and treats the samples. Initially, I was getting samples out from the buffer with this code:

// Read FIFO_SIZE words from the RX FIFO
for(i = 0; i < FIFO_SIZE; i++)
{
rdata[i] = SpiaRegs.SPIRXBUF;
}

where FIFO_SIZE = 12. This was executed in 412 cpu cycles, which lasts 2,74us at C28_FREQ = 150 MHz. But the ADC sends a new 9-bit word each 1/Fs = 1,6666 us, so I decided to manually unroll this loop. For the example above, I wrote:

rdata[0] = SpiaRegs.SPIRXBUF;
rdata[1] = SpiaRegs.SPIRXBUF;
rdata[2] = SpiaRegs.SPIRXBUF;
rdata[3] = SpiaRegs.SPIRXBUF;
rdata[4] = SpiaRegs.SPIRXBUF;
rdata[5] = SpiaRegs.SPIRXBUF;
rdata[6] = SpiaRegs.SPIRXBUF;
rdata[7] = SpiaRegs.SPIRXBUF;
rdata[8] = SpiaRegs.SPIRXBUF;
rdata[9] = SpiaRegs.SPIRXBUF;
rdata[10] = SpiaRegs.SPIRXBUF;
rdata[11] = SpiaRegs.SPIRXBUF;

This is executed in 75 cpu cycles, or 500 ns, which is far better. My problem is: I wanna do it parametrically, i.e, I want to unroll the loop automatically, independetly of Fs and Fcontrol. This means I want to open the loop as a function of FIFO_SIZE in compiling time, not writing it manually for each case.

After some research, I got to know the MUST_ITERATE and UNROLL pragmas, and I expected to get some results with the following code:

#pragma MUST_ITERATE(2,FIFO_SIZE,2);
#pragma UNROLL(FIFO_SIZE);
// Read FIFO_SIZE words from the RX FIFO
for(i = 0; i < FIFO_SIZE; i++)
{

rdata[i] = SpiaRegs.SPIRXBUF;

}

But it didn't work as expected. With FIFO_SIZE = 12, the compiled code was:

240 for(i = 0; i < FIFO_SIZE; i++)

0x00801E: 5207 CMPB AL, #0x7

242 rdata[i] = SpiaRegs.SPIRXBUF;

C$DW$L$_spiRxFifoIsr$2$B, C$L2:

0x00801F: ED05 SBF 5, NEQ
0x008020: 761F0280 MOVW DP, #0x280
0x008022: 5B2B MOVZ AR3, @0x2b
0x008023: 6F04 SB 4, UNC
0x008024: 761F0280 MOVW DP, #0x280
0x008026: 5B2A MOVZ AR3, @0x2a
0x008027: 9A01 MOVB AL, #0x1
0x008028: 94A3 ADD AL, @AR3

240 for(i = 0; i < FIFO_SIZE; i++)

0x008029: 560301A9 MOV ACC, @AL << 1
0x00802B: 0EA9 MOVU ACC, @AL
0x00802C: 1E42 MOVL *-SP[2], ACC

I tried to use different values for the FIFO_SIZE, for all arguments from both pragmas, and also tried all optimization settings (the code above was generated with -O1). The resulting code is always almost the same. The code above is executed in 1,5333 ns, which isn't satisfactory.

I don't have much experience with assembly code, but it seems to me that the unrolling simply wasn't done. It looks like a typical loop code, with no unrolled code. I thought it would be trivial for the optimizer to unroll this loop, since there's only one instruction been executed, and the vector index is exaclty the loop index.

Is there anything wrong in my code? Am I forgetting any setting? 

After solving this, I also want to be able to unroll the following code, which puts two 9-bit words together to form the 18-bit samples sent by the ADC:

for(i = 0; i < FIFO_SIZE; i++)
{

valorADCH = (0x000001FF & (Uint32)rdata[i]) << 9;
i++;
valorADCL = 0x000001FF & rdata[i];
valorI[rdata_point] = valorADCH + valorADCL;
rdata_point++;

}

Thanks in advance,

Gabriel

  • Please, could anyone help me with this?

  • This is an old post but I'm responding to it because it came up in my search of the same question and was helpful. Gabriel's method does work for me now with compiler 6.4.6. Without pragmas code execution time is 218 cycles, with only 59.

    I used:

    #pragma MUST_ITERATE(SIZE)

    #pragma UNROLL(SIZE)

    And the code is shorter. I wonder why the compiler doesn't unroll on its own.

  • An UNROLL pragma overrides the compiler's automatic decision regarding whether and how much to unroll a loop.  While the compiler does a good job in many cases, it is very difficult to make the best decision for all cases all the time.  It is not surprising to find a case where the compiler makes a bad decision.  When that happens, the UNROLL pragma gives you a way out.

    Note that MUST_ITERATE(SIZE) says the loop always iterates at least SIZE times.  If that is ever false, your code is likely to silently do the wrong thing.  Be cautious.

    Thanks and regards,

    -George