Hercules RM42 for-loop slow?

Falk Sobe

Hi Support,

I have encountered a problem, using a simple for-loop for copying data from mibspi rx ram to main ram.
I transmit 64 bytes and it needs 19 µs. The Hercules runs with 100 MHz.
The loop looks like this:

for( i = headerSize; i < lenTransfer; i++ )
{
	*(pData++) = mibspiRAM1->rx[i].data;
}

In disassembly:

   636:                 for( i = headerSize; i < lenTransfer; i++ ) 
   637:                 { 
0x00007A28 E1A01005  MOV      r1,r5
0x00007A2C EA000004  B        0x00007A44
   638:                         *(pData++) = mibspiRAM1->rx[i].data; 
   639:                 } 
   640:                  
0x00007A30 E59F7424  LDR      r7,[pc,#1060]    ; @0x00007E5C
0x00007A34 E0877101  ADD      r7,r7,r1,LSL #2
0x00007A38 E1D770B0  LDRH     r7,[r7,#0]
0x00007A3C E4C67001  STRB     r7,[r6],#1
0x00007A40 E2811001  ADD      r1,r1,#1
0x00007A44 E1510002  CMP      r1,r2
0x00007A48 3AFFFFF8  BCC      0x00007A30

Why this few instructions need 19 µs ?

Best regards

Falk

over 11 years ago

0 Christian Herget over 11 years ago

TI__Expert 6985 points

Hi Falk,

19us sounds like a reasonable amount of time for this loop to. You got 7 instructions in the loop, so even if every instruction would only take on cycle you would end up at 7 * 64 / 100MHz = 4.5us. But you got other factors which will slow this down further, like the load and store instructions and the branch. Especially the LDRH from the SPI memory will consume quite some cycles and will block the STRB from being executed as it's depends on the previous LDRH. Also the BCC could potentially be mispredicted and cause the processor pipe to be flushed.

I would try the following to improve the speed of this loop with the help of the #pragma MUST_ITERATE ( min, max, multiple ).
Especially the multiple argument could potentially help to speed this up.

#pragma MUST_ITERATE (0, 64, 8);
for( i = headerSize; i < lenTransfer; i++ )
{
    *(pData++) = mibspiRAM1->rx[i].data;
}

Please refer to SPNU151J Section 5.10.16.1 for more info on the pragma.

Best Regards,
Christian

0 Zhaohong Zhang over 11 years ago in reply to Christian Herget

TI__Mastermind 22715 points

To be more specific, it takes 12 VCLK cycles for a LDR instruction to return when reading from peripheral space. this is the performance bottleneck on this device. You may consider to use load multiple instruction (needs to done in assembly code) or DMA (if the operation is repetitive and you do not need to re-configure the DMA registers) to relieve CPU time.

Thanks and regards,

Zhaohong

0 Falk Sobe over 11 years ago in reply to Zhaohong Zhang

Intellectual 335 points

Thanks for the answers!

@ Christian:

The transfer size can change every Time. So does it make sense to use MUST_ITERATE ?

@ Zhaohong

In RM42 there is no DMA I think? Is there a way to speed the thing up, e.g. to force a AXI burst transfer (with assembler)?

Best regards

Falk

0 Christian Herget over 11 years ago in reply to Falk Sobe

TI__Expert 6985 points

Hi Falk,

If you can gurantee, that transfer size is always a multiple of X > 1 than MUST_ITERATE makes sense, if not than you are right it doesn't make any sense.

Handoptimized assembly might speed up things alittle, but don't expect to much of an improvemt.

You can try the following code and compile it with -O2:

void SPI_COPY(mibspiRAM_t *mibspi, uint8_t *pData, uint32_t headerSize, uint32_t lenTransfer)
{
    uint32_t i;

    for( i = headerSize; i < lenTransfer; i++ )
    {
        *(pData++) = mibspi->rx[i].data;
    }
    
    return;
}

In my case this produced the following assembly:

    .text
    .arm
    
    .def     SPI_COPY
    .asmfunc 
    
SPI_COPY:
        CMP       A4, A3                ; lenTransfer < headerSize
        ADD       A1, A1, A3, LSL #2    ; mibspi = mibspi + (headerSize * 4);
        BXLS      LR                    ; if(lenTransfer < headerSize) return;

        SUB       V9, A4, A3            ; temp1 = lenTransfer - headerSize;
        ADD       A1, A1, #512          ; mibspi += 512byte;
copy_loop:
		LDRH      A3, [A1], #4          ; temp2 = *mibspi; mibspi + 4byte; 
		SUBS      V9, V9, #1            ; temp1--;
        STRB      A3, [A2], #1          ; *pData = temp2; temp2 += 1Byte;
		
        BNE       copy_loop             ; if(temp1 != 0) goto copy_loop;
		
        BX       LR
    .endasmfunc

Above apears to be already highly optimized.

To save a few more cyles you could furthermore mark the C function as __inline, if wasn't already automatically inlined by the compiler

Best Regards
Christian

Arm-based microcontrollers

Arm-based microcontrollers forum

Hercules RM42 for-loop slow?