This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CPU clocks per operation

Other Parts Discussed in Thread: TMS570LS3137

Hello,

I'm working with TMS570 MCU at now moment and can't understand one point.

I've created one small function which is necessary to delay necessary count of CPU clocks:

void Wait(uint32_t lVal)
{
	asm(" LSR r0, r0, #1");

	asm("LoopLabel:");
	asm(" SUBS r0, r0, #1");
	asm(" BNE LoopLabel");
}

I expected that loop takes 2 CPU cycles taking into account Cortex™-R5 and Cortex-R5F Technical Reference Manual.

1 cycle for 'SUB' and 1 for 'B'.

So, shift is used to divide input cycles count by 2.

Unfortunately this loop takes 4 cycles. I checked it by debugger (clocks for run between start of loop and end of loop):

Additionaly I checked it by GPIO toggling with input value for the function 0xFFFFFFFF.

With CPU frequence 300MHz it should takes about 0xFFFFFFFF/330000000 = 14sec. But during program run it takes about 28 seconds.

So, why this loop takes 4 cycles?

Thanks a lot in advance for your help.

  • I have confirmed your data using PMU. On average, those two instructions take 4 CPU cycles. I am not too sure about the cycles for the BNE instruction even with correct branch prediction. I will need to do some research.

    Thanks and regards,

    Zhaohong
  • Dear Zhaohong,

    thank you for your answer.
    I've checked it with different input values for counter and always it is about doubled value in the loop...

    I will wait for your answer if you find any information.

    Thanks a lot in advance for your help!
  • I added two instructions to your original functions as follows.
    void Wait(uint32_t lVal)
    {
    asm(" LSR r0, r0, #1");
    asm("LoopLabel:");
    asm(" ADD r0, r0, #1");
    asm(" SUB r0, r0, #1");
    asm(" SUBS r0, r0, #1");
    asm(" BNE LoopLabel");
    }

    I found that those four instructions takes about 6 cycles at average.

    asm(" ADD r0, r0, #1");
    asm(" SUB r0, r0, #1");
    asm(" SUBS r0, r0, #1");
    asm(" BNE LoopLabel");

    It seems that the BNE instruction takes about 3 cycles. I will play with Cortex_R5 branch prediction settings before calling this function to see if there are any difference.

    Thanks and regards,

    Zhaohong
  • Hello Zhaohong,

    I'm glad to see your answer.

    It is strange that BNE takes 3 cycles. In accordance with manual it should takes 1 or 8.

    I want to ask, if it takes 3 at my configuration, and 3 cycles at your. Is it possible to guarantee that it will take about 3 cycles independence of other configurations/startup files or something else?

    And of course I will be happy if you find some information why it takes 3 cycles.

    Thanks a lot in advance for your help!

  • In the following link, you can find the details about the Cortex R4 execution pipeline. Cortex-R5 has the same pipeline.

    You can see that the firsts tages( fe1 and fe2) will be flushed when the branch takes place at pd stage (predict correctly). Since the 2 cycles for fe1 and fe2 are lost, The branch actually takes 3 cycles in the pipeline. I also confirmed the same behavior on TMS570LS3137 by running test test from on-chip RAM.

    Thanks and regards,

    Zhaohong

  • Hello Zhaohong,

    thanks a lot fot your help.

    Now it is cleaк where two clocks were lost.

    It is bad that low boundary is defined inctorrectly for B instruction in the manual (for both R4F and R5F).

    Upper boundary (8 and 9 clocks) is correct taking into account pipeline scheme which you sent me.

  • Dear Zhaohong,

    unfortunately I should up this topic again.
    It was necessary to replace part of functions to RAM.
    I have moved there test function with the loop also.

    It was moved to RAM by (with flash API together):

        flash_handler :
       	{
       		main.obj (.text)
    		Flash.obj (.text, .const)
         	--library = F021_API_CortexR4_BE_L2FMC.lib (.text)
      	} load = FLASH_HANDLER, run = RAM,
      	LOAD_START(_lFlashHandler_Load_g),
      	RUN_START(_lFlashHandler_Run_g),
      	SIZE(_lFlashHandler_Size_g)

    Now I try to measure clocks again and I have strange results.

    I use loop counter value (R0) equal to 10000 and I expect to obtain ~40000 CPU clocks (~4 clocks per loop).

    But when it is executed from RAM I have the following result:

    It is equal to 180039. I checked same test from flash and it takes 40022.

    I can't understand why. Do you have any idea?

    Thanks a lot in advance for your help!

  • When we measured 4 cycles per loop, the code is already in cache. It should not change if location of the code changes as long as the code storage location is cacheable.  Would you please check the MPU settings for the RAM and see if it is cacheable?

    Thanks and regards,

    Zhaohong

  • Hello Zhaohong,

    yes, you are absolutely right. Code was not cached due to wrong MPU configuration.
    Now everything is OK.

    Thank you and have a nice day!