AM2634: Code runs more than 2x slower the first time it executes

Nanda Marwali

Part Number: AM2634

We were trying to benchmark our code running on TI AM2634 control card when we found that the first time it gets executed, it seems to runs much slower (requires more than 2x CPU cycles). Specifically to illustrate the problem, we use 64 x NOP instructions to benchmark using the CPU cycle counter as shown below. The code is put in a 1ms timer ISR (the only one in the system) so it runs periodically and all code runs in RAM in one of the cores of AM2634. The first time the ISR executes, we found that CycleNum = 150. The second time it runs, CycleNum goes down to a lower number. Only after the 3rd or the 4th time it executes, CycleNum then gets to be equal to 64 and stays constant at 64.

What is the reason for this behavior? Is this an expected behavior? Is this related to CPU instruction caching? Or did we miss something in our configuration of the CPU core?. Please explain. This issue is causing us a big problem because we need to fit our control loop within a very tight execution time requirement and having code run more than 2x as slow the first few times is just not acceptable.

Below is the code we use to produce the issue:

uint32_t CycleNum;
uint32_t t1, t2, t3;

void isr_1ms_task

{

CycleCounterP_reset();

t1 = CycleCounterP_getCount32();

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

__asm(" NOP");

t2 = CycleCounterP_getCount32();

t3 = CycleCounterP_getCount32();

CycleNum = t2 - t1 - (t3 - t2) + 1;

}

over 2 years ago

0 Nanda Marwali over 2 years ago

Prodigy 10 points

BTW. to confirm the result above, we have tried using the RTI timer to measure the elapsed time between the 64x NOP instructions, and we found the two methods (CPU cycles counter and RTI timer) show the same elapsed time.

0 Ming Wei over 2 years ago in reply to Nanda Marwali

TI__Guru 55385 points

Hi Nanda,

If you put all the code, data and stacks into TCM, then you will get 64 cycles every time. I did a 100 time loop for the code above, it returns the following, if I put everything in TCM. If I put everything in OCRAM, then it is 145 cycles, because the OCRAM is much slower than the TCM. Even the 145 for 64 NOP is the result of the caching.

[Cortex_R5_0] CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64
CycleNum=64

...

Please see my attached likner.cmd file for empty project.

Best regards,

Ming

https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/908/0777.linker.cmd

0 Nanda Marwali over 2 years ago

Prodigy 10 points

Hmm..so we need to use the TCM to get consistent execution speed? But there's only around 64kbyte of this TCM memory, right? What I dont understand is why the execution time changes on the second, and third time before finally settling at 64 cycles. Here is what we got:

First run: CycleNum = 150,

Second run: CycleNum = 98

Third run :CyckeNum=64

Fourth run and onward:CycleNum 64

Is this the result of caching?

You said that the 145 you got is the result of caching. But then why is it gradually decreasing until it gets to 64 cycles in our case.

Nanda

0 Ming Wei over 2 years ago in reply to Nanda Marwali

TI__Guru 55385 points

Hi Nanda,

It is due to the caching, because the read from OCRAM without caching will take 9 cycles. You will need to carefully plan the usage of the TCM so that the mission critical code, data and even stack needs to be in TCM.

Best regards,

Ming

Arm-based microcontrollers

Arm-based microcontrollers forum

AM2634: Code runs more than 2x slower the first time it executes