Clock cycles per instruction

Pablo Cottens

Other Parts Discussed in Thread: RM48L952, RM57L843, HALCOGEN

Hello,

I've been using the Profile Clock in CCS to measure performance for my RM48L952.

The thing is that I don't think it's working properly. I tried to measure the execution time for one instruction (a simple mov) is 16 cycles @160MHz. I tried testing the same mov in a RM42 launchpad and it took 12 cycles @50MHz.

Correct me if I'm wrong, but that instruction shouldn't take just 1 cycle to execute? I know the cortex R architecture has a pipeline of 7 or 8 stages (can't remember now, mey be confusing it with cortex A). Is it possible that at every brake point there the pipeline is flushed and that's why it's taking so long ? Is the lockstep responsible for such delay?

I'm unable to understand this.

over 10 years ago

0 Pablo Cottens over 10 years ago

Genius 3020 points

Another theory that could explain this behaviour.

Could it be because of the wait states for memory access? @160MHz, if I'm not mistake it was necessary to program 3 or 4 wait states.

If so, could executing code from TCRAM be faster? In the initialization process add some code to load the flash content to TCRAM and then relinquish control to PC? I think that for that I'd have to compile relative to PC. How can I do that in CCS (probably a question for another forum)?

0 Anthony F. Seely over 10 years ago in reply to Pablo Cottens

TI__Guru 68290 points

Pablo,

I don't know how you are trying to measure the single move, but I doubt the result is very accurate.

And like you said there is a pipeline effect - so you really need to measure larger segments of code in order to get a reasonable result that shows the actual performance.

If you have the PROTRACE box ($$) there is an option to show 'cycle accurate' trace from the ETM without doing any instrumentation of your code.

Regarding the question about CCS this can be done. The compiler manual actually explains how and the runtime lib / linker even work together to make it easy i.e. the linker creates a table w. the data to transfer and the runtime lib includes a function to do the transfer for you interpreting the table. There is information on this in the assembly language tools manual lit # spnu118. Look for section 8.8 Linker-Generated Copy Tables.

One note, you will get the best cycle performance if you run the part at 0 WS from flash which I think means < 45 or 50MHz (would need to check the datasheet). As you add wait states you'll lose performance running from flash. The flash is wide and does read more than one 32-bit word per cycle but there is a penalty getting started if you will so it comes out to > 1 cycle / word.

You actually will not get full performance of the CPU by running code from RAM as this will create contention between program fetches and data accesses on the TCMRAM bus. If you live w. the wait state penalty on the flash at higher speeds you won't have this contention because the flash uses TCM bus "A" and the RAM "B0" & "B1" and accesses to these different ports can go on in parallel. So moving code to RAM may not actually improve performance - depends on what the code does and whether it performs lots of ram accesses. Maybe if it's some sort of recursive calculation that uses only registers you'd get a big speedup...

Ok so if you want the best performance in the Hercules family you should check out the RM57L843. This part runs at a much higher clock frequency and has program & data caches as well as a very wide flash (I think physically it's 256 wide).

0 FinleyQ over 8 years ago in reply to Anthony F. Seely

Prodigy 80 points

Dear Anthony,

I use a Launch XL2, the MCU is RM57LS843. I set GCLK = 300MHz, HCLK = 150MHz and others CLK to 75MHz by HalCodeGen. I test consuming time of a simple code { i = 1000; while( i-- ); } by oscilloscope. The result is 200us, when I change value i to 500, the result time is 100us. But I think the consuming time need much less than 200us. I have not found a valid method to improve efficienty. So can you provide some suggestions for me to solve this problem.
Maybe the instruction cycle is too long? Do you have some docments about introducing instruction cycle( or other related sides ) for this problem?

Thanks,
Best Regards.

0 Anthony F. Seely over 8 years ago in reply to FinleyQ

TI__Guru 68290 points

user4513915,

In this particular case, I would guess that the issue is not having the cache enabled or configured correctly.

Using an oscilloscope for performance measurements is difficult with a 300MHz processor like this.

Because the act of toggling an IO pin can take many clock cycles at 300MHz. First the IO system runs slower (150 or 75MHz) and then there are many cycles required to perform the write to the IO. So if you try to use an oscilloscope to measure a relatively small interval the IO overhead can become a significant source of error.

The best idea is to use the PMU.

First, the PMU has minimal overhead. It takes only 6 cycles to copy the PMU register to a working register.
And no pointer is needed for this because the PMU register (unlike the RTI) is in the coprocessor space ... it is not memory mapped.

Second the PMU can count CPU clock cycles - much better resolution than the RTI which has dividers.

Third - the PMU has two additional counters that can count events like cache misses. so it can give insight into why your code takes a long time to execute.

HalCoGen provides PMU functions, and there is an appnote in the product folder that explains how to use the PMU for measurements.

0 FinleyQ over 8 years ago in reply to Anthony F. Seely

Prodigy 80 points

Hello Anthony,

Thanks so much for you response.

I had disabled Cache because using SCI DMA function. I have another test with Cache enabled, and the result seems correct, the running speed improve about 10 times.

But the transfer data of DMA is not right when I use cache. how can I use cache, use DMA, and guarantee a good processing speed?

Thanks Again,

Best Regards,

Finley Quan

0 FinleyQ over 8 years ago in reply to Anthony F. Seely

Prodigy 80 points

Hello Anthony,

Thanks for your suggestion.

And I have found the answer about using CACHE and SCI-DMA at the same time at e2e.ti.com/.../666341

Best Regards,
Finley Quan.

0 Anthony F. Seely over 8 years ago in reply to FinleyQ

TI__Guru 68290 points

Great - glad you got this working.

Arm-based microcontrollers

Arm-based microcontrollers forum

Clock cycles per instruction