TDA4VM: Reading of GTC timer on 1GHz R5F takes 235ns

Tomas Tichy

Part Number: TDA4VM

Hi there,

I used 200MHz GTC timer, as enabled in the examples, to measure its own access time, using of two consecutive reads. I have inline function to read 32-bits of the timer like this:

static inline uint32_t GetTimer(void)
{
return (*(volatile uint32_t*)(0xa90000UL + 0x8U));
}

Here is my test code:

...
uint32_t t1 = GetTimer();
uint32_t t2 = GetTimer();
printf("### Info: GetTimer time offset: %u ns\n", (t2 - t1) * 5u );
...

To be sure, I checked the assembly where we can see two read accesses where the address is loaded in the r4 already. Next, there is subtraction, multiplication by 5 (200MHz is 5ns) made via adding and printf call. Simple clean code w/o any overhead.

; return (*(volatile uint32_t*)(0xa90000UL + 0x8U));
a22540c8: 00 00 94 e5 ldr r0, [r4]
a22540cc: 00 10 94 e5 ldr r1, [r4]
; printf("### Info: GetTimer time offset: %u ns\n", (t2 - t1) * 5u );
a22540d0: 00 00 41 e0 sub r0, r1, r0
a22540d4: 00 11 80 e0 add r1, r0, r0, lsl #2
a22540d8: 3c 03 9f e5 ldr r0, [pc, #828]
a22540dc: 4f 22 00 eb bl #35132 <printf>

Then, I'm getting this text on the Linux IPC logging utility, all the time:

### Info: GetTimer time offset: 235 ns

If I consider 1GHz R5F clock, I found 235 ns simply too long time, especially in comparison to pure embedded ARM cores like M4/M7, where are peripherals accessed w/o any significant wait states. Is there any explanation or did I something wrong or even is not the CPU running at that speed, even logging message says " ### CPU Frequency = 1000000000 Hz" and verified 200MHz GTC clock? I used shipping vison-example and I placed my small code into the app_init.c file after the init phase.

Thanks a lot.

Best regards,
Tomas

over 2 years ago

0 Parth Nagpal over 2 years ago

TI__Mastermind 42135 points

Hi Tomas,

Can you please share detailed steps, so that we can reproduce this issue on our end?

Regards,
Parth

0 Tomas Tichy over 2 years ago in reply to Parth Nagpal

Prodigy 60 points

Hi Parth,

here are my steps:

Open ti-processor-sdk-rtos-j721e-evm-08_04_00_06\vision_apps\platform\j721e\rtos\common\app_init.c
Navigate to line 414 and insert there the code as follows (for copy&paste in text format):
{
uint32_t t1 = (*(volatile uint32_t*)(0xa90000UL + 0x8U));
uint32_t t2 = (*(volatile uint32_t*)(0xa90000UL + 0x8U));
appLogPrintf("### Info: GetTimer time offset: %u ns\n", (t2 - t1) * 5u );
}
It looks in editor like this
Recompile the code
make vx_app_rtos_linux_mcu2_0 -s
Load the elf-file to the system
Run ../vx_app_arm_remote_log.out to see IPC logging
Check the reported value:

The value 240 is made out of timer readout difference multiplied by 5 because we have 200MHz GTC clock, e.g. 5ns.

As additional test, I inserted 5x nop between timer readout:

{
uint32_t t1 = (*(volatile uint32_t*)(0xa90000UL + 0x8U));
__asm("nop");
__asm("nop");
__asm("nop");
__asm("nop");
__asm("nop");
uint32_t t2 = (*(volatile uint32_t*)(0xa90000UL + 0x8U));
appLogPrintf("### Info: GetTimer time offset with nop: %u ns\n", (t2 - t1) * 5u );
}

And this is what I have got:

[MCU2_0] 14058.719712 s: ### Info: GetTimer time offset with nop: 280 ns

Before we had 240ns, with additional 5x nop we have 280ns, e.g. 40ns took 5 nops => 8ns per nop. Of course, the speed depends on I-cache and on prefetch queue, so exact time is not so easy to measure, but still, on 1GHz CPU I would expect faster execution as known from ARM M4/M7 where such measurements are very precise.

Thanks
Regards,
Tomas

0 Nikhil Dasan over 2 years ago in reply to Tomas Tichy

TI__Guru* 85396 points

Hi Tomas,

In the first scenario, I'm checking internally regarding the latency of read from the GTC register.
But as per my understanding, it wouldn't be same as ARM M4/M7 as we have a heterogeneous architecture and there are latencies associated with the read operation.

Regarding "nop", as the nop operation is not taken from cache, it has some delay, whereas the later ones are cached operation.
To check this, you could call __asm("nop") once before profiling (i.e. before t1). After this, it will take 1-2 ns for "nop" operation.

Regards,
Nikhil

0 Tomas Tichy over 2 years ago in reply to Nikhil Dasan

Prodigy 60 points

Hi Nikhil.

thanks for the reply. With the NOP, I modified the code as you proposed and now it takes 1ns per NOP es expected. It make sense with the cache, it takes some time to be loaded, it's clear.
Thanks

Regards,
Tomas

0 Nikhil Dasan over 2 years ago in reply to Tomas Tichy

TI__Guru* 85396 points

Hi Tomas,

Nikhil Dasan said:
In the first scenario, I'm checking internally regarding the latency of read from the GTC register

Regarding this, 235ns is expected. It wouldn't be same as ARM M4/M7 as we have a heterogeneous architecture and there are latencies associated with the read operation. There are multiple hops the R5 has to traverse, including an async bridge, for both command and data.

Regards,
Nikhil

0 Tomas Tichy over 2 years ago in reply to Nikhil Dasan

Prodigy 60 points

Hi Nikhil,

thanks, this confirms my investigation. I also did the same code on the C66 and there was nearly the same result (a little bit faster), e.g. it's due to architecture.

I also tried the performance counters inside of the R5F-core. I used modified code from https://opensource.apple.com/source/xnu/xnu-4570.41.2/osfmk/arm/kpc_arm.c (functions enable_counter/read_counter) and I measured in CPU clocks. Also interesting, delta between 2 NOPs are 8 clock, where 6 clocks are to capture the Performance counter itself, e.g. 2 NOPs = 2 clocks. When I increase number of NOPs to 4 or more, I will get 232+ clocks. This shows how CPU needs to load next code chunk and has to wait to get access to the memory. The same experiment with empty for-loop with 1000 iterations, consisting out of 5 instructions, including one backward jump, which is always bad, takes 11312 clocks. In average and roughly, 11312/1000/5 => 2.2 Clocks per instruction, which is OK.

Thanks again for you help.

Regards,
Tomas

Processors

Processors forum

TDA4VM: Reading of GTC timer on 1GHz R5F takes 235ns