This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: Reading of GTC timer on 1GHz R5F takes 235ns

Part Number: TDA4VM

Hi there,

I used 200MHz GTC timer, as enabled in the examples, to measure its own access time, using of two consecutive reads. I have inline function to read 32-bits of the timer like this:

static inline uint32_t GetTimer(void)
{
    return (*(volatile uint32_t*)(0xa90000UL + 0x8U));
}

Here is my test code:

...
uint32_t t1 = GetTimer();
uint32_t t2 = GetTimer();
printf("### Info: GetTimer time offset: %u ns\n", (t2 - t1) * 5u );
...

To be sure, I checked the assembly where we can see two read accesses where the address is loaded in the r4 already. Next, there is subtraction, multiplication by 5 (200MHz is 5ns) made via adding and printf call. Simple clean code w/o any overhead.

; return (*(volatile uint32_t*)(0xa90000UL + 0x8U));
a22540c8: 00 00 94 e5 ldr r0, [r4]
a22540cc: 00 10 94 e5 ldr r1, [r4]
; printf("### Info: GetTimer time offset: %u ns\n", (t2 - t1) * 5u );
a22540d0: 00 00 41 e0 sub r0, r1, r0
a22540d4: 00 11 80 e0 add r1, r0, r0, lsl #2
a22540d8: 3c 03 9f e5 ldr r0, [pc, #828]
a22540dc: 4f 22 00 eb bl #35132 <printf>

Then, I'm getting this text on the Linux IPC logging utility, all the time:

### Info: GetTimer time offset: 235 ns

If I consider 1GHz R5F clock, I found 235 ns simply too long time, especially in comparison to pure embedded ARM cores like M4/M7, where are peripherals accessed w/o any significant wait states. Is there any explanation or did I something wrong or even is not the CPU running at that speed, even logging message says " ### CPU Frequency = 1000000000 Hz" and verified 200MHz GTC clock? I used shipping vison-example and I placed my small code into the app_init.c file after the init phase.

Thanks a lot.

Best regards,
Tomas

  • Hi Tomas,

    Can you please share detailed steps, so that we can reproduce this issue on our end?

    Regards,
    Parth

  • Hi Parth,

    here are my steps:

    • Open  ti-processor-sdk-rtos-j721e-evm-08_04_00_06\vision_apps\platform\j721e\rtos\common\app_init.c 
    • Navigate to line 414 and insert there the code as follows (for copy&paste in text format):
      {
        uint32_t t1 = (*(volatile uint32_t*)(0xa90000UL + 0x8U));
        uint32_t t2 = (*(volatile uint32_t*)(0xa90000UL + 0x8U));
        appLogPrintf("### Info: GetTimer time offset: %u ns\n", (t2 - t1) * 5u );
      }
    • It looks in editor like this

    • Recompile the code
      make vx_app_rtos_linux_mcu2_0 -s
    • Load the elf-file to the system

    • Run ../vx_app_arm_remote_log.out to see IPC logging

    • Check the reported value:

    The value 240 is made out of timer readout difference multiplied by 5 because we have 200MHz GTC clock, e.g. 5ns.

    As additional test, I inserted 5x nop between timer readout:

    {
      uint32_t t1 = (*(volatile uint32_t*)(0xa90000UL + 0x8U));
      __asm("nop");
      __asm("nop");
      __asm("nop");
      __asm("nop");
      __asm("nop");
      uint32_t t2 = (*(volatile uint32_t*)(0xa90000UL + 0x8U));
      appLogPrintf("### Info: GetTimer time offset with nop: %u ns\n", (t2 - t1) * 5u );
    }

    And this is what I have got:

    [MCU2_0] 14058.719712 s: ### Info: GetTimer time offset with nop: 280 ns

    Before we had 240ns, with additional 5x nop we have 280ns, e.g. 40ns took 5 nops => 8ns per nop. Of course, the speed depends on I-cache and on prefetch queue, so exact time is not so easy to measure, but still, on 1GHz CPU I would expect faster execution as known from ARM M4/M7 where such measurements are very precise.

    Thanks
    Regards,
    Tomas

  • Hi Tomas,

    In the first scenario, I'm checking internally regarding the latency of read from the GTC register.
    But as per my understanding, it wouldn't be same as ARM M4/M7 as we have a heterogeneous architecture and there are latencies associated with the read operation.

    Regarding "nop", as the nop operation is not taken from cache, it has some delay, whereas the later ones are cached operation.
    To check this, you could call __asm("nop") once before profiling (i.e. before t1). After this, it will take 1-2 ns for "nop" operation.

    Regards,
    Nikhil

  • Hi Nikhil.

    thanks for the reply. With the NOP, I modified the code as you proposed and now it takes 1ns per NOP es expected. It make sense with the cache, it takes some time to be loaded, it's clear.
    Thanks

    Regards,
    Tomas

  • Hi Tomas,

    In the first scenario, I'm checking internally regarding the latency of read from the GTC register

    Regarding this, 235ns is expected. It wouldn't be same as ARM M4/M7 as we have a heterogeneous architecture and there are latencies associated with the read operation. There are multiple hops the R5 has to traverse, including an async bridge, for both command and data.

    Regards,
    Nikhil

  • Hi Nikhil,

    thanks, this confirms my investigation. I also did the same code on the C66 and there was nearly the same result (a little bit faster), e.g. it's due to architecture.

    I also tried the performance counters inside of the R5F-core. I used modified code from https://opensource.apple.com/source/xnu/xnu-4570.41.2/osfmk/arm/kpc_arm.c  (functions enable_counter/read_counter) and I measured in CPU clocks. Also interesting, delta between 2 NOPs are 8 clock, where 6 clocks are to capture the Performance counter itself, e.g. 2 NOPs = 2 clocks. When I increase number of NOPs to 4 or more, I will get 232+ clocks. This shows how CPU needs to load next code chunk and has to wait to get access to the memory. The same experiment with empty for-loop with 1000 iterations, consisting out of 5 instructions, including one backward jump, which is always bad, takes 11312 clocks. In average and roughly, 11312/1000/5 => 2.2 Clocks per instruction, which is OK.

    Thanks again for you help.

    Regards,
    Tomas