am243x freeRTOS is slow

Rasty Slutsker

we research real-time problems in our system.

The idea is to test task reaction time to interrupt (via semaphore port) and measure jitter in task wakeup time.

Setup:
125 uSec interrupt that posts semaphore to task, whcih in turn posts another semaphore.
CPU time stamp counter is sampled in 2 places/tasks. Each time after semaphoreTake.

In total we have 2 hop, namely semaphore posted from ISR, tasks wakes up, samples time and posts semapore to the next task at the lower priority, whick samples time again.

We measure difference in time stamps that represents jitter in task wake up.

With each new "hop" jitter increases, there is a visible correlation between system tick (1 mSec) and jitter.

It looks like Tick notification takes too much time or CPU is just slow.

I checked that DDR is cached and both I/D caches are turned on.

Is it possible to review SOC configuration, maybe some PLL/Frequency is not set correctly?

over 1 year ago

0 Mukul Bhatnagar over 1 year ago

TI__Guru* 83935 points

Rasty

If you are running these experiments on the EVM , please share the reproducible code, so that the team can closely analyze what you are observing.

0 Rasty Slutsker over 1 year ago in reply to Mukul Bhatnagar

Expert 1820 points

Hi Mukul

I implemeted cyclic test that mesusured task wakeup jitter, please follow xSemaphoreTake/xSemaphoreGive pairs.

1. Task 1 is waked up by Interrupt

2. Task 2 is waked up by task 1

3. Task 3 is waked up by task 2

priorities 1< 2 < 3

I measure jitter of time of wake up in each task.

If I have one task jitter is few uSec, if 3 jitter gows to 28 uSec, which is not logical taking into account that tasks do nothing.

please follow followins vars:

empty_am243x-evm_r5fss0-0_freertos_ti-arm-clang.zip

Best regards

Rasty

0 Swargam Anil over 1 year ago in reply to Rasty Slutsker

TI__Mastermind 46626 points

Hello Rasty,

I am looking at your queries and you may expect reply in one or two days .

Regards,

S.Anil.

0 Rasty Slutsker over 1 year ago in reply to Swargam Anil

Expert 1820 points

One more thing that I cannot explain

0 Rasty Slutsker over 1 year ago in reply to Rasty Slutsker

Expert 1820 points

if I do a cross-check between: FreeRTOS tick (1 mSec) vs timer interrupt(125 uSec) vs CPU time stamp-counter, I obtain CPU frequency (R5 core), something like 13 mHz (!)

0 Mukul Bhatnagar over 1 year ago in reply to Rasty Slutsker

TI__Guru* 83935 points

Rasty

I believe you have evaluated AM64 in the past , and had some critical feedback on performance for it to be a replacement for AM57x.

AM24 belongs to the same family - what pre-evaluation was done prior to picking this for your project, what are the performance criteria evaluated during processor selection?. What big picture KPIs are you trying to accomplish. Are you evaluating device or Free RTOS offering - do you have other OS choices via 3P on AM24?

Just trying to understand the big picture while you do various experiments.

0 Rasty Slutsker over 1 year ago in reply to Mukul Bhatnagar

Expert 1820 points

Hi Mukul,

FreeRTOS is good for us.

We found that performance is low. After week of testing I manage to narrow problem down to very simple question:

Why CPU frequence that I measure by means of CycleCounterP_getCount32 and 2 periodic timers (for cross check) show only 12956289.072268067Hz instead of 800 Mhz?

Best regards

RAsty

0 Mukul Bhatnagar over 1 year ago in reply to Rasty Slutsker

TI__Guru* 83935 points

Understood, thanks for clarifying that. Will have the team look at this Monday, as I am sure there is a disconnect somewhere - your request is logical.

0 Rasty Slutsker over 1 year ago in reply to Rasty Slutsker

Expert 1820 points

Hi Mukul

1. I found the reason for incorrect clock measurements - it was WFI instruction in FreeRTOS idle hook.

2, When I removed it I got very good measurements with verion 9.0.0.35 of SDK

3. version 8.6.0.43 (our version ) provides strange numbers.

0 Karan Saxena over 1 year ago in reply to Rasty Slutsker

TI__Guru* 77774 points

Rasty

Rasty Slutsker said:
Why CPU frequence that I measure by means of CycleCounterP_getCount32 and 2 periodic timers (for cross check) show only 12956289.072268067Hz instead of 800 Mhz?

Please share how these experiments were carried out. Can you please share a snippet?

Rasty Slutsker said:
1. I found the reason for incorrect clock measurements - it was WFI instruction in FreeRTOS idle hook.

WFI should not create a problem unless there is another interrupt which generated before your timer kicks in. If another ISR/Task (higher priority) interrupts the core, the frequency calculation will go for a toss.

Example

time_stamp(t1)

delay(2 ms)

time_stamp(t2)

If between t2 and t1, there is an additional context switch (than switch to idle and back), then the t2-t1 would not be 2ms.

Please share what hook function are you referring to here.

Rasty Slutsker said:
2, When I removed it I got very good measurements with verion 9.0.0.35 of SDK

3. version 8.6.0.43 (our version ) provides strange numbers.

Please clarify, does this mean that with 9.0.0.35, if you remove WFI from the idle hook, results are reasonable. But even when you do the same with 8.6.0.43, results are still incorrect?

Regards

Karan

0 Mukul Bhatnagar over 1 year ago in reply to Karan Saxena

TI__Guru* 83935 points

Additionally it would be good to clarify if there are any reasons why you would need to be on SDK8.6, my hope is that for new development like yours you will be on the latest SDK offering i.e. SDK 9.0.1

https://www.ti.com/tool/download/MCU-PLUS-SDK-AM243X/09.01.00.41

There seems to be reasonable amount of bug fixes in this release over the previous one

https://software-dl.ti.com/mcu-plus-sdk/esd/AM243X/09_01_00_41/exports/docs/api_guide_am243x/RELEASE_NOTES_09_01_00_PAGE.html

0 Karan Saxena over 1 year ago in reply to Mukul Bhatnagar

TI__Guru* 77774 points

1. This application is compiled using GCC (not TI CLANG). Action (TI/Rasty) to replicate a simple application with one single ISR and 3 tasks (this should not have any additional code blocks/functionality.

2. Rasty to share GCC compiler flags being used to compile the application. TI will review with the compiler team. Suspicion is that the flags used for GCC compilation of the app are not optimal.

Rasty Slutsker said:
1. I found the reason for incorrect clock measurements - it was WFI instruction in FreeRTOS idle hook.

WFI will stop PMU and hence could lead to incorrect measurements of cycle counter.

Regards

Karan

0 Rasty Slutsker over 1 year ago in reply to Karan Saxena

Expert 1820 points

Mukul Bhatnagar

I managed to reproduce jitter on top of your example.

It is enough to add low priority task that constantly (tight loop) fills big array. CPU runs out cache and needs cache refill.

One task addes jitter of 12 uSecs. Is it possible that cache refill takes 12 uSecs?

It also influences interrupt reaction time.

Can someone review with is DDR and cache settings?

0 Swargam Anil over 1 year ago in reply to Rasty Slutsker

TI__Mastermind 46626 points

Hello Rasty ,

We have tested the above example on our bench and noticed that the PMU Timer is not functioning properly.

Instead of the PMU timer, we have added a GTC timer in our Application, and please try to check at your bench, and share your test results.

I am trying to reproduce the same issue with GCC compilers and let you know the status.

5125.empty_am243x-evm_r5fss0-0_freertos_ti-arm-clang.zip

Regards,

S.Anil.

0 Swargam Anil over 1 year ago in reply to Swargam Anil

TI__Mastermind 46626 points

Hello Rasty,

Rasty Slutsker said:
It is enough to add low priority task that constantly (tight loop) fills big array. CPU runs out cache and needs cache refill.

Can you please share the reproducible code with us to further debug the issue at our side ?

Rasty Slutsker said:
One task addes jitter of 12 uSecs. Is it possible that cache refill takes 12 uSecs?

We need to check this timing details with other exerts as well.

If you remove the non-cached for this buffers memory, still do you have a jitter issue ?

Regards,

S.Anil.

0 Rasty Slutsker over 1 year ago in reply to Swargam Anil

Expert 1820 points

You can monitor these counters.

You can stop debugger and put 0 each to restart measurement. You should see values close to 125 (after division by 800.0)

[0] is ISR cycle time

[1],[2],[3] task cycle time.

empty_am243x-evm_r5fss0-0_freertos_ti-arm-clang_1.zip

Try "as is" then commenot out in main.

gHop4Task = xTaskCreateStatic( empty_hop4, /* Pointer to the function that implements the task. */
gHop5Task = xTaskCreateStatic( empty_hop5, /* Pointer to the function that implements the task. */

0 Swargam Anil over 1 year ago in reply to Rasty Slutsker

TI__Mastermind 46626 points

Hello Rasty,

Rasty Slutsker said:
empty_am243x-evm_r5fss0-0_freertos_ti-arm-clang_1.zip

In this example, you have created only one task ? Do I need to create 2 other tasks and will do I check on my side ?

Regards,

S.Anil.

0 Rasty Slutsker over 1 year ago in reply to Swargam Anil

Expert 1820 points

3301.empty_am243x-evm_r5fss0-0_freertos_ti-arm-clang.zip

I uploaded example that I've shown during the meeting.

Thansk

Rasty

0 Karan Saxena over 1 year ago in reply to Rasty Slutsker

TI__Guru* 77774 points

Update on this thread

1. TI able to replicate the issue and see jitter in task scheduling when lower priority tasks are filling up large buffers (32 x 10000 / 8 Bytes = 40KB) and doing a buffer copy.

Expectation is that this is overwrites the Cache contents and due to this the other thread execution and context switch takes time, leading to jitter. This is validated by putting the large buffers in uncached DDR, after which there is no jitter.

Open questions

Buffers used in the lower priority tasks should cache in the data cache and not the instruction, so it not possible to cause thrashing of instructions which should be services by instruction cache.
Is 12 us (max jitter which we see in tasks) explainable?
- Additional data point here is that every cache miss will incur penalty of fetching from DDR. The DDR access latency for is 280ns as per https://www.ti.com/lit/an/spracv1b/spracv1b.pdf
- So an additional cache misses of 43 can lead to an increase in execution time by 43 x 280 ns = 12us

Additional information

ISR execution time is << 125us and comes out to be 4-6us.

Next steps

Run PMU profiling to measure cache miss data for I$ and D$

Regards

Karan

0 Karan Saxena over 1 year ago in reply to Karan Saxena

TI__Guru* 77774 points

Update as of 3/6

TI is still working on the above actions, getting PMU data.

Example which can be used for this - mcu_plus_sdk_am243x_09_01_00_41\examples\drivers\pmu\pmu_multievent\pmu_multievent.c

Regards

Karan

0 Karan Saxena over 1 year ago in reply to Karan Saxena

TI__Guru* 77774 points

Notes from the experiments

1. D$ misses are high for ISR as well (probably because of the profiling code), so it makes sense to have an impact on ISR execution when you have lower priority tasks filling up large buffers (32 x 10000 / 8 Bytes = 40KB) and doing a buffer copy.

D$ misses in the ISR are >4000.

Next step would be to try put the benchmark code & buffers in uncached memory or TCM.

2. Moving the time stamping code to the beginning of the ISR (& tasks) reduces the jitter, this is probably because no the delta due to cache miss is consistent for consecutive runs. ISR hits at worst case latency of 127us (1.6% jitter), this earlier was around 130us (4% jitter).

3. Time of execution of ISR in case when the low priority tasks filling up large buffers aren't there is 1us and in case when the buffer filling and copy is enabled in the other tasks, the execution time increases to 6us.

If there are any additional experiments which your team has done? Please post results. Let's identify the next steps when we meet today.

Regards

Karan

0 Karan Saxena over 1 year ago in reply to Karan Saxena

TI__Guru* 77774 points

Next steps

1. See consecutive times of execution for for ISR, if consecutive times of accesses are different, then the time difference could manifest as a jitter.

2. Keep the profiling code in TCM/DDR uncached when running this experiment.

3. Reduce the size of buffer to see impact on jitter. TI tried 4KB instead of 40KB -> this reduces the jitter.

Exp from customer

1. Move example to internal RAM

Action on TI: how to move the autogen code to internal memory. This will reduce the jitter but will not remove it as the cache will still come into the picture when you run from internal memory.

Jitter should not be there if things are in TCM, as there is no cache in between the R5 and TCM access.

2. Customer is currently not using TCM or internal memory.

3. DDR freq - 1600MT/s

4. TCP/IP - could be in uncached DDR.

Regards

Karan

0 Rasty Slutsker over 1 year ago in reply to Karan Saxena

Expert 1820 points

Meanwhile we apply following strategy"

Separate heaps for internal, external cached and external non cached memory. This works prette nice with C++, thanks to overload of "new"/"delete"

The problem is that freertos does not know how to work with multiply heaps, so we are going to make adaptations to freertos as well.

I discussed this on Freertos developers forum.

Next step is code separation according to sections, It works well with C and not so good with C++.

Need GCC version 14. It would be greate If you could help with this tool-chain. Latest binary on ARM site is for 13.

Best regards

Rasty

0 Rasty Slutsker over 1 year ago in reply to Karan Saxena

Expert 1820 points

Karan Saxena

Can you help with explanation how to move (attach attributes to) autogenerated ISR wrapper to specifc memory section:

Thanks

Rasty

0 Swargam Anil over 1 year ago in reply to Rasty Slutsker

TI__Mastermind 46626 points

Hello Rasty ,

I have tried two methods. One method is the _attribute method and another one is of keeping functions in the .text in the linker command file.

In my testing I found that the 2nd method is simple since we do no need to update the generated files as well.

Please try the below method ,. Just you can update the linker cmd file to keep functions in the TCM memory.

And the TCMA memory size is also defined in the same linker command file.

/* This is the stack that is used by code running within main()
 * In case of NORTOS,
 * - This means all the code outside of ISR uses this stack
 * In case of FreeRTOS
 * - This means all the code until vTaskStartScheduler() is called in main()
 *   uses this stack.
 * - After vTaskStartScheduler() each task created in FreeRTOS has its own stack
 */
--stack_size=16384
/* This is the heap size for malloc() API in NORTOS and FreeRTOS
 * This is also the heap used by pvPortMalloc in FreeRTOS
 */
--heap_size=32768
-e_vectors  /* This is the entry of the application, _vector MUST be plabed starting address 0x0 */

/* This is the size of stack when R5 is in IRQ mode
 * In NORTOS,
 * - Here interrupt nesting is enabled
 * - This is the stack used by ISRs registered as type IRQ
 * In FreeRTOS,
 * - Here interrupt nesting is disabled
 * - This is stack that is used initally when a IRQ is received
 * - But then the mode is switched to SVC mode and SVC stack is used for all user ISR callbacks
 * - Hence in FreeRTOS, IRQ stack size is less and SVC stack size is more
 */
__IRQ_STACK_SIZE = 256;
/* This is the size of stack when R5 is in IRQ mode
 * - In both NORTOS and FreeRTOS nesting is disabled for FIQ
 */
__FIQ_STACK_SIZE = 256;
__SVC_STACK_SIZE = 4096; /* This is the size of stack when R5 is in SVC mode */
__ABORT_STACK_SIZE = 256;  /* This is the size of stack when R5 is in ABORT mode */
__UNDEFINED_STACK_SIZE = 256;  /* This is the size of stack when R5 is in UNDEF mode */

SECTIONS
{
    /* This has the R5F entry point and vector table, this MUST be at 0x0 */
    .vectors:{} palign(8) > R5F_VECS
    /*.controlfnc:{} palign(8) > R5F_TCMA*/


    /* This has the R5F boot code until MPU is enabled,  this MUST be at a address < 0x80000000
     * i.e this cannot be placed in DDR
     */
    GROUP {
        .text.hwi: palign(8)
        .text.cache: palign(8)
        .text.mpu: palign(8)
        .text.boot: palign(8)
        .text:abort: palign(8) /* this helps in loading symbols when using XIP mode */
    } > MSRAM

    /* This is rest of code. This can be placed in DDR if DDR is available and needed */
    GROUP {
        .text:   {} palign(8)   /* This is where code resides */
        .rodata: {} palign(8)   /* This is where const's go */
    } > DDR ///MSRAM

    /* This is rest of initialized data. This can be placed in DDR if DDR is available and needed */
    GROUP {
        .data:   {} palign(8)   /* This is where initialized globals and static go */
    } > DDR ///MSRAM

    /* This is rest of uninitialized data. This can be placed in DDR if DDR is available and needed */
    GROUP {
        .bss:    {} palign(8)   /* This is where uninitialized globals go */
        RUN_START(__BSS_START)
        RUN_END(__BSS_END)
        .sysmem: {} palign(8)   /* This is where the malloc heap goes */
        .stack:  {} palign(8)   /* This is where the main() stack goes */
    } > DDR ///MSRAM

    /* This is where the stacks for different R5F modes go */
    GROUP {
        .irqstack: {. = . + __IRQ_STACK_SIZE;} align(8)
        RUN_START(__IRQ_STACK_START)
        RUN_END(__IRQ_STACK_END)
        .fiqstack: {. = . + __FIQ_STACK_SIZE;} align(8)
        RUN_START(__FIQ_STACK_START)
        RUN_END(__FIQ_STACK_END)
        .svcstack: {. = . + __SVC_STACK_SIZE;} align(8)
        RUN_START(__SVC_STACK_START)
        RUN_END(__SVC_STACK_END)
        .abortstack: {. = . + __ABORT_STACK_SIZE;} align(8)
        RUN_START(__ABORT_STACK_START)
        RUN_END(__ABORT_STACK_END)
        .undefinedstack: {. = . + __UNDEFINED_STACK_SIZE;} align(8)
        RUN_START(__UNDEFINED_STACK_START)
        RUN_END(__UNDEFINED_STACK_END)
    } > DDR ///MSRAM

    /* Sections needed put into TCM */
    GROUP {
        CRITICAL_DATA_SECTION:  {} palign(8)   /* critical data section */
        CRITICAL_TEXT_SECTION: {} palign(8)   /* critical text section */
    } > DDR ///MSRAM

    /* Sections needed for C++ projects */
    GROUP {
        .ARM.exidx:  {} palign(8)   /* Needed for C++ exception handling */
        .init_array: {} palign(8)   /* Contains function pointers called before main */
        .fini_array: {} palign(8)   /* Contains function pointers called after main */
    } > MSRAM

    /* General purpose user shared memory, used in some examples */
    .bss.user_shared_mem (NOLOAD) : {} > USER_SHM_MEM
    /* this is used when Debug log's to shared memory are enabled, else this is not used */
    .bss.log_shared_mem  (NOLOAD) : {} > LOG_SHM_MEM
    /* this is used only when IPC RPMessage is enabled, else this is not used */
    .bss.ipc_vring_mem   (NOLOAD) : {} > RTOS_NORTOS_IPC_SHM_MEM
    /* General purpose non cacheable memory, used in some examples */
    .bss.nocache (NOLOAD) : {} > NON_CACHE_MEM

	GROUP {
    .text.TimerP_isr0 : {} palign(8)
    .text.TimerTick : {} palign(8)
	} >  R5F_TCMA


     /*.DDR_UNCACHED.filebuf:   {} palign(8) > DDR_UNCACHED*/


}




/*
NOTE: Below memory is reserved for DMSC usage
 - During Boot till security handoff is complete
   0x701E0000 - 0x701FFFFF (128KB)
 - After "Security Handoff" is complete (i.e at run time)
   0x701F4000 - 0x701FFFFF (48KB)

 Security handoff is complete when this message is sent to the DMSC,
   TISCI_MSG_SEC_HANDOVER

 This should be sent once all cores are loaded and all application
 specific firewall calls are setup.
*/

MEMORY
{
    R5F_VECS  : ORIGIN = 0x00000000 , LENGTH = 0x00000040
    R5F_TCMA  : ORIGIN = 0x00000040 , LENGTH = 0x00007FC0
    R5F_TCMB0 : ORIGIN = 0x41010000 , LENGTH = 0x00008000

    /* memory segment used to hold CPU specific non-cached data, MAKE to add a MPU entry to mark this as non-cached */
    NON_CACHE_MEM : ORIGIN = 0x70060000 , LENGTH = 0x8000

    /* when using multi-core application's i.e more than one R5F/M4F active, make sure
     * this memory does not overlap with other R5F's
     */
    MSRAM     : ORIGIN = 0x70080000 , LENGTH = 0x40000

    DDR       : ORIGIN = 0x80000000 , LENGTH = 0x80000

    /* This section can be used to put XIP section of the application in flash, make sure this does not overlap with
     * other CPUs. Also make sure to add a MPU entry for this section and mark it as cached and code executable
     */
    FLASH     : ORIGIN = 0x60100000 , LENGTH = 0x80000


    DDR_UNCACHED  : ORIGIN = 0x80080000 , LENGTH = 0x70000
    /* shared memory segments */
    /* On R5F,
     * - make sure there is a MPU entry which maps below regions as non-cache
     */
    USER_SHM_MEM            : ORIGIN = 0x701D0000, LENGTH = 0x180
    LOG_SHM_MEM             : ORIGIN = 0x701D0000 + 0x180, LENGTH = 0x00004000 - 0x180
    RTOS_NORTOS_IPC_SHM_MEM : ORIGIN = 0x701D4000, LENGTH = 0x0000C000
}

Please look at the memory allocation window. The above two functions are stored in the TCM memory only.

I have tried with TCMA and the same procedure for other memory regions as well.

Regards,

S.Anil.

0 Rasty Slutsker over 1 year ago in reply to Swargam Anil

Expert 1820 points

OK. Thanks. It helps.

Code Composer Studio™︎

Code Composer Studio forum

am243x freeRTOS is slow