This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

am243x freeRTOS is slow

Hi

we research real-time problems in our system.

The idea is to test task reaction time to interrupt (via semaphore port) and measure jitter in task wakeup time.

Setup:
125 uSec interrupt that posts semaphore to task, whcih in turn posts another semaphore.
CPU time stamp counter is sampled in 2 places/tasks. Each time after semaphoreTake.

In total we have 2 hop, namely semaphore posted from ISR, tasks wakes up, samples time and posts semapore to the next task at the lower priority, whick samples time again.

We measure difference in time stamps that represents jitter in task wake up.

With each new "hop" jitter increases, there is a visible correlation between system tick (1 mSec) and jitter.

It looks like Tick notification takes too much time or CPU is just slow.

I checked that DDR is cached and both I/D caches are turned on.

Is it possible to review SOC configuration, maybe some PLL/Frequency is not set correctly?

  • Rasty

    If you are running these experiments on the EVM , please share the reproducible code, so that the team can closely analyze what you are observing.

  • Hi Mukul

    I implemeted cyclic test that mesusured task wakeup jitter, please follow xSemaphoreTake/xSemaphoreGive pairs.

    1. Task 1 is waked up by Interrupt

    2. Task 2 is waked up by task 1

    3. Task 3 is waked up by task 2

    priorities 1< 2 < 3

    I measure jitter of time of wake up in each task.

    If I have one task jitter is few uSec, if 3 jitter gows to 28 uSec, which is not logical taking into account that tasks do nothing.

    please follow followins vars:

    empty_am243x-evm_r5fss0-0_freertos_ti-arm-clang.zip

    Best regards

    Rasty

  • Hello Rasty,

    I am looking at your queries and you may expect reply in one or two days .

    Regards,

    S.Anil.

  • One more thing that I cannot explain

  • if I do a cross-check between: FreeRTOS tick (1 mSec) vs timer interrupt(125 uSec) vs CPU time stamp-counter, I obtain CPU frequency (R5 core), something like 13 mHz (!)

  • Rasty

    I believe you have evaluated AM64 in the past , and had some critical feedback on performance for it to be a replacement for AM57x. 

    AM24 belongs to the same family - what pre-evaluation was done prior to picking this for your project, what are the performance criteria evaluated during processor selection?. What big picture KPIs are you trying to accomplish. Are you evaluating device or Free RTOS offering - do you have other OS choices via 3P on AM24?

    Just trying to understand the big picture while you do various experiments.  

  • Hi Mukul,

    FreeRTOS is good for us.

    We found that performance is low. After week of testing I manage to narrow problem down to very simple question:

    Why CPU frequence that I measure by means of CycleCounterP_getCount32 and 2 periodic timers (for cross check) show only 12956289.072268067Hz instead of 800 Mhz? 

    Best regards

    RAsty 

  • Understood, thanks for clarifying that. Will have the team look at this Monday, as I am sure there is a disconnect somewhere - your request is logical. 

  • Hi Mukul

    1. I found the reason for incorrect clock measurements  - it was WFI instruction in FreeRTOS idle hook.

    2, When I removed it I got very good measurements with verion 9.0.0.35 of SDK

    3. version 8.6.0.43 (our version ) provides strange numbers.

  • Rasty

    Why CPU frequence that I measure by means of CycleCounterP_getCount32 and 2 periodic timers (for cross check) show only 12956289.072268067Hz instead of 800 Mhz? 

    Please share how these experiments were carried out. Can you please share a snippet?

    1. I found the reason for incorrect clock measurements  - it was WFI instruction in FreeRTOS idle hook.

    WFI should not create a problem unless there is another interrupt which generated before your timer kicks in. If another ISR/Task (higher priority) interrupts the core, the frequency calculation will go for a toss.

    Example

    time_stamp(t1)

    delay(2 ms)

    time_stamp(t2)

    If between t2 and t1, there is an additional context switch (than switch to idle and back), then the t2-t1 would not be 2ms.

    Please share what hook function are you referring to here.

    2, When I removed it I got very good measurements with verion 9.0.0.35 of SDK

    3. version 8.6.0.43 (our version ) provides strange numbers.

    Please clarify, does this mean that with 9.0.0.35, if you remove WFI from the idle hook, results are reasonable. But even when you do the same with 8.6.0.43, results are still incorrect?

    Regards

    Karan

  • Additionally it would be good to clarify if there are any reasons why you would need to be on SDK8.6, my hope is that for new development like yours you will be on the latest SDK offering i.e. SDK 9.0.1

    https://www.ti.com/tool/download/MCU-PLUS-SDK-AM243X/09.01.00.41

    There seems to be reasonable amount of bug fixes in this release over the previous one

    https://software-dl.ti.com/mcu-plus-sdk/esd/AM243X/09_01_00_41/exports/docs/api_guide_am243x/RELEASE_NOTES_09_01_00_PAGE.html

  • 1. This application is compiled using GCC (not TI CLANG). Action (TI/Rasty) to replicate a simple application with one single ISR and 3 tasks (this should not have any additional code blocks/functionality.

    2. Rasty to share GCC compiler flags being used to compile the application. TI will review with the compiler team. Suspicion is that the flags used for GCC compilation of the app are not optimal.

    3.

    1. I found the reason for incorrect clock measurements  - it was WFI instruction in FreeRTOS idle hook.

    WFI will stop PMU and hence could lead to incorrect measurements of cycle counter.

    Regards

    Karan

  • I managed to reproduce jitter on top of your example.

    It is enough to add low priority task that constantly (tight loop) fills big array. CPU runs out cache and needs cache refill.

    One task addes jitter of 12 uSecs. Is it possible that cache refill takes 12 uSecs?

    It also influences interrupt reaction time.

    Can someone review with is DDR  and cache settings?

  • Hello Rasty ,

    We have tested the above example on our bench and noticed that the PMU Timer is not functioning properly.

    Instead of the PMU timer, we have added a GTC timer in our Application, and please try to check at your bench, and share your test results.

    I am trying to reproduce the same issue with GCC compilers and let you know the status.

    5125.empty_am243x-evm_r5fss0-0_freertos_ti-arm-clang.zip

    Regards,

    S.Anil.

  • Hello Rasty,

    It is enough to add low priority task that constantly (tight loop) fills big array. CPU runs out cache and needs cache refill.

    Can you please share the reproducible code with us to further debug the issue at our side ?

    One task addes jitter of 12 uSecs. Is it possible that cache refill takes 12 uSecs?

    We need to check this timing details with other exerts as well.

    If you remove the non-cached for this buffers memory, still do you have a jitter issue ?

    Regards,

    S.Anil.

  • Hi

    You can monitor these counters.

    You can stop debugger and put 0 each to restart measurement. You should see values close to 125 (after division by 800.0)

    [0] is ISR cycle time

    [1],[2],[3] task cycle time.

     empty_am243x-evm_r5fss0-0_freertos_ti-arm-clang_1.zip

    Try "as is" then commenot out in main.

    gHop4Task = xTaskCreateStatic( empty_hop4, /* Pointer to the function that implements the task. */
    gHop5Task = xTaskCreateStatic( empty_hop5, /* Pointer to the function that implements the task. */

  • Hello Rasty,

    empty_am243x-evm_r5fss0-0_freertos_ti-arm-clang_1.zip

    In this example, you have created only one task ? Do I need to create 2 other tasks and will do I check on my side ?

    Regards,

    S.Anil.

  • 3301.empty_am243x-evm_r5fss0-0_freertos_ti-arm-clang.zip

    Hi 

    I uploaded example that I've shown during the meeting.

    Thansk

    Rasty

  • Update on this thread

    1. TI able to replicate the issue and see jitter in task scheduling when lower priority tasks are filling up large buffers (32 x 10000 / 8 Bytes = 40KB) and doing a buffer copy.

    Expectation is that this is overwrites the Cache contents and due to this the other thread execution and context switch takes time, leading to jitter. This is validated by putting the large buffers in uncached DDR, after which there is no jitter.

    Open questions

    • Buffers used in the lower priority tasks should cache in the data cache and not the instruction, so it not possible to cause thrashing of instructions which should be services by instruction cache.
    • Is 12 us (max jitter which we see in tasks) explainable?
      • Additional data point here is that every cache miss will incur penalty of fetching from DDR. The DDR access latency for is 280ns as per https://www.ti.com/lit/an/spracv1b/spracv1b.pdf
      • So an additional cache misses of 43 can lead to an increase in execution time by 43 x 280 ns = 12us

    Additional information

    • ISR execution time is << 125us and comes out to be 4-6us.

    Next steps

    • Run PMU profiling to measure cache miss data for I$ and D$

    Regards

    Karan

  • Update as of 3/6

    TI is still working on the above actions, getting PMU data.

    Example which can be used for this - mcu_plus_sdk_am243x_09_01_00_41\examples\drivers\pmu\pmu_multievent\pmu_multievent.c

    Regards

    Karan

  • Notes from the experiments

    1. D$ misses are high for ISR as well (probably because of the profiling code), so it makes sense to have an impact on ISR execution when you have lower priority tasks filling up large buffers (32 x 10000 / 8 Bytes = 40KB) and doing a buffer copy.

    D$ misses in the ISR are >4000.

    Next step would be to try put the benchmark code & buffers in uncached memory or TCM.

    2. Moving the time stamping code to the beginning of the ISR (& tasks) reduces the jitter, this is probably because no the delta due to cache miss is consistent for consecutive runs. ISR hits at worst case latency of 127us (1.6% jitter), this earlier was around 130us (4% jitter).

    3. Time of execution of ISR in case when the low priority tasks filling up large buffers aren't there is 1us and in case when the buffer filling and copy is enabled in the other tasks, the execution time increases to 6us.

    If there are any additional experiments which your team has done? Please post results. Let's identify the next steps when we meet today.

    Regards

    Karan

  • Next steps

    1. See consecutive times of execution for for ISR, if consecutive times of accesses are different, then the time difference could manifest as a jitter.

    2. Keep the profiling code in TCM/DDR uncached when running this experiment.

    3. Reduce the size of buffer to see impact on jitter. TI tried 4KB instead of 40KB -> this reduces the jitter.

    Exp from customer

    1. Move example to internal RAM

    Action on TI: how to move the autogen code to internal memory. This will reduce the jitter but will not remove it as the cache will still come into the picture when you run from internal memory.

    Jitter should not be there if things are in TCM, as there is no cache in between the R5 and TCM access.

    2. Customer is currently not using TCM or internal memory.

    3. DDR freq - 1600MT/s

    4. TCP/IP - could be in uncached DDR.

    Regards

    Karan

  • Hi 

    Meanwhile we apply following strategy"

    Separate heaps for internal, external cached and external non cached memory. This works prette nice with C++, thanks to overload of "new"/"delete"

    The problem is that freertos does not know how to work with multiply heaps, so we are going to make adaptations to freertos as well.

    I discussed this on Freertos developers forum.

    Next step is code separation according to sections, It works well with C and not so good with C++.

    Need GCC version 14. It would be greate If you could help with this tool-chain. Latest binary on ARM site is for 13.

    Best regards

    Rasty

  • Hi 

    Can you help with explanation how to move (attach attributes to) autogenerated ISR wrapper to specifc memory section:

    Thanks

    Rasty

  • Hello Rasty ,

    I have tried two methods. One method is the _attribute method and another one is of keeping functions in the .text in the linker command file.

    In my testing I found that the 2nd method is simple since we do no need to update the generated files as well.

    Please try the below method ,. Just you can update the linker cmd file to keep functions in the TCM memory.

    And the TCMA memory size is also defined in the same linker command file. 

    /* This is the stack that is used by code running within main()
     * In case of NORTOS,
     * - This means all the code outside of ISR uses this stack
     * In case of FreeRTOS
     * - This means all the code until vTaskStartScheduler() is called in main()
     *   uses this stack.
     * - After vTaskStartScheduler() each task created in FreeRTOS has its own stack
     */
    --stack_size=16384
    /* This is the heap size for malloc() API in NORTOS and FreeRTOS
     * This is also the heap used by pvPortMalloc in FreeRTOS
     */
    --heap_size=32768
    -e_vectors  /* This is the entry of the application, _vector MUST be plabed starting address 0x0 */
    
    /* This is the size of stack when R5 is in IRQ mode
     * In NORTOS,
     * - Here interrupt nesting is enabled
     * - This is the stack used by ISRs registered as type IRQ
     * In FreeRTOS,
     * - Here interrupt nesting is disabled
     * - This is stack that is used initally when a IRQ is received
     * - But then the mode is switched to SVC mode and SVC stack is used for all user ISR callbacks
     * - Hence in FreeRTOS, IRQ stack size is less and SVC stack size is more
     */
    __IRQ_STACK_SIZE = 256;
    /* This is the size of stack when R5 is in IRQ mode
     * - In both NORTOS and FreeRTOS nesting is disabled for FIQ
     */
    __FIQ_STACK_SIZE = 256;
    __SVC_STACK_SIZE = 4096; /* This is the size of stack when R5 is in SVC mode */
    __ABORT_STACK_SIZE = 256;  /* This is the size of stack when R5 is in ABORT mode */
    __UNDEFINED_STACK_SIZE = 256;  /* This is the size of stack when R5 is in UNDEF mode */
    
    SECTIONS
    {
        /* This has the R5F entry point and vector table, this MUST be at 0x0 */
        .vectors:{} palign(8) > R5F_VECS
        /*.controlfnc:{} palign(8) > R5F_TCMA*/
    
    
        /* This has the R5F boot code until MPU is enabled,  this MUST be at a address < 0x80000000
         * i.e this cannot be placed in DDR
         */
        GROUP {
            .text.hwi: palign(8)
            .text.cache: palign(8)
            .text.mpu: palign(8)
            .text.boot: palign(8)
            .text:abort: palign(8) /* this helps in loading symbols when using XIP mode */
        } > MSRAM
    
        /* This is rest of code. This can be placed in DDR if DDR is available and needed */
        GROUP {
            .text:   {} palign(8)   /* This is where code resides */
            .rodata: {} palign(8)   /* This is where const's go */
        } > DDR ///MSRAM
    
        /* This is rest of initialized data. This can be placed in DDR if DDR is available and needed */
        GROUP {
            .data:   {} palign(8)   /* This is where initialized globals and static go */
        } > DDR ///MSRAM
    
        /* This is rest of uninitialized data. This can be placed in DDR if DDR is available and needed */
        GROUP {
            .bss:    {} palign(8)   /* This is where uninitialized globals go */
            RUN_START(__BSS_START)
            RUN_END(__BSS_END)
            .sysmem: {} palign(8)   /* This is where the malloc heap goes */
            .stack:  {} palign(8)   /* This is where the main() stack goes */
        } > DDR ///MSRAM
    
        /* This is where the stacks for different R5F modes go */
        GROUP {
            .irqstack: {. = . + __IRQ_STACK_SIZE;} align(8)
            RUN_START(__IRQ_STACK_START)
            RUN_END(__IRQ_STACK_END)
            .fiqstack: {. = . + __FIQ_STACK_SIZE;} align(8)
            RUN_START(__FIQ_STACK_START)
            RUN_END(__FIQ_STACK_END)
            .svcstack: {. = . + __SVC_STACK_SIZE;} align(8)
            RUN_START(__SVC_STACK_START)
            RUN_END(__SVC_STACK_END)
            .abortstack: {. = . + __ABORT_STACK_SIZE;} align(8)
            RUN_START(__ABORT_STACK_START)
            RUN_END(__ABORT_STACK_END)
            .undefinedstack: {. = . + __UNDEFINED_STACK_SIZE;} align(8)
            RUN_START(__UNDEFINED_STACK_START)
            RUN_END(__UNDEFINED_STACK_END)
        } > DDR ///MSRAM
    
        /* Sections needed put into TCM */
        GROUP {
            CRITICAL_DATA_SECTION:  {} palign(8)   /* critical data section */
            CRITICAL_TEXT_SECTION: {} palign(8)   /* critical text section */
        } > DDR ///MSRAM
    
        /* Sections needed for C++ projects */
        GROUP {
            .ARM.exidx:  {} palign(8)   /* Needed for C++ exception handling */
            .init_array: {} palign(8)   /* Contains function pointers called before main */
            .fini_array: {} palign(8)   /* Contains function pointers called after main */
        } > MSRAM
    
        /* General purpose user shared memory, used in some examples */
        .bss.user_shared_mem (NOLOAD) : {} > USER_SHM_MEM
        /* this is used when Debug log's to shared memory are enabled, else this is not used */
        .bss.log_shared_mem  (NOLOAD) : {} > LOG_SHM_MEM
        /* this is used only when IPC RPMessage is enabled, else this is not used */
        .bss.ipc_vring_mem   (NOLOAD) : {} > RTOS_NORTOS_IPC_SHM_MEM
        /* General purpose non cacheable memory, used in some examples */
        .bss.nocache (NOLOAD) : {} > NON_CACHE_MEM
    
    	GROUP {
        .text.TimerP_isr0 : {} palign(8)
        .text.TimerTick : {} palign(8)
    	} >  R5F_TCMA
    
    
         /*.DDR_UNCACHED.filebuf:   {} palign(8) > DDR_UNCACHED*/
    
    
    }
    
    
    
    
    /*
    NOTE: Below memory is reserved for DMSC usage
     - During Boot till security handoff is complete
       0x701E0000 - 0x701FFFFF (128KB)
     - After "Security Handoff" is complete (i.e at run time)
       0x701F4000 - 0x701FFFFF (48KB)
    
     Security handoff is complete when this message is sent to the DMSC,
       TISCI_MSG_SEC_HANDOVER
    
     This should be sent once all cores are loaded and all application
     specific firewall calls are setup.
    */
    
    MEMORY
    {
        R5F_VECS  : ORIGIN = 0x00000000 , LENGTH = 0x00000040
        R5F_TCMA  : ORIGIN = 0x00000040 , LENGTH = 0x00007FC0
        R5F_TCMB0 : ORIGIN = 0x41010000 , LENGTH = 0x00008000
    
        /* memory segment used to hold CPU specific non-cached data, MAKE to add a MPU entry to mark this as non-cached */
        NON_CACHE_MEM : ORIGIN = 0x70060000 , LENGTH = 0x8000
    
        /* when using multi-core application's i.e more than one R5F/M4F active, make sure
         * this memory does not overlap with other R5F's
         */
        MSRAM     : ORIGIN = 0x70080000 , LENGTH = 0x40000
    
        DDR       : ORIGIN = 0x80000000 , LENGTH = 0x80000
    
        /* This section can be used to put XIP section of the application in flash, make sure this does not overlap with
         * other CPUs. Also make sure to add a MPU entry for this section and mark it as cached and code executable
         */
        FLASH     : ORIGIN = 0x60100000 , LENGTH = 0x80000
    
    
        DDR_UNCACHED  : ORIGIN = 0x80080000 , LENGTH = 0x70000
        /* shared memory segments */
        /* On R5F,
         * - make sure there is a MPU entry which maps below regions as non-cache
         */
        USER_SHM_MEM            : ORIGIN = 0x701D0000, LENGTH = 0x180
        LOG_SHM_MEM             : ORIGIN = 0x701D0000 + 0x180, LENGTH = 0x00004000 - 0x180
        RTOS_NORTOS_IPC_SHM_MEM : ORIGIN = 0x701D4000, LENGTH = 0x0000C000
    }
    

    Please look at the memory allocation window. The above two functions are stored in the TCM memory only.

    I have tried with TCMA and the same procedure for other memory regions as well.

    Regards,

    S.Anil.

  • OK. Thanks. It helps.