This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM6442: Cortex R5F Performance

Part Number: AM6442
Other Parts Discussed in Thread: SYSCONFIG

Hello,

We are trying to switch from F28379D to a more performant AM64 platform. AM64 seems like a great option, but some measurements have surprised us. The same math computation takes:

  • 280uS on Delfino with 200MHz clock and 2nd optimization;
  • 215uS on R5F on AM64 with 800MHz clock and 3rd optimization;
  • 30uS on A53 on AM64 with 1GHz clock and optimization 0.

We find the performance of R5 surprising. Is 200MHz Delfino supposed to perform almost the same as ARM R5f with 800MHz? Is there a way to improve the performance of the R5 core (optimization, enabling HW peripherals)?

Regards,

Andrean

  • Hi Andrean,

    Thanks for your query.

    The same math computation takes:

    Can you provide more info what exactly you are trying here ?

    for example:

    Which memory location the data and code is stored ?

    Method of accessing data ?

    Best Regards

    Ashwani

  • Hello Ashwani,

    We implement a multi-axis motion control. There are trigonometrical functions, IIR filters, multiplications, divisions and the like. For computation, we only use the "float" type.


    The code and data are stored in DDR RAM. We use random memory access to all the data(arrays). The application runs with FreeRTOS.

    Best regards,

    Andrean

  • Hi Andrean,

    The major delay is come from memory access especially when you put the code and data in DDR RAM. There are three levels of the memory in the AM64x: TCM, OCRAM and DDR. The memory access delays for those memory levels are shown in the following table:

     in SitaraTmAM64x /AM243x BenchmarksCortex-R5 Memory Access Latency (Rev. B) (ti.com).

    Try to put the code and data especially the most frequently used code and data in TCM and the OCRAM. Please avoid using the DDR as much as possible.

    Best regards,

    Ming

  • Hello Ming,

    Thanks for the fast answer!

    We use DDR, but data and instructions are cached. We've just run the same tests with a disabled cache. It takes 3ms(instead of 215uS) to compute the same motion on R5F. We also tried to put data and code into TCM memory. R5f performs the same way as in DDR with an enabled cache.

    Is there anything else that could help to improve the performance of the R5 core?

    Best regards, 

    Andrean

  • Hi ,

    Can you please share or check linker.cmd file and sysconfig file settings for memory regions ?

    Here are some guidelines:

    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1236505/faq-am64x-profinet-component-placement-recommendation-for-sitara-mpu-memory

    Best Regards

    Ashwani

  • Hello Ashwani,

    This is the linker file:

    
     /* This is the stack that is used by code running within main()
      * In case of NORTOS,
      * - This means all the code outside of ISR uses this stack
      * In case of FreeRTOS
      * - This means all the code until vTaskStartScheduler() is called in main()
      *   uses this stack.
      * - After vTaskStartScheduler() each task created in FreeRTOS has its own stack
      */
    
     --stack_size=8192
    /* This is the heap size for malloc() API in NORTOS and FreeRTOS
    * This is also the heap used by pvPortMalloc in FreeRTOS
    */
     --heap_size=200000
    -e_vectors  /* This is the entry of the application, _vector MUST be placed starting address 0x0 */
    
    /* This is the size of stack when R5 is in IRQ mode
     * In NORTOS,
     * - Here interrupt nesting is enabled
     * - This is the stack used by ISRs registered as type IRQ
     * In FreeRTOS,
     * - Here interrupt nesting is enabled
     * - This is stack that is used initally when a IRQ is received
     * - But then the mode is switched to SVC mode and SVC stack is used for all user ISR callbacks
     * - Hence in FreeRTOS, IRQ stack size is less and SVC stack size is more
     */
    __IRQ_STACK_SIZE = 256;
    /* This is the size of stack when R5 is in IRQ mode
     * - In both NORTOS and FreeRTOS nesting is disabled for FIQ
     */
    __FIQ_STACK_SIZE = 256;
    __SVC_STACK_SIZE = 4096; /* This is the size of stack when R5 is in SVC mode */
    __ABORT_STACK_SIZE = 256;  /* This is the size of stack when R5 is in ABORT mode */
    __UNDEFINED_STACK_SIZE = 256;  /* This is the size of stack when R5 is in UNDEF mode */
    
    
    
    SECTIONS
    {
        .vectors  : {
        } > R5F_VECS   , palign(8) 
    
    
        GROUP  :   {
        .text.hwi : {
        } palign(8)
        .text.cache : {
        } palign(8)
        .text.mpu : {
        } palign(8)
        .text.boot : {
        } palign(8)
        .text:abort : {
        } palign(8)
        } > MSRAM  
    
    
        GROUP  :   {
        .text : {
        } palign(8)
        .rodata : {
        } palign(8)
        } > DDR  
    
    
        GROUP  :   {
        .data : {
        } palign(8)
        } > DDR  
    
    
        GROUP  :   {
        .bss : {
        } palign(8)
        RUN_START(__BSS_START)
        RUN_END(__BSS_END)
        .sysmem : {
        } palign(8)
        .stack : {
        } palign(8)
        } > DDR  
    
    
        GROUP  :   {
        .irqstack : {
            . = . + __IRQ_STACK_SIZE;
        } align(8)
        RUN_START(__IRQ_STACK_START)
        RUN_END(__IRQ_STACK_END)
        .fiqstack : {
            . = . + __FIQ_STACK_SIZE;
        } align(8)
        RUN_START(__FIQ_STACK_START)
        RUN_END(__FIQ_STACK_END)
        .svcstack : {
            . = . + __SVC_STACK_SIZE;
        } align(8)
        RUN_START(__SVC_STACK_START)
        RUN_END(__SVC_STACK_END)
        .abortstack : {
            . = . + __ABORT_STACK_SIZE;
        } align(8)
        RUN_START(__ABORT_STACK_START)
        RUN_END(__ABORT_STACK_END)
        .undefinedstack : {
            . = . + __UNDEFINED_STACK_SIZE;
        } align(8)
        RUN_START(__UNDEFINED_STACK_START)
        RUN_END(__UNDEFINED_STACK_END)
        } > MSRAM  
    
    
        GROUP  :   {
        .ARM.exidx : {
        } palign(8)
        .init_array : {
        } palign(8)
        .fini_array : {
        } palign(8)
        } > MSRAM  
    
        .bss.user_shared_mem (NOLOAD) : {
        } > USER_SHM_MEM    
    
        .bss.log_shared_mem (NOLOAD) : {
        } > LOG_SHM_MEM    
    
        .bss.ipc_vring_mem (NOLOAD) : {
        } > RTOS_NORTOS_IPC_SHM_MEM    
    
        .bss.nocache (NOLOAD) : {
        } > NON_CACHE_MEM    
    
    
        GROUP  :   {
        motion : {
        } align(8)
        } > R5F_TCMB0  
    
    
    }
    
    
    MEMORY
    {
        R5F_VECS   : ORIGIN = 0x0 , LENGTH = 0x40 
        R5F_TCMA   : ORIGIN = 0x40 , LENGTH = 0x7FC0 
        R5F_TCMB0   : ORIGIN = 0x41010000 , LENGTH = 0x8000 
        NON_CACHE_MEM   : ORIGIN = 0x70060000 , LENGTH = 0x8000 
        MSRAM   : ORIGIN = 0x70080000 , LENGTH = 0x40000 
        USER_SHM_MEM   : ORIGIN = 0x701D0000 , LENGTH = 0x80 
        LOG_SHM_MEM   : ORIGIN = 0x701D0080 , LENGTH = 0x3F80 
        RTOS_NORTOS_IPC_SHM_MEM   : ORIGIN = 0x701D4000 , LENGTH = 0xC000 
        FLASH   : ORIGIN = 0x60100000 , LENGTH = 0x80000 
        DDR   : ORIGIN = 0x80000000 , LENGTH = 0x1F0000 
    
        /* For memory Regions not defined in this core but shared by other cores with the current core */
    
    
    }
    

    This is the MPU configuration:

    Best regards,

    Andrean

  • Hi Andrean,

    Thanks for logs.

    Can you please share generated memory_map file as well?

    Best Regards

    Ashwani

  • Hello Ashwani,

    Yes, sure.

    Best regards,

    Andrean

  • Thanks Andrean,

    Allow me some time to review and get back to you.

    Best Regards

    Ashwani

  • Hello Ashwani,

    Did you have time to review my files?

    Best regards,

    Andrean

  • Hi ,

    Sorry for delay in response.

    I am still discussing this internally.

    Please allow me some more time.

    Thanks for having patience.

    Best Regards

    Ashwani

  • Hi Andrean,

    Can you try the following in linker.cmd?

    Add

    MSRAM1 : ORIGIN = 0x700C0000 , LENGTH = 0x100000

    Change all reference of DDR to MSRAM1

    Best regards,

    Ming

  • Hello Ming,

    If I add everything to MSRAM, the execution time improves:

    - 180uS without optimization
    80uS with 3rd optimization

    Thank you for the help!

    Best regards,

    Andrean