AM6442: Cortex R5F Performance

Andrean С.

Part Number: AM6442
Other Parts Discussed in Thread: SYSCONFIG

Hello,

We are trying to switch from F28379D to a more performant AM64 platform. AM64 seems like a great option, but some measurements have surprised us. The same math computation takes:

280uS on Delfino with 200MHz clock and 2nd optimization;
215uS on R5F on AM64 with 800MHz clock and 3rd optimization;
30uS on A53 on AM64 with 1GHz clock and optimization 0.

We find the performance of R5 surprising. Is 200MHz Delfino supposed to perform almost the same as ARM R5f with 800MHz? Is there a way to improve the performance of the R5 core (optimization, enabling HW peripherals)?

Regards,

Andrean

over 2 years ago

0 Ashwani Goel over 2 years ago

TI__Mastermind 27610 points

Hi Andrean,

Thanks for your query.

Andrean С. said:
The same math computation takes:

Can you provide more info what exactly you are trying here ?

for example:

Which memory location the data and code is stored ?

Method of accessing data ?

Best Regards

Ashwani

0 Andrean С. over 2 years ago in reply to Ashwani Goel

Prodigy 180 points

Hello Ashwani,

We implement a multi-axis motion control. There are trigonometrical functions, IIR filters, multiplications, divisions and the like. For computation, we only use the "float" type.

The code and data are stored in DDR RAM. We use random memory access to all the data(arrays). The application runs with FreeRTOS.

Best regards,

Andrean

0 Ming Wei over 2 years ago in reply to Andrean С.

TI__Guru 56855 points

Hi Andrean,

The major delay is come from memory access especially when you put the code and data in DDR RAM. There are three levels of the memory in the AM64x: TCM, OCRAM and DDR. The memory access delays for those memory levels are shown in the following table:

in SitaraAM64x /AM243x BenchmarksCortex-R5 Memory Access Latency (Rev. B) (ti.com).

Try to put the code and data especially the most frequently used code and data in TCM and the OCRAM. Please avoid using the DDR as much as possible.

Best regards,

Ming

0 Andrean С. over 2 years ago in reply to Ming Wei

Prodigy 180 points

Hello Ming,

Thanks for the fast answer!

We use DDR, but data and instructions are cached. We've just run the same tests with a disabled cache. It takes 3ms(instead of 215uS) to compute the same motion on R5F. We also tried to put data and code into TCM memory. R5f performs the same way as in DDR with an enabled cache.

Is there anything else that could help to improve the performance of the R5 core?

Best regards,

Andrean

0 Ashwani Goel over 2 years ago in reply to Andrean С.

TI__Mastermind 27610 points

Hi Andrean С,

Can you please share or check linker.cmd file and sysconfig file settings for memory regions ?

Here are some guidelines:

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1236505/faq-am64x-profinet-component-placement-recommendation-for-sitara-mpu-memory

Best Regards

Ashwani

0 Andrean С. over 2 years ago in reply to Ashwani Goel

Prodigy 180 points

Hello Ashwani,

This is the linker file:


 /* This is the stack that is used by code running within main()
  * In case of NORTOS,
  * - This means all the code outside of ISR uses this stack
  * In case of FreeRTOS
  * - This means all the code until vTaskStartScheduler() is called in main()
  *   uses this stack.
  * - After vTaskStartScheduler() each task created in FreeRTOS has its own stack
  */

 --stack_size=8192
/* This is the heap size for malloc() API in NORTOS and FreeRTOS
* This is also the heap used by pvPortMalloc in FreeRTOS
*/
 --heap_size=200000
-e_vectors  /* This is the entry of the application, _vector MUST be placed starting address 0x0 */

/* This is the size of stack when R5 is in IRQ mode
 * In NORTOS,
 * - Here interrupt nesting is enabled
 * - This is the stack used by ISRs registered as type IRQ
 * In FreeRTOS,
 * - Here interrupt nesting is enabled
 * - This is stack that is used initally when a IRQ is received
 * - But then the mode is switched to SVC mode and SVC stack is used for all user ISR callbacks
 * - Hence in FreeRTOS, IRQ stack size is less and SVC stack size is more
 */
__IRQ_STACK_SIZE = 256;
/* This is the size of stack when R5 is in IRQ mode
 * - In both NORTOS and FreeRTOS nesting is disabled for FIQ
 */
__FIQ_STACK_SIZE = 256;
__SVC_STACK_SIZE = 4096; /* This is the size of stack when R5 is in SVC mode */
__ABORT_STACK_SIZE = 256;  /* This is the size of stack when R5 is in ABORT mode */
__UNDEFINED_STACK_SIZE = 256;  /* This is the size of stack when R5 is in UNDEF mode */



SECTIONS
{
    .vectors  : {
    } > R5F_VECS   , palign(8) 


    GROUP  :   {
    .text.hwi : {
    } palign(8)
    .text.cache : {
    } palign(8)
    .text.mpu : {
    } palign(8)
    .text.boot : {
    } palign(8)
    .text:abort : {
    } palign(8)
    } > MSRAM  


    GROUP  :   {
    .text : {
    } palign(8)
    .rodata : {
    } palign(8)
    } > DDR  


    GROUP  :   {
    .data : {
    } palign(8)
    } > DDR  


    GROUP  :   {
    .bss : {
    } palign(8)
    RUN_START(__BSS_START)
    RUN_END(__BSS_END)
    .sysmem : {
    } palign(8)
    .stack : {
    } palign(8)
    } > DDR  


    GROUP  :   {
    .irqstack : {
        . = . + __IRQ_STACK_SIZE;
    } align(8)
    RUN_START(__IRQ_STACK_START)
    RUN_END(__IRQ_STACK_END)
    .fiqstack : {
        . = . + __FIQ_STACK_SIZE;
    } align(8)
    RUN_START(__FIQ_STACK_START)
    RUN_END(__FIQ_STACK_END)
    .svcstack : {
        . = . + __SVC_STACK_SIZE;
    } align(8)
    RUN_START(__SVC_STACK_START)
    RUN_END(__SVC_STACK_END)
    .abortstack : {
        . = . + __ABORT_STACK_SIZE;
    } align(8)
    RUN_START(__ABORT_STACK_START)
    RUN_END(__ABORT_STACK_END)
    .undefinedstack : {
        . = . + __UNDEFINED_STACK_SIZE;
    } align(8)
    RUN_START(__UNDEFINED_STACK_START)
    RUN_END(__UNDEFINED_STACK_END)
    } > MSRAM  


    GROUP  :   {
    .ARM.exidx : {
    } palign(8)
    .init_array : {
    } palign(8)
    .fini_array : {
    } palign(8)
    } > MSRAM  

    .bss.user_shared_mem (NOLOAD) : {
    } > USER_SHM_MEM    

    .bss.log_shared_mem (NOLOAD) : {
    } > LOG_SHM_MEM    

    .bss.ipc_vring_mem (NOLOAD) : {
    } > RTOS_NORTOS_IPC_SHM_MEM    

    .bss.nocache (NOLOAD) : {
    } > NON_CACHE_MEM    


    GROUP  :   {
    motion : {
    } align(8)
    } > R5F_TCMB0  


}


MEMORY
{
    R5F_VECS   : ORIGIN = 0x0 , LENGTH = 0x40 
    R5F_TCMA   : ORIGIN = 0x40 , LENGTH = 0x7FC0 
    R5F_TCMB0   : ORIGIN = 0x41010000 , LENGTH = 0x8000 
    NON_CACHE_MEM   : ORIGIN = 0x70060000 , LENGTH = 0x8000 
    MSRAM   : ORIGIN = 0x70080000 , LENGTH = 0x40000 
    USER_SHM_MEM   : ORIGIN = 0x701D0000 , LENGTH = 0x80 
    LOG_SHM_MEM   : ORIGIN = 0x701D0080 , LENGTH = 0x3F80 
    RTOS_NORTOS_IPC_SHM_MEM   : ORIGIN = 0x701D4000 , LENGTH = 0xC000 
    FLASH   : ORIGIN = 0x60100000 , LENGTH = 0x80000 
    DDR   : ORIGIN = 0x80000000 , LENGTH = 0x1F0000 

    /* For memory Regions not defined in this core but shared by other cores with the current core */


}

This is the MPU configuration:

Best regards,

Andrean

0 Ashwani Goel over 2 years ago in reply to Andrean С.

TI__Mastermind 27610 points

Hi Andrean,

Thanks for logs.

Can you please share generated memory_map file as well?

Best Regards

Ashwani

0 Andrean С. over 2 years ago in reply to Ashwani Goel

Prodigy 180 points

Hello Ashwani,

Yes, sure.

Best regards,

Andrean

0 Ashwani Goel over 2 years ago in reply to Andrean С.

TI__Mastermind 27610 points

Thanks Andrean,

Allow me some time to review and get back to you.

Best Regards

Ashwani

0 Andrean С. over 2 years ago in reply to Ashwani Goel

Prodigy 180 points

Hello Ashwani,

Did you have time to review my files?

Best regards,

Andrean

0 Ashwani Goel over 2 years ago in reply to Andrean С.

TI__Mastermind 27610 points

Hi Andrean С,

Sorry for delay in response.

I am still discussing this internally.

Please allow me some more time.

Thanks for having patience.

Best Regards

Ashwani

+1 Ming Wei over 2 years ago in reply to Ashwani Goel

TI__Guru 56855 points

Hi Andrean,

Can you try the following in linker.cmd?

Add

MSRAM1 : ORIGIN = 0x700C0000 , LENGTH = 0x100000

Change all reference of DDR to MSRAM1

Best regards,

Ming

0 Andrean С. over 2 years ago in reply to Ming Wei

Prodigy 180 points

Hello Ming,

If I add everything to MSRAM, the execution time improves:

- 180uS without optimization
- 80uS with 3rd optimization

Thank you for the help!

Best regards,

Andrean

Processors

Processors forum

AM6442: Cortex R5F Performance