MCU-PLUS-SDK-AM263X: If the TCM area is used, the performance is rather deteriorated.

Choi Jiung

Hi, Expert

I would like to reduce the load by assigning the functions and variables used in the PWM ISR to TCMA and TCMB.

All functions and variables were assigned to the TCM area by referring to the link below. (checked through .map file)

https://e2e.ti.com/support/microcontrollers/arm-based-microcontrollers-group/arm-based-microcontrollers/f/arm-based-microcontrollers-forum/1165683/faq-mcu-plus-sdk-am263x-how-to-add-code-into-tcm-using-linker-cmd-file

However, as a result of measuring the load, the performance rather deteriorated.

The test proceeded as follows.

1. Source Code

uint32_t gTestIndex __attribute__((__section__(".controldata")));
uint32_t gTestTable[1000] __attribute__((__section__(".controldata")));

__attribute__((__section__(".controlfnc"))) void MakeLoad (void)
{
for(gTestIndex = 0U; gTestIndex < 1000; gTestIndex++)
gTestTable[gTestIndex] = gTestIndex + 1U;
}

uint16_t gCounterBufferStart[100];
uint16_t gCounterBufferEnd[100];
uint16_t gCounterBufferDiff[100];
uint32_t gCounterBufferStartIndex = 0U;
uint32_t gCounterBufferEndIndex = 0U;

static void EPWM_ISR(void *handle)
{
volatile bool status;

if(gCounterBufferStartIndex < 100)
{
gCounterBufferStart[gCounterBufferStartIndex++] = EPWM_getTimeBaseCounterValue(0x50080000ul);
}
else
{
/*NOP */
}

MakeLoad();

if(gCounterBufferEndIndex < 100)
{
gCounterBufferEnd[gCounterBufferEndIndex] = EPWM_getTimeBaseCounterValue(0x50080000ul);
gCounterBufferDiff[gCounterBufferEndIndex] = gCounterBufferEnd[gCounterBufferEndIndex] - gCounterBufferStart[gCounterBufferEndIndex];
gCounterBufferEndIndex++;
}
else
{
/*NOP */
}

status = EPWM_getEventTriggerInterruptStatus(gEpwmBaseAddr);
if(status == true)
{
EPWM_clearEventTriggerInterruptFlag(gEpwmBaseAddr);
}

return;
}

2. Test configuration

- PWM ISR occurs at TBC 0

- In order to measure the time during which MakeLoad is executed, TBC is measured before and after the function, and the load is evaluated as the difference value.

3. Test Result

- If functions and variables are not assigned to the TCM via the "__attribute__" reserved word:

Result of both function and variable being allocated in OCRAM

- If functions and variables are assigned to the TCM via the "__attribute__" reserved word:

Result assigned to TCMA for function and TCMB for variable

The above code was written for testing, and the code to be applied in practice is more complex.

However, the load increased when the actual code was assigned to TCM in the same way.

What are these causes?

Please let us know if you have any data needed for analysis.

Best Regards

Jiung Choi

over 2 years ago

0 Sri Vidya Gunturi over 2 years ago

TI__Expert 7586 points

Hi Jiung

Your ISR is in OCRAM I think as it is not placed in TCM (static void EPWM_ISR(void *handle)).

So the Program has to jump from OCRAM to TCM when MakeLoad(); is executed. This Branch instruction might be taking more time as it is moving from one mem region to another.

I see two options here:

The test to check the execution time needs to be performed inside the MakeLoad API at the start and end of execution.
Or you could place the entire ISR code into TCM so that everything is placed in TCM.

Choi Jiung said:
- In order to measure the time during which MakeLoad is executed, TBC is measured before and after the function, and the load is evaluated as the difference value.

The test needs to be measured using CPU cycle counter.

Please find the APIs here:

https://software-dl.ti.com/mcu-plus-sdk/esd/AM263X/latest/exports/docs/api_guide_am263x/KERNEL_DPL_CYCLE_COUNTER_PAGE.html

https://software-dl.ti.com/mcu-plus-sdk/esd/AM263X/latest/exports/docs/api_guide_am263x/group__KERNEL__DPL__CYCLE__COUNTER.html

What do you mean by TBC here?

For more information on Optimizations:

https://www.ti.com/lit/an/sprad27a/sprad27a.pdf?ts=1678344317965&ref_url=https%253A%252F%252Fwww.ti.com%252Fproduct%252FAM2634

Thanks & regards

Sri Vidya

0 Choi Jiung over 2 years ago in reply to Sri Vidya Gunturi

Prodigy 200 points

Hi, Sri Vidya

First of all, ISR was also placed in TCM.

However, the load is still higher than before deploying to TCM.

Are there any other methods that can be applied?

And TBC means TimeBaseCounter of PWM module.

Also we want everything to be done within the PWM ISR.

Therefore, the PWM TBC becomes the load reference.

If there is no reason to measure with CPU cycle counter, this method is judged to be more suitable for us.

Currently we have PWM ISR triggered when TBC is 0, and TBC is increased up to 1000.

When TBC reaches 1000, it is automatically reset to 0.

+++

All of the ISR routines are assigned to TCMA as shown below.
It's better than it was before, but the load is still higher than before the allocation to TCM.

Best Regards

Jiung Choi

0 Sri Vidya Gunturi over 2 years ago in reply to Choi Jiung

TI__Expert 7586 points

Choi Jiung said:
And TBC means TimeBaseCounter of PWM module.

Understood.

Could you share your linker.cmd file to see where your program stack and IRQ stack are present?

Also, are you in release mode or debug mode of build settings?

Choi Jiung said:
uint16_t gCounterBufferStart[100];
uint16_t gCounterBufferEnd[100];
uint16_t gCounterBufferDiff[100];
uint32_t gCounterBufferStartIndex = 0U;
uint32_t gCounterBufferEndIndex = 0U;

Choi Jiung said:
volatile bool status;

Could you also tell me if these above variables are in TCM or OCRAM?

Regards

Sri Vidya

0 Choi Jiung over 2 years ago in reply to Sri Vidya Gunturi

Prodigy 200 points

Hi, Sri Vidya

Unfortunately, file uploads are not possible due to our internal security policy.

However, the IRQ stack area is also allocated to TCM as shown below.

/* This is where the stacks for different R5F modes go */
.irqstack: {. = . + __IRQ_STACK_SIZE;} align(8) > R5F_TCMA
RUN_START(__IRQ_STACK_START)
RUN_END(__IRQ_STACK_END)
.fiqstack: {. = . + __FIQ_STACK_SIZE;} align(8) > R5F_TCMB
RUN_START(__FIQ_STACK_START)
RUN_END(__FIQ_STACK_END)
.svcstack: {. = . + __SVC_STACK_SIZE;} align(8) > R5F_TCMB
RUN_START(__SVC_STACK_START)
RUN_END(__SVC_STACK_END)
.abortstack: {. = . + __ABORT_STACK_SIZE;} align(8) > R5F_TCMB
RUN_START(__ABORT_STACK_START)
RUN_END(__ABORT_STACK_END)
.undefinedstack: {. = . + __UNDEFINED_STACK_SIZE;} align(8) > R5F_TCMB
RUN_START(__UNDEFINED_STACK_START)
RUN_END(__UNDEFINED_STACK_END)

In addition, the load is the same even if the variables used for measuring the current load are assigned to TCM.

(If you look at the address on the right of the figure below, you can see that it is assigned to TCMB.)

Best Regards

Jiung Choi

0 Choi Jiung over 2 years ago in reply to Choi Jiung

Prodigy 200 points

Hi, Sri Vidya

Below is the complete linker.com file currently in use.

Best Regards

Jiung Choi

/* This is the stack that is used by code running within main()
* In case of NORTOS,
* - This means all the code outside of ISR uses this stack
* In case of FreeRTOS
* - This means all the code until vTaskStartScheduler() is called in main()
* uses this stack.
* - After vTaskStartScheduler() each task created in FreeRTOS has its own stack
*/
--stack_size=16384
/* This is the heap size for malloc() API in NORTOS and FreeRTOS
* This is also the heap used by pvPortMalloc in FreeRTOS
*/
--heap_size=32768
-e_vectors /* This is the entry of the application, _vector MUST be plabed starting address 0x0 */

/* This is the size of stack when R5 is in IRQ mode
* In NORTOS,
* - Here interrupt nesting is enabled
* - This is the stack used by ISRs registered as type IRQ
* In FreeRTOS,
* - Here interrupt nesting is enabled
* - This is stack that is used initally when a IRQ is received
* - But then the mode is switched to SVC mode and SVC stack is used for all user ISR callbacks
* - Hence in FreeRTOS, IRQ stack size is less and SVC stack size is more
*/
__IRQ_STACK_SIZE = 256;
/* This is the size of stack when R5 is in IRQ mode
* - In both NORTOS and FreeRTOS nesting is disabled for FIQ
*/
__FIQ_STACK_SIZE = 256;
__SVC_STACK_SIZE = 4096; /* This is the size of stack when R5 is in SVC mode */
__ABORT_STACK_SIZE = 256; /* This is the size of stack when R5 is in ABORT mode */
__UNDEFINED_STACK_SIZE = 256; /* This is the size of stack when R5 is in UNDEF mode */

SECTIONS
{
/* This has the R5F entry point and vector table, this MUST be at 0x0 */
.vectors:{} palign(8) > R5F_VECS

/* This has the R5F boot code until MPU is enabled, this MUST be at a address < 0x80000000
* i.e this cannot be placed in DDR
*/
GROUP {
.text.hwi: palign(8)
.text.cache: palign(8)
.text.mpu: palign(8)
.text.boot: palign(8)
.text:abort: palign(8) /* this helps in loading symbols when using XIP mode */
} > R5F_TCMA

/* This is rest of code. This can be placed in DDR if DDR is available and needed */
GROUP {
.text: {} palign(8) /* This is where code resides */
.rodata: {} palign(8) /* This is where const's go */
} > OCRAM

/* This is rest of initialized data. This can be placed in DDR if DDR is available and needed */
GROUP {

.data: {} palign(8) /* This is where initialized globals and static go */
} > OCRAM

/* This is rest of uninitialized data. This can be placed in DDR if DDR is available and needed */
GROUP {
.bss: {} palign(8) /* This is where uninitialized globals go */
RUN_START(__BSS_START)
RUN_END(__BSS_END)
.sysmem: {} palign(8) /* This is where the malloc heap goes */
.stack: {} palign(8) /* This is where the main() stack goes */
} > OCRAM

/* Sections needed for C++ projects */
GROUP {
.ARM.exidx: {} palign(8) /* Needed for C++ exception handling */
.init_array: {} palign(8) /* Contains function pointers called before main */
.fini_array: {} palign(8) /* Contains function pointers called after main */
} > OCRAM

/* General purpose user shared memory, used in some examples */
.bss.user_shared_mem (NOLOAD) : {} > USER_SHM_MEM
/* this is used when Debug log's to shared memory are enabled, else this is not used */
.bss.log_shared_mem (NOLOAD) : {} > LOG_SHM_MEM
/* this is used only when IPC RPMessage is enabled, else this is not used */
.bss.ipc_vring_mem (NOLOAD) : {} > RTOS_NORTOS_IPC_SHM_MEM
/* this is used only when Secure IPC is enabled */
.bss.sipc_hsm_queue_mem (NOLOAD) : {} > MAILBOX_HSM
.bss.sipc_r5f_queue_mem (NOLOAD) : {} > MAILBOX_R5F

.intvecs : {} palign(8) > R5F_TCMA
.controlfnc : {} palign(8) > R5F_TCMA
.controldata : {} palign(8) > R5F_TCMB
}

MEMORY
{
R5F_VECS : ORIGIN = 0x00000000 , LENGTH = 0x00000040
R5F_TCMA : ORIGIN = 0x00000040 , LENGTH = 0x00007FC0
R5F_TCMB : ORIGIN = 0x00080000 , LENGTH = 0x00008000

/* when using multi-core application's i.e more than one R5F/M4F active, make sure
* this memory does not overlap with other R5F's
*/
OCRAM : ORIGIN = 0x700C0000 , LENGTH = 0x40000

/* This section can be used to put XIP section of the application in flash, make sure this does not overlap with
* other CPUs. Also make sure to add a MPU entry for this section and mark it as cached and code executable
*/
FLASH : ORIGIN = 0x60200000 , LENGTH = 0x80000

/* shared memories that are used by RTOS/NORTOS cores */
/* On R5F,
* - make sure there is a MPU entry which maps below regions as non-cache
*/
USER_SHM_MEM : ORIGIN = 0x701D0000, LENGTH = 0x00004000
LOG_SHM_MEM : ORIGIN = 0x701D4000, LENGTH = 0x00004000
PROGSIG : ORIGIN = 0x701FFFF0, LENGTH = 0x00000010

/* MSS mailbox memory is used as shared memory, we dont use bottom 32*12 bytes, since its used as SW queue by ipc_notify */
RTOS_NORTOS_IPC_SHM_MEM : ORIGIN = 0x72000000, LENGTH = 0x3E80
MAILBOX_HSM: ORIGIN = 0x44000000 , LENGTH = 0x000003CE
MAILBOX_R5F: ORIGIN = 0x44000400 , LENGTH = 0x000003CE
}

0 Sri Vidya Gunturi over 2 years ago in reply to Choi Jiung

TI__Expert 7586 points

Hi Jiung Choi

I will try to replicate your issue at my end and reply here soon. Please allow me some time. Thanks for your patience and sending this.

We did see similar issue many times as the code is not placed entirely in TCM. And there would be jumps from TCM to OCRAM in the program which increases the execution time.

I see that the global variables are in OCRAM:

Could you move them to TCM?

My recommendation would be to use the below linker.cmd file and run the application in release mode with Optimisation set to -z for better performance.

linker.zip

Regards

Sri Vidya

0 Choi Jiung over 2 years ago in reply to Sri Vidya Gunturi

Prodigy 200 points

Hi, Sri Vidya

Thanks for the advice.

Using the optimization option as an internal policy requires a lot of internal discussion.

Please let us know if there is anything updated.

Best Regards

Jiung Choi

+1 Sri Vidya Gunturi over 2 years ago in reply to Choi Jiung

TI__Expert 7586 points

Hi Sorry

Sri Vidya Gunturi said:
We did see similar issue many times as the code is not placed entirely in TCM. And there would be jumps from TCM to OCRAM in the program which increases the execution time.

I see that the global variables are in OCRAM:

I just updated my reply. Could you please check previous reply?

Choi Jiung said:
Using the optimization option as an internal policy requires a lot of internal discussion.

I understand.

Could you try the below options which are low level compared to -Oz. Just providing this for more information.

https://software-dl.ti.com/codegen/docs/tiarmclang/compiler_tools_user_guide/compiler_manual/using_compiler/compiler_options/optimization_options.html

Apart from the Optimizations, I still should be able to get the TBCTR lower than OCRAM. Will update soon.

Regards

Sri Vidya

0 Choi Jiung over 2 years ago in reply to Sri Vidya Gunturi

Prodigy 200 points

Hi, Sri Vidya

First of all, the above answer did not solve the issue. (wrong click)

Even though I moved all global variables to TCMB, the load is measured as below. (When not using TCM :41)

Therefore, we still believe that using TCM is slower.

Global variables assigned to TCM can be checked as follows.

Please let me know if there are any additional things to try.

Best Regards

Jiung Choi

0 Sri Vidya Gunturi over 2 years ago in reply to Choi Jiung

TI__Expert 7586 points

Hi Jiung Choi

I performed the same experiment in Debug and Release modes.

My TBPRD value for PWM is 10000

TBCLK is 200MHz.

Make Load function is as below:

These are the results in debug mode:

Without TCM:

With TCM:

In Release mode, the below are the results:

Without TCM:

With TCM:

In this case, the TCM and OCRAM seems to have similar performance.

Regards

Sri Vidya

0 Sri Vidya Gunturi over 2 years ago in reply to Sri Vidya Gunturi

TI__Expert 7586 points

Along with the above,

I have also tried changing the MPU Settings of EPWM Module as shown below:

SDK App Guide for basic info à https://software-dl.ti.com/mcu-plus-sdk/esd/AM263X/latest/exports/docs/api_guide_am263x/KERNEL_DPL_MPU_ARMV7_PAGE.html

MPU Configuration can be done through Syscfg for a specific memory region length. The drop downs help in understanding different types of access:

By default TCM memory is configured as cached.
By default other memory regions are configured as strongly ordered.

The above Attributes in the Syscfg are explained in the ARM R5 TRM in the below table:

https://documentation-service.arm.com/static/5f042788cafe527e86f5cc83?token=

Refer to Section à 4.3.21 c6, MPU memory region programming registers

Below are the results obtained after the MPU region configuration:

Without TCM:

With TCM:

0 Choi Jiung over 2 years ago in reply to Sri Vidya Gunturi

Prodigy 200 points

Hi, Sri Vidya

Thank you for confirming

Based on the above results, the improvement by TCM seems to be insignificant.

Is the performance improvement expected from TI similar to the above?

If it is an improvement of the above level, there is no need to check additionally.

Best Regards

Jiung Choi

+1 Sri Vidya Gunturi over 2 years ago in reply to Choi Jiung

TI__Expert 7586 points

This difference is happening because we are checking for the performance of a recurring piece of code (in our case a for loop) and this must have been put in cache by the CPU.

For more information, the ARM R5 programmer's guide provides details about the above:

Link to this page: https://developer.arm.com/documentation/den0042/a/Cache

Here is a page comparing between cache and TCM performance:

https://developer.arm.com/documentation/den0042/a/Tightly-Coupled-Memory/Performance-of-TCM-compared-to-cache

Hope this helps.

Regards

Sri Vidya

0 Choi Jiung over 2 years ago in reply to Sri Vidya Gunturi

Prodigy 200 points

Hi, Sri Vidya

Thank you for confirming

I understood that this difference could be caused by the presence or absence of the cache.

Therefore, the more complex the SW, the more likely it is to get a value from the common memory rather than the cache. In this case, TCM is judged to be effective.

Considering these details, we will review whether it is applicable or not.

Thank you for your support.

Best Regards

Jiung Choi

Arm-based microcontrollers

Arm-based microcontrollers forum

MCU-PLUS-SDK-AM263X: If the TCM area is used, the performance is rather deteriorated.