This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

EK-TM4C1294XL: RAMFUNC aka running functions out of RAM is slower than running from Flash

Part Number: EK-TM4C1294XL

Typically running functions from RAM is much faster on most MCUs however I created a quick benchmark project which proves that running the same function from RAM is about 25% slower. 

see code below:

"

/******************************************************************************/
/* MISC_NOPs
*
* The function does a certain number of NOPs.
* */
/******************************************************************************/
void MISC_NOPs(unsigned int nops)
{
unsigned int i;

for(i=0;i<nops;i++)
{
NOP();
}
}

#pragma CODE_SECTION(MISC_NOPs_RAM,".TI.ramfunc");
/******************************************************************************/
/* MISC_NOPs_RAM
*
* The function does a certain number of NOPs.
* */
/******************************************************************************/
void MISC_NOPs_RAM(unsigned int nops)
{
unsigned int i;

for(i=0;i<nops;i++)
{
NOP();
}
}

"

The one from ram takes ~13uS while the one from Flash takes ~10uS.

Linker:

"

MEMORY
{
/* Application stored in and executes from internal flash */
FLASH (RX) : origin = APP_BASE, length = 0x00100000
/* Application uses internal RAM for data */
SRAM (RWX) : origin = 0x20000000, length = 0x00030000
SRAM_RAMFUNC (RWX) : origin = 0x20030000, length = 0x00010000
}

/* Section allocation in memory */

SECTIONS
{
.intvecs: > APP_BASE
.text : > FLASH
.const : > FLASH
.cinit : > FLASH
.pinit : > FLASH
.init_array : > FLASH

.vtable : > RAM_BASE
.data : > SRAM
.bss : > SRAM
.sysmem : > SRAM
.stack : > SRAM, fill 0xAAAA5555

.TI.ramfunc : {} LOAD = FLASH,
RUN = SRAM_RAMFUNC,
LOAD_START(RamfuncsLoadStart),
LOAD_SIZE(RamfuncsLoadSize),
LOAD_END(RamfuncsLoadEnd),
RUN_START(RamfuncsRunStart),
RUN_SIZE(RamfuncsRunSize),
RUN_END(RamfuncsRunEnd),
PAGE = 0, ALIGN(4)

}

"

  • What level of optimization did you use? Without optimization the index "i" is stored in system RAM and reread each time. These reads and writes compete with the reads of the opcodes when executing from RAM. The flash has a separate bus interface to the CPU so RAM reads and writes can occur at the same time as flash reads. That may explain your results.
  • I was going to comment very similarly to Vendor's Bob - while noting the following two points:

    • it is recommended that programs always use the Code region (i.e. Flash) because the Cortex-M4F has separate buses that can perform instruction fetches and data accesses simultaneously.
    • your  measurements - yet not your results - are sure to change if you "Cascaded the NOPs" - rather than,  "Holding them Hostage" w/in a (so restricted/limited) a loop

    As the ARM design has proven superior to (past others) - the performance of,  "Flash vs SRAM" - may prove an "advantage" - not a liability...