MSP432P401R: Performance with SRAM_CODE at 48 MHz.

Jacob Livchitz

Part Number: MSP432P401R
Other Parts Discussed in Thread: MSP432E401Y

Hi, I am a UMD student and Ti Intern who is working to get a CoreMark benchmark on my MSP432P401R EVM. I am using CCS project and running at 48 MHz. I am getting a lower than expected CoreMark score. I noticed that with all code running on SRAM, I hit a score of 2.41. While when I run code from flash my score improves to 2.6. The expected number is 3.4. Can someone please explain why the SRAM based runs are slower? I expected SRAM to be 0 wait-states and thereby perform better than flash. My data is in SRAM in both cases. My compiler flags are:

-mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --fp_mode=relaxed

and my map file has a memory configuration of:

MAIN 00000000 00040000 000000e4 0003ff1c R X
INFO 00200000 00004000 00000000 00004000 R X
SRAM_CODE 01000000 00008000 00006b0c 000014f4 RW X
SRAM_DATA 20008000 00008000 00006e08 000011f8 RW

Thank you!

over 5 years ago

0 Dennis Lehman over 5 years ago

TI__Guru 71216 points

Hi Jacob,

What software are you using to measure the benchmarks?

0 Jacob Livchitz over 5 years ago in reply to Dennis Lehman

Prodigy 50 points

Hi Dennis,

I am using DWT->CYCCNT to calculate the delta number of cycles in a standard CoreMark I cloned from Github and ported to baremetal.

-Jacob

0 Zack L. over 5 years ago in reply to Jacob Livchitz

TI__Expert 3390 points

Jacob,

What startup code did you use before running the benchmark? If you look in our MSP432 SDK there is a file "source/ti/devices/msp432p4xx/startup_system_files/msp432p401r.c" that provides some startup code that is helpful to run before running the main application. Most of our examples run something like this before main() is called.

You can see that the code enables all SRAM banks and configures the Flash wait states. Could explain the results you saw.

-Zack

0 Jacob Livchitz over 5 years ago in reply to Zack L.

Prodigy 50 points

Hi Zack,

I do run the startup code you mentioned before running my main. This code sets the system clock to 48 MHz and enables all SRAM banks and sets 1 waitstate for the FLASH.

Also, can you verify my understanding that (assuming a proper setup) SRAM should be running with 0 waitstates and hence should outperform FLASH runs?

0 Chester Gillon over 5 years ago

Guru 92251 points

Jacob Livchitz said:
I noticed that with all code running on SRAM, I hit a score of 2.41. While when I run code from flash my score improves to 2.6.

I can repeat that with a MSP432P401R project with the CPU clock set to 48 MHz and -O4 --opt_for_speed=5 as the optimisation settings.

CoreMark Benchmarking for ARM Cortex Processors was used as the basis for setting the coremark test.

Running with the code in FLASH results in:

[CORTEX_M4_0] 2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 15967
Total time (secs): 15.967000
Iterations/Sec   : 125.258345
Iterations       : 2000
Compiler version : TI20002001
Compiler flags   : -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=__MSP432P401R__ --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi 
Memory location  : STACK
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0x4983
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 125.258345 / TI20002001 -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=__MSP432P401R__ --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi  / STACK

Which is a score of 125.258345 / 48 or 2.61.

Running the code from SRAM results in:

[CORTEX_M4_0] 2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 17408
Total time (secs): 17.408000
Iterations/Sec   : 114.889706
Iterations       : 2000
Compiler version : TI20002001
Compiler flags   : -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=__MSP432P401R__ --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi 
Memory location  : STACK
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0x4983
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 114.889706 / TI20002001 -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=__MSP432P401R__ --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi  / STACK

Which is a score of 114.889706 / 48 = 2.39.

Jacob Livchitz said:
Can someone please explain why the SRAM based runs are slower? I expected SRAM to be 0 wait-states and thereby perform better than flash. My data is in SRAM in both cases.

The MSP432P2xx TRM shows there are multiple Cortex-M4F Bus Interfaces:

Why I think the coremark score when the code is running in SRAM is lower than then the code is running in FLASH is that instructions fetches and data access both have to access SRAM.

Whereas when the code is running in FLASH the ICODE and SBUS interfaces can be active at the same time, allowing some overlap between instruction fetches and data accesses. I haven't looked at the ARM Cortex-M4F pipeline structure to confirm this theory.

For reference, my project created in CCS 10 is attached. There are build configuration for Release_FLASH and Release_SRAM which indicate if the code runs from FLASH or SRAM.

MSP432_coremark.zip

0 Chester Gillon over 5 years ago in reply to Chester Gillon

Guru 92251 points

I also ran coremark on a MSP432E401Y with the CPU set to 120 MHz.

With the code in FLASH the result was:

[CORTEX_M4_0] 2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 15133
Total time (secs): 15.133000
Iterations/Sec   : 264.323003
Iterations       : 4000
Compiler version : TI20002001
Compiler flags   : -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/simplelink_msp432e4_sdk_4_10_00_13/source --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=CMSIS --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= --define=__MSP432E401Y__ -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi 
Memory location  : STACK
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0x65c5
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 264.323003 / TI20002001 -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/simplelink_msp432e4_sdk_4_10_00_13/source --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=CMSIS --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= --define=__MSP432E401Y__ -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi  / STACK

Which is a score of 264.323003 / 120 = 2.20.

Running the code from SRAM results in:

[CORTEX_M4_0] 2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 12818
Total time (secs): 12.818000
Iterations/Sec   : 234.045873
Iterations       : 3000
Compiler version : TI20002001
Compiler flags   : -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/simplelink_msp432e4_sdk_4_10_00_13/source --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=CMSIS --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= --define=__MSP432E401Y__ -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi 
Memory location  : STACK
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xcc42
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 234.045873 / TI20002001 -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/simplelink_msp432e4_sdk_4_10_00_13/source --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=CMSIS --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= --define=__MSP432E401Y__ -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi  / STACK

Which is a score of 234.045873 / 120 = 1.95.

Therefore, on a MSP432E401Y the score is also lower when running the code in SRAM, compared to FLASH.

And a MSP432E401Y at 120 MHz has lower scores than a MSP432P401R at 48 MHz. They both have a Cortex-M4F, so guess the difference in scores is down to the relative performance on the memories.

The updated project is attached, which has different build configuration to support the combinations of MSP432E401Y/MSP432P401R and FLASH/SRAM.

4682.MSP432_coremark.zip

0 Jacob Livchitz over 5 years ago in reply to Chester Gillon

Prodigy 50 points

Thanks so much for the response,

Just for reference, I achieved the best CoreMark scores when I ran code from FLASH at SYSCLK = 12MHz since this allows my FLASH access to have no wait-states. At 48MHz my FLASH is set to have 1 wait-state. The CoreMark scores I am getting are 3.01 at 12MHz and 2.62 at 48Mhz respectively. When I moved my code from FLASH to SRAM I tried a lot of different configurations of code and data mapped to ICODE, DCODE, and SBUS. All runs were worse. Talking with my advisor, we came to the conclusion that the hardware implementation may not have 0 waitstate SRAM. Also, FLASH memory has a prefetch buffer so that it could explain the reasonably well-performing FLASH.

I also experimented with the Dhrystone benchmark and got similar results. The best results I obtained running from FLASH with SYSCLK = 12MHz and using the GNU compiler.

I would appreciate any other insights,

Thanks again.

0 Chester Gillon over 5 years ago in reply to Jacob Livchitz

Guru 92251 points

Jacob Livchitz said:
Talking with my advisor, we came to the conclusion that the hardware implementation may not have 0 waitstate SRAM.

The CoreMark and Dhrystone benchmark tests involve fetching instructions, performing calculations and performing data read/writes. With such overlapped functionality it is difficult to predict the effect of memory wait states on the overall execution time.

To investigate the effect of memory wait states set up a test program in which execute a sequence of 15,600 manually unrolled instructions in a loop, where each instruction in the loop is 32-bits and is expected to take 1 cycle to execute when fetched from memory with zero wait states. There are no data accesses in the test, so the test duration is dominated by the instruction fetch rate from memory.

From looking at instruction timings, including the loop overhead and function call overhead, with zero wait states one function call should take 15,976,454 to 15,978,506 cycles. The variation is due to branch instructions invoking the overhead of the number of cycles required for a pipeline refill, which according to the ARM Cortex-M4 instruction timing is 1 to 3 cycles.

The tests were run with a MSP432P401R set to a 48 MHz clock, and changing the memory the code was executing from. The CCS Profile Clock was used to obtain the timings. The results were:

Code location	Execution time in cycles
SRAM	15,977,477
FLASH with flash bank read buffering enabled	19,972,102
FLASH with flash bank read buffering disabled	31,954,949

The conclusions are at 48 MHz CPU clock

SRAM is zero wait states, as the measured execution time was within the range expected for zero wait states.
FLASH is one wait state, since with flash bank read buffering disabled the measured execution time was 2.0 times that expected from zero wait states.
FLASH with one wait state, but with the read buffer enabled, with this test causes the execution time to be 1.25 times that of the zero wait state SRAM. I.e. shows the benefit of the read buffer helping to hide the extra latency caused by one wait state. In this test a linear access to FLASH helped the read buffering. A program which is making non-linear access to flash might not get such a speed-up due to the read buffer.
Note that by default the SystemInit() function enables the flash read buffering; I modified the function to disable it for the 2nd test.

Jacob Livchitz said:
When I moved my code from FLASH to SRAM I tried a lot of different configurations of code and data mapped to ICODE, DCODE, and SBUS. All runs were worse.

While FLASH has slower access than SRAM at 48 MHz, as a result of FLASH wait states, due to the ARM "Harvard architecture" which has separate bus for access to FLASH for instructions and SRAM for data the overall test performs better when FLASH is used for instructions and SRAM is used for data. I believe this is because the separate buses allow instruction fetches and data accesses to happen at the same time (as per https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/8615/how-to-explain-the-harvard-architecture-of-arm-processor-at-instruction-level)

I have attached my project used for the above tests.

MSP432_estimate_wait_state_cycles.zip

0 Jacob Livchitz over 5 years ago in reply to Chester Gillon

Prodigy 50 points

Thanks for the response, I found the insight very helpful. I do have one question regarding the Harvard bus architecture. When I execute from SRAM_CODE space (0x01...) am I using ICODE or SBUS? It seems to me that when I run at 48MHz my performance degrades due to bus conflicts. If SRAM_CODE works over the ICODE bus, I should expect, the same performance (per MHz) with 48MHz and 0 wait-state SRAM as I got with 12MHz and zero wait states FLASH. However, my numbers do not reflect this.

-Jacob

0 Chester Gillon over 5 years ago in reply to Jacob Livchitz

Guru 92251 points

Jacob Livchitz said:
When I execute from SRAM_CODE space (0x01...) am I using ICODE or SBUS?

My understanding from the documentation is that ICODE will be used.

Jacob Livchitz said:
If SRAM_CODE works over the ICODE bus, I should expect, the same performance (per MHz) with 48MHz and 0 wait-state SRAM as I got with 12MHz and zero wait states FLASH.

While the Cortex-M4F CPU will be fetching instructions over the ICODE bus, I think the conflict will be when the ICODE (for instruction fetches) and SBUS (for data accesses) buses both need to access the SRAM. From the documentation I can't find any mention of the SRAM allowing dual-port accesses.

0 Dennis Lehman over 5 years ago in reply to Chester Gillon

TI__Guru 71216 points

Hi Jacob,

It's been a few days since I have heard from you so I’m assuming your question has been answered.
If this isn’t the case, please click the "This did NOT resolve my issue" button and reply to this thread with more information.
If this thread locks, please click the "Ask a related question" button and in the new thread describe the current status of your issue and any additional details you may have to assist us in helping to solve your issues.

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

MSP432P401R: Performance with SRAM_CODE at 48 MHz.