This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MSP432P401R: Performance with SRAM_CODE at 48 MHz.

Part Number: MSP432P401R
Other Parts Discussed in Thread: MSP432E401Y

Hi, I am a UMD student and Ti Intern who is working to get a CoreMark benchmark on my MSP432P401R EVM. I am using CCS project and running at 48 MHz. I am getting a lower than expected CoreMark score. I noticed that with all code running on SRAM, I hit a score of 2.41. While when I run code from flash my score improves to 2.6. The expected number is 3.4. Can someone please explain why the SRAM based runs are slower? I expected SRAM to be 0 wait-states and thereby perform better than flash. My data is in SRAM in both cases. My compiler flags are:

-mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --fp_mode=relaxed

and my map file has a memory configuration of:

MAIN                   00000000  00040000   000000e4  0003ff1c R X
INFO                   00200000  00004000   00000000  00004000 R X
SRAM_CODE     01000000   00008000  00006b0c 000014f4 RW X
SRAM_DATA      20008000    00008000 00006e08 000011f8 RW

Thank you!

  • Hi Jacob,

    What software are you using to measure the benchmarks?

  • Hi Dennis, 

    I am using DWT->CYCCNT to calculate the delta number of cycles in a standard CoreMark I cloned from Github and ported to baremetal. 

    -Jacob

  • Jacob,

    What startup code did you use before running the benchmark? If you look in our MSP432 SDK there is a file "source/ti/devices/msp432p4xx/startup_system_files/msp432p401r.c" that provides some startup code that is helpful to run before running the main application. Most of our examples run something like this before main() is called.

    You can see that the code enables all SRAM banks and configures the Flash wait states. Could explain the results you saw.

    -Zack

  • Hi Zack,

    I do run the startup code you mentioned before running my main. This code sets the system clock to 48 MHz and enables all SRAM banks and sets 1 waitstate for the FLASH. 

    Also, can you verify my understanding that (assuming a proper setup) SRAM should be running with 0 waitstates and hence should outperform FLASH runs?

  • Jacob Livchitz said:
    I noticed that with all code running on SRAM, I hit a score of 2.41. While when I run code from flash my score improves to 2.6.

    I can repeat that with a MSP432P401R project with the CPU clock set to 48 MHz and -O4 --opt_for_speed=5 as the optimisation settings. 

    CoreMark Benchmarking for ARM Cortex Processors was used as the basis for setting the coremark test.

    Running with the code in FLASH results in:

    [CORTEX_M4_0] 2K performance run parameters for coremark.
    CoreMark Size    : 666
    Total ticks      : 15967
    Total time (secs): 15.967000
    Iterations/Sec   : 125.258345
    Iterations       : 2000
    Compiler version : TI20002001
    Compiler flags   : -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=__MSP432P401R__ --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi 
    Memory location  : STACK
    seedcrc          : 0xe9f5
    [0]crclist       : 0xe714
    [0]crcmatrix     : 0x1fd7
    [0]crcstate      : 0x8e3a
    [0]crcfinal      : 0x4983
    Correct operation validated. See README.md for run and reporting rules.
    CoreMark 1.0 : 125.258345 / TI20002001 -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=__MSP432P401R__ --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi  / STACK

    Which is a score of 125.258345 / 48 or 2.61.

    Running the code from SRAM results in:

    [CORTEX_M4_0] 2K performance run parameters for coremark.
    CoreMark Size    : 666
    Total ticks      : 17408
    Total time (secs): 17.408000
    Iterations/Sec   : 114.889706
    Iterations       : 2000
    Compiler version : TI20002001
    Compiler flags   : -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=__MSP432P401R__ --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi 
    Memory location  : STACK
    seedcrc          : 0xe9f5
    [0]crclist       : 0xe714
    [0]crcmatrix     : 0x1fd7
    [0]crcstate      : 0x8e3a
    [0]crcfinal      : 0x4983
    Correct operation validated. See README.md for run and reporting rules.
    CoreMark 1.0 : 114.889706 / TI20002001 -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=__MSP432P401R__ --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi  / STACK

    Which is a score of 114.889706 / 48 = 2.39.

    Jacob Livchitz said:
    Can someone please explain why the SRAM based runs are slower? I expected SRAM to be 0 wait-states and thereby perform better than flash. My data is in SRAM in both cases.

    The MSP432P2xx TRM shows there are multiple Cortex-M4F Bus Interfaces:

    Why I think the coremark score when the code is running in SRAM is lower than then the code is running in FLASH is that instructions fetches and data access both have to access SRAM.

    Whereas when the code is running in FLASH the ICODE and SBUS interfaces can be active at the same time, allowing some overlap between instruction fetches and data accesses. I haven't looked at the ARM Cortex-M4F pipeline structure to confirm this theory.

    For reference, my project created in CCS 10 is attached. There are build configuration for Release_FLASH and Release_SRAM which indicate if the code runs from FLASH or SRAM.

    MSP432_coremark.zip

  • I also ran coremark on a MSP432E401Y with the CPU set to 120 MHz.

    With the code in FLASH the result was:

    [CORTEX_M4_0] 2K performance run parameters for coremark.
    CoreMark Size    : 666
    Total ticks      : 15133
    Total time (secs): 15.133000
    Iterations/Sec   : 264.323003
    Iterations       : 4000
    Compiler version : TI20002001
    Compiler flags   : -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/simplelink_msp432e4_sdk_4_10_00_13/source --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=CMSIS --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= --define=__MSP432E401Y__ -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi 
    Memory location  : STACK
    seedcrc          : 0xe9f5
    [0]crclist       : 0xe714
    [0]crcmatrix     : 0x1fd7
    [0]crcstate      : 0x8e3a
    [0]crcfinal      : 0x65c5
    Correct operation validated. See README.md for run and reporting rules.
    CoreMark 1.0 : 264.323003 / TI20002001 -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/simplelink_msp432e4_sdk_4_10_00_13/source --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=CMSIS --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= --define=__MSP432E401Y__ -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi  / STACK

    Which is a score of 264.323003 / 120 = 2.20.

    Running the code from SRAM results in:

    [CORTEX_M4_0] 2K performance run parameters for coremark.
    CoreMark Size    : 666
    Total ticks      : 12818
    Total time (secs): 12.818000
    Iterations/Sec   : 234.045873
    Iterations       : 3000
    Compiler version : TI20002001
    Compiler flags   : -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/simplelink_msp432e4_sdk_4_10_00_13/source --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=CMSIS --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= --define=__MSP432E401Y__ -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi 
    Memory location  : STACK
    seedcrc          : 0xe9f5
    [0]crclist       : 0xe714
    [0]crcmatrix     : 0x1fd7
    [0]crcstate      : 0x8e3a
    [0]crcfinal      : 0xcc42
    Correct operation validated. See README.md for run and reporting rules.
    CoreMark 1.0 : 234.045873 / TI20002001 -mv7M4 --code_state=16 --float_support=FPv4SPD16 -me -O4 --opt_for_speed=5 --include_path=/home/mr_halfword/ti/simplelink_msp432e4_sdk_4_10_00_13/source --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include --include_path=/home/mr_halfword/ti/ccs1000/ccs/ccs_base/arm/include/CMSIS --include_path=/home/mr_halfword/E2E_example_projects/MSP432_coremark --include_path=/home/mr_halfword/ti/ccs1000/ccs/tools/compiler/ti-cgt-arm_20.2.1.LTS/include --advice:power=all --define=CMSIS --define=ccs --define=ITERATIONS=0 --define=SEED_METHOD=SEED_ARG --define=COMPILER_FLAGS= --define=__MSP432E401Y__ -g --gcc --diag_warning=225 --diag_wrap=off --display_error_number --abi=eabi  / STACK

    Which is a score of 234.045873 / 120 = 1.95.

    Therefore, on a MSP432E401Y the score is also lower when running the code in SRAM, compared to FLASH.

    And a MSP432E401Y at 120 MHz has lower scores than a MSP432P401R at 48 MHz. They both have a Cortex-M4F, so guess the difference in scores is down to the relative performance on the memories.

    The updated project is attached, which has different build configuration to support the combinations of MSP432E401Y/MSP432P401R and FLASH/SRAM.

    4682.MSP432_coremark.zip

  • Thanks so much for the response, 

    Just for reference, I achieved the best CoreMark scores when I ran code from FLASH at SYSCLK = 12MHz since this allows my FLASH access to have no wait-states. At 48MHz my FLASH is set to have 1 wait-state. The CoreMark scores I am getting are 3.01 at 12MHz and 2.62 at 48Mhz respectively. When I moved my code from FLASH to SRAM I tried a lot of different configurations of code and data mapped to ICODE, DCODE, and SBUS. All runs were worse. Talking with my advisor, we came to the conclusion that the hardware implementation may not have 0 waitstate SRAM. Also, FLASH memory has a prefetch buffer so that it could explain the reasonably well-performing FLASH. 

    I also experimented with the Dhrystone benchmark and got similar results. The best results I obtained running from FLASH with SYSCLK = 12MHz and using the GNU compiler. 

    I would appreciate any other insights, 

    Thanks again. 

  • Jacob Livchitz said:
    Talking with my advisor, we came to the conclusion that the hardware implementation may not have 0 waitstate SRAM.

    The CoreMark and Dhrystone benchmark tests involve fetching instructions, performing calculations and performing data read/writes. With such overlapped functionality it is difficult to predict the effect of memory wait states on the overall execution time.

    To investigate the effect of memory wait states set up a test program in which execute a sequence of 15,600 manually unrolled instructions in a loop, where each instruction in the loop is 32-bits and is expected to take 1 cycle to execute when fetched from memory with zero wait states. There are no data accesses in the test, so the test duration is dominated by the instruction fetch rate from memory.

    From looking at instruction timings, including the loop overhead and function call overhead, with zero wait states one function call should take 15,976,454 to 15,978,506 cycles. The variation is due to branch instructions invoking the overhead of the number of cycles required for a pipeline refill, which according to the ARM Cortex-M4 instruction timing is 1 to 3 cycles.

    The tests were run with a MSP432P401R set to a 48 MHz clock, and changing the memory the code was executing from. The CCS Profile Clock was used to obtain the timings. The results were:

    Code location Execution time in cycles
    SRAM 15,977,477
    FLASH with flash bank read buffering enabled 19,972,102
    FLASH with flash bank read buffering disabled 31,954,949

    The conclusions are at 48 MHz CPU clock

    1. SRAM is zero wait states, as the measured execution time was within the range expected for zero wait states.
    2. FLASH is one wait state, since with flash bank read buffering disabled the measured execution time was 2.0 times that expected from zero wait states.
    3. FLASH with one wait state, but with the read buffer enabled, with this test causes the execution time to be 1.25 times that of the zero wait state SRAM. I.e. shows the benefit of the read buffer helping to hide the extra latency caused by one wait state. In this test a linear access to FLASH helped the read buffering. A program which is making non-linear access to flash might not get such a speed-up due to the read buffer.
    4. Note that by default the SystemInit() function enables the flash read buffering; I modified the function to disable it for the 2nd test.

    Jacob Livchitz said:
    When I moved my code from FLASH to SRAM I tried a lot of different configurations of code and data mapped to ICODE, DCODE, and SBUS. All runs were worse.

    While FLASH has slower access than SRAM at 48 MHz, as a result of FLASH wait states, due to the ARM "Harvard architecture" which has separate bus for access to FLASH for instructions and SRAM for data the overall test performs better when FLASH is used for instructions and SRAM is used for data. I believe this is because the separate buses allow instruction fetches and data accesses to happen at the same time (as per https://community.arm.com/developer/ip-products/processors/f/cortex-a-forum/8615/how-to-explain-the-harvard-architecture-of-arm-processor-at-instruction-level)

    I have attached my project used for the above tests.

    MSP432_estimate_wait_state_cycles.zip

  • Thanks for the response, I found the insight very helpful. I do have one question regarding the Harvard bus architecture. When I execute from SRAM_CODE space (0x01...) am I using ICODE or SBUS? It seems to me that when I run at 48MHz my performance degrades due to bus conflicts. If SRAM_CODE works over the ICODE bus, I should expect, the same performance (per MHz) with 48MHz and 0 wait-state SRAM as I got with 12MHz and zero wait states FLASH. However, my numbers do not reflect this.    

    -Jacob

  • Jacob Livchitz said:
    When I execute from SRAM_CODE space (0x01...) am I using ICODE or SBUS?

    My understanding from the documentation is that ICODE will be used.

    Jacob Livchitz said:
    If SRAM_CODE works over the ICODE bus, I should expect, the same performance (per MHz) with 48MHz and 0 wait-state SRAM as I got with 12MHz and zero wait states FLASH.

    While the Cortex-M4F CPU will be fetching instructions over the ICODE bus, I think the conflict will be when the ICODE (for instruction fetches) and SBUS (for data accesses) buses both need to access the SRAM. From the documentation I can't find any mention of the SRAM allowing dual-port accesses.

  • Hi Jacob,

    It's been a few days since I have heard from you so I’m assuming your question has been answered.
    If this isn’t the case, please click the "This did NOT resolve my issue" button and reply to this thread with more information.
    If this thread locks, please click the "Ask a related question" button and in the new thread describe the current status of your issue and any additional details you may have to assist us in helping to solve your issues.


**Attention** This is a public forum