This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM2431: XIP - Benchmark issues

Part Number: AM2431
Other Parts Discussed in Thread: SYSCONFIG, UNIFLASH

This is regarding two different items on XIP:

  • Benchmarking

I am trying to do some XIP benchmarking on the EVM AM243x and I am running the example benchmark project that was provided in the SDK. I am seeing higher numbers on SDK 11.02.00.24 vs SDK 10.01.00.32. 

On 10.01-> OSPI input clock was 133.33MHz with a clock divider of 4 with DDR and PHY mode. The FIR filter computation -  Max/Flash read execution time matches what is listed in the SDK guide as XIP sample output around ~55000 cycles.

On 11.02 -> OSPI input clock is 166.66MHz with a clock divider of 4 with DDR and PHY mode.

Here the same FIR filter computation - max value is about ~85000 to 87000 cycles when compared to the estimated output at ~58000 to 59000 cycles.

I tried a clock divider value of 2 and the flash initialization in "Board_drivers_open()" failed. I did some digging and came across this thread : https://e2e.ti.com/support/microcontrollers/arm-based-microcontrollers-group/arm-based-microcontrollers/f/arm-based-microcontrollers-forum/1477978/mcu-plus-sdk-am243x-questions-about-ospi-ineffective-8d-mode/5680440?tisearch=e2e-sitesearch&keymatch=AM243x%252525252520OSPI%252525252520speeds#

This is listing a valid configuration for the OSPI controller and the potential configurations to get PHY mode working. Please confirm this is true on SDK 11.02 as well.

 

As per this thread, when PHY mode is enabled in sysconfig, the clock divider is skipped and uses the input clock as is. https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1431856/faq-am62x-am62ax-am62px-am62d-q1-am64x-am243x-ospi-phy-tuning-algorithm

So with 8D-8D-8D at 166.66MHz, we should be able to achieve ~300MB/s bandwidth.

How to make sure I am actually using the PHY mode and also OSPI is clocking at the right frequency from SW/FW?

What changed in the latest SDK ? Is this performance expected ?

  • Flash write in XIP

Here is my thoughts right now. Please let me know if this is feasible.

  1. Via sysconfig place the required write function to the flash in a section on SRAM.
  2. Mark the required flash region with WR permission on the MPU but keep the rest for XIP with RD permission.
  3. When the interrupt is triggered for flash write, stop/halt my application execution from flash.
  4. Disable DAC Mode. (May be not required)
  5. Call the flash write function to perform the desired writes.

Thanks,

Prasanna

  • Hi Prasanna,

    Could you provide the sbl_ospi.cfg configuration file you're using with the XIP benchmark example? I'd like to examine the flash memory offset settings and writer configurations.

    Can you also share the link to the XIP benchmark implementation you're testing?

    Meanwhile, I will run the xip benchmarks on MCU+ SDK 11.02 and 10.01 on my setup and check for the discrepancy.

    Regards,

    Aryamaan Chaurasia

  • Hi Aryamaan,

    I made sure I am not erasing the phy tuning vector at 0x2000000 and put all the images at the default offsets So do not worry about overwriting the vector.

    XIP image is using flash offset 0x400000.

    Here is my sbl_ospi.cfg file:

    # First point to sbl_uart_uniflash binary, which function's as a server to flash one or more files
    --flash-writer=sbl_prebuilt/am243x-evm/sbl_uart_uniflash.release.hs_fs.tiimage

    # When sending bootloader make sure to flash at offset 0x0. ROM expects bootloader at offset 0x0
    --file=sbl_ospi_am243x-evm_r5fss0-0_nortos_ti-arm-clang/sbl_ospi.Debug.hs_fs.tiimage --operation=flash --flash-offset=0x0

    --file=XIP_test/Debug/XIP_test.appimage.hs_fs --operation=flash --flash-offset=0x80000
    --file=XIP_test/Debug/XIP_test.appimage_xip --operation=flash-xip

    # Program the OSPI PHY tuning attack vector
    --operation=flash-phy-tuning-data

    This is the example project I am using for the benchmark: 

    C:\ti\mcu_plus_sdk_am243x_11_02_00_24\examples\kernel\dpl\xip_benchmark\am243x-evm\r5fss0-0_freertos.

    I am running only one R5 core and not using all other cores.

    Thanks,

    Prasanna

  • Hi Prasanna,

    I noticed that you are running the xip benchmark in debug mode.

    For production purpose, we strongly recommend running the example in release mode.

    I have tested the xip benchmark on R5FSS0-0 core in release mode:

    Mode

    Stats

    (number

    of cycles)

    SDK
    10.01 11.02
    NO-SYNC

    Min

    26741

    24050

    Max

    54002

    48656

    Avg

    29580

    26513

    Additionally,

    The xip_benchmark example uses an OSPI clock frequency of 166 MHz, with an input clock division of 4, and has PHY and DMA enabled for both SDK versions. For more details on the OSPI configurations used in the XIP benchmark examples, please refer to the sbl_ospi example.

    The ospi_flash_xip example uses an OSPI clock frequency of 133 MHz, with an input clock division of 4, and has PHY and DMA enabled for both SDK versions.

    Regarding to your second question,

    Flash Writes can be performed normally in XIP mode, you can refer the ospi_flash_io example on guidance on how to perform Flash Writes.

    Note that for successful flash writes, the memory regions must be configured as Strongly Ordered, rather than Cached.

    Regards,

    Aryamaan Chaurasia

  • Hi Aryamaan,

    Thanks for the response. I re-tried a few things and I am now seeing even unbelievable numbers. These are consistent across several runs and not just a one time wonder.

    - No difference shown between Debug/Release build profile. They produce the same numbers.

    - No difference when OSPI - DMA is ON/OFF. 

    - In the new SDK 11.02 -> sysconfig shows an option to toggle on/off "Validate OTP" under OSPI configuration. This wasn't present in the older SDK versions. I thought this was making a difference and tried disabling that option. But no improvement. I still think there is some potential mis-configuration in OSPI IP which is blowing the numbers up.

    Taking about ~10 times longer than the cached access. SRAM @ 1600MB/s vs Flash @ 160MB/s. This implies OSPI could be running at 166.66 MHz using the PHY but in 1S-1S-1S mode instead of 8D-8D-8D.  

    But even more confusing is memcpy is also on the slower side when compared to my previous runs. CPU clock is running at 800MHz as per SBL profile logs. Somehow the CPU has slowed down as well ?

    Here is my OSPI configuration in my SBL. My XIP App doesn't have OSPI/Flash drivers enabled at all. I am skipping them altogether. 

    Logs:


    DMSC Firmware Version 11.2.5--v11.02.05 (Fancy Rat)
    DMSC Firmware revision 0xb
    DMSC ABI revision 4.0

    rx_buf 0xaf
    KPI_DATA: [BOOTLOADER_PROFILE] CPU Clock : 800.000 MHz
    KPI_DATA: [BOOTLOADER_PROFILE] Boot Media : NOR SPI FLASH
    KPI_DATA: [BOOTLOADER_PROFILE] Boot Media Clock : 166.667 MHz
    KPI_DATA: [BOOTLOADER_PROFILE] Boot Image Size : 0 KB
    KPI_DATA: [BOOTLOADER_PROFILE] Cores present :
    r5f0-0
    KPI_DATA: [BOOTLOADER PROFILE] SYSFW init : 11299us
    KPI_DATA: [BOOTLOADER PROFILE] System_init : 11324us
    KPI_DATA: [BOOTLOADER PROFILE] Drivers_open : 1694us
    KPI_DATA: [BOOTLOADER PROFILE] Board_driversOpen : 31057us
    KPI_DATA: [BOOTLOADER PROFILE] Sciclient Get Version : 9990us
    KPI_DATA: [BOOTLOADER PROFILE] CPU load : 3354us
    KPI_DATA: [BOOTLOADER PROFILE] SBL End : 5us
    KPI_DATA: [BOOTLOADER_PROFILE] SBL Total Time Taken : 68726us

    Image loading done, switching to application ...

    ### XIP benchmark ###
    FIR:
    24049 cycles (code/data fully cached) ,
    284667 cycles (code/data not cached) ,
    50112 cycles (code/data not-cached 1 of 10 iterations)
    MEMCPY:
    1564 cycles (code/data fully cached) ,
    28781 cycles (code/data not cached) ,
    4295 cycles (code/data not-cached 1 of 10 iterations)
    All tests have passed!!

    Thanks,

    Prasanna

  • Hi Prasanna,

    I've observed that the minimum cycles (fully cached) are identical for both our setups. However, the discrepancy in maximum cycles suggests a potential issue with the flash configuration, as the flash read speeds in your case might be significantly slower.

    To help investigate, could you please confirm whether the AM243x EVM is equipped with the Serial NOR OSPI Flash (S28HS512T)?

    Additionally to check the flash configurations, would you mind sharing the sysconfig file from your SBL OSPI example? Have you made any modifications to the XIP benchmark example, or are you running the default example without any changes?

    Regarding the Validate OTP option, enabling it may introduce a slight performance overhead, but it shouldn't cause such a substantial slowdown unless the validation process fails.

    Regards,

    Aryamaan Chaurasia

  • Hi Aryamaan,

    Yes I am using the EVM with the mentioned flash part on board. I expected to see a minimum overhead when the OTP validation was turned on as well. 

    I haven't made any changes to the XIP Benchmark code/configuration. 

    Here is my sbl syscfg file:

    0451.example.zip

    Thanks,

    Prasanna

  • Hi Prasanna,

    Can you try using a newly installed MCU+ SDK 11.02, which includes the unchanged SBL OSPI example and XIP Benchmark example? This will help us isolate the issue, as the SBL OSPI code remains unchanged.

    I've tested the unchanged SBL OSPI example and XIP Benchmark example, and I'm unable to reproduce the issue you're experiencing.

    Thanks and Regards,

    Aryamaan Chaurasia

  • Hi Aryamaan,

    I re-ran the default sbl_ospi image that were part of 11.02 SDK and I see the numbers are matching with yours. The moment I switched to my SBL_OSPI, I see poor numbers reported. 

    Based on your review of the SBL syscfg file, do you see any outliers/obvious settings which is a mis-match from the base SBL OSPI ?

    Firsthand, I want to understand whether this is a compile time or a run time issue or a combination of both:

    - Compiler settings/build environment/sysconfig related issue. Could even be the version is different.

    - OSPI speed/mode, PHY not being enabled properly or used, OTP validation is failing and retry logic is taking longer to complete.

    Thanks,

    Prasanna

  • Hi Prasanna,

    I've observed that the memory configurations in your SBL OSPI example differ from the original SBL OSPI, notably with the addition of a flash memory section. Could you share the complete SBL OSPI example? Are you performing any Flash operations in your SBL OSPI example?

    Regards,

    Aryamaan Chaurasia

  • Hi Aryamaan,

    Quick updates on my experiments from yesterday:

    - Yes. I have some env variables in the flash. This is read by the SBL. I already removed the flash read part of my SBL code and removed the MPU region  and memory section altogether. I let the XIP APP configure the MPU region as cached and to allow code execution. Still no improvement on XIP performance. 

    - I tried to change the OSPI Clock to 133.33 MHz, but for some reason this didn't work and the benchmark didn't run at all. I wanted to see if benchmark numbers are slower this time at this rate to narrow down the issue. 

    - I verified from JTAG that the OSPI registers are actually being setup correct for 8D-8D-8D mode. OSPI_CONFIG_REG Register, OSPI_DEV_INSTR_RD_CONFIG_REG and OSPI_DEV_INSTR_WR_CONFIG_REG. I don't see any issues here.

    - I did some quick checks on the build env and noticed that I am using sysconfig 1.25 while the SDK example shows 1.21. I further probed and found this in sysconfig for the TargetConfigs. This is different than the sbl_ospi example that comes with the SDK. Is this variant making any default assumptions ?

    Thanks,

    Prasanna

  • Hi Prasanna,

    1). Thats correct. For XIP Application, the Flash region should be marked as Cached and enabled to allow code execution. However, this is the default setting for the XIP Benchmark example. So, this shouldnt be an issue.

    2). With OSPI Clock to 133.33 MHz, the following numbers are obtained:

    Benchmark

    Stats

    (number

    of cycles)

    SDK
    11.02
    FIR

    Min

    24050

    Max

    51734

    Avg

    26873

    MEMCPY Min

    1564

    Max 4259
    Avg 1841

    The benchmark results for the maximum number of CPU cycles are indeed slower.

    3). Can you also verify the values of the OSPI registers during the debugging of XIP Benchmark example?

    4). SysConfig version should not be an issue, since I too am using version 1.25.

    Can you also confirm the number of READ and CMD dummy cycles configured for your Flash? This is present in Protocol Enable Configuration Section in SysConfig.

    Regards,

    Aryamaan Chaurasia

  • Hi Aryamaan,

    I am looking at other high priority tasks right now. I haven't been able to make progress on this item. It will be a while till I get back to this item.

    Can you please clarify, one thing : In CCS, I always click on "Rebuilding project" after I make any changes to syscfg, hopefully this re-generates the required PM/DM cfgs for the Sci Drivers for DMSC.I am not sure if DMSC is somehow not clocking the OSPI right.  

    Here is the quick flash configuration in sysconfig: 

       

    Thanks,

    Prasanna

  • Hi Prasanna,

    In CCS, I always click on "Rebuilding project" after I make any changes to syscfg, hopefully this re-generates the required PM/DM cfgs for the Sci Drivers for DMSC.I am not sure if DMSC is somehow not clocking the OSPI right.  

    This should just work fine. The flash configurations which you have attached are correct, assuming the on board flash is S28HS512T as well for you.

    I am looking at other high priority tasks right now. I haven't been able to make progress on this item. It will be a while till I get back to this item.

    Looking forward to it.

    Thanks,

    Vaibhav