AM2431: XIP - Benchmark issues

Part Number: AM2431
Other Parts Discussed in Thread: SYSCONFIG, UNIFLASH

This is regarding two different items on XIP:

  • Benchmarking

I am trying to do some XIP benchmarking on the EVM AM243x and I am running the example benchmark project that was provided in the SDK. I am seeing higher numbers on SDK 11.02.00.24 vs SDK 10.01.00.32. 

On 10.01-> OSPI input clock was 133.33MHz with a clock divider of 4 with DDR and PHY mode. The FIR filter computation -  Max/Flash read execution time matches what is listed in the SDK guide as XIP sample output around ~55000 cycles.

On 11.02 -> OSPI input clock is 166.66MHz with a clock divider of 4 with DDR and PHY mode.

Here the same FIR filter computation - max value is about ~85000 to 87000 cycles when compared to the estimated output at ~58000 to 59000 cycles.

I tried a clock divider value of 2 and the flash initialization in "Board_drivers_open()" failed. I did some digging and came across this thread : https://e2e.ti.com/support/microcontrollers/arm-based-microcontrollers-group/arm-based-microcontrollers/f/arm-based-microcontrollers-forum/1477978/mcu-plus-sdk-am243x-questions-about-ospi-ineffective-8d-mode/5680440?tisearch=e2e-sitesearch&keymatch=AM243x%252525252520OSPI%252525252520speeds#

This is listing a valid configuration for the OSPI controller and the potential configurations to get PHY mode working. Please confirm this is true on SDK 11.02 as well.

 

As per this thread, when PHY mode is enabled in sysconfig, the clock divider is skipped and uses the input clock as is. https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1431856/faq-am62x-am62ax-am62px-am62d-q1-am64x-am243x-ospi-phy-tuning-algorithm

So with 8D-8D-8D at 166.66MHz, we should be able to achieve ~300MB/s bandwidth.

How to make sure I am actually using the PHY mode and also OSPI is clocking at the right frequency from SW/FW?

What changed in the latest SDK ? Is this performance expected ?

  • Flash write in XIP

Here is my thoughts right now. Please let me know if this is feasible.

  1. Via sysconfig place the required write function to the flash in a section on SRAM.
  2. Mark the required flash region with WR permission on the MPU but keep the rest for XIP with RD permission.
  3. When the interrupt is triggered for flash write, stop/halt my application execution from flash.
  4. Disable DAC Mode. (May be not required)
  5. Call the flash write function to perform the desired writes.

Thanks,

Prasanna

  • Hi Prasanna,

    Could you provide the sbl_ospi.cfg configuration file you're using with the XIP benchmark example? I'd like to examine the flash memory offset settings and writer configurations.

    Can you also share the link to the XIP benchmark implementation you're testing?

    Meanwhile, I will run the xip benchmarks on MCU+ SDK 11.02 and 10.01 on my setup and check for the discrepancy.

    Regards,

    Aryamaan Chaurasia

  • Hi Aryamaan,

    I made sure I am not erasing the phy tuning vector at 0x2000000 and put all the images at the default offsets So do not worry about overwriting the vector.

    XIP image is using flash offset 0x400000.

    Here is my sbl_ospi.cfg file:

    # First point to sbl_uart_uniflash binary, which function's as a server to flash one or more files
    --flash-writer=sbl_prebuilt/am243x-evm/sbl_uart_uniflash.release.hs_fs.tiimage

    # When sending bootloader make sure to flash at offset 0x0. ROM expects bootloader at offset 0x0
    --file=sbl_ospi_am243x-evm_r5fss0-0_nortos_ti-arm-clang/sbl_ospi.Debug.hs_fs.tiimage --operation=flash --flash-offset=0x0

    --file=XIP_test/Debug/XIP_test.appimage.hs_fs --operation=flash --flash-offset=0x80000
    --file=XIP_test/Debug/XIP_test.appimage_xip --operation=flash-xip

    # Program the OSPI PHY tuning attack vector
    --operation=flash-phy-tuning-data

    This is the example project I am using for the benchmark: 

    C:\ti\mcu_plus_sdk_am243x_11_02_00_24\examples\kernel\dpl\xip_benchmark\am243x-evm\r5fss0-0_freertos.

    I am running only one R5 core and not using all other cores.

    Thanks,

    Prasanna

  • Hi Prasanna,

    I noticed that you are running the xip benchmark in debug mode.

    For production purpose, we strongly recommend running the example in release mode.

    I have tested the xip benchmark on R5FSS0-0 core in release mode:

    Mode

    Stats

    (number

    of cycles)

    SDK
    10.01 11.02
    NO-SYNC

    Min

    26741

    24050

    Max

    54002

    48656

    Avg

    29580

    26513

    Additionally,

    The xip_benchmark example uses an OSPI clock frequency of 166 MHz, with an input clock division of 4, and has PHY and DMA enabled for both SDK versions. For more details on the OSPI configurations used in the XIP benchmark examples, please refer to the sbl_ospi example.

    The ospi_flash_xip example uses an OSPI clock frequency of 133 MHz, with an input clock division of 4, and has PHY and DMA enabled for both SDK versions.

    Regarding to your second question,

    Flash Writes can be performed normally in XIP mode, you can refer the ospi_flash_io example on guidance on how to perform Flash Writes.

    Note that for successful flash writes, the memory regions must be configured as Strongly Ordered, rather than Cached.

    Regards,

    Aryamaan Chaurasia

  • Hi Aryamaan,

    Thanks for the response. I re-tried a few things and I am now seeing even unbelievable numbers. These are consistent across several runs and not just a one time wonder.

    - No difference shown between Debug/Release build profile. They produce the same numbers.

    - No difference when OSPI - DMA is ON/OFF. 

    - In the new SDK 11.02 -> sysconfig shows an option to toggle on/off "Validate OTP" under OSPI configuration. This wasn't present in the older SDK versions. I thought this was making a difference and tried disabling that option. But no improvement. I still think there is some potential mis-configuration in OSPI IP which is blowing the numbers up.

    Taking about ~10 times longer than the cached access. SRAM @ 1600MB/s vs Flash @ 160MB/s. This implies OSPI could be running at 166.66 MHz using the PHY but in 1S-1S-1S mode instead of 8D-8D-8D.  

    But even more confusing is memcpy is also on the slower side when compared to my previous runs. CPU clock is running at 800MHz as per SBL profile logs. Somehow the CPU has slowed down as well ?

    Here is my OSPI configuration in my SBL. My XIP App doesn't have OSPI/Flash drivers enabled at all. I am skipping them altogether. 

    Logs:


    DMSC Firmware Version 11.2.5--v11.02.05 (Fancy Rat)
    DMSC Firmware revision 0xb
    DMSC ABI revision 4.0

    rx_buf 0xaf
    KPI_DATA: [BOOTLOADER_PROFILE] CPU Clock : 800.000 MHz
    KPI_DATA: [BOOTLOADER_PROFILE] Boot Media : NOR SPI FLASH
    KPI_DATA: [BOOTLOADER_PROFILE] Boot Media Clock : 166.667 MHz
    KPI_DATA: [BOOTLOADER_PROFILE] Boot Image Size : 0 KB
    KPI_DATA: [BOOTLOADER_PROFILE] Cores present :
    r5f0-0
    KPI_DATA: [BOOTLOADER PROFILE] SYSFW init : 11299us
    KPI_DATA: [BOOTLOADER PROFILE] System_init : 11324us
    KPI_DATA: [BOOTLOADER PROFILE] Drivers_open : 1694us
    KPI_DATA: [BOOTLOADER PROFILE] Board_driversOpen : 31057us
    KPI_DATA: [BOOTLOADER PROFILE] Sciclient Get Version : 9990us
    KPI_DATA: [BOOTLOADER PROFILE] CPU load : 3354us
    KPI_DATA: [BOOTLOADER PROFILE] SBL End : 5us
    KPI_DATA: [BOOTLOADER_PROFILE] SBL Total Time Taken : 68726us

    Image loading done, switching to application ...

    ### XIP benchmark ###
    FIR:
    24049 cycles (code/data fully cached) ,
    284667 cycles (code/data not cached) ,
    50112 cycles (code/data not-cached 1 of 10 iterations)
    MEMCPY:
    1564 cycles (code/data fully cached) ,
    28781 cycles (code/data not cached) ,
    4295 cycles (code/data not-cached 1 of 10 iterations)
    All tests have passed!!

    Thanks,

    Prasanna

  • Hi Prasanna,

    I've observed that the minimum cycles (fully cached) are identical for both our setups. However, the discrepancy in maximum cycles suggests a potential issue with the flash configuration, as the flash read speeds in your case might be significantly slower.

    To help investigate, could you please confirm whether the AM243x EVM is equipped with the Serial NOR OSPI Flash (S28HS512T)?

    Additionally to check the flash configurations, would you mind sharing the sysconfig file from your SBL OSPI example? Have you made any modifications to the XIP benchmark example, or are you running the default example without any changes?

    Regarding the Validate OTP option, enabling it may introduce a slight performance overhead, but it shouldn't cause such a substantial slowdown unless the validation process fails.

    Regards,

    Aryamaan Chaurasia