This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS570LC4357: Load/store instructions timings

Part Number: TMS570LC4357

Hi,

I'm performing some measurements at instruction level, using the PMU in order to validate the expected results of EMIF accesses to an external FPGA.

I also extended those tests to measure loads and writes to different kinds memories: L2_Flash, L2_RAM, EMIF, SYSTEM module registers.

Based on the different clocks of the system, I try to understand the timings of a single load or a single store instruction based on the Cortex-R5 documentation (Appendix B), but I don't understand anything when I compare my measurements to the expected timings.

When performing the measurements, the system is in the following state:

  • the PMU cycle counter is configured to count each cycle
  • GCLK = 300MHz (cortex-r5 clock)
  • HCLK=150MHz (Level 2 memories clock)
  • VCLK = 75MHz (Clock for System modules under PCR1)
  • VCLK3 = 50MHz (EMIF clock)
  • Flash wait states are configured according to the TMS570LC4357 datasheet (i.e. 3)
  • EMIF configuration to access the FPGA: 16-bit data bus, read setup/strobe/hold cycles combined = 8 EMIF Clock cycles; write setup/strobe/hold cycles combined = 5 EMIF clock cycles
  • caches are disabled
  • MPU is enabled: EMIF and SYSTEM peripherals memories are strongly ordered memory, RAM and flash are normal cacheable memories
  • all cycles numbers given here-below have been minored of the PMU cycle count register access (6 cycles: this measurement was accurate according to the Cortex-R5 documentation)
  • the measurement covers only one instruction (ldr or str)
  • all addresses used for loads and stores are aligned on 64 bits
  • all cortex-r5 performance features (example: dual issue) are left in their reset state (enabled)

Addresses used during tests:

  • L2RAM: 0x08000000
  • L2FLASH: 0x00044000
  • SYSTEM module: 0xFFFFFF00
  • EMIF: 0x60000000

Here are the results of the measurements:

  • L2RAM: load 16-bits >> 14 cycles
  • L2RAM: load 32-bits >> 14 cycles
  • L2RAM: load 64-bits >> 14 cycles
  • L2RAM: store 16-bits >> 1 cycle
  • L2RAM: store 32-bits >> 1 cycle
  • L2RAM: store 64-bits >> 1 cycle
  • L2FLASH: load 16-bits >> 24 cycles
  • L2FLASH: load 32-bits >> 24 cycles
  • L2FLASH: load 64-bits >> 24 cycles
  • SYS: load 16-bits >> 31 cycles
  • SYS: load 32-bits >> 31 cycles
  • SYS: write 16-bits >> 27 cycles
  • SYS: write 32-bits >> 27 cycles
  • EMIF: load 16-bits >> 129 cycles
  • EMIF: load 32-bits >> 177 cycles
  • EMIF: load 64-bits >> 275 cycles
  • EMIF: store 16-bits >> 50 cycles
  • EMIF: store 32-bits >> 48 cycles
  • EMIF: store 64-bits >> 88 cycles

Some info that I can see from these measurements:

  • For EMIF accesses, one 16-bit transfer for a read should take 48 core cycles (8 cycles at 50 MHz >> 48 cycles at 300MHz), and there is approximately 48 cycles between a load 16-bits and a load 32-bits. There are also 2*48 cycles difference between load 32-bits and load 64-bits.
  • Store in internal RAM is indicated to take one cycle in chapter B.11 of Cortex-R5 TRM, and this is what is measured

But I can't answer the following questions, even after searching this forum, ARM community forums, ARM application notes. I don't understand what I am doing wrong!

  • L2 RAM has 0 wait states and its clock (HCLK) is half the core clock (GCLK): the load should last 2 or 3 cycles, right?
  • L2 FLASH has 3 wait states and its clock (HCLK) is half the core clock (GCLK): the load should last 6 or 7 cycles, right? 
  • EMIF: a single EMIF 16-bit read transfer should take 48 cycles: why is there 81 cycles (129 - 48) overhead?
  • EMIF: why a store 16-bits and a store 32-bits last the same amount of time (should be 5 EMIF cycles >> 30 core cycles)? I thought all acceses to strongly ordered memories should be completed before the core continued the execution?
  • EMIF: why the 64-bits store does not last 4 * 30 cycles, but less (88)?
  • SYSTEM modules registers: why those numbers?

Any help on this will be highly appreciated!

Thanks,

Gael

  • Hi Gael,

    With the latest silicon revision (revision B), there is a fix for the EMIF issue that forced the external memory to be configured as "strongly-ordered". You can now configure it to be of normal-type or device-type and see a significant performance improvement.

    It will take us some time to get the cycle analysis completed for the accesses you measured. I will keep you posted on the progress in getting this data.

    Regards,
    Sunil
  • Hi Sunil,

    Thank you for the info. I was aware of that and I took the measurement also for with the MPU configured as "Device" for both SYSTEM peripherals and EMIF space.

    In that conditions, the measured times are the following:

    • L2RAM: load 16-bits >> 14 cycles
    • L2RAM: load 32-bits >> 14 cycles
    • L2RAM: load 64-bits >> 14 cycles
    • L2RAM: store 16-bits >> 1 cycle
    • L2RAM: store 32-bits >> 1 cycle
    • L2RAM: store 64-bits >> 1 cycle
    • L2FLASH: load 16-bits >> 24 cycles
    • L2FLASH: load 32-bits >> 24 cycles
    • L2FLASH: load 64-bits >> 24 cycles
    • SYS: load 16-bits >> 29 cycles
    • SYS: load 32-bits >> 29 cycles
    • SYS: write 16-bits >> 1 cycles
    • SYS: write 32-bits >> 1 cycles
    • EMIF: load 16-bits >> 125 cycles
    • EMIF: load 32-bits >> 173 cycles
    • EMIF: load 64-bits >> 271 cycles
    • EMIF: store 16-bits >> 1 cycles
    • EMIF: store 32-bits >> 1 cycles
    • EMIF: store 64-bits >> 1 cycles

    In device mode, are the order of the accesses respected inside an MPU region? By that I mean: if the core performs a write to 0x60000000, it will take 1 cycle, but if it is followed by a read to 0x60000004, will the core be stalled until the previous write is completed?

    I also found on the forum that EMIF access has an "internal delay of 12 VCLK" to start a transfer (assuming that "VCLK" refers to VCLK3) but it was not for TMS570LC4357 specifically. Is that delay applicable to EMIF in the TMS570LC4357?

    Is there a similar delay for reads from the L2RAM? the L2FLASH? the peripherals registers under the Peripheral Interconnect Subsystem?

    I hope you could get me some answers.

    Thanks,

    Gael

  • Hi Sunil,

    Any news about the memories access times?
    Thanks
    Gael
  • Hi Gael,

    I do not have this data yet. I will check on the progress and get back to you.

    Regards,
    Sunil
  • Hi Sunil,

    Any update on the progress?

    Best regards,
    Gael
  • Hi Sunil,

    I hope things are going well. When do you think you will be able to provide some answers?

    Thanks,
    Gael
  • Hi Gael,

    Unfortunately I have not been able to find any bandwidth or throughput information for the EMIF. The data may have to be retaken on our end, depending on how critically it is required.

    Trying to answer some other questions you asked:

    In device mode, are the order of the accesses respected inside an MPU region? By that I mean: if the core performs a write to 0x60000000, it will take 1 cycle, but if it is followed by a read to 0x60000004, will the core be stalled until the previous write is completed?

    >> Yes, in this case the CPU will wait for the write to finish first.

    I also found on the forum that EMIF access has an "internal delay of 12 VCLK" to start a transfer (assuming that "VCLK" refers to VCLK3) but it was not for TMS570LC4357 specifically. Is that delay applicable to EMIF in the TMS570LC4357?

    >> Yes, this delay is applicable to the LC4357 as well.

    Is there a similar delay for reads from the L2RAM? the L2FLASH? the peripherals registers under the Peripheral Interconnect Subsystem?

    >> Yes, it takes 30 CPU cycles to read from L2RAM/L2FLASH or from any of the peripheral registers on the LC4357.
  • Thanks for these answers.

    For the last one, could you detail the 30 CPU cycles for a read? How could this number be the same for flash, RAM and peripherals if the clocks and wait states for each one (flash, ram, peripherals) are not the same?

    Moreover, the measurements I provided to you show less than 30 cycles for each access type (ram: 14 cycles, flash: 24 cycles, SYS module: 29 cycles). How can this be explained?

    Thanks,
    Gael
  • Hi Sunil,

    Can you provide an update on the progress of your investigations?

    Thanks,
    Gael
  • Hi,

    Any news on this subject?

    Thanks,
    Gael
  • Hi,
    Any idea when you will be able to give an answer for this question?
    Regards,
    Gael
  • Hi Gael,

    Getting these internal timings is hard as it requires dedicated time from the design team to reproduce the accesses and provide the data. I will get back to you with an estimated time for the data to be available. Sorry for the long delay in getting these timings to you.

    Regards,
    Sunil
  • Ok, thank you. I understand this could be long, but I would like to be sure I did not miss anything in the different memories access configuration (when applicable), that could affect the system performance.