TMS570LC4357: Load/store instructions timings

Gael Le Moing

Part Number: TMS570LC4357

Hi,

I'm performing some measurements at instruction level, using the PMU in order to validate the expected results of EMIF accesses to an external FPGA.

I also extended those tests to measure loads and writes to different kinds memories: L2_Flash, L2_RAM, EMIF, SYSTEM module registers.

Based on the different clocks of the system, I try to understand the timings of a single load or a single store instruction based on the Cortex-R5 documentation (Appendix B), but I don't understand anything when I compare my measurements to the expected timings.

When performing the measurements, the system is in the following state:

the PMU cycle counter is configured to count each cycle
GCLK = 300MHz (cortex-r5 clock)
HCLK=150MHz (Level 2 memories clock)
VCLK = 75MHz (Clock for System modules under PCR1)
VCLK3 = 50MHz (EMIF clock)
Flash wait states are configured according to the TMS570LC4357 datasheet (i.e. 3)
EMIF configuration to access the FPGA: 16-bit data bus, read setup/strobe/hold cycles combined = 8 EMIF Clock cycles; write setup/strobe/hold cycles combined = 5 EMIF clock cycles
caches are disabled
MPU is enabled: EMIF and SYSTEM peripherals memories are strongly ordered memory, RAM and flash are normal cacheable memories
all cycles numbers given here-below have been minored of the PMU cycle count register access (6 cycles: this measurement was accurate according to the Cortex-R5 documentation)
the measurement covers only one instruction (ldr or str)
all addresses used for loads and stores are aligned on 64 bits
all cortex-r5 performance features (example: dual issue) are left in their reset state (enabled)

Addresses used during tests:

L2RAM: 0x08000000
L2FLASH: 0x00044000
SYSTEM module: 0xFFFFFF00
EMIF: 0x60000000

Here are the results of the measurements:

L2RAM: load 16-bits >> 14 cycles
L2RAM: load 32-bits >> 14 cycles
L2RAM: load 64-bits >> 14 cycles
L2RAM: store 16-bits >> 1 cycle
L2RAM: store 32-bits >> 1 cycle
L2RAM: store 64-bits >> 1 cycle
L2FLASH: load 16-bits >> 24 cycles
L2FLASH: load 32-bits >> 24 cycles
L2FLASH: load 64-bits >> 24 cycles
SYS: load 16-bits >> 31 cycles
SYS: load 32-bits >> 31 cycles
SYS: write 16-bits >> 27 cycles
SYS: write 32-bits >> 27 cycles
EMIF: load 16-bits >> 129 cycles
EMIF: load 32-bits >> 177 cycles
EMIF: load 64-bits >> 275 cycles
EMIF: store 16-bits >> 50 cycles
EMIF: store 32-bits >> 48 cycles
EMIF: store 64-bits >> 88 cycles

Some info that I can see from these measurements:

For EMIF accesses, one 16-bit transfer for a read should take 48 core cycles (8 cycles at 50 MHz >> 48 cycles at 300MHz), and there is approximately 48 cycles between a load 16-bits and a load 32-bits. There are also 2*48 cycles difference between load 32-bits and load 64-bits.
Store in internal RAM is indicated to take one cycle in chapter B.11 of Cortex-R5 TRM, and this is what is measured

But I can't answer the following questions, even after searching this forum, ARM community forums, ARM application notes. I don't understand what I am doing wrong!

L2 RAM has 0 wait states and its clock (HCLK) is half the core clock (GCLK): the load should last 2 or 3 cycles, right?
L2 FLASH has 3 wait states and its clock (HCLK) is half the core clock (GCLK): the load should last 6 or 7 cycles, right?
EMIF: a single EMIF 16-bit read transfer should take 48 cycles: why is there 81 cycles (129 - 48) overhead?
EMIF: why a store 16-bits and a store 32-bits last the same amount of time (should be 5 EMIF cycles >> 30 core cycles)? I thought all acceses to strongly ordered memories should be completed before the core continued the execution?
EMIF: why the 64-bits store does not last 4 * 30 cycles, but less (88)?
SYSTEM modules registers: why those numbers?

Any help on this will be highly appreciated!

Thanks,

Gael

over 5 years ago

0 Sunil Oak over 5 years ago

TI__Mastermind 49120 points

Hi Gael,

With the latest silicon revision (revision B), there is a fix for the EMIF issue that forced the external memory to be configured as "strongly-ordered". You can now configure it to be of normal-type or device-type and see a significant performance improvement.

It will take us some time to get the cycle analysis completed for the accesses you measured. I will keep you posted on the progress in getting this data.

Regards,
Sunil

0 Gael Le Moing over 5 years ago in reply to Sunil Oak

Expert 1020 points

Hi Sunil,

Thank you for the info. I was aware of that and I took the measurement also for with the MPU configured as "Device" for both SYSTEM peripherals and EMIF space.

In that conditions, the measured times are the following:

L2RAM: load 16-bits >> 14 cycles
L2RAM: load 32-bits >> 14 cycles
L2RAM: load 64-bits >> 14 cycles
L2RAM: store 16-bits >> 1 cycle
L2RAM: store 32-bits >> 1 cycle
L2RAM: store 64-bits >> 1 cycle
L2FLASH: load 16-bits >> 24 cycles
L2FLASH: load 32-bits >> 24 cycles
L2FLASH: load 64-bits >> 24 cycles
SYS: load 16-bits >> 29 cycles
SYS: load 32-bits >> 29 cycles
SYS: write 16-bits >> 1 cycles
SYS: write 32-bits >> 1 cycles
EMIF: load 16-bits >> 125 cycles
EMIF: load 32-bits >> 173 cycles
EMIF: load 64-bits >> 271 cycles
EMIF: store 16-bits >> 1 cycles
EMIF: store 32-bits >> 1 cycles
EMIF: store 64-bits >> 1 cycles

In device mode, are the order of the accesses respected inside an MPU region? By that I mean: if the core performs a write to 0x60000000, it will take 1 cycle, but if it is followed by a read to 0x60000004, will the core be stalled until the previous write is completed?

I also found on the forum that EMIF access has an "internal delay of 12 VCLK" to start a transfer (assuming that "VCLK" refers to VCLK3) but it was not for TMS570LC4357 specifically. Is that delay applicable to EMIF in the TMS570LC4357?

Is there a similar delay for reads from the L2RAM? the L2FLASH? the peripherals registers under the Peripheral Interconnect Subsystem?

I hope you could get me some answers.

Thanks,

Gael

0 Gael Le Moing over 5 years ago in reply to Gael Le Moing

Expert 1020 points

Hi Sunil,

Any news about the memories access times?
Thanks
Gael

0 Sunil Oak over 5 years ago in reply to Gael Le Moing

TI__Mastermind 49120 points

Hi Gael,

I do not have this data yet. I will check on the progress and get back to you.

Regards,
Sunil

0 Gael Le Moing over 5 years ago in reply to Sunil Oak

Expert 1020 points

Hi Sunil,

Any update on the progress?

Best regards,
Gael

0 Gael Le Moing over 5 years ago in reply to Gael Le Moing

Expert 1020 points

Hi Sunil,

I hope things are going well. When do you think you will be able to provide some answers?

Thanks,
Gael

0 Sunil Oak over 5 years ago in reply to Gael Le Moing

TI__Mastermind 49120 points

Hi Gael,

Unfortunately I have not been able to find any bandwidth or throughput information for the EMIF. The data may have to be retaken on our end, depending on how critically it is required.

Trying to answer some other questions you asked:

In device mode, are the order of the accesses respected inside an MPU region? By that I mean: if the core performs a write to 0x60000000, it will take 1 cycle, but if it is followed by a read to 0x60000004, will the core be stalled until the previous write is completed?

>> Yes, in this case the CPU will wait for the write to finish first.

I also found on the forum that EMIF access has an "internal delay of 12 VCLK" to start a transfer (assuming that "VCLK" refers to VCLK3) but it was not for TMS570LC4357 specifically. Is that delay applicable to EMIF in the TMS570LC4357?

>> Yes, this delay is applicable to the LC4357 as well.

Is there a similar delay for reads from the L2RAM? the L2FLASH? the peripherals registers under the Peripheral Interconnect Subsystem?

>> Yes, it takes 30 CPU cycles to read from L2RAM/L2FLASH or from any of the peripheral registers on the LC4357.

0 Gael Le Moing over 5 years ago in reply to Sunil Oak

Expert 1020 points

Thanks for these answers.

For the last one, could you detail the 30 CPU cycles for a read? How could this number be the same for flash, RAM and peripherals if the clocks and wait states for each one (flash, ram, peripherals) are not the same?

Moreover, the measurements I provided to you show less than 30 cycles for each access type (ram: 14 cycles, flash: 24 cycles, SYS module: 29 cycles). How can this be explained?

Thanks,
Gael

0 Gael Le Moing over 5 years ago in reply to Gael Le Moing

Expert 1020 points

Hi Sunil,

Can you provide an update on the progress of your investigations?

Thanks,
Gael

0 Gael Le Moing over 5 years ago in reply to Gael Le Moing

Expert 1020 points

Hi,

Any news on this subject?

Thanks,
Gael

0 Gael Le Moing over 5 years ago in reply to Gael Le Moing

Expert 1020 points

Hi,
Any idea when you will be able to give an answer for this question?
Regards,
Gael

0 Sunil Oak over 5 years ago in reply to Gael Le Moing

TI__Mastermind 49120 points

Hi Gael,

Getting these internal timings is hard as it requires dedicated time from the design team to reproduce the accesses and provide the data. I will get back to you with an estimated time for the data to be available. Sorry for the long delay in getting these timings to you.

Regards,
Sunil

0 Gael Le Moing over 5 years ago in reply to Sunil Oak

Expert 1020 points

Ok, thank you. I understand this could be long, but I would like to be sure I did not miss anything in the different memories access configuration (when applicable), that could affect the system performance.

Arm-based microcontrollers

Arm-based microcontrollers forum

TMS570LC4357: Load/store instructions timings