Hi,
I'm performing some measurements at instruction level, using the PMU in order to validate the expected results of EMIF accesses to an external FPGA.
I also extended those tests to measure loads and writes to different kinds memories: L2_Flash, L2_RAM, EMIF, SYSTEM module registers.
Based on the different clocks of the system, I try to understand the timings of a single load or a single store instruction based on the Cortex-R5 documentation (Appendix B), but I don't understand anything when I compare my measurements to the expected timings.
When performing the measurements, the system is in the following state:
- the PMU cycle counter is configured to count each cycle
- GCLK = 300MHz (cortex-r5 clock)
- HCLK=150MHz (Level 2 memories clock)
- VCLK = 75MHz (Clock for System modules under PCR1)
- VCLK3 = 50MHz (EMIF clock)
- Flash wait states are configured according to the TMS570LC4357 datasheet (i.e. 3)
- EMIF configuration to access the FPGA: 16-bit data bus, read setup/strobe/hold cycles combined = 8 EMIF Clock cycles; write setup/strobe/hold cycles combined = 5 EMIF clock cycles
- caches are disabled
- MPU is enabled: EMIF and SYSTEM peripherals memories are strongly ordered memory, RAM and flash are normal cacheable memories
- all cycles numbers given here-below have been minored of the PMU cycle count register access (6 cycles: this measurement was accurate according to the Cortex-R5 documentation)
- the measurement covers only one instruction (ldr or str)
- all addresses used for loads and stores are aligned on 64 bits
- all cortex-r5 performance features (example: dual issue) are left in their reset state (enabled)
Addresses used during tests:
- L2RAM: 0x08000000
- L2FLASH: 0x00044000
- SYSTEM module: 0xFFFFFF00
- EMIF: 0x60000000
Here are the results of the measurements:
- L2RAM: load 16-bits >> 14 cycles
- L2RAM: load 32-bits >> 14 cycles
- L2RAM: load 64-bits >> 14 cycles
- L2RAM: store 16-bits >> 1 cycle
- L2RAM: store 32-bits >> 1 cycle
- L2RAM: store 64-bits >> 1 cycle
- L2FLASH: load 16-bits >> 24 cycles
- L2FLASH: load 32-bits >> 24 cycles
- L2FLASH: load 64-bits >> 24 cycles
- SYS: load 16-bits >> 31 cycles
- SYS: load 32-bits >> 31 cycles
- SYS: write 16-bits >> 27 cycles
- SYS: write 32-bits >> 27 cycles
- EMIF: load 16-bits >> 129 cycles
- EMIF: load 32-bits >> 177 cycles
- EMIF: load 64-bits >> 275 cycles
- EMIF: store 16-bits >> 50 cycles
- EMIF: store 32-bits >> 48 cycles
- EMIF: store 64-bits >> 88 cycles
Some info that I can see from these measurements:
- For EMIF accesses, one 16-bit transfer for a read should take 48 core cycles (8 cycles at 50 MHz >> 48 cycles at 300MHz), and there is approximately 48 cycles between a load 16-bits and a load 32-bits. There are also 2*48 cycles difference between load 32-bits and load 64-bits.
- Store in internal RAM is indicated to take one cycle in chapter B.11 of Cortex-R5 TRM, and this is what is measured
But I can't answer the following questions, even after searching this forum, ARM community forums, ARM application notes. I don't understand what I am doing wrong!
- L2 RAM has 0 wait states and its clock (HCLK) is half the core clock (GCLK): the load should last 2 or 3 cycles, right?
- L2 FLASH has 3 wait states and its clock (HCLK) is half the core clock (GCLK): the load should last 6 or 7 cycles, right?
- EMIF: a single EMIF 16-bit read transfer should take 48 cycles: why is there 81 cycles (129 - 48) overhead?
- EMIF: why a store 16-bits and a store 32-bits last the same amount of time (should be 5 EMIF cycles >> 30 core cycles)? I thought all acceses to strongly ordered memories should be completed before the core continued the execution?
- EMIF: why the 64-bits store does not last 4 * 30 cycles, but less (88)?
- SYSTEM modules registers: why those numbers?
Any help on this will be highly appreciated!
Thanks,
Gael