This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OMAP3530 memory bus clock verification

Other Parts Discussed in Thread: OMAP3530

Hi,

I try to verify the correct setting of the bus-clock of an OMAP3530 using simple assembler functions with a known cycle count.

The idea is to turn off D-cache and run bus&core with the same clock-speed.
'load multiple' from internal memory for this case is expected to have a cycle-count very close to the cycle-count of the d-cache-on-case.

(for 'load multiple' every bus-clock-cycle should provide a data-fetch)

In addition chapter 11.3.3.2 of the OMAP3530 technical reference (spruf98b): "The device-embedded RAM [...] Operates at full L3 interconnect clock frequency"

See cache-measurements below.
p1 seems fine in all cases.
When turning off data-cache profiling number exceed expected values by far!
In p2 profiling using internal memory increases by factor 52(!)
In p3 profiling number increase even though the processor clock only is upped.

How can the effects seen on the hardware be explained? What can drive up the required core/bus-cycles so high up?



p1: D-Cache ON (CPU=250MHz, DPLL3=250MHz)
Function                            cycles      cycles/10^6
testLoop                             2000471     2.00
testMem_SRAM (internal)             10000649    10.00
testMem2_SRAM (internal)             7000484     7.00
testMem_SDRAM (external)            10000471    10.00
testMem2_SDRAM (external)            7000306     7.00

p2: D-Cache OFF (CPU=250MHz, DPLL3=250MHz)
Function                            cycles      cycles/10^6
testLoop                             2001381     2.00
testMem_SRAM (internal)            520204489   520.20
testMem2_SRAM (internal)           272099033   272.10
testMem_SDRAM (external)           685616405   685.62
testMem2_SDRAM (external)          347525487   347.53

p3: D-Cache OFF (CPU=500MHz, DPLL3=250MHz)
testLoop                             2001249     2.00
testMem_SRAM (internal)            760001521   760.00
testMem2_SRAM (internal)           402406243   402.41
testMem_SDRAM (external)          1087827357  1087.83
testMem2_SDRAM (external)          544892391   544.89

p4: D-Cache OFF (CPU=500MHz, DPLL3=332MHz)
Function                            cycles      cycles/10^6
testLoop                             2001727     2.00
testMem_SRAM (internal)            654763809   654.76
testMem2_SRAM (internal)           341618941   341.62
testMem_SDRAM (external)           897161839   897.16
testMem2_SDRAM (external)          448581711   448.58


Code:
--------------->
testLoop   
        ;;  r0 is loop-counter
    subs    r0, r0, #1 ; reduce loop counter
    bne     testLoop
[...]

testMem
        ;; r0 is loop-counter
        ;; r1 is address to read from
testMemLoop
    ldr     r2, [r1]        ; execute 10 single loads
    ldr     r2, [r1, #4]
    ldr     r2, [r1, #8]
    ldr     r2, [r1, #12]
    ldr     r2, [r1, #16]
    ldr     r2, [r1, #20]
    ldr     r2, [r1, #24]
    ldr     r2, [r1, #28]
    ldr     r2, [r1, #32]
    ldr     r2, [r1, #36]
    subs    r0, r0, #1     ; reduce loop counter
    bne     testMemLoop
[...]

testMem2
        ;; r0 is loop-counter
        ;; r1 is address to read from
[...]
testMem2loop
    ldmia   r1, {r2-r11}     ; load multiple (10 registers)
    subs    r0, r0, #1       ; reduce loop counter
    bne     testMem2loop
[...]
<-----------------------------

thanks!

jhoff

  • I don't have any comments right now about the numbers you have posted.

    I was curious to understand from you what your expectations were.  You mentioned the numbers blew through your expectations, but you didn't indicate what those were.
    It would be beneficial to understand that as well.  Thank you.

  • Hi Brandon,

    For testMem (10 single loads) I expected 20*10^6 cycles; for every load-operation I expected one instruction fetch and one data fetch.

    For testMem2 ('load multiple' of 10 datasets) I originally expected 11*10^6 cycles; one instruction fetch + 10 data fetches for every bus/cpu-tick.

     

    I actually forgot to mention that I execute every loop 10^6 times.

     

    Anyone any comments/ideas?

    thanks!

     

    jhoff

  • I believe the root of this is stemming from the Cortex A8 internal memory architecture, so your answer is likely in ARM documentation as opposed to the OMAP3 TRM. I suspect that this is just a byproduct of the device being designed entirely around having the internal L1 cache enabled, there are significant latencies for accesing L2 cache (minimum 8 cycles) let alone going across the L3 memory interface to another on chip memory like the OCM_RAM. I am not sure how much latency if any is mitigated by disabling the caching, but it seems at least that you have proved one should have the data cache enabled to maintain reasonable performance.

    If you want to measure the speed of the OCM_RAM you may be able to do so by using DMA transfers, though I am not sure how this would help if you are going to be dependent on the CPU accessing the data anyway. What is your end goal in this testing, do you have concerns about the OMAP3 meeting your intended system's bandwidth requirements?

  • Hi Bernie,

    thanks for your reply!

    Let me explain what the goal behind this exercise is.

    Our software has a relative large data exchange with the memory. So the cpu/bus-clock has a relatively big effect on the speed our code runs in the end (next to Cache-sizes).

    So I'm setting the bus-clock ... and I wanted to verify that this is correct (That the bus-clock actually ticks at the speed I think it does ...).

    How do you determine if the bus-clock is correct, if you can't measure it directly?

    That's why I tried to verify the bus-clock in relation the CPU-clock using the assembler commands above, which lead me the results I could not really explain.

     

    jhoff

  • jhoff said:
    How do you determine if the bus-clock is correct, if you can't measure it directly?

    Unfortunately I do not know of a way to directly measure the bus clock, you can calculate what it should be based on your input clock and the register values within the device which is what I would typically suggest, though this is not truly proof of the speed. You could run benchmark code like you were trying to do while changing the bus clock to try to perceive the changes to confirm if your clock changes are actually affecting performance, but this would not give an real measurement of the clock, just that it is there.

    In the case of using the Linux drivers to modify the system frequency you can measure the voltage that the part is at to determine what the clock frequencies should be, since the driver adjusts both the clock and voltage levels in a series of operating points OPP which are discussed in chapter 10 of the PSP user's guide.

  • Hi Bernie,

    thanks for putting more thought into that.

     

    I calculated the bus clock based on the input clock and register settings (that's how the number in my first post came to life). And I was able to see a dependency of the clock-cycles of the code above, when changing the clock speed.

    You can even see a dependency in my very first post. In the examples p3 and p4 I just change the clock-setting and you can see a change in clock-cycles.

    Therefore I started trusting my settings :-)

    Yet, the question of this large "offset" to the expected values puzzled me.

     

    I don't have a Linux installed at this point in time, since for my profiling purposes "bare-metal" is better.

     

    jhoff