This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

PROCESSOR-SDK-AM335X: L1 cache performance comparison

Part Number: PROCESSOR-SDK-AM335X


I am comparing the performance of am335x (on a Beaglebone Black) vs am437x (on a MYIR Rico board).

Using u-boot on both platforms, I run the identical program (pseudo-program listed):

while (1) {

  setGPIO()

  clearGPIO()

  for (ii = 0; ii < 50000; ++ii);

}

making sure the "for" loop is not optimized away and the same assembly code is generated for both platforms.

The am335x runs about 2.5 times faster than the 437x.  Is that an expected result?

Additional info:

- the GPIO spikes are used to measure timing with an oscilloscope.

- the GPIO timings without the "for" loop are about 4-8 nanoseconds slower on the 437x (i.e. barely any difference in timing)

- when the "for" loop is replaced with a giant function greater than 256k in size so as to force the processors to access DRAM, the

  437x gradually begins to outperform the 335x because of its wider memory path.

But the big question remains, why is the 335x outperforming the 437x when running from cache?

- Chuan Neng Lee

  Precise Automation, LLC

  • What are the MPU frequencies? Have you tried running this from kernel user space? You can find Linux performance benchmarks here: processors.wiki.ti.com/.../Processor_SDK_Linux_Kernel_Performance_Guide
  • Also, can you enable some bits in the L2 Cache Pre-fetch Control Register in am437x? Use this code

            unsigned int val;
            val = readl(0x48242f60);
            val |= 0x50000000;
            omap_smc1(0x113, val);

    I added it in board.c and saw better performance in AM437x.

    Steve K.

  • Thank you for your kind response.

    I am running this from u-boot, so there is no OS involved.  I've been using a scope to measure the timings because I was not completely convinced about the frequency settings.  Now, I haven't been able to figure out a way to directly or indirectly measure the current MPU frequency setting.  Can you give me a suggestion?

    Thank you!

  • Hi Steve,

    Unfortunately, because I am running with u-boot rather than Linux, I don't have omap_smc1(). However, I understand your intention, i.e. to set the control register via the smc exception, so I'll do the equivalent in u-boot. Thank you!
  • The code I mention is in u-boot. There are several omap_smc macros. I think I used omap_smc1 from omap-common/lowlevel_init.S.

    Steve K.
  • Steve,

    Thank you for your hint. I managed to enable the L2 cache in board.c with the equivalent of:
    mov r12, #0x102
    mov r0, #1
    smc #0
    and the 437x performance improved by 2.5x for medium code size (32k < code-size < 256k).

    However, for code size < 32k, the 335x continues to outperform the 437x.

    Steve and Biser,
    Perhaps the new questions are:
    - how do I verify the MPU speed setting, either by direct measurement or by inspecting a register?
    - is it possible that the L1 cache is disabled while the L2 cache is enabled? What register should I check for that possibility?

    Thank you!

    - Chuan Neng Lee

    Update: For good measure, to explicitly enable L1 cache (just in case), I added to board.c the equivalent of

    mov r12, #0x116
    mov r0, #1
    smc #0

    but there was no change in performance for code size < 32k

  • Just to close this issue, I found that the u-boot shipped with the 437x board set the following values:

    PRCM_CM_CLKSEL_DPLL_CORE = 0x3e817 (i.e. apparently 1024 MHz)
    PRCM_CM_CLKSEL_DPLL_MPU = 0x25817 (i.e. apparently 629 MHz)

    (thanks to Biser for the hint as to what to look for).

    With both registers set for 1024 MHz, the 437x is now "only" 37% slower than the 335x for integer operations running out of L1 cache. I believe this is not too for off the benchmarks pointed to by Biser.

    Since both Biser and Steve pointed to different problems with the u-boot config I'll attempt to mark both answers as resolving my issue. Thanks!
  • Actually, the divider value in the register is 1 less than the real divider. So for the CORE you have
    1000 *24MHz / 24 = 1000MHz

    Similarly for the MPU you have
    600*24MHz / 24 = 600MHz

    Steve K.