AM5728: Simple addition runs way too slow (MPU speed)

J. Sch.

Part Number: AM5728

Tool/software:

Hi everyone,

I use a Beagle-Board X15 with a AM5728 SoC with "Sciopta" RTOS (so no Linux). Only Core0 and DSP0 are active. My firmware runs in general, only the times for some calculations (in my case converting data into "Flatbuffer" structures) seem extremely slow.

Oscillator is 20 MHz, the prescalers of DPLL_MPU are OPP_NOM (M=500, N=9, M2=1). So the MPU should be running with 1000 MHz (i.e. one CPU cylce = 1ns).
Compiler: GNU v7.3.1 (FSF), IDE: CodeComposerStudio 12.8.0 with BlackHawk XDSv560v2 Debugger.
Caching should be active, the memory map (see idkAM572x.ld) looks like
dram0     (rw) : org = 0x80000000,    len = 512M
rom0      (rwx): org = 0xA0000000,    len = 16M
no_cache0 (rw) : org = 0xA1000000,    len = 240M

I inserted the following test code right at the beginning ("start_hook") of my RTOS ("Sciopta"). No interrupts are active at that early stage.

    GPIOPinWrite(SOC_GPIO4_BASE, 17, GPIO_PIN_HIGH);
     uint32_t i;
     uint32_t dummy = 0;
     uint32_t erg = 0;
     for(i = 0; i < 1000; i++) {
        erg = dummy + 2;
     }
    GPIOPinWrite(SOC_GPIO4_BASE, 17, GPIO_PIN_LOW);

Disassembly:
    491            GPIOPinWrite(SOC_GPIO4_BASE, 17, GPIO_PIN_HIGH);
    a000a230:   E3A02001            mov        r2, #1
    a000a234:   E3A01011            mov        r1, #0x11
    a000a238:   E3A00A09            mov        r0, #0x9000
    a000a23c:   E3440805            movt       r0, #0x4805
    a000a240:   EB03D0AF            bl         GPIOPinWrite
    493            uint32_t dummy = 0;
    a000a244:   E3A03000            mov        r3, #0
    a000a248:   E50B3010            str        r3, [r11, #-0x10]
    494            uint32_t erg = 0;
    a000a24c:   E3A03000            mov        r3, #0
    a000a250:   E50B3014            str        r3, [r11, #-0x14]
    495            for(i = 0; i < 1000; i++) {
    a000a254:   E3A03000            mov        r3, #0
    a000a258:   E50B3008            str        r3, [r11, #-8]
    a000a25c:   EA000005            b          #0xa000a278
    496               erg = dummy + 2;
    a000a260:   E51B3010            ldr        r3, [r11, #-0x10]
    a000a264:   E2833002            add        r3, r3, #2
    a000a268:   E50B3014            str        r3, [r11, #-0x14]
    495            for(i = 0; i < 1000; i++) {
    a000a26c:   E51B3008            ldr        r3, [r11, #-8]
    a000a270:   E2833001            add        r3, r3, #1
    a000a274:   E50B3008            str        r3, [r11, #-8]
    a000a278:   E51B3008            ldr        r3, [r11, #-8]
    a000a27c:   E3530FFA            cmp        r3, #0x3e8
    a000a280:   3AFFFFF6            blo        #0xa000a260
    498            GPIOPinWrite(SOC_GPIO4_BASE, 17, GPIO_PIN_LOW);
    a000a284:   E3A02000            mov        r2, #0
    a000a288:   E3A01011            mov        r1, #0x11
    a000a28c:   E3A00A09            mov        r0, #0x9000
    a000a290:   E3440805            movt       r0, #0x4805
    a000a294:   EB03D09A            bl         GPIOPinWrite

Content of register R11: 0x800363D4 (located in section "dram", see linker map, above).

This test code needs 277µs (measured with oscilloscope on GPIO pin 4.17), which means that 1 iteration needs 277ns = 277 CPU cycles.
In my eyes that seems way too slow for a simple addition!!

Notes:

Making the variables "volatile" didn't change anything.
How can I verify the actual speed of the bus between MPU and DDR3 RAM? (DDR3 RAM is "Kingston D2516EC4BXGGB" with word write speed of 1066 Mb/s -> 30ns per word.)
MPU seems to be NOT in SleepMode: Register CM_MPU_MPU_CLKCTRL, Bit STBST resp. IDLEST are "0"
Register CM_MPU_CLKSTCTRL, Bit CLKACTIVITY_MPU_GCLK = 1, -> MPU-Clock is running
Bit CLKTRCTRL = 2, d.h. SW_WKUP: "Start a software forced wake-up transition on the domain"
MMU seems to be off (not selected in the Sciopta RTOS config tool)
How can I check whether there are Cache misses?

Any ideas why this code takes so long?
Thanks a lot!

Juergen

4 months ago

0 Josue Zamitiz-Ayala 4 months ago

TI__Mastermind 32295 points

Hello Juergen,

From a SW standpoint, Sciopta is not supported officially by TI. We cannot comment on the behavior of this OS since we do not validate our HW with it.

In terms of the HW side, I will have to reassign to our ARM MPU HW engineer.

-Josue

0 J. Sch. 4 months ago in reply to Josue Zamitiz-Ayala

Prodigy 10 points

Yes, please forward the question to your MPW HW engineer. Thank you!

0 Richard Woodruff 4 months ago in reply to J. Sch.

TI__Mastermind 23715 points

Hello,

The above seems reasonable for a core which does not have MMU or data cache enabled. Your loop of 1000 has 2 loads and 1 store to a non-cached strongly ordered memory. A 277/3 = 92ns. The round trip time to the DDR for a load single is in the 90nS range. It is not possible to turn on the data cache on an arm without the mmu enabled. Probably your icache is enabled otherwise it would take longer yet. If you turn on the data cache only they 1 loop will be 90nS and the following ones will be more like 4nS per loop as they will be hits in the L1. For this kind of code, the most efficient way to time it is to use ARM ETM trace. It removes the need to instrument with a GPIO, you just use symbols-addresses as trigger points.

Regards,

Richard W.

0 J. Sch. 4 months ago in reply to Richard Woodruff

Prodigy 10 points

Hi Richard,

thanks for the reply! It contained really valuable information that I did not find in all the TI documents/data sheets/manuals that I studied.

Best regards
Juergen

Processors

Processors forum

AM5728: Simple addition runs way too slow (MPU speed)