Tool/software:
Hi everyone,
I use a Beagle-Board X15 with a AM5728 SoC with "Sciopta" RTOS (so no Linux). Only Core0 and DSP0 are active. My firmware runs in general, only the times for some calculations (in my case converting data into "Flatbuffer" structures) seem extremely slow.
- Oscillator is 20 MHz, the prescalers of DPLL_MPU are OPP_NOM (M=500, N=9, M2=1). So the MPU should be running with 1000 MHz (i.e. one CPU cylce = 1ns).
- Compiler: GNU v7.3.1 (FSF), IDE: CodeComposerStudio 12.8.0 with BlackHawk XDSv560v2 Debugger.
- Caching should be active, the memory map (see idkAM572x.ld) looks like
dram0 (rw) : org = 0x80000000, len = 512M
rom0 (rwx): org = 0xA0000000, len = 16M
no_cache0 (rw) : org = 0xA1000000, len = 240M
I inserted the following test code right at the beginning ("start_hook") of my RTOS ("Sciopta"). No interrupts are active at that early stage.
GPIOPinWrite(SOC_GPIO4_BASE, 17, GPIO_PIN_HIGH);
uint32_t i;
uint32_t dummy = 0;
uint32_t erg = 0;
for(i = 0; i < 1000; i++) {
erg = dummy + 2;
}
GPIOPinWrite(SOC_GPIO4_BASE, 17, GPIO_PIN_LOW);
Disassembly:
491 GPIOPinWrite(SOC_GPIO4_BASE, 17, GPIO_PIN_HIGH);
a000a230: E3A02001 mov r2, #1
a000a234: E3A01011 mov r1, #0x11
a000a238: E3A00A09 mov r0, #0x9000
a000a23c: E3440805 movt r0, #0x4805
a000a240: EB03D0AF bl GPIOPinWrite
493 uint32_t dummy = 0;
a000a244: E3A03000 mov r3, #0
a000a248: E50B3010 str r3, [r11, #-0x10]
494 uint32_t erg = 0;
a000a24c: E3A03000 mov r3, #0
a000a250: E50B3014 str r3, [r11, #-0x14]
495 for(i = 0; i < 1000; i++) {
a000a254: E3A03000 mov r3, #0
a000a258: E50B3008 str r3, [r11, #-8]
a000a25c: EA000005 b #0xa000a278
496 erg = dummy + 2;
a000a260: E51B3010 ldr r3, [r11, #-0x10]
a000a264: E2833002 add r3, r3, #2
a000a268: E50B3014 str r3, [r11, #-0x14]
495 for(i = 0; i < 1000; i++) {
a000a26c: E51B3008 ldr r3, [r11, #-8]
a000a270: E2833001 add r3, r3, #1
a000a274: E50B3008 str r3, [r11, #-8]
a000a278: E51B3008 ldr r3, [r11, #-8]
a000a27c: E3530FFA cmp r3, #0x3e8
a000a280: 3AFFFFF6 blo #0xa000a260
498 GPIOPinWrite(SOC_GPIO4_BASE, 17, GPIO_PIN_LOW);
a000a284: E3A02000 mov r2, #0
a000a288: E3A01011 mov r1, #0x11
a000a28c: E3A00A09 mov r0, #0x9000
a000a290: E3440805 movt r0, #0x4805
a000a294: EB03D09A bl GPIOPinWrite
Content of register R11: 0x800363D4 (located in section "dram", see linker map, above).
This test code needs 277µs (measured with oscilloscope on GPIO pin 4.17), which means that 1 iteration needs 277ns = 277 CPU cycles.
In my eyes that seems way too slow for a simple addition!!
Notes:
- Making the variables "volatile" didn't change anything.
- How can I verify the actual speed of the bus between MPU and DDR3 RAM? (DDR3 RAM is "Kingston D2516EC4BXGGB" with word write speed of 1066 Mb/s -> 30ns per word.)
- MPU seems to be NOT in SleepMode: Register CM_MPU_MPU_CLKCTRL, Bit STBST resp. IDLEST are "0"
Register CM_MPU_CLKSTCTRL, Bit CLKACTIVITY_MPU_GCLK = 1, -> MPU-Clock is running
Bit CLKTRCTRL = 2, d.h. SW_WKUP: "Start a software forced wake-up transition on the domain" - MMU seems to be off (not selected in the Sciopta RTOS config tool)
- How can I check whether there are Cache misses?
Any ideas why this code takes so long?
Thanks a lot!
Juergen