This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Poor OMAP35 ARM performance (3.3MIPS instead of 333 MIPS)

Other Parts Discussed in Thread: OMAP-L137, SYSCONFIG, AM3517

Hi there

I just did a simple benchmark of the ARM Cortex-A8 inside the OMAP35. Do you know if the ARM caches are enabled by default?
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344b/BABIGFEH.html

Code Composer v4.2 with rtssrc.zip, debug configuration

Using the following GEL-files: http://processors.wiki.ti.com/index.php/OMAP_and_Sitara_CCS_support
GELs for OMAP3EVM for CCSv4  

EBVBeagle Rev C4
Configuration Location loop 10'000'000 loop 50'000'000
Debug External SDRAM (0x8000 0000) 6.280s 31.400s
Release External SDRAM (0x8000 0000) 4.656s 23.227s
Debug On-Chip SRAM Internal (0x4020 0000) 4.995s 24.997s
Release On-Chip SRAM Internal (0x4020 0000) 3.728s 18.639s

Shouldn't the OMAP35 outperform the OMAP-L137? I guess he is using a much faster memory..
http://e2e.ti.com/support/dsp/omap_applications_processors/f/42/p/43501/168120.aspx

Benchmark code:

 #include <stdio.h>
#include "omap35xx_base_regs.h"
#include "omap35xx_prcm.h"
#include "omap35xx_gptimer.h"
#include "RegisterIoMacros.h"

void setupPerformanceCounter() {
 // enable clocks
 OMAP_PRCM_PER_CM_REGS* pPowerClockRegs = (OMAP_PRCM_PER_CM_REGS*)OMAP_PRCM_PER_CM_REGS_PA;
 // select 32.768kHz
 CLRREG32(&pPowerClockRegs->CM_CLKSEL_PER, CLKSEL_GPT4);
 // enable functional clock
 SETREG32(&pPowerClockRegs->CM_FCLKEN_PER, CM_CLKEN_GPT4);
 // enable interface clock
 SETREG32(&pPowerClockRegs->CM_ICLKEN_PER, CM_CLKEN_GPT4);
 // wait until GPTimer is ready to use
 while (INREG32(&pPowerClockRegs->CM_IDLEST_PER) & CM_IDLEST_ST_GPT4) {
  ;
 }
 
 // enable GPTimer4
 OMAP_GPTIMER_REGS* pTimerReg = (OMAP_GPTIMER_REGS*)OMAP_GPTIMER4_REGS_PA;
 // Soft reset GPTIMER and wait until finished
 SETREG32(&pTimerReg->TIOCP, SYSCONFIG_SOFTRESET);
 while ((INREG32(&pTimerReg->TISTAT) & GPTIMER_TISTAT_RESETDONE) == 0) {
  ;
 }
 // clear interrupts
 OUTREG32(&pTimerReg->TISR, 0);
 
 // start count at zero
 OUTREG32(&pTimerReg->TLDR, 0);
 // Trigger a counter reload by writing to TTGR 
 OUTREG32(&pTimerReg->TTGR, 0xFFFFFFFF);
 
 // Start the timer, set for auto reload
 OUTREG32(&pTimerReg->TCLR, GPTIMER_TCLR_ST|GPTIMER_TCLR_AR); 
}

unsigned int ticksToMilliseconds(unsigned int nofTicks) {
 return nofTicks * 1000 / 32768;
}

void hdelay(unsigned int count)
{
   printf("loop %d..",count);
   volatile OMAP_GPTIMER_REGS* pTimerReg = (OMAP_GPTIMER_REGS*)OMAP_GPTIMER4_REGS_PA;
   unsigned int startTCRR = INREG32(&pTimerReg->TCRR);
   volatile unsigned int i;
   for(i=0;i<count;i++) {
       ;
   }
   unsigned int endTCRR = INREG32(&pTimerReg->TCRR);
   unsigned int duration_ms = ticksToMilliseconds(endTCRR - startTCRR);
   printf("..done in %dms\n",duration_ms);
}

int main(void) {
 setupPerformanceCounter();
 hdelay(10000000);
 hdelay(50000000);
 return 0;
}

  • Update

    I found a whetstone benchmark sourcecode: http://www.netlib.org/benchmark/whetstone.c
    So i created a small sample project based on this source (added the GPTimer4 as timing source), below are the results:
    Using the following GEL-files: http://processors.wiki.ti.com/index.php/OMAP_and_Sitara_CCS_support
    GELs for OMAP3EVM for CCSv4

    Code Composer v4 Project:

    Device Configuration Whetstone Result
    EBVBeagle Debug 1.8 MIPS / 54.3 seconds
    EBVBeagle Release 3.2 MIPS / 31.6 seconds

    Android Benchmarks:

    The OMAP35 scores 333.3 MIPS
    http://processors.wiki.ti.com/index.php/Android_Comparative_Benchmarks#RowboPerf:_ARM

    Question:

    Why is my standalone project 100x slower than the Android result?
    Please run the attached CCSv4 project and post your results..

    3630.WhetstoneBenchmarkARM.zip

  • Update:
    Instruction cache enabled, so it looks like it's D-Cache related (default is MMU OFF and D-Cache disabled)

    I-Cache, D-Cache and MMU combinations: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka8788.html
    Why must I enable the MMU to use the D-Cache but not for the I-Cache: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13835.html

    This Assembler code executes in 250ms, resulting in about 600MIPS. (50'000'000 * 3 Instructions / 0.25s = 600MIPS)

    asm("LOOP_COUNT   .word 0x2FAF080"); //50000000
    void asmdelay()
    {

     printf("loop 50000000 in registers..");
     volatile OMAP_GPTIMER_REGS* pTimerReg = (OMAP_GPTIMER_REGS*)OMAP_GPTIMER4_REGS_PA;
     unsigned int startTCRR = INREG32(&pTimerReg->TCRR);
     // R0 holds the character
     asm(" LDR R1, LOOP_COUNT");
     asm(" MOV R0, #0 ;running variable");
     asm("loop:");
     asm(" ADD R0, R0, #1 ;Increment it");
     asm(" CMP R0, R1 ;Check the limit");
     asm(" BLE loop ;Loop if not finished");
     
     unsigned int endTCRR = INREG32(&pTimerReg->TCRR);
     unsigned int duration_ms = ticksToMilliseconds(endTCRR - startTCRR);
     printf("..done in %dms\n",duration_ms);

    }

  • Hi MGun,

    Since Whetstone is a floating point benchmark, I think some of the bad performance is due to 2 possible reasons.

    1. Compile with flags that enable VFP code generation.

    2. Check the CCS support math libs and see if they have been compiled to generate VFP code or whether they are software floating point emulation.

    If you look at the generated assembly, do you see VFP commands like the ones shown below which will run on the VFP hardware and improve performance.

     fsitod  d5, s20

     fldd    d6, .L87

     fmuld   d4, d5, d6

    Same with the math libs, whetstone calls trig functions like floating point sin(). Do you see VFP commands in the math library?
    I can post the results I obtained in Linux, I tested with the same whetstone you are using: http://www.netlib.org/benchmark/whetstone.c
    This is the difference that can be seen due to issue #2.
    AM37330 running at 1GHz: 200 MIPS - Using software floating point emulation in the math libraries.
    AM3730 running at 1 GHz: 555.6 MIPS - Using a math lib built with VFP support.
    AM3517 running at 600MHz 111 MIPS - Using software floating point emulation in the math libraries.
    AM3517 running at 600MHz 357 MIPS - Using a math lib built with VFP support.
    There should also be a significant difference due to issue #1, but I don't have the numbers for that.

  • MGun,

    I forgot to mention that I'm not 100% sure, but I think this flag "-mv7a8" will enable VFP code generation. I briefly checked the project you sent and I don't think I saw this flag, but I may be mistaken. 

  •  Hi Jeff

    Thank you very much for your effort! I'll have a look at the compiler flags and the generated assembly, but here is an easier example that something is wrong (not related to floating point):

    A simple for loop over 50'000'000 takes up to 30seconds (see my first post). However, if i loop with my own assembler code (values stored directly in registers), it only takes 250ms. I have attached the example as a zip file.
    I think it's the problem that the variable with the compiler is located on the SDRAM, and due to the lack of the D-Cache the processor has to get it every time from the "slow" SDRAM.

    1033.OMAP3530_LoopBenchmarkARM.zip

    Number of loops Duration [ms] Number of instructions inside the loop MIPS
    C-Code generated loop [Debug] 50'000'000 34025ms 7 50M * 7 / 34.025s = 10.2MIPS
    Assembler loop 50'000'000 250ms 3 50M * 3 / 0.25s = 600MIPS

    C-Loop:

    unsigned int i;
    for(i=0;i<count;i++) {
        ;
    }

    Compiler generated loop [Debug]:

                C$DW$L$_hdelay, C$L3:
    0x80022B3C:   E59DC00C LDR             R12, [R13, #12]
    0x80022B40:   E28CC001 ADD             R12, R12, #1
    0x80022B44:   E58DC00C STR             R12, [R13, #12]
    0x80022B48:   E59D000C LDR             R0, [R13, #12]
    0x80022B4C:   E59DC000 LDR             R12, [R13]
    0x80022B50:   E15C0000 CMP             R12, R0
    0x80022B54:   8AFFFFF8 BHI             C$L3

    Assembler loop:

      // Increment R0 until R0=R1
     asm(" LDR R1, LOOP_COUNT");
     asm(" MOV R0, #0   ;running variable");
     asm("loop:");
     asm(" ADD R0, R0, #1  ;Increment it");
     asm(" CMP R0, R1   ;Check the limit");
     asm(" BLE loop   ;Loop if not finished");

  • MGun,

    Sorry for the slow response. I do not have the time to look at the code you submitted, but below I'm listing the MMU setup that I was using to do low level testing some time ago, I hope that can help you get your DCache enabled.

     

    ;******************************************************************************

    ;* initmmu v#####                                                               *

    ;* Copyright (c) 2008@%%%% Texas Instruments Incorporated                     *

    ;* Author: : Modified the original  code for cortex A8 on OMAP3 *

    ;* device  *

    ;******************************************************************************

     

     .text

    .state32

    .align 4

     

    Fault         .set    00B             ; constant defines for level 1 pagetable

    Section       .set    0010B           ; 2_ denotes a binary number

    B             .set    0100B

    C             .set    1000B

    TTBit         .set    10000B

    Domain        .set    111100000B

    FullAccess    .set    110000000000B

     

    ttb_first_level .field 0x80000000, 32

    domain_val      .field 0xffffffff, 32

    loop_count      .field 0xfff, 32

     

            .global  ARM_InitMMUentry

            .armfunc ARM_InitMMUentry

     

    ARM_InitMMUentry: 

     

    ; if MMU/MPU enabled - disable it (useful for ARMulator tests)

    ; also disable the caches and and invalidate the TLBs

    ; NOTE: this would not be required from a cold reset

     

            MRC     p15, #0, r0, c1, c0, #0     ; read CP15 register 1 into r0

            BIC     r0, r0, #0x1                ; clear bit 0

            MCR     p15, #0, r0, c1, c0, #0     ; write value back

     

            MOV     r0, #0

            MCR     p15, #0, r0, c7, c5, #0     ; invalidate caches jay

     

            MRC   p15, #0, R0, c1, c0, #1       ;    /* Read Auxiliary Control Register */

            MOV   R1, #0x3D                     ;

            AND   R0, R1, R0                        ;               /* Clear L2EN bit */

            MCR   p15, #0, R0, c1, c0, #1       ;    /* Disable L2$ */

     

            MCR     p15, #0, r0, c8, c7, #0     ; invalidate TLBs

    ; Cortex-A8 supports two translation tables

            ; Configure translation table base (TTB) control register cp15,c2

            ; to a value of all zeros, indicates we are using TTB register 0.

     

            MOV     r0,#0x0

            MCR     p15, #0, r0, c2, c0, #2

     

     

            LDR     r0, ttb_first_level         ; set start of Translation Table 

       ; base (16k Boundary)

            MCR     p15, #0, r0, c2, c0, #0     ; write to CP15 register 2

     

     

    ; Create translation table for flat mapping

    ; Top 12 bits of VA is pointer into table

    ; Create 4096 entries from 000xxxxx to fffxxxxx

     

     

            LDR     r1,loop_count               ; loop counter

            MOV     r2, #(TTBit | Section)      ; build descriptor pattern in reg

            ORR     r2, r2, #(Domain | FullAccess)

     

    _init_ttb_1:

            ORR     r3, r2, r1, LSL #20         ; use loop counter to create 

       ; individual table entries

            ORR     r3,r3,#1100B                ; set cachable and bufferable 

       ; attributes for section 0 (3:2)

            STR     r3, [r0, r1, LSL #2]        ; str r3 at TTB base + loopcount*4

            SUBS    r1, r1, #1                  ; decrement loop counter

            BPL     _init_ttb_1

     

    ;===================================================================        

    ; Setup domain control register - Enable all domains to client mode

    ;===================================================================

     

            MRC     p15, #0, r0, c3, c0, #0     ; Read Domain Access Control Register

            LDR     r0, domain_val           ; Initialize every domain entry to b01 (client)

            MCR     p15, #0, r0, c3, c0, #0     ; Write Domain Access Control Register  

     

    ; enable MMU

            MRC     p15, #0, r0, c1, c0, #0     ; read CP15 register 1 into r0

            BIC     r0, r0, #(0x1  <<12)        ; ensure I Cache disabled

            BIC     r0, r0, #(0x1  <<2)         ; ensure D Cache disabled

            ORR     r0, r0, #0x1                ; enable MMU before scatter loading

            MCR     p15, #0, r0, c1, c0, #0     ; write CP15 register 1

     

    ; Now the MMU is enabled, virtual to physical address translations will 

    ; occur and effect the next instruction fetch. Even if this module is 

    ; remapped, the branch instruction should be safe as it is

    ; contained in the pipeline.  However, this should not be relied upon 

    ; (as this file stands, it flat-maps

    ; the entire address space, so there is no problem.

     

         BX LR

  • Hi Jeff

    Thank you very much for your code snippet!
    This saved me a lot of time, and the loop with 50'000'000 iterations executes now within 580ms (~600MIPS) instead of 30seconds!

    unsigned int i;
    unsigned int count = 50000000;
    for(i=0;i<count;i++) {
        ;
    }

    I'll now continue to investigate the floating point bottleneck with the compiler options you posted before..

    Thanks
    Michael

  • MGun,

    Glad it could help you. Please post back if you do find more bottlenecks.

    I did find an example where another TI'er ran Whetstone in CCSv3 project at 600MHz clock rate and got 111 MIPS with the same example code I posted. So this tells me that Dcache and Icache are set up.  Also it indicates the math library is not built for armv7 architecture.

  • Thanks to the D-Cache, the standalone Whetstone-Project now scores with 20.7MIPS [Release Build] instead of 3.2 MIPS.
    I had a look at the Disassembly, but did not found any of the commands you mentioned (fsitod, fldd, fmuld).

    Looks like CCSv4 Projects "out of the box" are not using the full potential of the Cortex-A8 at all..
    My Compiler Options look like this: (-mv7A8 is defined)

    Here is my updated project with the I-Cache, D-Cache and MMU enabled: 0131.WhetstoneBenchmarkARM_v2(MMU).zip
    Would you be so kind and ask some CCSv4 Experts what has to be adjusted, to use the full potential of the 3530? I have the impression that i'm the only one using the 3530 with Code Composer v4 "bare metal"..

    I found a note in the ARM Optimizing C/C++ Compiler v4.7 User's Guide (SPNU151F) on Page 53 (VFP Support):

  • Mgun,

    Sorry for the slow response.  I'm now trying to find someone to help you. I think you should be able to achieve 111.1 Whetstone MIPS.  I don't think the CCSv5 C support libs are compiled for armv7 architecture.  I do believe you can recompile them yourself with the proper CFLAGS to support armv7 and utilize the VFP for floating point. Then I think you can achieve 357 Whetstone MIPS.

  • Hi Jeff

    Yes we reached about 110 MIPS after recompiling the CCSv5 libraries for the armv7 architecture with VFP support.
    We also had to enable the I-cache and configure the MMU to use the D-Cache to get 110 MIPS.

    That's currently fine for us. However, i still don't know how the android code reached 333 whetstone MIPS.

    Thank you!
    Michael

  • Hello All!

    I did try to use the code example from the posts earlier in our bootloader.

    As I understood, this is the example from ARM Developer guide, but there they switch ON D-Caches and I-Caches too.

    Within this example, if I do it, the micro resets itself (or hangs on).

    The example from ARM guide works fine until the bootloader finishes its job and jumps into loaded code. I did switched off the caches and MMU before that of course, but it seems, that I forgot some thing to flushing or invalidating.

    Can anyone to show the right way to switch the caches and MMUs off?

     

    Thank You all in advance,

    Alexey