Poor OMAP35 ARM performance (3.3MIPS instead of 333 MIPS)

MGun

Other Parts Discussed in Thread: OMAP-L137, SYSCONFIG, AM3517

Hi there

I just did a simple benchmark of the ARM Cortex-A8 inside the OMAP35. Do you know if the ARM caches are enabled by default?
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344b/BABIGFEH.html

Code Composer v4.2 with rtssrc.zip, debug configuration

Using the following GEL-files: http://processors.wiki.ti.com/index.php/OMAP_and_Sitara_CCS_support
GELs for OMAP3EVM for CCSv4

EBVBeagle Rev C4
Configuration	Location	loop 10'000'000	loop 50'000'000
Debug	External SDRAM (0x8000 0000)	6.280s	31.400s
Release	External SDRAM (0x8000 0000)	4.656s	23.227s
Debug	On-Chip SRAM Internal (0x4020 0000)	4.995s	24.997s
Release	On-Chip SRAM Internal (0x4020 0000)	3.728s	18.639s

Shouldn't the OMAP35 outperform the OMAP-L137? I guess he is using a much faster memory..
http://e2e.ti.com/support/dsp/omap_applications_processors/f/42/p/43501/168120.aspx

Benchmark code:

#include <stdio.h>
#include "omap35xx_base_regs.h"
#include "omap35xx_prcm.h"
#include "omap35xx_gptimer.h"
#include "RegisterIoMacros.h"

void setupPerformanceCounter() {
// enable clocks
OMAP_PRCM_PER_CM_REGS* pPowerClockRegs = (OMAP_PRCM_PER_CM_REGS*)OMAP_PRCM_PER_CM_REGS_PA;
// select 32.768kHz
CLRREG32(&pPowerClockRegs->CM_CLKSEL_PER, CLKSEL_GPT4);
// enable functional clock
SETREG32(&pPowerClockRegs->CM_FCLKEN_PER, CM_CLKEN_GPT4);
// enable interface clock
SETREG32(&pPowerClockRegs->CM_ICLKEN_PER, CM_CLKEN_GPT4);
// wait until GPTimer is ready to use
while (INREG32(&pPowerClockRegs->CM_IDLEST_PER) & CM_IDLEST_ST_GPT4) {
;
}

// enable GPTimer4
OMAP_GPTIMER_REGS* pTimerReg = (OMAP_GPTIMER_REGS*)OMAP_GPTIMER4_REGS_PA;
// Soft reset GPTIMER and wait until finished
SETREG32(&pTimerReg->TIOCP, SYSCONFIG_SOFTRESET);
while ((INREG32(&pTimerReg->TISTAT) & GPTIMER_TISTAT_RESETDONE) == 0) {
;
}
// clear interrupts
OUTREG32(&pTimerReg->TISR, 0);

// start count at zero
OUTREG32(&pTimerReg->TLDR, 0);
// Trigger a counter reload by writing to TTGR
OUTREG32(&pTimerReg->TTGR, 0xFFFFFFFF);

// Start the timer, set for auto reload
OUTREG32(&pTimerReg->TCLR, GPTIMER_TCLR_ST|GPTIMER_TCLR_AR);
}

unsigned int ticksToMilliseconds(unsigned int nofTicks) {
return nofTicks * 1000 / 32768;
}

void hdelay(unsigned int count)
{
   printf("loop %d..",count);
   volatile OMAP_GPTIMER_REGS* pTimerReg = (OMAP_GPTIMER_REGS*)OMAP_GPTIMER4_REGS_PA;
   unsigned int startTCRR = INREG32(&pTimerReg->TCRR);
   volatile unsigned int i;
   for(i=0;i<count;i++) {
       ;
   }
   unsigned int endTCRR = INREG32(&pTimerReg->TCRR);
   unsigned int duration_ms = ticksToMilliseconds(endTCRR - startTCRR);
   printf("..done in %dms\n",duration_ms);
}

int main(void) {
setupPerformanceCounter();
hdelay(10000000);
hdelay(50000000);
return 0;
}

over 14 years ago

0 MGun over 14 years ago

Intellectual 895 points

Update

I found a whetstone benchmark sourcecode: http://www.netlib.org/benchmark/whetstone.c
So i created a small sample project based on this source (added the GPTimer4 as timing source), below are the results:
Using the following GEL-files: http://processors.wiki.ti.com/index.php/OMAP_and_Sitara_CCS_support
GELs for OMAP3EVM for CCSv4

Code Composer v4 Project:

Device	Configuration	Whetstone Result
EBVBeagle	Debug	1.8 MIPS / 54.3 seconds
EBVBeagle	Release	3.2 MIPS / 31.6 seconds

Android Benchmarks:

The OMAP35 scores 333.3 MIPS
http://processors.wiki.ti.com/index.php/Android_Comparative_Benchmarks#RowboPerf:_ARM

Question:

Why is my standalone project 100x slower than the Android result?
Please run the attached CCSv4 project and post your results..

3630.WhetstoneBenchmarkARM.zip

0 MGun over 14 years ago in reply to MGun

Intellectual 895 points

Update:
Instruction cache enabled, so it looks like it's D-Cache related (default is MMU OFF and D-Cache disabled)

I-Cache, D-Cache and MMU combinations: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka8788.html
Why must I enable the MMU to use the D-Cache but not for the I-Cache: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13835.html

This Assembler code executes in 250ms, resulting in about 600MIPS. (50'000'000 * 3 Instructions / 0.25s = 600MIPS)

asm("LOOP_COUNT .word 0x2FAF080"); //50000000
void asmdelay()
{

printf("loop 50000000 in registers..");
volatile OMAP_GPTIMER_REGS* pTimerReg = (OMAP_GPTIMER_REGS*)OMAP_GPTIMER4_REGS_PA;
unsigned int startTCRR = INREG32(&pTimerReg->TCRR);
// R0 holds the character
asm(" LDR R1, LOOP_COUNT");
asm(" MOV R0, #0 ;running variable");
asm("loop:");
asm(" ADD R0, R0, #1 ;Increment it");
asm(" CMP R0, R1 ;Check the limit");
asm(" BLE loop ;Loop if not finished");

unsigned int endTCRR = INREG32(&pTimerReg->TCRR);
unsigned int duration_ms = ticksToMilliseconds(endTCRR - startTCRR);
printf("..done in %dms\n",duration_ms);

}

0 Jeff L over 14 years ago in reply to MGun

TI__Expert 5960 points

Hi MGun,

Since Whetstone is a floating point benchmark, I think some of the bad performance is due to 2 possible reasons.

1. Compile with flags that enable VFP code generation.

2. Check the CCS support math libs and see if they have been compiled to generate VFP code or whether they are software floating point emulation.

If you look at the generated assembly, do you see VFP commands like the ones shown below which will run on the VFP hardware and improve performance.

fsitod d5, s20

fldd d6, .L87

fmuld d4, d5, d6

Same with the math libs, whetstone calls trig functions like floating point sin(). Do you see VFP commands in the math library?

I can post the results I obtained in Linux, I tested with the same whetstone you are using: http://www.netlib.org/benchmark/whetstone.c

This is the difference that can be seen due to issue #2.

AM37330 running at 1GHz: 200 MIPS - Using software floating point emulation in the math libraries.

AM3730 running at 1 GHz: 555.6 MIPS - Using a math lib built with VFP support.

AM3517 running at 600MHz 111 MIPS - Using software floating point emulation in the math libraries.

AM3517 running at 600MHz 357 MIPS - Using a math lib built with VFP support.

There should also be a significant difference due to issue #1, but I don't have the numbers for that.

0 Jeff L over 14 years ago in reply to Jeff L

TI__Expert 5960 points

MGun,

I forgot to mention that I'm not 100% sure, but I think this flag "-mv7a8" will enable VFP code generation. I briefly checked the project you sent and I don't think I saw this flag, but I may be mistaken.

0 MGun over 14 years ago in reply to Jeff L

Intellectual 895 points

Hi Jeff

Thank you very much for your effort! I'll have a look at the compiler flags and the generated assembly, but here is an easier example that something is wrong (not related to floating point):

A simple for loop over 50'000'000 takes up to 30seconds (see my first post). However, if i loop with my own assembler code (values stored directly in registers), it only takes 250ms. I have attached the example as a zip file.
I think it's the problem that the variable with the compiler is located on the SDRAM, and due to the lack of the D-Cache the processor has to get it every time from the "slow" SDRAM.

1033.OMAP3530_LoopBenchmarkARM.zip

	Number of loops	Duration [ms]	Number of instructions inside the loop	MIPS
C-Code generated loop [Debug]	50'000'000	34025ms	7	50M * 7 / 34.025s = 10.2MIPS
Assembler loop	50'000'000	250ms	3	50M * 3 / 0.25s = 600MIPS

C-Loop:

unsigned int i;
for(i=0;i<count;i++) {
;
}

Compiler generated loop [Debug]:

            C$DW$L$_hdelay, C$L3:
0x80022B3C:   E59DC00C LDR             R12, [R13, #12]
0x80022B40:   E28CC001 ADD             R12, R12, #1
0x80022B44:   E58DC00C STR             R12, [R13, #12]
0x80022B48:   E59D000C LDR             R0, [R13, #12]
0x80022B4C:   E59DC000 LDR             R12, [R13]
0x80022B50:   E15C0000 CMP             R12, R0
0x80022B54:   8AFFFFF8 BHI             C$L3

Assembler loop:

  // Increment R0 until R0=R1
asm(" LDR R1, LOOP_COUNT");
asm(" MOV R0, #0   ;running variable");
asm("loop:");
asm(" ADD R0, R0, #1 ;Increment it");
asm(" CMP R0, R1   ;Check the limit");
asm(" BLE loop   ;Loop if not finished");

0 Jeff L over 14 years ago in reply to MGun

TI__Expert 5960 points

MGun,

Sorry for the slow response. I do not have the time to look at the code you submitted, but below I'm listing the MMU setup that I was using to do low level testing some time ago, I hope that can help you get your DCache enabled.

;******************************************************************************

;* initmmu v##### *

;* Author: : Modified the original code for cortex A8 on OMAP3 *

;* device *

;******************************************************************************

.text

.state32

.align 4

Fault .set 00B ; constant defines for level 1 pagetable

Section .set 0010B ; 2_ denotes a binary number

B .set 0100B

C .set 1000B

TTBit .set 10000B

Domain .set 111100000B

FullAccess .set 110000000000B

ttb_first_level .field 0x80000000, 32

domain_val .field 0xffffffff, 32

loop_count .field 0xfff, 32

.global ARM_InitMMUentry

.armfunc ARM_InitMMUentry

ARM_InitMMUentry:

; if MMU/MPU enabled - disable it (useful for ARMulator tests)

; also disable the caches and and invalidate the TLBs

; NOTE: this would not be required from a cold reset

MRC p15, #0, r0, c1, c0, #0 ; read CP15 register 1 into r0

BIC r0, r0, #0x1 ; clear bit 0

MCR p15, #0, r0, c1, c0, #0 ; write value back

MOV r0, #0

MCR p15, #0, r0, c7, c5, #0 ; invalidate caches jay

MRC p15, #0, R0, c1, c0, #1 ; /* Read Auxiliary Control Register */

MOV R1, #0x3D ;

AND R0, R1, R0 ; /* Clear L2EN bit */

MCR p15, #0, R0, c1, c0, #1 ; /* Disable L2$ */

MCR p15, #0, r0, c8, c7, #0 ; invalidate TLBs

; Cortex-A8 supports two translation tables

; Configure translation table base (TTB) control register cp15,c2

; to a value of all zeros, indicates we are using TTB register 0.

MOV r0,#0x0

MCR p15, #0, r0, c2, c0, #2

LDR r0, ttb_first_level ; set start of Translation Table

; base (16k Boundary)

MCR p15, #0, r0, c2, c0, #0 ; write to CP15 register 2

; Create translation table for flat mapping

; Top 12 bits of VA is pointer into table

; Create 4096 entries from 000xxxxx to fffxxxxx

LDR r1,loop_count ; loop counter

MOV r2, #(TTBit | Section) ; build descriptor pattern in reg

ORR r2, r2, #(Domain | FullAccess)

_init_ttb_1:

ORR r3, r2, r1, LSL #20 ; use loop counter to create

; individual table entries

ORR r3,r3,#1100B ; set cachable and bufferable

; attributes for section 0 (3:2)

STR r3, [r0, r1, LSL #2] ; str r3 at TTB base + loopcount*4

SUBS r1, r1, #1 ; decrement loop counter

BPL _init_ttb_1

;===================================================================

; Setup domain control register - Enable all domains to client mode

;===================================================================

MRC p15, #0, r0, c3, c0, #0 ; Read Domain Access Control Register

LDR r0, domain_val ; Initialize every domain entry to b01 (client)

MCR p15, #0, r0, c3, c0, #0 ; Write Domain Access Control Register

; enable MMU

MRC p15, #0, r0, c1, c0, #0 ; read CP15 register 1 into r0

BIC r0, r0, #(0x1 <<12) ; ensure I Cache disabled

BIC r0, r0, #(0x1 <<2) ; ensure D Cache disabled

ORR r0, r0, #0x1 ; enable MMU before scatter loading

MCR p15, #0, r0, c1, c0, #0 ; write CP15 register 1

; Now the MMU is enabled, virtual to physical address translations will

; occur and effect the next instruction fetch. Even if this module is

; remapped, the branch instruction should be safe as it is

; contained in the pipeline. However, this should not be relied upon

; (as this file stands, it flat-maps

; the entire address space, so there is no problem.

BX LR

0 MGun over 14 years ago in reply to Jeff L

Intellectual 895 points

Hi Jeff

Thank you very much for your code snippet!
This saved me a lot of time, and the loop with 50'000'000 iterations executes now within 580ms (~600MIPS) instead of 30seconds!

unsigned int i;
unsigned int count = 50000000;
for(i=0;i<count;i++) {
;
}

I'll now continue to investigate the floating point bottleneck with the compiler options you posted before..

Thanks
Michael

0 Jeff L over 14 years ago in reply to MGun

TI__Expert 5960 points

MGun,

Glad it could help you. Please post back if you do find more bottlenecks.

I did find an example where another TI'er ran Whetstone in CCSv3 project at 600MHz clock rate and got 111 MIPS with the same example code I posted. So this tells me that Dcache and Icache are set up. Also it indicates the math library is not built for armv7 architecture.

0 MGun over 14 years ago in reply to Jeff L

Intellectual 895 points

Thanks to the D-Cache, the standalone Whetstone-Project now scores with 20.7MIPS [Release Build] instead of 3.2 MIPS.
I had a look at the Disassembly, but did not found any of the commands you mentioned (fsitod, fldd, fmuld).

Looks like CCSv4 Projects "out of the box" are not using the full potential of the Cortex-A8 at all..
My Compiler Options look like this: (-mv7A8 is defined)

Here is my updated project with the I-Cache, D-Cache and MMU enabled: 0131.WhetstoneBenchmarkARM_v2(MMU).zip
Would you be so kind and ask some CCSv4 Experts what has to be adjusted, to use the full potential of the 3530? I have the impression that i'm the only one using the 3530 with Code Composer v4 "bare metal"..

I found a note in the ARM Optimizing C/C++ Compiler v4.7 User's Guide (SPNU151F) on Page 53 (VFP Support):

0 Jeff L over 14 years ago in reply to MGun

TI__Expert 5960 points

Mgun,

Sorry for the slow response. I'm now trying to find someone to help you. I think you should be able to achieve 111.1 Whetstone MIPS. I don't think the CCSv5 C support libs are compiled for armv7 architecture. I do believe you can recompile them yourself with the proper CFLAGS to support armv7 and utilize the VFP for floating point. Then I think you can achieve 357 Whetstone MIPS.

0 MGun over 14 years ago in reply to Jeff L

Intellectual 895 points

Hi Jeff

Yes we reached about 110 MIPS after recompiling the CCSv5 libraries for the armv7 architecture with VFP support.
We also had to enable the I-cache and configure the MMU to use the D-Cache to get 110 MIPS.

That's currently fine for us. However, i still don't know how the android code reached 333 whetstone MIPS.

Thank you!
Michael

0 Alexey Govorukhin over 14 years ago in reply to MGun

Prodigy 180 points

Hello All!

I did try to use the code example from the posts earlier in our bootloader.

As I understood, this is the example from ARM Developer guide, but there they switch ON D-Caches and I-Caches too.

Within this example, if I do it, the micro resets itself (or hangs on).

The example from ARM guide works fine until the bootloader finishes its job and jumps into loaded code. I did switched off the caches and MMU before that of course, but it seems, that I forgot some thing to flushing or invalidating.

Can anyone to show the right way to switch the caches and MMUs off?

Thank You all in advance,

Alexey

Processors

Processors forum

Poor OMAP35 ARM performance (3.3MIPS instead of 333 MIPS)