OMAP L138 - ARM9 performance

Przemyslaw Baranski

Other Parts Discussed in Thread: OMAP-L138

Hi!

That's my first post on this forum. I wanted to share my experience with ARM9 execution time. I tested the following loop:

int i;
int max = 90000000;
for( i = 0; i < max; i++ );

with different configurations:
- ICache on/off,
- variables i and max declared as registers (register int i; register int max=90000000) or not
- program and variables placed in: shared memory/ARM memory

I use Code Compose 4.2.4. Compiler optimization was switched off. The core runs on 300MHz

What I did not test yet, is the influence of data cache. I'll update, once I have more free time.

The results are in the attachment.

When the variables are declared as NOT registers, the loop is unrolled as follows:
$C$L1:
LDR           R0, $C$CON1
LDR           R12, [R0]
ADD           R12, R12, #0x1
STR           R12, [R0]
LDR           R12, $C$CON2
LDR           R0, $C$CON1
LDR           R12, [R12]
LDR           R0, [R0]
CMP           R12, R0
BGT           $C$L1

The variables i and max are retrieved from memory each iteration, compromising the performance.

When variables i and max are declared as registers, the loop unfolds as:
$C$L1:
ADD           R12, R12, #0x1
CMP           R4, R12
BGT           $C$L1
what eliminates the need to grab to the memory, giving considerable boost.

Conclusions:
1) The execution time for extreme cases differs by a factor of 50!!!
2) Observe assembler code to pinpoint bottlenecks.

Best regards
Przemyslaw Baranski

over 11 years ago

0 nikunj.rudani over 11 years ago

Expert 1990 points

Appreciate your work!! make sense when I cache is on and making both variable as register than considerable boost can be achieved. Thanks.

0 Przemyslaw Baranski over 11 years ago in reply to nikunj.rudani

Prodigy 25 points

Hi Nikunj!
Thanks for your interest.
1) For some cases I calculated MIPS measure and also compared results with ARM7 microcontroller clocked at 55MHz, the program being fed from from RAM. The oldie ARM7 (@55MHz) seems to outperform ARM9 (@300MHz) with the latter not using ICache.
2) Regarding ARM9 on OMAP-L138, some people say that running program from ARM memory should give better results than from shared memory. The results shows however sth else. I guess, running from ARM memory might bring better results when the SCR is being used by other peripherals and thus access to shared memory is queued.

Best regards

Przemyslaw Baranski

0 nikunj.rudani over 11 years ago in reply to Przemyslaw Baranski

Expert 1990 points

you are absolutely right, ARM will have best performance than shared memory only when SCR is used by some other peripherals, just because of queue of shared memory.

If performance enhancement is your area of interest than you should try for SIMD instruction set for NEON architecture.

Processors

Processors forum

OMAP L138 - ARM9 performance