This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OMAP L138 - ARM9 performance

Other Parts Discussed in Thread: OMAP-L138

Hi!

That's my first post on this forum. I wanted to share my experience with ARM9 execution time. I tested the following loop:

int i;
int max = 90000000;
for( i = 0; i < max; i++ );

with different configurations:
- ICache on/off,
- variables i and max declared as registers (register int i; register int max=90000000) or not
- program and variables placed in: shared memory/ARM memory

I use Code Compose 4.2.4. Compiler optimization was switched off. The core runs on 300MHz

What I did not test yet, is the influence of data cache. I'll update, once I have more free time.

The results are in the attachment.

When the variables are declared as NOT registers, the loop is unrolled as follows:
$C$L1:
LDR           R0, $C$CON1
LDR           R12, [R0]
ADD           R12, R12, #0x1
STR           R12, [R0]
LDR           R12, $C$CON2
LDR           R0, $C$CON1
LDR           R12, [R12]
LDR           R0, [R0]
CMP           R12, R0
BGT           $C$L1

The variables i and max are retrieved from memory each iteration, compromising the performance.

When variables i and max are declared as registers, the loop unfolds as:
$C$L1:
ADD           R12, R12, #0x1
CMP           R4, R12
BGT           $C$L1
what eliminates the need to grab to the memory, giving considerable boost.

Conclusions:
1) The execution time for extreme cases differs by a factor of 50!!!
2) Observe assembler code to pinpoint bottlenecks.

Best regards
Przemyslaw Baranski

  • Appreciate your work!!  make sense when I cache is on and  making both variable as register than considerable boost can be achieved. Thanks.

  • Hi Nikunj!
    Thanks for your interest.
    1) For some cases I calculated MIPS measure and also compared results with ARM7 microcontroller clocked at 55MHz, the program being fed from from RAM. The oldie ARM7 (@55MHz) seems to outperform ARM9 (@300MHz) with the latter not using ICache.
    2) Regarding ARM9 on OMAP-L138, some people say that running program from ARM memory should give better results than from shared memory. The results shows however sth else. I guess, running from ARM memory might bring better results when the SCR is being used by other peripherals and thus access to shared memory is queued.


    Best regards

    Przemyslaw Baranski

  • you are absolutely right, ARM will have best performance than shared memory only when SCR is used by some other peripherals, just because of queue of shared memory. 

    If performance enhancement is your area of interest than you should try for SIMD instruction set for NEON architecture.