I have just fired up the ARM side of my OMAP-L137 processor after using the DSP side exclusively.
It seems 10X slower than it should be so I wrote a simple delay loop which I implemented all in internal memory (0xFFFF0000)
(Note I have no external memory on my PC board)
void hdelay(int32 count)
{
volatile uint32 i;
for(i=0;i<count;i++)
;
}
I am calling it as such:
hdelay(100000000);
to iterate 100 million loops and I am timing the result (running at 300MHz) - (no interrupts or DMA) 16-bit instructions, optimizations ON.
On the ARM side it takes 30 seconds for 100M loops, since the inner loop is about 6 instructions that comes out to be 50ns per instruction (20MIPS).
On the 6747 side it takes 16 seconds for 100M loops (12 instructions including NOP stalls) = 10ns per instruction (100MIPS).
On another product the 6713 (300MHz) takes 8 seconds per 100M loops = (21 cycles per loop) 3.8ns per instruction (260MIPS).
I checked the clocks using the OBSCLK pin.
Am I missing something somewhere in my memory setup?
When I run the ARM test from SHARED memory I get similar results - I would think that it should take longer in shared RAM.
Do I need to cache internal memory in the ARM?
Where can I find wait state and cycle count information on the various memories inside the chip?
Thanks,
-howy