This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OMAP-L138 Execution speed too slow

Other Parts Discussed in Thread: OMAP-L138

We are having trouble with the OMAP-L138 execution speed.
We are using the eXperimenters Kit with CCS4, and are programming the ARM9 side.

We are using a GEL file provided with the kit to set 300MHz on PLL0 and 132 MHz on PLL1.
We added code in the GEL file to set the OBSSEL to present SYSCLK6 to the OBSCLK pin with
a divide of 30 (OBSDIV), and we see a 10MHz clock on OBSCLK (TP6).  This indicates that PLL0 is outputting
a 300MHz clock.

To verify the execution speed, we have written a small loop that outputs a square wave on a GPIO pin.
We used the TI ARM926EJ-S Device Cycle Accurate Simulator to count the cycles in this loop, and
according to the simulator, the loop takes 220,034 cycles to execute.  When we run the code on the
experimenter board the loop takes 10.1 ms.  This implies a ~22 MHz clock.


We feel that we are setting up the PLLs correctly, but the CPU seems to be running at
~22Mhz instead of 300MHz.

We have also tried to enable the Instruction and Data caches using:

    .align    4
$C$CON3:    .field      0000307Ch,32    ;SBO set and Vector is 0xFFFF0000 I & D Cache enabled

    LDR r0, $C$CON3              ; disable MMU, enable caches, write buffer
    MCR p15, #0, r0, c1, c0, #0

What are we missing?

Marc

  • Marc,

    The cycle-accurate simulator might not have the entire system interconnect and clock tree modeled accurately. The details of simulator can be found at: http://focus.ti.com/lit/ml/sprs397/sprs397.pdf

    You can try to run some internal algorithm benchmarks instead of reaching out of the core to GPIO.

    Thanks,
    Gaurav

  • Marc Bunyard said:

    We are using a GEL file provided with the kit to set 300MHz on PLL0 and 132 MHz on PLL1.
    We added code in the GEL file to set the OBSSEL to present SYSCLK6 to the OBSCLK pin with
    a divide of 30 (OBSDIV), and we see a 10MHz clock on OBSCLK (TP6).  This indicates that PLL0 is outputting
    a 300MHz clock.

    Yes!  Nicely done.  This is the most accurate way to verify your PLL setup.

    Marc Bunyard said:

    To verify the execution speed, we have written a small loop that outputs a square wave on a GPIO pin.
    We used the TI ARM926EJ-S Device Cycle Accurate Simulator to count the cycles in this loop, and
    according to the simulator, the loop takes 220,034 cycles to execute.  When we run the code on the
    experimenter board the loop takes 10.1 ms.  This implies a ~22 MHz clock.


    We feel that we are setting up the PLLs correctly, but the CPU seems to be running at
    ~22Mhz instead of 300MHz.

    Abandon this line of thinking -- it's not measuring CPU speed but rather the speed of the configuration bus!  Those writes are going to peripheral registers which reside in what we generally call "configuration space".  Writes are buffered, but if you're doing a ton of them in a row then eventually the buffer fills up.  Once that happens the CPU will stall until a write completes and makes room for the CPU to put another one in the write buffer.  So in your case where you're doing tons of writes you are spending most of your time with a stalled CPU core because there's no room for the data in the write buffer.  In other words, the issue is not that the CPU is running at the wrong speed, just a consequence of doing lots of writes to configuration memory.

  • Thank you for the reply.  I should clarify the description of the loop that is outputting to the GPIO.  This loop outputs to the GPIO twice in 10.1 ms (or every 220,034 cycles as counted by the Simulator). 

        int xxx;

        for (;;)
        {
            for (xxx = 0; xxx < 10000; xxx++)
                ;
            setStepEn();                                       //Set line high
            for (xxx = 0; xxx < 10000; xxx++)
                ;
            clrStepEn();                                        //Set line low
        }

    It doesn't seem like the writes to the GPIO are holding off the CPU.  I will add that we are executing out of shared RAM, and our stack is in shared RAM.  Do we need the cache running?  Is there a problem with having our data and instructions running out of the same memory space?

  • Thanks for clarifying.  Executing out of shared RAM will also be extremely slow without proper cache configuration.  I believe that requires configuring the MMU too since for each MMU page you specify the memory policy for the corresponding page.

  • Thank you Brad.  We have decided to use Windows CE.  CE sets up the caches and MMU for the ARM, and the execution speed is where we expect it. Now to find a BSP for the Evaluation Kit.

  • Marc, here is the download for the BSP for the eval board in case you didn't find it yet:

    http://focus.ti.com/docs/toolsw/folders/print/wincesdk-am1xomapl1x.html