This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

too slow L3 bus in OMAP3530

Hi,

 

I work on the DevKit8000 development board under U-boot (without Linux). ARM_FCLK is set to 500MHz, L3_CLK =  133MHz and L4_CLK = 66MHz. I have a problem with a program execution time. Program execution is much to slow when it read data from RAM (internal or external).

 

This code execute fast enough:

0x81000060:   E2899001 ADD             R9, R9, #1

0x81000064:   E1540009 CMP             R4, R9

0x81000068:   CAFFFFFC BGT             0x81000060

 

The loop executes in 6ns when ARM_FCLK = 500MHz so here everything is right.

 

But this code:

0x81000070:   E593E000 LDR             R14, [R3]

0x81000074:   E28EC001 ADD             R12, R14, #1

0x81000078:   E583C000 STR             R12, [R3]

0x8100007C:   E5931000 LDR             R1, [R3]

0x81000080:   E1540001 CMP             R4, R1

0x81000084:   CAFFFFF9 BGT             0x81000070

 

This loop executes in 650ns. If we assume that ADD, CMP, BGT instructions executes in 1 ARM_FCLK cycle (2ns) then reading form /writing to RAM lasts for 215ns!! Why it lasts for so long since L3_CLK is equal to 133MHz? I think I turned off the ICLK of all of the peripherals (except UART3), IVA2.2 is off. Execution time doesn’t depend on whether the program runs from internal or external RAM (data is placed in the same RAM as the program).

I turn off all the interrupts. L1 i L2 Cache are turn on (I think according to PM_PWSTCTRL_MPU and PM_PWSTST_MPU). I load program via RS232 under the U-boot. I debug program via XDS100v2 (disassembled codes are copied from CCS during debug).

 

 

Reading from peripheral registers is more or less ok.

Test code:

0x810000B0:   E592148C LDR             R1, [R2, #1164]

0x810000B4:   E2833002 ADD             R3, R3, #2

0x810000B8:   E1540003 CMP             R4, R3

0x810000BC:   E58D1004 STR             R1, [R13, #4]

0x810000C0:   E592C48C LDR             R12, [R2, #1164]

0x810000C4:   E58DC004 STR             R12, [R13, #4]

0x810000C8:   1AFFFFF8 BNE             0x810000B0

 

This loop executes in 480ns. If we assume that ADD, CMP and BNE executes in 1 ARM_FCLK cycle (2ns) and 2 x STR R1, [R13, #4] (writing via L3 bus from RAM) in 215ns then LDR and STR via L4 lasts for ~24ns (~2 cycles of L4 bus clock). Some other test where there were only L4 accesses showed 149ns read time from peripheral register - not so good.

 

What can cause such a slow operations on L3 bus?

 

 

Best regards

 

  • Hi again,

    Maybe I asked my question in wrong way. In other words: What is access time and read/write time from ARM and DSP to L3 bus address space? I can't fine these information?

     

    Best regards

  • Hi Tom,

    The short answer is that this depends on the peripheral you are addressing. L3 and L4 are relatively dumb OCP-busses, providing access form an initiator (DSP/ARM/DMA/GFX/etc) to a target (UART/SPI/I2C/TIMER/MEMORY/etc/etc). Some targets respons faster than others, and in case this case (not having data caches active)  the L3 can't do other that just waiting for the transaction to finish and thereby stalling the ARM core (as you as well discovered).

    For writes you though have the posibility of doing what's called posted-writes which will allow the ARM core to continue (even though the transaction isn't finished) until it eventually might be blocked by another transaction which needs the first one to finish prior to continuing...

    The delay you see of 215ns for external memory access I therefore doesn't think origins from the L3 as such, but more from the external DDR-ram access, which might be configured for 64-byte busrt access, which will take much longer time than just getting 4 bytes as you expect. Not having the data cache enabled you will just be throwing away 15/16 of the read data...

    Can you try to do the same experiment, but instead of using external data memory, then use the internal OCM memory for data as well as for the code. I expect that this will give you a speed increase, but to be honset I'm not 100% sure. Normally the biggest increase is achieved when you enable the data cache and set the memory mapping to cacheable and bufferable...

    I hope this helped you forward?
      Søren

  • Hi,

    Thanks for an answer. So maybe the L3 bus is fast, but the latency is long and it slows things down. I didn't test an access to the internal memory, because in my application I have to use an external RAM, because of a large data I use.

    You say that the biggest speed increase I'll get when I turn on cacheable and bufferable features. Turning an ARM or a DSP cache is easy, but I don't know how to make them to buffer data. I wanted to created an array in the ARM cache and then send it via a DMA to a SDRAM, but ARM's cache is not mapped, so I can’t put there data myself. So I'll have to use DSP core and its cache to do that. Is there any way to make the ARM use its cache as a buffer?

    Best regards

    Tom