OMAP-L138: Access latency onto registers in the peripheral area

Joerg Seiler

We wanted to measure the performance of our code by querying an OMAP internal timer.
The DSP CPU runs at 300 MHz, the timer is clocked at 24 MHz.

Executing the following code snippet, in which the timer is queried twice,
and the read values are stored in the onchip shared memory

*( ( unsigned int * )0x8001c000 ) = TIMER0_ADDR[ TIMER_CNT34 ];
*( ( unsigned int * )0x8001c004 ) = TIMER0_ADDR[ TIMER_CNT34 ];

leads in example to following read timer values in the shared memory:

0x8001c000: 0x644AEA04 0x644AEA0E

Between the second and the first value, there is a difference
of 10, meaning that there was a latency of 10*(1/24MHz) = 400 nsec
in the execution of the 2 lines of code. We would have assumed
no difference or only a difference of 1. It looks as if the CPU is stopped
by anything, maybe by the data transfer via the system interconnect.
The associated assembler code are only 4 lines that do not provide
an explanation for the effect. Interrupts have been disabled during the
test.

Do you have an explanation for this observerd latency?

Do we have maybe misconfigured our system that leads to such an effect?

(In another thread, I have read something like: .. The PRU and DSP config port
are at a similar "distance" from the SYSCFG module ... Reads will be around
30-40 DSP clock cycles.

Is this maybe linked to what I have observed?)

over 13 years ago

0 Daniel Allred over 13 years ago

TI__Genius 17355 points

There is a latency for the DSP core to access peripheral registers of other components of the SoC, including the timer modules. Because of this, I recommend using the DSP's internal timestamp registers, which increment at the DSP core frequency, and can therefore give you nearly cycle accurate measurements.

See this other thread for more info.

Regards, Daniel

0 Joerg Seiler over 13 years ago in reply to Daniel Allred

Prodigy 90 points

Daniel,

actually, I do work on the implementation of a driver for the data transfer via the uPP interface.
In this context, we have wondered why we do not achieve the transfer behaviour as maybe expected.
It's difficult to judge where a delay comes from, either because the uPP peripheral initiates
the transfer after it has filled any of its transfer buffers or because the access onto the
uPP registers is slow or ...

For a write access onto UPQD0, I have measured 40 nsec, sometimes 20 nsec
(20 nsec corresponds to 6 clock cycles x 3.3 nsec, which is o.k. due to the nops)

For a read access onto the same register, I have measured 100 nsec.

In general, read accesses to various registers in the peripheral area has always
shown values in the range of 100..160 nsec. The access onto this timer register
that I have mentioned in my first question was the worst one with 360 nsec.

We want to understand where this latency comes from. We suppose that is caused by
anything in the system interconnect (spruh77, fig.4.1). Maybe, some questions to it:
- Is the latency actually caused by the SCRs and bridges?
- What is the buffer size within SCR / bridges for write operations?
- Is there any buffer (caching) for (burst) read operations?
- Is data available on the minimum latency for read/write operations crossing SCRn / BRn.
- Can the latency be higher in case of concurrent accesses by different modules?

We want to know if this latency can be reduced by anything( caching, non-volatile type, block reading, ...)

It is also thinkable to let a PRU execute the code controlling the transfer via the uPP
and therefore we ask us, if the PRU experiences the same latency as the DSP core.

Best regards

Joerg

0 kcastille over 13 years ago in reply to Joerg Seiler

TI__Guru 54382 points

Joerg,

Can you provide a little background on what you're trying to do in the system w/ UPP. Generally, the UPP DMA registers should be set in the background of ongoing data transactions, thus minimizing the effect of any latency for CPU read/write accesses to those registers.

The stall measurements you mention for reads and writes to register space in the last post are in the ball park, though I don't have exact reference numbers available. The timer numbers seem high. I wonder if that is an effect of how your benchmark is written. Specifically, you mention:

> and the read values are stored in the onchip shared memory

If L2 Cache is enabled for shared memory range, then the write to that address may cause an L2 line allocate to happen, adding latency to the benchmark. You may try assigning the timer value's directly to a variable instead. The compiler will (probably?) issue two reads to the timer and assign that to an internal CPU register (double check the asm to be sure). That may bring the latency back down to the expected range.

Along similar lines, how are you benchmarking the stalls for access to UPP registers?

In any case ... back to your follow on questions:

First, you may reference the following overview info:

http://processors.wiki.ti.com/index.php/OMAP-L1x/C674x/AM1x_SOC_Architecture_and_Throughput_Overview

http://processors.wiki.ti.com/images/d/d9/Integra_%28OMAP-L1x%29Interconnectivity_v1.zip

First, some background: Writes are "buffered" aka "posted" aka "fire-n-forget". Thus, the perceived stall by the CPU is minimized since the CPU is unstalled while the data traverses the system. Whereas for reads, the CPU is stalled while the read command traverses the system and the return data flows back. That's why read stall time is significantly higher than write stall time.

- Is the latency actually caused by the SCRs and bridges?

Yes. Also, the DSP has internal latency. So, the latency is the sum of DSP latency, scr latency, bridge latency, peripheral response time.

- What is the buffer size within SCR / bridges for write operations?

Each SCR has effectively a 1 word buffer. Each bridge can store between 4 and 8 words depending on the specific bridge instance.

- Is there any buffer (caching) for (burst) read operations?

Cache accesses can burst. Hhwever, non-cacheable accesses are never bursts. UPP registers are strictly non-cacheable.

- Is data available on the minimum latency for read/write operations crossing SCRn / BRn.

No.

- Can the latency be higher in case of concurrent accesses by different modules?

Yes, there can be contention in case you're using the same "link" of the interconnect. HOwever, the main SCRs are crossbars and can support truly concurrent/non-blocking accesses between independent masters and slaves.

Regards
Kyle

0 Joerg Seiler over 13 years ago in reply to kcastille

Prodigy 90 points

Kyle,

In our system, an Fpga is connected to the OMAP via uPP. The Fpga itself has data areas that often need
to be written and read burst-wise via the uPP interface. Accessing these data areas, we expect higher transfer rates
by using the uPP compared to other interfaces. This is the motivation why we do this. When these data areas
are accessed, we have some kind of ongoing data transactions via the uPP for a certain period.
The Fpga also provides single registers of 4 bytes that need to be accessed from time to time.
These single register accesses are singular transactions on the uPP interface, where the latency counts,
where it is important how fast a uPP register can be accessed and how fast a uPP transfer can be initiated.
That’s the background.

My benchmark looks as follows: A first version of the SW runs on a test board, on which the uPP is used
in a loop application. What is sent out via one channel is received via the other. By triggering the data transfer
repeatedly, the transfer rate can be monitored, e.g. by measuring the rate of enable signal in the transmit path.
A modification in the code, e.g. by adding a nop or by adding an additional access onto a uPP register immediately
affects the measured transfer rate. Often, I do see a direct relation in the assembler code that is executed,
but sometimes the change of the transfer rate is not reflected in the assembler code as for the access to uPP registers.

Do you think that the way how I have measured the performance results in missleading values?

Best regards

Joerg

0 Joerg Seiler over 13 years ago in reply to Joerg Seiler

Prodigy 90 points

After reading the recommended documentation, I come to the conclusion,
that the latency is determined by design of the system interconnect.
The only possibility a user has to affect the latency to access registers
in the peripheral area apparently is by changing the master priority values.

Processors

Processors forum

OMAP-L138: Access latency onto registers in the peripheral area