[FAQ] PRU Read & Write Latencies

Nick Saulnier

I am writing a PRU application for a PRU-ICSS device (AM261x, AM263Px, AM263x, AM335x, AM437x, AM57x), PRUSS device (AM62x), or a PRU_ICSSG device (AM24x, AM64x, AM65x). I want to calculate how many clock cycles a read or a write will take. How do I do it?

----------------------------------------------------------------------------------------------------------------------------------------------

This FAQ is an update to previous FAQ [FAQ] PRU: How do I calculate read and write latencies? . This FAQ can be thought of as a "first draft" for an update to the PRU Read Latencies app note. Once the app note is updated, this FAQ will be updated to reflect that the app note has the most up-to-date information.

Arbitration delay can also affect read or write latency. For more information, refer to [FAQ] PRU Arbitration Delay .

This FAQ is a work-in-progress! If you are reading this while it still has the work-in-progress label, feel free to create a new e2e thread to chat with us about the latest updates.

4 months ago

+1 Nick Saulnier 4 months ago

TI__Guru** 109910 points

PRU Read & Write Instructions

Processor cores work by executing instructions. Even if the PRU firmware is programmed in C, the C compiler will convert the C code into instructions that the PRU cores can run. PRU firmware can also be written in assembly in order to directly control every single instruction that runs on the PRU cores. If the application cares about whether a read executes in 6 PRU clock cycles or 7 PRU clock cycles, the time sensitive code should probably be written in assembly.

This document will use the term "latency" to describe the time it takes for an instruction to execute.

The full PRU instruction set is documented in the PRU Assembly Instruction User Guide. That document lists three kinds of read & write instructions:

LBBO (load byte burst) & SBBO (store byte burst)

LBBO and SBBO are the default read (i.e., load) and write (i.e., store) PRU instructions. This FAQ will focus on calculating read and write latencies with LBBO and SBBO.

LBCO (Load Byte Burst with Constant Table Offset) & SBCO (Store Byte Burst with Constant Table Offset)

LBCO and SBCO work similarly to LBBO and SBBO. However, instead of using a register to specify the read/write memory address, LBCO and SBCO use an address from the PRU Constant Table. The latency for LBCO and SBCO to execute is the same as the latency for LBBO and SBBO.

While the latency to execute just the read or write is the same, LBCO/SBCO take fewer clock cycles overall. Since the read/write address for LBBO/SBBO is stored in a register, LBBO/SBBO require additional assembly instructions to load the address into the register. Since LBCO and SBCO use a constant table entry for the read/write address instead of a register value, LBCO and SBCO do not require the additional "address load" assembly instructions.

Broadside commands: XIN (Register Transfer In), XOUT (Register Transfer Out), XCHG (Register Exchange)

XIN & XOUT can be used to read or write up to 31 registers to or from a broadside interface. Since XIN & XOUT execute in a single PRU clock cycle, they are the fastest way to move data. However, XIN and XOUT can only communicate with modules that are attached to that broadside interface.

Different PRU devices have different broadside interface attachments. For example, AM335x can use XIN & XOUT to move registers between the PRU core and scratchpad registers, or directly from one PRU core to the other. AM64x can use XIN & XOUT to move information to and from scratchpad registers, but NOT directly to another PRU_ICSSG core. However, AM64x can ALSO use XIN & XOUT to access its own dedicated broadside (BS) RAM, or to read & write multiple registers to a memory space outside of the PRU_ICSSG through the XFR2VBUSP interface. For more information, please reference the Technical Reference Manual (TRM) for the desired processor.

We will discuss more about using XFR2VBUSP to conduct reads & writes in a later section.

+1 Nick Saulnier 4 months ago

TI__Guru** 109910 points

How do I tell if a read or write is deterministic?

PRU cores are completely deterministic. That means we can know exactly how long every PRU instruction will take to execute... with some exceptions, which are described below.

Non-read & write instructions

PRU instructions that are not reads or writes are completely deterministic. These instructions will always take exactly one PRU clock cycle to execute.

Read & write instructions to an address within the PRU Subsystem

Reads and writes to an address within the PRU subsystem are completely deterministic. However, there is some extra math involved:

There are specific rules for how long a read or a write instruction will take. We will cover those rules in depth below.

But what if a PRU core is trying to access a peripheral within the PRU subsystem, and another core is already reading or writing to that peripheral? Or what if two cores try to access the same peripheral on the exact same clock cycle? Only one of the cores can access the peripheral at a time, so the PRU subsystem arbitrates between the two cores to decide which core gets to go first. Once one core is reading or writing to the peripheral, then the second core will stall until the peripheral becomes available. The stall time is called "arbitration delay".

When designing a PRU system with sensitive timing, the designer must keep in mind the lowest possible delay (i.e., "time to do the read/write"), and the highest possible delay (i.e., "time to do the read/write" + "arbitration delay").

Read & write instructions to a system address outside the PRU Subsystem

Reads and writes to addresses outside of the PRU subsystem are NOT deterministic. There may be choke points, arbitration delay, impacts from other cores, etc. If you run tests to approximate the latency of external reads and writes in your system, we suggest giving the processor as similar a load as possible to your actual usecase (for example, a test where the PRU is the only core running on the entire device will not give a good representation of system behavior when a Linux core is driving a display, communicating over Ethernet, etc).

PRU-ICSS cores and PRU-SS cores:
These cores will not experience arbitration delay between PRU cores when the signals exit the PRU Subsystem. However, PRU subsystems with XFR2VBUSP can have arbitration between a PRU core and XFR2VBUS instance in the same slice.

PRU_ICSSG:
Accesses to the external system must take arbitration delays into account from other PRU cores, as well as XFR2VBUSP accelerators.

+1 Nick Saulnier 4 months ago

TI__Guru** 109910 points

Calculating write latencies with SBBO/SBCO

PRU-ICSS, PRU-SS Write to memory locations outside of the PRU subsystem: N PRU clocks to write N words

These instructions are fire-and-forget.

There are buffers between the PRU subsystem and the rest of the device. It takes N clock cycles for SBBO/SBCO to write N words to the system bus buffer. Then the PRU will move to the next assembly command without waiting for the write to complete. It will take additional time for the value written to the buffer to travel through the system busses and eventually update the external memory address.

This assumes that the SBBO writes are to addresses that are word aligned (e.g., 0x100, 0x104, 0x108, etc). Whenever the load crosses a 4 byte boundary, it takes another PRU clock cycle.
- e.g., write 2 bytes to 0x100: one PRU clock
- e.g., write 2 bytes to 0x102: one PRU clock
- e.g., write 2 bytes to 0x103: two PRU clocks (one clock to write to the word at 0x100, one clock to write to the word at 0x104)
* Thus, if the write address is not word aligned, the equation changes to N + 1 PRU clocks for SBBO to write N 32-bit words

PRU_ICSSG Write to memory locations outside of the PRU subsystem: N PRU clocks to write N words, plus per-slice arbitration delay

See the PRU-ICSS / PRU-SS section above, and the arbitration delay FAQ [FAQ] PRU Arbitration Delay

PRU-ICSS, PRU-SS, PRU_ICSSG Write to memory locations within the PRU subsystem: N PRU clocks to write N words, plus arbitration delay

There are no buffers between the PRU core and memory internal to the PRU subsystem. Thus, if another core is using an endpoint connected to the ICSS CBASS bus, the PRU core has to wait for the endpoint to become available before it can perform a write.

The internal CBASS interconnect is "fully switched". That means that multiple cores can use the CBASS simultaneously, as long as the cores are accessing different endpoints. For example, PRU0 can access DRAM at the same time that PRU1 is accessing the PRU's hardware UART.

See the arbitration delay FAQ for graphics that list the CBASS endpoints.

The same word alignment rules apply here. That is, if the write is not 32-bit word aligned, add 1 additional clock cycle.

+1 Nick Saulnier 4 months ago

TI__Guru** 109910 points

Calculating read latencies with LBBO / LBCO

Read instructions take multiple PRU clock cycles, depending on how far away the destination address is on the processor. The fastest LBBO / LBCO read is from DMEM0, DMEM1, SMEM (PRU subsystem Data RAM).

LBBO from PRU DRAM takes 2 + N PRU clocks to read N 32-bit words (plus arbitration delay, plus 1 PRU clock if address is not word aligned)

LBBO takes 3 clock cycles to get the first 4 bytes of data from PRU internal memory. Every additional 4 bytes adds 1 clock.

for example, LBBO loadDestination, loadAddress, 0, 16
takes 6 PRU clocks: 3 clocks to load the first 4 Bytes, and 3 clocks to load the next 12 Bytes

Depending on the PRU subsystem, reads from different locations may have arbitration delay from different sources.

For more information about arbitration delay, refer to [FAQ] PRU Arbitration Delay .

+1 Nick Saulnier 4 months ago

TI__Guru** 109910 points

What about XFR2VBUS?

What is XFR2VBUS?

PRU cores can use the XIN / XOUT commands to swap up to 64 Bytes of data with the XFR2VBUS accelerator in a single clock cycle.

How can XFR2VBUS save PRU clock cycles

So for a 64 Byte write, instead of stalling for 16 PRU clock cycles to send the data with SBBO/SBCO, the PRU can send the data to XFR2VBUS in 1 clock cycle. Then the PRU code can continue executing instructions, while the XFR2VBUS accelerator handles the actual data transfer.

A PRU 64 Byte read from on-chip SRAM could stall the PRU core for ~40-80 PRU clock cycles, depending on the processor. In the best case, a 64 Byte read from PRU internal memory would stall the PRU core for 18 clock cycles - or the PRU core could trigger the XFR2VBUS to read the memory with 1 clock cycle, and then load the data from XFR2VBUS in another clock cycle later on. The XFR2VBUS can also be configured to auto-increment on read: in that case, loading data from the XFR2VBUS automatically triggers the next read, saving another clock cycle.

Which PRU subsystems have the XFR2VBUS accelerator?

AM335x, AM437x, AM57x, K2G: No XFR2VBUS

AM261x, AM263Px, AM263x, AM62x: XFR2VBUS RX (1 per slice = 2 per PRU subsystem)

AM243x, AM64x, AM65x: XFR2VBUS TX (2 per slice = 4 per PRU subsystem), XFR2VBUS RX (3 per slice = 6 per PRU subsystem)

Are there any downsides to using XFR2VBUS?

XFR2VBUS is not appropriate for every single usecase.

The XFR2VBUS can only queue up one read or write at a time, so the data cannot be pipelined.

While XFR2VBUS can be used to read or write data from the PRU subsystem, XFR2VBUS cannot directly access the PRU subsystem's internal CBASS bus. So any read or write to memory within the PRU subsystem must physically exit the PRU subsystem, go through the SoC level bus, and access the internal CBASS from the SoC bus. This means that the actual clock cycles for the XFR2VBUS to read data from the PRU's local DMEM will be greater than the clock cycles that an LBBO/LBCO command would take, since PRU cores can use local addresses with LBBO/LBCO to directly access the PRU's internal CBASS.

+1 Nick Saulnier 4 months ago

TI__Guru** 109910 points

Note! Use the local address for resources within the PRU Subsystem

Check the PRU's memory map, and use the local memory addresses to access PRU memory, peripherals, and registers. If you use the system addresses for these targets, then the LBCO/LBBO/SBCO/SBBO commands will access these targets through the SoC bus instead of directly through the PRU's local CBASS bus. This leads to larger latencies, and potentially introduces additional jitter in the latency.

For more information, refer to the graphics in [FAQ] PRU Arbitration Delay

Processors

Processors forum

[FAQ] PRU Read & Write Latencies