This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM6442: Does the ICSSG1 PRU0 supports assembly instruction DMB, DSB, ISB

Part Number: AM6442

Hello,

Does the ICSSG1 PRU0 supports assembly instruction like DMB, DSB and ISB? The doc PRU Assembly Instruction User Guide does not mention the three instructions.

Where can I find how many instruction cycles each of the PRU assembly instruction costs? Such as SBBO, LBBO?

  • edited April 21, 2022

    Hello,

    General answers

    If the assembly instruction is not listed in the PRU Assembly Instruction User Guide (https://www.ti.com/lit/spruij2), then it does not exist for the PRU. The PRU does not support DMB, DSB, and ISB.

    For read latencies, take a look at the PRU Read Latencies App note (https://www.ti.com/lit/sprace8 ). We have not yet added AM64x to this document, but the basics will still apply:
    1) all PRU instructions that are not memory reads execute in a single PRU clock cycle
    2) PRU write instructions to memory locations outside of the PRU subsystem are fire-and-forget. There are buffers between the PRU subsystem and the rest of the device. It takes one clock cycle for SBBO to write to the system bus buffer. Then the PRU will move to the next assembly command without waiting for the write to complete. It may take additional time for the value written to the buffer to travel through the system busses and eventually update the external memory address. 
    3) PRU write instructions to memory locations within the PRU subsystem take one clock cycle, plus any arbitration delay. This is because there are no buffers between the PRU core and memory internal to the PRU subsystem. Thus, if another core is using the ICSS bus, the PRU core has to wait for the bus to become available before it can perform a write. See the "super deep dive" section below for information on how to calculate arbitration delay.
    4)
    PRU read instructions will take multiple PRU clock cycles, depending on how far away the destination memory is.

    Super deep dive on LBBO with PRU DRAM, for PRU-ICSS devices and AM65x

    LBBO takes 3 clock cycles to get the first 4 bytes of data from PRU internal memory. Every additional 4 bytes adds 1 clock.
    for example,  LBBO loadDestination, loadAddress, 0, 16
    takes 3 clocks to load the first 4 Bytes, and 3 clocks to load the next 12 Bytes

    While we are going SUPER in depth here, let's also talk about arbitration delay:

    Any time a PRU core performs a load from internal memory, there is a chance that another core is already accessing that memory.

    The ICSS bus is 4 Bytes wide, so if another core is performing an atomic write of x Bytes then it will take ceil(x/4) PRU clocks for the write to occur. E.g., a write of 30 bytes would take ceil(30/4) = ceil(7.5) = 8 PRU clocks to complete.

    If a PRU core and a non-PRU core try to access the same ICSS location during the same PRU clock cycle (SRAM, UART, ECAP, etc), the PRU wins. However, if PRU initiates a transaction from SRAM after another core is executing a burst write transaction, the PRU transaction will stall until the burst write is completed. So if the non-PRU core's write takes 8 PRU clocks to complete, the longest the PRU would have to wait is 8 - 1 = 7 PRU clocks. If you want to completely avoid arbitration delay, you could program the non-PRU core to perform multiple 4 Byte writes instead of a single long write. In that case, the PRU would win arbitration every clock cycle it initiates a read.

    What about LBBO with PRU DRAM on AM24x & AM64x? 

    AM24x and AM64x have 64 bit external CBASS busses, but 32 bit internal CBASS busses as per the current revision of the Technical Reference Manual. So I would expect that any writes to DRAM would need to go through the 32 bit internal CBASS bus. If true, that means AM24x and AM64x have the same LBBO timing as all the previous PRU-ICSS / PRU_ICSSG devices. I am double checking with the hardware designer.

    Regards,

    Nick

  • Hi Nick,

    1) In my program, I found that the SBBO cost 4 PRU clock cycles. My program is C and assembly mixed.

    Below is parts of my assembly code:

    /********************************************************************************
    *inline_read() - read data
    *@cmd: not used now
    *@databuf: the memory address to store the read data
    ********************************************************************************/
    void inline_read(uint32_t cmd, uint32_t databuf)
    {
    __asm volatile(
    "LDI R26, 0 \n\t " /* R26 used to store the read data temporary */
    "LDI R27, 0 \n\t " /* R27 used to count the number of byte have been read */
    "LDI R30.b2, 0 \n\t "
    "LDI R30.b0, 0 \n\t "

    "loop_read: \n\t"
    "SET R30, R30, 16 \n\t "
    "MOV R26.b0, R31.b1 \n\t " /* read input data from R31.b1 during bit16 is high */
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "CLR R30, R30, 16 \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "SET R30, R30, 16 \n\t "
    "MOV R26.b1, R31.b1 \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "CLR R30, R30, 16 \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "SET R30, R30, 16 \n\t "
    "MOV R26.b2, R31.b1 \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "CLR R30, R30, 16 \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "SET R30, R30, 16 \n\t "
    "MOV R26.b3, R31.b1 \n\t "
    "SBBO &R26.b0, R15, R27.b1, 4 \n\t " /* store every 4 bytes to R15 memory address */
    "CLR R30, R30, 16 \n\t "
    "nop \n\t "
    "nop \n\t "
    "ADD R27.b0, R27.b0, 2 \n\t " /* R27.b0: read word number */
    "ADD R27.b1, R27.b1, 4 \n\t " /* R27.b1: read bytes number */
    "QBGT loop_read, R27.b0, 64 \n\t " /* if R27.b0 < 64, continue to read */

    "LDI R30.b0, 0x0 \n\t "
    "LDI R30.b2, 0x0 \n\t "
    );
    }

    This code can get periodically pull up and down on the R30 bit16:

    And in my case, I found that the SBBO cost 4 PRU clock cycles. So I'm wondering that if the SBBO cost cycles is adding by how many bytes are stored to memory?

    2) And I found that if I use the below code for the loop_read parts,

    "loop_read: \n\t"
    "SET R30, R30, 16 \n\t "
    "MOV R26.b0, R31.b1 \n\t " /* read input data from R31.b1 during bit16 is high */
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "CLR R30, R30, 16 \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "nop \n\t "
    "SET R30, R30, 16 \n\t "
    "MOV R26.b1, R31.b1 \n\t "
    "SBBO &R26.b0, R15, R27.b1, 2 \n\t " /* store every 2 bytes to R15 memory address */
    "CLR R30, R30, 16 \n\t "
    "nop \n\t "
    "nop \n\t "
    "ADD R27.b0, R27.b0, 1 \n\t " /* R27.b0: read word number */
    "ADD R27.b1, R27.b1, 2 \n\t " /* R27.b1: read bytes number */
    "QBGT loop_read, R27.b0, 64 \n\t " /* if R27.b0 < 64, continue to read */

    I can not get a periodically pull up and down on the R30 bit16. Is there any instruction optimization when the SBBO and ADD are near by each other?

    Best Regards

    xixiguo