AM2434: EtherCAT PRU read/write time

Part Number: AM2434

Hi TI experts,

 

We are evaluating if EtherCAT cycle time can be 50 us on AM2434. So we need to measure the runtime for function HW_EscReadIsr and HW_EscWriteIsr.

Supprisingly our test result is that HW_EscReadIsr takes 4 us when RPDO size is 9 bytes and 5.5 us when RPDO size 19 bytes.

I would like to know if our measurement is correct. I saw that OCRAM speed is 64b @ 250Mhz  in https://e2e.ti.com/support/microcontrollers/arm-based-microcontrollers-group/arm-based-microcontrollers/f/arm-based-microcontrollers-forum/1287056/am2432-what-s-the-bandwidth-of-the-shared-ocram.

But the delta time between 19 bytes and 9 bytes is 1.5 us, that means 0.15 us per byte.

So if the runtime is valid and cannot be reduced, what is the read / write time between shared memory? We also need to know if a separate core for ethercat, and another for motor control is feasible.

Thanks,

Jianyu

  • The 150ns per byte latency is expected behavior for byte-by-byte R5F access to ICSSG SMEM through the interconnect. Here's the assessment and recommended path forward:

    For your 19-byte process data, switching from byte copy to 32-bit word copy is the correct immediate optimization. This reduces the number of bus transactions from 19 to 5 (4 word reads + 1 byte for the remainder), bringing you to ~150ns per 32-bit word rather than per byte [1]. The ICSSG internal SMEM (64KB) has ~15ns access latency from PRU cores internally, but R5F access traverses the external interconnect, which adds the latency you're seeing

  • OK,so this works for all DPRAM read/write, SDO and PDO. Or ESC registers are also included?

  • Above explanation mainly apply to SDO, PDO and DPRAM read/write. Because ESC registers are also emulated via ICSS SMEM, this is technically applicable, but do you see frequent access path in stack to ESC registers cyclically which is done 8-bits at a time ?

  • Hi Pratheesh,

    It seems that the start address of PDO is not 4-byte aligned. If I directly use 4-byte copy like below code, the drive would just go crash.

    while (c>=4)
    {
    *(uint32_t*)d = *(uint32_t*)s;
    d+=4;
    s+=4;
    c-=4;
    }Code

    I ploted the address of &pEsc[0x1100], supposed SM2 address is changing every cycle.

    Then after I add 4-byte alignment at start, I cannot observe any improvement on running time.

    When testing 19-byte RPDO, we do get around 500 ns improvement. The running time is reduced to around 5 us.

    I wonder if any more improvement is possible. The current testing is replacing bsp_read with bsp_myread.

  • void bsp_myread(PRUICSS_Handle pruIcssHandle, uint8_t *pdata, uint16_t address,
                  uint16_t len)
    {
        uint8_t *pEsc = (uint8_t *)(((PRUICSS_HwAttrs *)(
                                         pruIcssHandle->hwAttrs))->baseAddr + PRUICSS_SHARED_RAM);
        uint8_t *d = pdata;
        const uint8_t *s = &pEsc[address];
        int c = len, remainder = 4 - ((long)s & 3);
        if (remainder != 0 && c >= remainder)
        {
            while (remainder)
            {
                *d = *s;
                d++;
                s++;
                remainder--;
            }
            c -= remainder;
        }
        while (c>=4)
        {
            *(uint32_t*)d = *(uint32_t*)s;
            d+=4;
            s+=4;
            c-=4;
        }
        while (c)
        {
            *d = *s;
            d++;
            s++;
            c--;
        }
    }

    Current function is attached.

  • mainly is ESC register 0x220 AL Event. It is handled in 1 ms, and it is OK if takes longer time.

    but do you see frequent access path in stack to ESC registers cyclically which is done 8-bits at a time ?
  • When testing 19-byte RPDO, we do get around 500 ns improvement. The running time is reduced to around 5 us.

    What optimization level is used? Can you look at disassembly level deltas


    I wonder if any more improvement is possible.

    There some possibilities like adding PRU DMA to R5F TCM but this requires firmware modifications as well as potential stack interface changes (we have done similar thing on AM437x EtherCAT implementation via EDMA) but this is not a planned feature as of now on AM64/AM243/AM26x

    https://dr-download.ti.com/software-development/driver-or-library/MD-JLNx46uE7Y/01.00.10.00/EtherCAT_Slave_Datasheet.pdf 

  • our optimization level is OIptimize most (o3), the assembly is as follows.

    70122234 <bsp_myread>:
    70122234: e52db004       str  r11, [sp, #-4]!
    70122238: e28db000       add  r11, sp, #0
    7012223c: e24dd02c       sub  sp, sp, #44
    70122240: e50b0020       str  r0, [r11, #-32]
    70122244: e50b1024       str  r1, [r11, #-36]
    70122248: e1a01002       mov  r1, r2
    7012224c: e1a02003       mov  r2, r3
    70122250: e1a03001       mov  r3, r1
    70122254: e14b32b6       strh  r3, [r11, #-38]
    70122258: e1a03002       mov  r3, r2
    7012225c: e14b32b8       strh  r3, [r11, #-40]
    70122260: e51b3020       ldr  r3, [r11, #-32]
    70122264: e5933004       ldr  r3, [r3, #4]
    70122268: e5933004       ldr  r3, [r3, #4]
    7012226c: e2833801       add  r3, r3, #65536
    70122270: e50b3018       str  r3, [r11, #-24]
    70122274: e51b3024       ldr  r3, [r11, #-36]
    70122278: e50b3008       str  r3, [r11, #-8]
    7012227c: e15b32b6       ldrh  r3, [r11, #-38]
    70122280: e51b2018       ldr  r2, [r11, #-24]
    70122284: e0823003       add  r3, r2, r3
    70122288: e50b300c       str  r3, [r11, #-12]
    7012228c: e15b32b8       ldrh  r3, [r11, #-40]
    70122290: e50b3010       str  r3, [r11, #-16]
    70122294: e51b300c       ldr  r3, [r11, #-12]
    70122298: e2033003       and  r3, r3, #3
    7012229c: e2633004       rsb  r3, r3, #4
    701222a0: e50b3014       str  r3, [r11, #-20]
    701222a4: e51b3014       ldr  r3, [r11, #-20]
    701222a8: e3530000       cmp  r3, #0
    701222ac: 0a000026       beq  0x7012234c <bsp_myread+0x118> @ imm = #152
    701222b0: e51b2010       ldr  r2, [r11, #-16]
    701222b4: e51b3014       ldr  r3, [r11, #-20]
    701222b8: e1520003       cmp  r2, r3
    701222bc: ba000022       blt  0x7012234c <bsp_myread+0x118> @ imm = #136
    701222c0: ea00000c       b  0x701222f8 <bsp_myread+0xc4> @ imm = #48
    701222c4: e51b300c       ldr  r3, [r11, #-12]
    701222c8: e5d32000       ldrb  r2, [r3]
    701222cc: e51b3008       ldr  r3, [r11, #-8]
    701222d0: e5c32000       strb  r2, [r3]
    701222d4: e51b3008       ldr  r3, [r11, #-8]
    701222d8: e2833001       add  r3, r3, #1
    701222dc: e50b3008       str  r3, [r11, #-8]
    701222e0: e51b300c       ldr  r3, [r11, #-12]
    701222e4: e2833001       add  r3, r3, #1
    701222e8: e50b300c       str  r3, [r11, #-12]
    701222ec: e51b3014       ldr  r3, [r11, #-20]
    701222f0: e2433001       sub  r3, r3, #1
    701222f4: e50b3014       str  r3, [r11, #-20]
    701222f8: e51b3014       ldr  r3, [r11, #-20]
    701222fc: e3530000       cmp  r3, #0
    70122300: 1affffef       bne  0x701222c4 <bsp_myread+0x90> @ imm = #-68
    70122304: e51b2010       ldr  r2, [r11, #-16]
    70122308: e51b3014       ldr  r3, [r11, #-20]
    7012230c: e0423003       sub  r3, r2, r3
    70122310: e50b3010       str  r3, [r11, #-16]
    70122314: ea00000c       b  0x7012234c <bsp_myread+0x118> @ imm = #48
    70122318: e51b300c       ldr  r3, [r11, #-12]
    7012231c: e5932000       ldr  r2, [r3]
    70122320: e51b3008       ldr  r3, [r11, #-8]
    70122324: e5832000       str  r2, [r3]
    70122328: e51b3008       ldr  r3, [r11, #-8]
    7012232c: e2833004       add  r3, r3, #4
    70122330: e50b3008       str  r3, [r11, #-8]
    70122334: e51b300c       ldr  r3, [r11, #-12]
    70122338: e2833004       add  r3, r3, #4
    7012233c: e50b300c       str  r3, [r11, #-12]
    70122340: e51b3010       ldr  r3, [r11, #-16]
    70122344: e2433004       sub  r3, r3, #4
    70122348: e50b3010       str  r3, [r11, #-16]
    7012234c: e51b3010       ldr  r3, [r11, #-16]
    70122350: e3530003       cmp  r3, #3
    70122354: caffffef       bgt  0x70122318 <bsp_myread+0xe4> @ imm = #-68
    70122358: ea00000c       b  0x70122390 <bsp_myread+0x15c> @ imm = #48
    7012235c: e51b300c       ldr  r3, [r11, #-12]
    70122360: e5d32000       ldrb  r2, [r3]
    70122364: e51b3008       ldr  r3, [r11, #-8]
    70122368: e5c32000       strb  r2, [r3]
    7012236c: e51b3008       ldr  r3, [r11, #-8]
    70122370: e2833001       add  r3, r3, #1
    70122374: e50b3008       str  r3, [r11, #-8]
    70122378: e51b300c       ldr  r3, [r11, #-12]
    7012237c: e2833001       add  r3, r3, #1
    70122380: e50b300c       str  r3, [r11, #-12]
    70122384: e51b3010       ldr  r3, [r11, #-16]
    70122388: e2433001       sub  r3, r3, #1
    7012238c: e50b3010       str  r3, [r11, #-16]
    70122390: e51b3010       ldr  r3, [r11, #-16]
    70122394: e3530000       cmp  r3, #0
    70122398: 1affffef       bne  0x7012235c <bsp_myread+0x128> @ imm = #-68
    7012239c: e320f000       nop
    701223a0: e320f000       nop
    701223a4: e28bd000       add  sp, r11, #0
    701223a8: e49db004       ldr  r11, [sp], #4
    701223ac: e12fff1e       bx  lr

  • Hi Pratheesh,

    Since EDMA is not supported on AM24 so far, please help have a look if there is any further optimizations could be done on the assembly code (for function bsp_myread.c) customer provided above, thanks.

    In the meanwhile, customer will analyze if the current configuration setup could meet their application level target or not. If it's not and no other optimizations could try then we may discuss the EDMA porting on AM24 in the future support later.

    Thanks,

    Kevin

  • our optimization level is OIptimize most (o3), the assembly is as follows.

    We did some analysis of the assembly snippet 

    Critical Issue: Excessive Stack Spilling in Every Loop Iteration

    The most severe problem is that all loop variables (src ptr, dst ptr, count, alignment) are stored on stack and reloaded on every iteration rather than being kept in registers.
    Word-copy loop (0x70122318–0x70122348) — 13 instructions per 4 bytes:
    ldr r3, [r11, #-12] ; reload src ptr FROM STACK
    ldr r2, [r3] ; actual load
    ldr r3, [r11, #-8] ; reload dst ptr FROM STACK
    str r2, [r3] ; actual store
    ldr r3, [r11, #-8] ; reload dst ptr AGAIN
    add r3, r3, #4
    str r3, [r11, #-8] ; write dst ptr BACK to stack
    ldr r3, [r11, #-12] ; reload src ptr AGAIN
    add r3, r3, #4
    str r3, [r11, #-12] ; write src ptr BACK to stack
    ldr r3, [r11, #-16] ; reload count
    sub r3, r3, #4
    str r3, [r11, #-16] ; write count back to stack
    Only 2 of these 13 instructions do real work. The rest are redundant stack traffic.

    SDK  memcpy():
    ; Inner loop: 16 bytes in 2 instructions, pointers STAY in registers
    ldm r1!, {r3, r4, r12, r14} ; load 4 words, auto-increment r1
    stm r0!, {r3, r4, r12, r14} ; store 4 words, auto-increment r0
    subs r2, r2, #0x10 ; decrement count
    bhs #loop ; branch if ≥ 0
    SDK memcpy copies 16 bytes in ~3 instructions. The customer's code takes ~13 instructions for 4 bytes — roughly a 10× instruction overhead.

    1. May be try with __restrict  to hint that src/dst params are not aliasing
    void bsp_myread(void *__restrict dst, const void *__restrict src, uint16_t addr, uint16_t len);
    2. Use local register copies of pointers:
    uint8_t *d = (uint8_t *)dst;
    const uint8_t *s = (const uint8_t *)src + addr;
    // compiler can now keep d and s in registers
    3. Replace the word-copy loop with LDM/STM (or just use memcpy (__builtin_memcpy))
    // After alignment handling:
    uint32_t words = remaining >> 2;
    if (words) {
    memcpy(d, s, words << 2); // SDK memcpy uses LDM/STM internally
    d += words << 2;
    s += words << 2;
    remaining &= 3;
    }
    // byte tail
    while (remaining--) *d++ = *s++;

    Region Current Optimized (estimate)
    4-byte aligned copy ~13 instr/word ~3 instr/16 bytes (LDM/STM)
    19-byte RPDO (5 words approx) ~65+ instructions ~10–15 instructions
    Expected runtime improvement ~5 µs baseline Potentially 1–1.5 µs reduction additional