This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SK-AM64B: RPmsg between A53 and R5 performance update, cont

Update Dec 6 2024
The previous thread is starting to get a bit long:
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1388960/sk-am64b-rpmsg-between-a53-and-r5-performance-update
I am splitting the later discussion off into this new thread.

Hi Nick,

for now i do not have much time to spend on this topic. We are entering "final stage" of pre-investigations and i have some more topics to look at.


The last thing i did, was trying to get rid of the RPmsg with the help of GPIO IRQs. Using SIGINT in user space. The example is done, but the measurements are not. Maybe i will have some time in between to do the measurements.

Currently i am focusing on the multicore scheduling. I will open another thread for this topic soon.

If you have any updates on this topic it would be great to see the results.

  • Hello Chris,

    Sounds good. I might not be the right team member to own the thread on multicore scheduling, but I would love to follow that thread to learn along with you.

    I'll try to get some test runs in tomorrow or the next day, and I'll update the thread as soon as I have anything worth sharing!

    Regards,

    Nick

  • Partial update: I updated my code to measure average & worst-case latencies, and validated on ti-rpmsg-char. Updated code and other details here:
    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1410313/am6442-communication-latency-issues-between-a53-and-r5-in-a-linux-rt-system/5434861#5434861

    I am almost done applying the updated code to the zerocopy example, and I'll run tests there next.

    Regards,

    Nick

  • pinging the thread to keep it active.

    I still want to finish running timing tests on zerocopy, and then see if I can implement and test some memory polling code for the notification mechanism instead of RPMsg. However, since this is not currently a near-term need for yall, I have been spending the last couple of weeks trying to get other major customer requests under control. Hoping to pivot back to this effort in a couple more weeks.

    Regards,

    Nick

  • Hello Chris,

    Alex told me yall were circling back to look at this more, so I will also put time towards finishing those zerocopy benchmark tests I was looking into. Just to set timeline expectations, I will probably not have any major updates this week, since it's the end of the workday and Tuesday is the only other day I am working this week. However, I will have more updates next week.

    Have you made any interesting progress on your side that I should be aware of, so we do not duplicate effort?

    Feel free to send your test code to Alex so that I can review it.

    Regards,

    Nick

  • Hello Chris,

    Finally had enough time to finish writing the test code to see where time is being used in the zerocopy example. The test code is still a bit buggy - I am getting segmentation faults on longer runs, not sure if there is some variable that is overflowing somewhere. For some reason the numbers are getting a bit mixed up as well. But it is enough to get an idea of what is contributing the most to the round-trip latency.

    Most of the latency is due to:
    Linux write data to buffer (~7.1 msec for 1MBytes)
    Linux read data from buffer (~52 msec for 1MBytes)
    Linux>R5F RPMsg + R5F read/write + R5F>Linux RPMsg (~12.5msec for 1MBytes)

    In general, the sync commands before and after reading seem to take <10usec, so negligible relative to the reads and writes.

    I'll hold off on posting my test code and the histograms, just to see if I get more time to debug exactly what is going on. But here's a representative output:

    ./rpmsg_char_zerocopy -r 2 -e carveout_ipc-memories@a5000000 -t 0x01010101 -n 50
    Created endpt device rpmsg-char-2-2009, fd = 4 port = 1025
    Exchanging 50 messages with rpmsg device on rproc id 2 ...
    
    dma-buf address: 0xa5000000
    root@am64xx-evm:~/241206_zerocopy_test#
    Completed 50 buffer updates successfully on rpmsg-char-2-2009
    
    Buffer size = 1048576 bytes = 1024 kbytes
    Pattern = 0x01010101
    Total execution time for the test: 3 seconds
    Many different latencies were measured in this test.
    
    latency = Linux>R5F RPMsg + R5F read/write + R5F>Linux RPMsg.
    Average latency: 12432
    Worst-case latency: 13805
    Histogram data at latency_histogram.txt
    
    sync1 = Linux time to sync dmabuf before writing.
    Average sync1: 1
    Worst-case sync1: 8
    Histogram data at sync1_histogram.txt
    
    buffer_write = Linux time to write dmabuf.
    Average buffer_write: 7129
    Worst-case buffer_write: 7558
    Histogram data at buffer_write_histogram.txt
    
    sync2 = Linux time to sync dmabuf after writing.
    Average sync2: 1
    Worst-case sync2: 6
    Histogram data at sync2_histogram.txt
    
    sync3 = Linux time to sync dmabuf before reading.
    Average sync3: 2
    Worst-case sync3: 12
    Histogram data at sync3_histogram.txt
    
    buffer_read = Linux time to read dmabuf *IN MSEC*.
    Average buffer_read: 52
    Worst-case buffer_read: 62
    Histogram data at buffer_read_histogram.txt
    
    // sync4 output is WRONG, avg = 2, worst-case 10
    // see further below
    // for some reason an extra two 19057 got added on top of
    // the 50 valid measurements
    sync4 = Linux time to sync dmabuf after reading.
    Average sync4: 764
    Worst-case sync4: 19057
    Histogram data at sync4_histogram.txt
    
    Number of iterations should = 50
    latency iterations = 49
    sync1 iterations = 50
    buffer_write iterations = 50
    sync2 iterations = 50
    sync3 iterations = 50
    buffer_read iterations = 48
    sync4 iterations = 52
    TEST STATUS: PASSED
    
    # vi sync4_histogram.txt
    // all 50 measurements are <= 10usec
    0 , 0
    1 , 28
    2 , 12
    3 , 2
    4 , 0
    5 , 4
    6 , 1
    7 , 0
    8 , 2
    9 , 0
    10 , 1
    

    Regards,

    Nick

  • Where this is what "write data" measures

    void buffer_init
    ...
            for(i = 0; i < lbuf->size / sizeof(uint32_t); i++)
                    lbuf->shared_buf[i] = pattern;

    and this is what  "read data" measures

    int buffer_validate
    ...
    for(i = 0; i < len; i++) {
                    if (lbuf->shared_buf[i] != pattern)
                            break;
            }