SK-AM64B: RPmsg between A53 and R5 performance update, cont

Chris

Update Dec 6 2024
The previous thread is starting to get a bit long:
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1388960/sk-am64b-rpmsg-between-a53-and-r5-performance-update
I am splitting the later discussion off into this new thread.

Hi Nick,

for now i do not have much time to spend on this topic. We are entering "final stage" of pre-investigations and i have some more topics to look at.

The last thing i did, was trying to get rid of the RPmsg with the help of GPIO IRQs. Using SIGINT in user space. The example is done, but the measurements are not. Maybe i will have some time in between to do the measurements.

Currently i am focusing on the multicore scheduling. I will open another thread for this topic soon.

If you have any updates on this topic it would be great to see the results.

over 1 year ago

0 Nick Saulnier over 1 year ago

TI__Guru** 101840 points

Hello Chris,

Sounds good. I might not be the right team member to own the thread on multicore scheduling, but I would love to follow that thread to learn along with you.

I'll try to get some test runs in tomorrow or the next day, and I'll update the thread as soon as I have anything worth sharing!

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101840 points

Partial update: I updated my code to measure average & worst-case latencies, and validated on ti-rpmsg-char. Updated code and other details here:
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1410313/am6442-communication-latency-issues-between-a53-and-r5-in-a-linux-rt-system/5434861#5434861

I am almost done applying the updated code to the zerocopy example, and I'll run tests there next.

Regards,

Nick

0 Nick Saulnier 11 months ago in reply to Nick Saulnier

TI__Guru** 101840 points

pinging the thread to keep it active.

I still want to finish running timing tests on zerocopy, and then see if I can implement and test some memory polling code for the notification mechanism instead of RPMsg. However, since this is not currently a near-term need for yall, I have been spending the last couple of weeks trying to get other major customer requests under control. Hoping to pivot back to this effort in a couple more weeks.

Regards,

Nick

0 Nick Saulnier 10 months ago in reply to Nick Saulnier

TI__Guru** 101840 points

Hello Chris,

Alex told me yall were circling back to look at this more, so I will also put time towards finishing those zerocopy benchmark tests I was looking into. Just to set timeline expectations, I will probably not have any major updates this week, since it's the end of the workday and Tuesday is the only other day I am working this week. However, I will have more updates next week.

Have you made any interesting progress on your side that I should be aware of, so we do not duplicate effort?

Feel free to send your test code to Alex so that I can review it.

Regards,

Nick

0 Nick Saulnier 9 months ago in reply to Nick Saulnier

TI__Guru** 101840 points

Hello Chris,

Finally had enough time to finish writing the test code to see where time is being used in the zerocopy example. The test code is still a bit buggy - I am getting segmentation faults on longer runs, not sure if there is some variable that is overflowing somewhere. For some reason the numbers are getting a bit mixed up as well. But it is enough to get an idea of what is contributing the most to the round-trip latency.

Most of the latency is due to:
Linux write data to buffer (~7.1 msec for 1MBytes)
Linux read data from buffer (~52 msec for 1MBytes)
Linux>R5F RPMsg + R5F read/write + R5F>Linux RPMsg (~12.5msec for 1MBytes)

In general, the sync commands before and after reading seem to take <10usec, so negligible relative to the reads and writes.

I'll hold off on posting my test code and the histograms, just to see if I get more time to debug exactly what is going on. But here's a representative output:

./rpmsg_char_zerocopy -r 2 -e carveout_ipc-memories@a5000000 -t 0x01010101 -n 50
Created endpt device rpmsg-char-2-2009, fd = 4 port = 1025
Exchanging 50 messages with rpmsg device on rproc id 2 ...

dma-buf address: 0xa5000000
root@am64xx-evm:~/241206_zerocopy_test#
Completed 50 buffer updates successfully on rpmsg-char-2-2009

Buffer size = 1048576 bytes = 1024 kbytes
Pattern = 0x01010101
Total execution time for the test: 3 seconds
Many different latencies were measured in this test.

latency = Linux>R5F RPMsg + R5F read/write + R5F>Linux RPMsg.
Average latency: 12432
Worst-case latency: 13805
Histogram data at latency_histogram.txt

sync1 = Linux time to sync dmabuf before writing.
Average sync1: 1
Worst-case sync1: 8
Histogram data at sync1_histogram.txt

buffer_write = Linux time to write dmabuf.
Average buffer_write: 7129
Worst-case buffer_write: 7558
Histogram data at buffer_write_histogram.txt

sync2 = Linux time to sync dmabuf after writing.
Average sync2: 1
Worst-case sync2: 6
Histogram data at sync2_histogram.txt

sync3 = Linux time to sync dmabuf before reading.
Average sync3: 2
Worst-case sync3: 12
Histogram data at sync3_histogram.txt

buffer_read = Linux time to read dmabuf *IN MSEC*.
Average buffer_read: 52
Worst-case buffer_read: 62
Histogram data at buffer_read_histogram.txt

// sync4 output is WRONG, avg = 2, worst-case 10
// see further below
// for some reason an extra two 19057 got added on top of
// the 50 valid measurements
sync4 = Linux time to sync dmabuf after reading.
Average sync4: 764
Worst-case sync4: 19057
Histogram data at sync4_histogram.txt

Number of iterations should = 50
latency iterations = 49
sync1 iterations = 50
buffer_write iterations = 50
sync2 iterations = 50
sync3 iterations = 50
buffer_read iterations = 48
sync4 iterations = 52
TEST STATUS: PASSED

# vi sync4_histogram.txt
// all 50 measurements are <= 10usec
0 , 0
1 , 28
2 , 12
3 , 2
4 , 0
5 , 4
6 , 1
7 , 0
8 , 2
9 , 0
10 , 1

Regards,

Nick

0 Nick Saulnier 9 months ago in reply to Nick Saulnier

TI__Guru** 101840 points

Where this is what "write data" measures

void buffer_init
...
        for(i = 0; i < lbuf->size / sizeof(uint32_t); i++)
                lbuf->shared_buf[i] = pattern;

and this is what "read data" measures

int buffer_validate
...
for(i = 0; i < len; i++) {
                if (lbuf->shared_buf[i] != pattern)
                        break;
        }

Processors

Processors forum

SK-AM64B: RPmsg between A53 and R5 performance update, cont