SK-AM64B: RPmsg between A53 and R5 performance update

Chris

Part Number: SK-AM64B
Other Parts Discussed in Thread: TMDS64EVM

Tool/software:

Hi,

i measured the communication performance with RPmsg between a UserSpace Task on A53 and the R5 running FreeRTOS. The amount of data was 4KB. I have some outliners with 2-3 milliseconds. Could you please provide some performance updates.

cyclictest looking good with 60us.

Right now i'm using:

SDK 09_02_01_10 with Kernel 6.1 on the SK-AM64B.

and

https://git.ti.com/cgit/rpmsg/rpmsg_char_zerocopy/

I tried to change the example to TCM only for the 4K data. No improvements there.

Thanks a lot

over 1 year ago

0 Dominic Rath over 1 year ago

Mastermind 7480 points

Hi,

not completely sure what you're seeing, but outliers of 2-3 milliseconds could be due to a known issue. The Linux kernel currently (TI SDK 09.02.01) handles mailbox messages (which are used to trigger RPmsg handling) on a workqueue, i.e. without any kind of priorization. There are patches on LKML from Andrew Davis that improve this by using a threaded interrupt handler. Not sure when these will be available in a TI SDK. Maybe Nick Saulnier knows more.

Please note that I'm not TI. They might have other advice for you.

Regards,

Dominic

0 Nick Saulnier over 1 year ago

TI__Guru** 101890 points

Hello Chris,

I head through the grapevine that Dominic might have provided some code for you? Not sure what kernel version it was on. Here's the code on kernel 6.10
https://lore.kernel.org/linux-kernel/20240410135942.61667-1-afd@ti.com/
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3f58c1f4206f37d0af4595a9046c76016334b301

No update yet from the developer on our side on backporting the code to kernel 6.6 - I'm still hoping we can get the patches in for SDK 10.0, but we'll see. I've got a sync with the developer in the middle of the week.

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101890 points

The developers have backported the code to kernel 6.6. We are going through the code review now before merging in the patches.

I am working on benchmarking code now so that we can quantify performance. Planning to run tests on kernel 6.6 on Thursday or Friday.

Regards,

Nick

0 Chris over 1 year ago in reply to Nick Saulnier

Prodigy 120 points

Hi Nick,

thanks for the feedback. Looking forward seeing the results.

You are right, I have already tested a kernel patch from Dominic Rath. The outliners are now a lot better. We are seeing round about 400us with 4K payload.

It would be great if you could provide us your patches for testing.

Regards,

Chris

0 Dominic Rath over 1 year ago in reply to Chris

Mastermind 7480 points

Hello Chris,

it seems the patches already made their way into TI's Git repository:

https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/log/?h=ti-rt-linux-6.6.y-cicd

Not sure if it makes more sense to try this kernel on a SDK 09.02.01 installation, or better to use a 10.00 from TI's CICD.

Regards,

Dominic

0 Dominic Rath over 1 year ago in reply to Dominic Rath

Mastermind 7480 points

Hello Chris,
Hello Nick,

running kernel 6.6 on a 09.02.01.10 debian image works just fine, BUT it seems that ti-rt-linux-6.6.y-cicd is missing the misc/dma-buf-phys.c "driver" that the zerocopy example relies on to determine the physical address of the DMA-buf.

Cherry-picking the commit from the 6.1 kernel worked. The driver compiles and works just the same.

Nick Saulnier: Can you find out if there's a reason why dma-buf-phys is not (yet) in 6.6? Would be great if that would be included in time for SDK 10.00.00.

Regards,

Dominic

0 Nick Saulnier over 1 year ago in reply to Dominic Rath

TI__Guru** 101890 points

Hello Chris,

Quick code review

Dominic shared some of your code with me. I won't go super in-depth because today I'm working on running some other benchmarks on those patches Dominic pointed to above, but here are some notes:

To double check: is this the latency you wanted to measure, or were you hoping to get the latency for something else?

Linux RPMsg to R5F --> R5F reads 4kB of R5F TCM --> R5F writes 4kB to R5F TCM --> R5F RPMsg to Linux

The code I currently have access to is NOT measuring the time for Linux to write to the R5F's TCM memory, or for Linux to read from the R5F's TCM memory.

The R5F code is using DDR (assuming you are using rproc_id = 2 = R5F_MAIN0_0, as per https://git.ti.com/cgit/rpmsg/ti-rpmsg-char/tree/include/rproc_id.h ). The R5F code might run slightly faster if you move those R5F DDR data allocations to TCM (best case) or SRAM (if you run out of TCM memory).

If this is the linker.cmd file you are using: https://git.ti.com/cgit/rpmsg/rpmsg_char_zerocopy/tree/rtos/am64x-evm/r5fss0-0_freertos/ti-arm-clang/linker.cmd

Keep the DDR_0 allocation - this is needed for Linux to initialize the remote core

DDR_1 allocations are what we want to get rid of.

FIRST, build the binary file. check the .map file that is generated to see how much memory each region is using. For more information, refer to https://dev.ti.com/tirex/explore/node?a=7qm9DIS__LATEST&node=A__AdPavpRhU8yrU-EQk33UdQ__AM64-ACADEMY__WI1KRXP__LATEST

Now that you know how much memory is being used, you can move around memory allocations as needed in the linker.cmd file.

Other Updates

As Dominic noted, we already merged the patches into the ti-linux-kernel repo for kernel 6.6 so that we can try to get them into the Linux SDK 10.0. Today and tomorrow I am running rigorous benchmarks to test performance, and what performance improvements we see over the "baseline" before the patches were applied.

The zerocopy example has not yet been ported to kernel 6.6. Dominic's solution of cherry-picking the dma-buf-phys commit from 6.1 should be good enough for now to enable your proof-of-concept. Our tentative plan is to migrate the zerocopy example to use a different memory sharing solution called remoteproc CDEV in kernel 6.6, but we'll get that done after we make sure we've enabled you.

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101890 points

Final post of the day:

One of the big improvements of the workqueue fix is that we should be able to effectively increase the priority of the IPC task.

I'm still figuring out the best way to do this (e.g., elevate the priority of the userspace application? Find some way to elevate the priority of the Linux mailbox driver?) But for now I am seeing big differences in results when I elevate priority as opposed to when I don't.

Tests on kernel 6.6, with the mailbox patches.

regular priority:

root@am64xx-evm:~# ./rpmsg_char_simple -r 2 -n 1000000
Created endpt device rpmsg-char-2-1040, fd = 4 port = 1025
Exchanging 1000000 messages with rpmsg device rpmsg-char-2-1040 on rproc id 2 ...


Communicated 1000000 messages successfully on rpmsg-char-2-1040

Total execution time for the test: 35 seconds
Average round-trip latency: 35
Worst-case round-trip latency: 1407
Histogram data at histogram.txt
TEST STATUS: PASSED

elevated priority:

root@am64xx-evm:~# ./rpmsg_char_simple -r 2 -n 1000000 & chrt -f -p 10 $!
root@am64xx-evm:~# Created endpt device rpmsg-char-2-1024, fd = 4 port = 1025
Exchanging 1000000 messages with rpmsg device rpmsg-char-2-1024 on rproc id 2 ...


Communicated 1000000 messages successfully on rpmsg-char-2-1024

Total execution time for the test: 35 seconds
Average round-trip latency: 34
Worst-case round-trip latency: 134
Histogram data at histogram.txt
TEST STATUS: PASSED

Regards,

Nick

0 Chris over 1 year ago in reply to Nick Saulnier

Prodigy 120 points

Hi Nick,

thanks for the update. Your are right, the code is not measuring the time that Linux needs to write and read the data. I have to change this in the final test.

Actually, we want to measure the following:

R5F writes 4kB to R5F TCM -> R5F RPMsg to Linux -> Linux reads 4kB from R5F TCM -> Linux writes 4kB to R5F TCM -> Linux RPMsg to R5F -> R5F reads 4kB from R5F TCM

The first write and the last read from R5F are Ethernet Data. I'm pretty sure there is a nice way to get the data into the right memory fast.

Anyway for now, we care about the communication time. This is the time we want to compare with our today design.

0 Chris over 1 year ago in reply to Chris

Prodigy 120 points

Hi Nick,

I can see the improvement. The delays are a lot better now.

I changed the time measurement to the following:

Linux writes 4kB to R5F TCM -> Linux RPMsg to R5F-> R5F reads 4kB from R5F TCM -> R5F writes 4kB to R5F TCM -> R5F RPMsg to Linux -> Linux reads 4kB from R5F TCM (that includes all we need)

The time went back to, lets say "not that fast". -> Understandable.

Now it is the questions, what is the best memory to use. The new bottleneck seems to be the memory access. Maybe we are working already with the best trade between the worlds.

have a nice weekend!

regards,

Chris

0 Nick Saulnier over 1 year ago in reply to Chris

TI__Guru** 101890 points

Hello Chris,

Ok, I've done some initial benchmarks for round-trip latency with the default RPMsg Echo example (NOT the shared memory example yet). Hopefully this gives you an idea of what kind of performance you can expect from RPMsg. I'll run one last test over the weekend and add those results later.

Next week I'll ask around to see if anyone has insight into optimizing Linux reads & writes to internal memory - all my key contacts were out today.

Summary

RT Linux is still a Linux environment, and because of that there IS still the chance for very rare, unbounded behavior. However, the mailbox workqueue patches do lead to much cleaner behavior, especially when the application is granted elevated permissions (which is one of the key ways we tune RT Linux performance).

Testing Environment

AM64x EVM (TMDS64EVM)

Version of Linux: ti-linux-kernel branch ti-rt-linux-6.6.y-cicd before and after adding the mailbox patches
Before mailbox patches: https://software-dl.ti.com/cicd-report/linux/index.html?section=snapshot&platform=am64xx&snapshot=cicd.scarthgap.202407230400 (used an internal filesystem build of tisdk-default-image-rt-am64xx-evm, NOT the thinlinux filesystem)
after mailbox patches: internal build cicd.scarthgap.202407250901

Version of ti-rpmsg-char example: attached

Fullscreen rpmsg_char_simple.c Download

/*
 * rpmsg_char_simple.c
 *
 * Simple Example application using rpmsg-char library
 *
 * Copyright (c) 2020 Texas Instruments Incorporated - https://www.ti.com
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * *  Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *
 * *  Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * *  Neither the name of Texas Instruments Incorporated nor the names of
 *    its contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 *  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 *  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 *  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 *  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
 *  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 *  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 *  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 *  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 *  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 *  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 */

#include <sys/select.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <stdint.h>
#include <stddef.h>
#include <fcntl.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <time.h>
#include <stdbool.h>
#include <semaphore.h>

#include <linux/rpmsg.h>
#include <ti_rpmsg_char.h>

#define NUM_ITERATIONS	100
#define REMOTE_ENDPT	14

/*
 * This test can measure round-trip latencies up to 20 ms
 * Latencies measured in microseconds (us)
 */
#define LATENCY_RANGE 20000

int send_msg(int fd, char *msg, int len)
{
	int ret = 0;

	ret = write(fd, msg, len);
	if (ret < 0) {
		perror("Can't write to rpmsg endpt device\n");
		return -1;
	}

	return ret;
}

int recv_msg(int fd, int len, char *reply_msg, int *reply_len)
{
	int ret = 0;

	/* Note: len should be max length of response expected */
	ret = read(fd, reply_msg, len);
	if (ret < 0) {
		perror("Can't read from rpmsg endpt device\n");
		return -1;
	} else {
		*reply_len = ret;
	}

	return 0;
}

/* single thread communicating with a single endpoint */
int rpmsg_char_ping(int rproc_id, char *dev_name, unsigned int local_endpt, unsigned int remote_endpt,
		    int num_msgs)
{
	int ret = 0;
	int i = 0;
	int packet_len;
	char eptdev_name[64] = { 0 };
	/*
	 * Each RPMsg packet can have up to 496 bytes of data:
	 * 512 bytes total - 16 byte header = 496
	 */
	char packet_buf[496] = { 0 };
	rpmsg_char_dev_t *rcdev;
	int flags = 0;
	struct timespec ts_current;
	struct timespec ts_end;

	/*
	 * Variables used for latency benchmarks
	 */
	struct timespec ts_start_test;
	struct timespec ts_end_test;
	/* latency measured in us */
	int latency = 0;
	int latencies[LATENCY_RANGE] = {0};
	int latency_worst_case = 0;
	double latency_average = 0; /* try double, since long long might have overflowed w/ 1Billion+ iterations */
	FILE *file_ptr;

        /*
         * Open the remote rpmsg device identified by dev_name and bind the
	 * device to a local end-point used for receiving messages from
	 * remote processor
         */
	sprintf(eptdev_name, "rpmsg-char-%d-%d", rproc_id, getpid());
	rcdev = rpmsg_char_open(rproc_id, dev_name, local_endpt, remote_endpt,
				eptdev_name, flags);
        if (!rcdev) {
		perror("Can't create an endpoint device");
		return -EPERM;
        }
        printf("Created endpt device %s, fd = %d port = %d\n", eptdev_name,
		rcdev->fd, rcdev->endpt);

        printf("Exchanging %d messages with rpmsg device %s on rproc id %d ...\n\n",
		num_msgs, eptdev_name, rproc_id);

	clock_gettime(CLOCK_MONOTONIC, &ts_start_test);

	for (i = 0; i < num_msgs; i++) {
		memset(packet_buf, 0, sizeof(packet_buf));

		/* minimum test: send 1 byte */
		sprintf(packet_buf, "h");
		/* "normal" test: do the hello message */
		//sprintf(packet_buf, "hello there %d!", i);
		/* maximum test: send 496 bytes */
		//sprintf(packet_buf, "0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345");

		packet_len = strlen(packet_buf);

		/* double-check: is packet_len changing, or fixed at 496? */
		//printf("packet_len = %d\n", packet_len);

		/* remove prints to speed up the test execution time */
		//printf("Sending message #%d: %s\n", i, packet_buf);

		clock_gettime(CLOCK_MONOTONIC, &ts_current);
		ret = send_msg(rcdev->fd, (char *)packet_buf, packet_len);
		if (ret < 0) {
			printf("send_msg failed for iteration %d, ret = %d\n", i, ret);
			goto out;
		}
		if (ret != packet_len) {
			printf("bytes written does not match send request, ret = %d, packet_len = %d\n",
				i, ret);
		    goto out;
		}

		ret = recv_msg(rcdev->fd, packet_len, (char *)packet_buf, &packet_len);
		clock_gettime(CLOCK_MONOTONIC, &ts_end);
		if (ret < 0) {
			printf("recv_msg failed for iteration %d, ret = %d\n", i, ret);
			goto out;
		}

		/* latency measured in usec */
		latency = (ts_end.tv_nsec - ts_current.tv_nsec) / 1000;

		/* if latency is greater than LATENCY_RANGE, throw an error and exit */
		if (latency > LATENCY_RANGE) {
			printf("latency is too large to be recorded: %d usec\n", latency);
			goto out;
		}

		/* increment the counter for that specific latency measurement */
		latencies[latency]++;

		/* remove prints to speed up the test execution time */
		//printf("Received message #%d: round trip delay(usecs) = %ld\n", i,(ts_end.tv_nsec - ts_current.tv_nsec)/1000);
		//printf("%s\n", packet_buf);
	}

	clock_gettime(CLOCK_MONOTONIC, &ts_end_test);

	/* find worst-case latency */
	for (i = LATENCY_RANGE - 1; i > 0; i--) {
		if (latencies[i] != 0) {
			latency_worst_case = i;
			break;
		}
	}

	/* WARNING: The average latency calculation is currently being validated */
	/* find the average latency */
	for (i = LATENCY_RANGE - 1; i > 0; i--) {
		/* e.g., if latencies[60] = 17, that means there was a latency of 60us 17 times */
		latency_average = latency_average + (latencies[i]*i)/num_msgs;
	}
	/* old code from using long long instead of double */
	/*latency_average = latency_average / num_msgs;*/

	/* export the latency measurements to a file */
	file_ptr = fopen("histogram.txt", "w");

	for (unsigned int i = 0; i < LATENCY_RANGE; i++)
	{
		fprintf(file_ptr, "%d , ", i);
		fprintf(file_ptr, "%d", latencies[i]);
		fprintf(file_ptr, "\n");
	}
	fclose(file_ptr);

	printf("\nCommunicated %d messages successfully on %s\n\n",
		num_msgs, eptdev_name);
	printf("Total execution time for the test: %ld seconds\n",
		ts_end_test.tv_sec - ts_start_test.tv_sec);
	printf("Average round-trip latency: %f\n", latency_average);
	printf("Worst-case round-trip latency: %d\n", latency_worst_case);
	printf("Histogram data at histogram.txt\n");

out:
	ret = rpmsg_char_close(rcdev);
	if (ret < 0)
		perror("Can't delete the endpoint device\n");

	return ret;
}

void usage()
{
	printf("Usage: rpmsg_char_simple [-r <rproc_id>] [-n <num_msgs>] [-d \
	       <rpmsg_dev_name] [-p <remote_endpt] [-l <local_endpt] \n");
	printf("\t\tDefaults: rproc_id: 0 num_msgs: %d rpmsg_dev_name: NULL remote_endpt: %d\n",
		NUM_ITERATIONS, REMOTE_ENDPT);
}

int main(int argc, char *argv[])
{
	int ret, status, c;
	int rproc_id = 0;
	int num_msgs = NUM_ITERATIONS;
	unsigned int remote_endpt = REMOTE_ENDPT;
	unsigned int local_endpt = RPMSG_ADDR_ANY;
	char *dev_name = NULL;

	while (1) {
		c = getopt(argc, argv, "r:n:p:d:l:");
		if (c == -1)
			break;

		switch (c) {
		case 'r':
			rproc_id = atoi(optarg);
			break;
		case 'n':
			num_msgs = atoi(optarg);
			break;
		case 'p':
			remote_endpt = atoi(optarg);
			break;
		case 'd':
			dev_name = optarg;
			break;
		case 'l':
			local_endpt = atoi(optarg);
			break;
		default:
			usage();
			exit(0);
		}
	}

	if (rproc_id < 0 || rproc_id >= RPROC_ID_MAX) {
		printf("Invalid rproc id %d, should be less than %d\n",
			rproc_id, RPROC_ID_MAX);
		usage();
		return 1;
	}

	/* Use auto-detection for SoC */
	ret = rpmsg_char_init(NULL);
	if (ret) {
		printf("rpmsg_char_init failed, ret = %d\n", ret);
		return ret;
	}

	status = rpmsg_char_ping(rproc_id, dev_name, local_endpt, remote_endpt, num_msgs);

	rpmsg_char_exit();

	if (status < 0) {
		printf("TEST STATUS: FAILED\n");
	} else {
		printf("TEST STATUS: PASSED\n");
	}

	return 0;
}

version of R5F firmware: whatever is built into the SDK 10.0 by default

tests with 10 million iterations

Summary of round-trip latencies

Test	Avg latency W/ patches	Worst latency W/ patches	Avg latency W/O patches	Worst latency W/O patches
basic run ./rpmsg_char_simple -r 2 -n 10000000	34 us	2,347 us	35 us	2,564 us
increase priority ./rpmsg_char_simple -r 2 -n 10000000 & chrt -f -p 10 $!	32 us	118 us	58 us	1,207 us
background load on Linux stress-ng --cpu-method=all -c 4 & ./rpmsg_char_simple -r 2 -n 10000000	57 us	6,582 us	95 us	6,941 us
increase priority, background load on Linux stress-ng --cpu-method=all -c 4 & ./rpmsg_char_simple -r 2 -n 10000000 & chrt -f -p 10 $!	31 us	168 us	58 us	5,903 us

Histogram plots

Test	W/ patches	W/O patches
basic run ./rpmsg_char_simple -r 2 -n 10000000
increase priority ./rpmsg_char_simple -r 2 -n 10000000 & chrt -f -p 10 $!
background load on Linux stress-ng --cpu-method=all -c 4 & ./rpmsg_char_simple -r 2 -n 10000000
increase priority, background load on Linux stress-ng --cpu-method=all -c 4 & ./rpmsg_char_simple -r 2 -n 10000000 & chrt -f -p 10 $!

Tests with 1 billion iterations

Summary of round-trip latencies

Test

Avg latency
W/ patches

worst latency
W/ patches

Avg latency
W/O patches

worst latency
W/O patches

increase priority

./rpmsg_char_simple -r 2 -n 1000000000 & chrt -f -p 10 $!

34 us

2,243 us

60 us

2,609 us

increase priority, background load on Linux

stress-ng --cpu-method=all -c 4 &
./rpmsg_char_simple -r 2 -n 1000000000 & chrt -f -p 10 $!

2 runs:

32 us

2 runs:

240 us

236 us

57 us

8,358 us

Histogram plots

Test	W/ patches	W/O patches
increase priority ./rpmsg_char_simple -r 2 -n 1000000000 & chrt -f -p 10 $!
Zoomed view
increase priority, background load on Linux stress-ng --cpu-method=all -c 4 & ./rpmsg_char_simple -r 2 -n 1000000000 & chrt -f -p 10 $!
same test as above

Hope this helps,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101890 points

Hello Chris & Dominic,

All the 1 billion iteration tests finished running, and the raw data has been added above. I am still double-checking the results with different developers on my side to make sure everything makes sense, but here is the high level summary:

Key takeaways:

1) "average" performance with and without priority elevation is about the same. HOWEVER, increasing the priority of the application combined with the workqueue patches leads to MUCH cleaner data.

2) Over 3 billion iterations with increased priority (three separate 1 billion iteration tests), the code with the workqueue patches had only 2 iterations where the latency was higher than 300us.

This performance is about what I would expect - RT Linux can be made much more real-time than "regular" Linux, but it is still NOT AN RTOS. While we seem to have put a "soft" upper bound on the worst-case latency, I am not sure if it is possible to implement a "hard" upper bound on the worst case latency, like we would with a true RTOS. With that said, histogram plots are much more useful here than just "average" and "worst case" numbers in order to get a feel for the likelihood that the software will behave in a particular way.

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101890 points

One more note, if you decide to adapt the code I attached for your own tests: the "average latency" code breaks in the 1 billion test range, so I had to use excel to manually calculate the averages. I'm still working on appropriate replacement code - you can't brute force it because even long long datatypes eventually overflow with enough iterations, so I'm looking into other options.

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101890 points

Hello Chris,

Final update of the day. I am looking into Linux latency to read/write internal memory now. If you want me looking into something else, let me know.

Commentary on the above results

Test method

Our RT Linux developer confirmed that the test methodology above looks good, with the comment that stress-ng is good for simulating high processor loads. You could also use a test that does a bunch of DDR accesses to stress the processor in a different way - that test might not be as useful for your system, but it could be handy for a system with graphics that would need to have heavy DDR accesses.

Do the results make sense?

Yes, the results look like he would expect. Even the 2 outliers in the 3 billion iterations are reasonable - he said unfortunately, outliers like that in any RT Linux system are unbounded, so there isn't necessarily a way to limit how bad the latency on a rare outlier is going to be.

In terms of what CAUSED those outliers, it could be anything. Sometimes you can debug and remove outliers like that, but RT Linux will always have the chance that unbounded latency will occur after the system has run for long enough.

TI is NOT currently in a position to guide customers through optimizing their RT systems, since there are a LOT of different settings to adjust. But the general process is to disable EVERY piece of code that is not being used in order to minimize the code that can cause issues, and then work from there. For example, the above tests were run on the "default" filesystem image, which has a bunch of extra example code on it. So someone trying to optimize an RT Linux system might instead start with the "base" filesystem image to see if outliers were still observed, run a test for multiple days to get many billions of iterations, and work from there.

What about optimizing reads & writes to shared memory?

I am still looking into this. It seems like there are a bunch of tactics for trying to speed up accesses to DDR, but it does not seem like these tactics apply to SRAM or TCM memory accesses (I haven't tested at this point). I'll include the optimizations here for my own reference and any future customers.

DDR optimizations

Linux often does not actually allocate the pages of memory when it assigns memory to a program - instead, it allocates the pages of memory when the program actually wants to use the memory, which can slow performance. Some people "prewarm" the pages by touching all the memory pages they need BEFORE they actually need to access the memory, to make sure that the memory is allocated and ready to use.

Similarly, Linux can flush pages that were allocated, requiring the pages to be allocated AGAIN the next time the program wants to use them. Pages can be "locked" to prevent them from getting flushed.

Make sure that the kernel image & file system take up as little DDR as possible. The more free space there is in DDR, the faster DDR accesses occur.

Remove unneeded code from your kernel & filesystem. The DDR reads & writes that you care about will go faster if there is not a lot of unneeded code that is also trying to access the DDR in the background.

There are also ways to adjust settings in the DDR hardware itself (instead of in Linux) - the code Pekka discussed that gives additional priority to reads is an example of these settings. These can be helpful as well (e.g., we can prioritize DDR accesses from specific cores in the QoS (Quality of Service) settings), but they are not very specific (e.g., the DDR controller cannot be programmed to treat one Linux application's DDR accesses differently from another Linux application's accesses - all the DDR knows is that both accesses come from the A53 core).

SRAM / TCM memory accesses

I'm still digging into this.

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101890 points

Hello Chris,

No substantial updates for you at this point. I've spent some time looking into different options for optimizing reads & writes, but so far most of what I have looked into is theoretical, and I don't have any code I can share.

(some topics I've been looking into, if it's faster for Linux to write to one memory location, and then R5F to write to a different memory location. For R5F -> A53 I know Dominic has looked into writing from R5F directly into the A53's L2 cache through the ACP port (thread here), but I haven't figured out yet if that's something we can take advantage of if Linux is running on the A53 cores, or if it's specific to RTOS on A53s. I can also see in the Linux remoteproc driver where the TCM memory is mapped as "write combining" with devm_ioremap_wc, but I haven't figured out if something similar can be done from Linux userspace so we can benchmark if it provides any kind of performance improvement)

If there is something specific you'd like me to focus on, let me know.

Regards,

Nick

0 Chris over 1 year ago in reply to Nick Saulnier

Prodigy 120 points

Hi Nick,

last update before holiday. I just want to show my measurement results, depending TCM and DDR. I can not see a big difference between the memories. What i can see in all of my measurements is the lower limit. There is a limit due to Linux and there is no way to get faster then that.

All of my tests run in a 10 million loop

stress-ng --cpu-method=all -c 4 & ./rpmsg_char_zerocopy -s 160-r 2 -e carveout_apps-shared-memory -t 0x01010101 -n 10000000

Memory	payload (Byte)	avg (us)	max (us)
DDR 0xA600_0000	160	56	195
	320	70	169 *)
	1280	144	324
	2560	256	424
	5120	448	663
TCM 0x7810_0000	160	50	243
	320	70	226
	1280	162	318
	2560	281	427
	5120	527	674

*) That the max value with 320 Byte payload is faster then the one with 160 Byte relies on the test duration.

Right now we are working with that results to see if we can met our requirements. Unfortunately the performance win that we get with the use of IPC is nearly eaten by the slightly lower single core performance of the Cortex-A53 compared with the Cortex-A9.

0 Nick Saulnier over 1 year ago in reply to Chris

TI__Guru** 101890 points

Hello Chris,

Results with elevated Linux priority?

Out of curiosity, if you increase the priority of the rpmsg_char_zerocopy application, does that lead to any differences in performance? something like this:

stress-ng --cpu-method=all -c 4 &

./rpmsg_char_zerocopy -s 160-r 2 -e carveout_apps-shared-memory -t 0x01010101 -n 10000000 & chrt -f -p 10 $!

Optimize the MCU+ code

When I look at the AM64x MCU+ code in the zerocopy project, a lot of the instructions & data are being stored in DDR, which I would expect to slow down execution significantly. Can you try with all of those extra data regions getting stored in SRAM instead to see if R5F becomes more responsive?

i.e., go here:
https://git.ti.com/cgit/rpmsg/rpmsg_char_zerocopy/tree/rtos/am64x-evm/r5fss0-0_freertos/ti-arm-clang/linker.cmd

And everywhere you see "DDR_1", replace it with "MSRAM" (except for the MEMORY{} section, just leave that alone for now)

You shouldn't need to make any other changes. When I check the .map output file from the default project, it is putting 0x20ef0 of data into DDR_1, but allocating 0x40000 of SRAM, so you already have a large enough allocation from the RTOS side. Just make sure that your SRAM allocation on the Linux side doesn't overlap 0x70080000 - 0x700C0000.

OUTPUT FILE NAME:   <ipc_rpmsg_zerocopy_linux.release.out>
ENTRY POINT SYMBOL: "_vectors"  address: 00000000


MEMORY CONFIGURATION

         name            origin    length      used     unused   attr    fill
----------------------  --------  ---------  --------  --------  ----  --------
  R5F_VECS              00000000   00000040  00000040  00000000  RWIX
  R5F_TCMA              00000040   00007fc0  000010f8  00006ec8  RWIX
  R5F_TCMB0             41010000   00008000  00000000  00008000  RWIX
  FLASH                 60100000   00080000  00000000  00080000  RWIX
  MSRAM                 70080000   00040000  00000000  00040000  RWIX
  LINUX_IPC_SHM_MEM     a0000000   00100000  00000000  00100000  RWIX
  DDR_0                 a0100000   00001000  00001000  00000000  RWIX
  DDR_1                 a0101000   00eff000  00020ef0  00ede110  RWIX
  USER_SHM_MEM          a5000000   00000080  00000000  00000080  RWIX
  LOG_SHM_MEM           a5000080   00003f80  00000000  00003f80  RWIX
  RTOS_NORTOS_IPC_SHM_M a5004000   0000c000  00006680  00005980  RWIX

Then you can rebuild the firmware like this and you're good to go:
rpmsg_char_zerocopy/rtos$ make -s -C am64x-evm/system_freertos all
(assuming you already exported the MCU_PLUS_SDK_PATH as discussed at https://git.ti.com/cgit/rpmsg/rpmsg_char_zerocopy/tree/rtos/README.md )

The code would run even faster if you put some of it in TCM, but that takes a couple extra steps. I can make those changes for you if you want to share the test code with me offline.

Performance of A53 vs A9?

I am surprised to hear that you're seeing lower performance on AM64x than AM437x. Could you point me to some of the tests that you're running?

Regards,

Nick

0 Chris over 1 year ago in reply to Nick Saulnier

Prodigy 120 points

Hi Nick,

yes, my results are with a high prio RT-Task. In my code the communication will be done within a separate RT-Task with prio 99. That's the reason why my call works without "chrt -f -p 10 $!". But you are absolutely right, i should have mentioned that.

I will try to change my example code to SRAM and do some new measurements.

Regarding to the performance between the cores, i mentioned earlier. I am talking about the single core performance of the A53. I did some coremark measurements. And the results i see fits to some other results of the A53 even from other vendors.

https://www.eembc.org/coremark/scores.php

The single core performance is important because we are not sure if we can parallelize the application from our customer. But i guess that could or should be discussed in another thread.

Regards,

Chris

0 Nick Saulnier over 1 year ago in reply to Chris

TI__Guru** 101890 points

Hello Chris,

Gotcha, ok.

I was double-checking with one of our developers, and they suggested real-time tasks be assigned a priority between 40 and 98 (so for my future tests, I'll probably use something like 80 instead of the 10 used in my previous tests).

Single core performance would be good for another thread.

Another potential option for reduced shared memory latency - cache coherency?

The Ethernet drivers currently work by using DMA to directly copy Ethernet frames into the A53's L2 cache. It looks like we SHOULD be able to use the R5F to also write data directly into the A53's L2 cache through the ACP port. If we can get that working, it should significantly reduce the latency to send data from R5F to the A53.

For more information about the basic concept, refer to section "IO Coherency Support" in the AM64x Technical Reference Manual (TRM).

As far as I can tell, we should be able to set ASEL = 14. That means that all (TO VERIFY: even writes to internal TCM addresses?) the R5F writes would get routed through the A53's L2 Cache.

ASEL 15 does not update the data values in Cache, but it marks the cache values as "dirty" (i.e., they are now out-of-date). Then the write goes on to actually write the value to the address in memory. This option would not actually save us much time.

ASEL 14 actually updates the data values in the cache, so that when A53 goes to read the values, it just reads out of cache instead of making a read to the actual shared memory region.

The associated register for R5F0 core 0 is QOS_LITE_MAIN_0_CPU0_WMST_MAP0. You could double-check the register value from Linux with
devmem2 0x45D82D00 w

There is also a register for reading from the L2 cache: QOS_LITE_MAIN_0_CPU0_RMST_MAP0 (0x45D8_2900)

Outstanding questions:

Does this cache coherency work with TCM memory, or ONLY SRAM or DDR addresses? (SRAM & DDR are mentioned in the registers)

For shared memory writes from A53 to the R5F, is it faster to do a cache flush to the TCM? Or to just write to the L2 cache, and then have the R5F read directly from the L2 cache with the RMST register above?

I'll try to spend more time looking into this tomorrow.

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101890 points

Hello Chris,

I spent a couple more hours looking into cache coherency today. For now I would NOT suggest trying it out - there are some potential gotchas that I was not aware of. I am still investigating to see if this looks like a potential implementation or not.

To set expectations, I won't be able to make any progress on Friday. Next week I plan to finish porting the zerocopy example to kernel 6.6, and then I'll start running tests alongside you to see what we can do to drive average latency lower.

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101890 points

Hello Chris,

Apologies for the delayed responses - I was out sick for a while, and am still catching up.

Support for Linux SDK 10.0 has been added to the zerocopy project, master branch:
https://git.ti.com/cgit/rpmsg/rpmsg_char_zerocopy/

I am working on benchmarking each part of the zerocopy project now. i.e., for both sending & receiving data, I want to see the time required to
- start syncing shared memory location
- write/read data between the data buffer and the cache (? not sure if syncing between cache & memory occurs as writing happens, or afterwards?)
- finish syncing shared memory location

in addition to the RPMsg signaling & R5F processing time. I'll also run tests for different kinds of memory to see if different memories have significantly different performance.

Regards,

Nick

0 Chris over 1 year ago in reply to Nick Saulnier

Prodigy 120 points

Hi Nick,

hope you get better quickly!

Did you get my zerocopy example code?
I'm really looking forward seeing your results.

As i mentioned before i do not see any significant latency changes if i change shared memory location or run the R5 Code out of SRAM.

regards,

Chris

0 Nick Saulnier over 1 year ago in reply to Chris

TI__Guru** 101890 points

Hello Chris,

Thank you! I did get the example code that was uploaded on Sept 5. If there was more recent code you wanted me to look at, I don't currently see it.

Did you get to play around more with isolating the task to 1 core and potentially disabling the preempt code? I assume from the Sept 5 code that you have not yet tried memory polling instead of RPMsg signaling.

I still have some more development to do on my side before running the next round of tests, but I'm hoping to be able to start getting some initial results Thursday/Friday.

Regards,

Nick

Processors

Processors forum

SK-AM64B: RPmsg between A53 and R5 performance update