AM6442: Communication Latency Issues between A53 and R5 in a Linux-RT System

? ??

Part Number: AM6442

Tool/software:

Downloaded TI's Linux kernel from the following URL and switched to the ti-rt-linux6.6.y-cicd version, using the RT (real-time) version of the Linux kernel.

https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/

Using the following example code from rpmsg_char_simple.c

Fullscreen 4276.rpmsg_char_simple.c Download

/*
 * rpmsg_char_simple.c
 *
 * Simple Example application using rpmsg-char library
 *
 * Copyright (c) 2020 Texas Instruments Incorporated - https://www.ti.com
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * *  Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *
 * *  Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * *  Neither the name of Texas Instruments Incorporated nor the names of
 *    its contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 *  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 *  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 *  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 *  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
 *  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 *  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 *  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 *  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 *  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 *  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 */

#include <sys/select.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <stdint.h>
#include <stddef.h>
#include <fcntl.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <time.h>
#include <stdbool.h>
#include <semaphore.h>

#include <linux/rpmsg.h>
#include <ti_rpmsg_char.h>

#define NUM_ITERATIONS	100
#define REMOTE_ENDPT	14

/*
 * This test can measure round-trip latencies up to 20 ms
 * Latencies measured in microseconds (us)
 */
#define LATENCY_RANGE 20000

int send_msg(int fd, char *msg, int len)
{
	int ret = 0;

	ret = write(fd, msg, len);
	if (ret < 0) {
		perror("Can't write to rpmsg endpt device\n");
		return -1;
	}

	return ret;
}

int recv_msg(int fd, int len, char *reply_msg, int *reply_len)
{
	int ret = 0;

	/* Note: len should be max length of response expected */
	ret = read(fd, reply_msg, len);
	if (ret < 0) {
		perror("Can't read from rpmsg endpt device\n");
		return -1;
	} else {
		*reply_len = ret;
	}

	return 0;
}

/* single thread communicating with a single endpoint */
int rpmsg_char_ping(int rproc_id, char *dev_name, unsigned int local_endpt, unsigned int remote_endpt,
		    int num_msgs)
{
	int ret = 0;
	int i = 0;
	int packet_len;
	char eptdev_name[64] = { 0 };
	/*
	 * Each RPMsg packet can have up to 496 bytes of data:
	 * 512 bytes total - 16 byte header = 496
	 */
	char packet_buf[496] = { 0 };
	rpmsg_char_dev_t *rcdev;
	int flags = 0;
	struct timespec ts_current;
	struct timespec ts_end;

	/*
	 * Variables used for latency benchmarks
	 */
	struct timespec ts_start_test;
	struct timespec ts_end_test;
	/* latency measured in us */
	int latency = 0;
	int latencies[LATENCY_RANGE] = {0};
	int latency_worst_case = 0;
	double latency_average = 0; /* try double, since long long might have overflowed w/ 1Billion+ iterations */
	FILE *file_ptr;

        /*
         * Open the remote rpmsg device identified by dev_name and bind the
	 * device to a local end-point used for receiving messages from
	 * remote processor
         */
	sprintf(eptdev_name, "rpmsg-char-%d-%d", rproc_id, getpid());
	rcdev = rpmsg_char_open(rproc_id, dev_name, local_endpt, remote_endpt,
				eptdev_name, flags);
        if (!rcdev) {
		perror("Can't create an endpoint device");
		return -EPERM;
        }
        printf("Created endpt device %s, fd = %d port = %d\n", eptdev_name,
		rcdev->fd, rcdev->endpt);

        printf("Exchanging %d messages with rpmsg device %s on rproc id %d ...\n\n",
		num_msgs, eptdev_name, rproc_id);

	clock_gettime(CLOCK_MONOTONIC, &ts_start_test);

	for (i = 0; i < num_msgs; i++) {
		memset(packet_buf, 0, sizeof(packet_buf));

		/* minimum test: send 1 byte */
		sprintf(packet_buf, "h");
		/* "normal" test: do the hello message */
		//sprintf(packet_buf, "hello there %d!", i);
		/* maximum test: send 496 bytes */
		//sprintf(packet_buf, "0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345");

		packet_len = strlen(packet_buf);

		/* double-check: is packet_len changing, or fixed at 496? */
		//printf("packet_len = %d\n", packet_len);

		/* remove prints to speed up the test execution time */
		//printf("Sending message #%d: %s\n", i, packet_buf);

		clock_gettime(CLOCK_MONOTONIC, &ts_current);
		ret = send_msg(rcdev->fd, (char *)packet_buf, packet_len);
		if (ret < 0) {
			printf("send_msg failed for iteration %d, ret = %d\n", i, ret);
			goto out;
		}
		if (ret != packet_len) {
			printf("bytes written does not match send request, ret = %d, packet_len = %d\n",
				i, ret);
		    goto out;
		}

		ret = recv_msg(rcdev->fd, packet_len, (char *)packet_buf, &packet_len);
		clock_gettime(CLOCK_MONOTONIC, &ts_end);
		if (ret < 0) {
			printf("recv_msg failed for iteration %d, ret = %d\n", i, ret);
			goto out;
		}

		/* latency measured in usec */
		latency = (ts_end.tv_nsec - ts_current.tv_nsec) / 1000;

		/* if latency is greater than LATENCY_RANGE, throw an error and exit */
		if (latency > LATENCY_RANGE) {
			printf("latency is too large to be recorded: %d usec\n", latency);
			goto out;
		}

		/* increment the counter for that specific latency measurement */
		latencies[latency]++;

		/* remove prints to speed up the test execution time */
		//printf("Received message #%d: round trip delay(usecs) = %ld\n", i,(ts_end.tv_nsec - ts_current.tv_nsec)/1000);
		//printf("%s\n", packet_buf);
	}

	clock_gettime(CLOCK_MONOTONIC, &ts_end_test);

	/* find worst-case latency */
	for (i = LATENCY_RANGE - 1; i > 0; i--) {
		if (latencies[i] != 0) {
			latency_worst_case = i;
			break;
		}
	}

	/* WARNING: The average latency calculation is currently being validated */
	/* find the average latency */
	for (i = LATENCY_RANGE - 1; i > 0; i--) {
		/* e.g., if latencies[60] = 17, that means there was a latency of 60us 17 times */
		latency_average = latency_average + (latencies[i]*i)/num_msgs;
	}
	/* old code from using long long instead of double */
	/*latency_average = latency_average / num_msgs;*/

	/* export the latency measurements to a file */
	file_ptr = fopen("histogram.txt", "w");

	for (unsigned int i = 0; i < LATENCY_RANGE; i++)
	{
		fprintf(file_ptr, "%d , ", i);
		fprintf(file_ptr, "%d", latencies[i]);
		fprintf(file_ptr, "\n");
	}
	fclose(file_ptr);

	printf("\nCommunicated %d messages successfully on %s\n\n",
		num_msgs, eptdev_name);
	printf("Total execution time for the test: %ld seconds\n",
		ts_end_test.tv_sec - ts_start_test.tv_sec);
	printf("Average round-trip latency: %f\n", latency_average);
	printf("Worst-case round-trip latency: %d\n", latency_worst_case);
	printf("Histogram data at histogram.txt\n");

out:
	ret = rpmsg_char_close(rcdev);
	if (ret < 0)
		perror("Can't delete the endpoint device\n");

	return ret;
}

void usage()
{
	printf("Usage: rpmsg_char_simple [-r <rproc_id>] [-n <num_msgs>] [-d \
	       <rpmsg_dev_name] [-p <remote_endpt] [-l <local_endpt] \n");
	printf("\t\tDefaults: rproc_id: 0 num_msgs: %d rpmsg_dev_name: NULL remote_endpt: %d\n",
		NUM_ITERATIONS, REMOTE_ENDPT);
}

int main(int argc, char *argv[])
{
	int ret, status, c;
	int rproc_id = 0;
	int num_msgs = NUM_ITERATIONS;
	unsigned int remote_endpt = REMOTE_ENDPT;
	unsigned int local_endpt = RPMSG_ADDR_ANY;
	char *dev_name = NULL;

	while (1) {
		c = getopt(argc, argv, "r:n:p:d:l:");
		if (c == -1)
			break;

		switch (c) {
		case 'r':
			rproc_id = atoi(optarg);
			break;
		case 'n':
			num_msgs = atoi(optarg);
			break;
		case 'p':
			remote_endpt = atoi(optarg);
			break;
		case 'd':
			dev_name = optarg;
			break;
		case 'l':
			local_endpt = atoi(optarg);
			break;
		default:
			usage();
			exit(0);
		}
	}

	if (rproc_id < 0 || rproc_id >= RPROC_ID_MAX) {
		printf("Invalid rproc id %d, should be less than %d\n",
			rproc_id, RPROC_ID_MAX);
		usage();
		return 1;
	}

	/* Use auto-detection for SoC */
	ret = rpmsg_char_init(NULL);
	if (ret) {
		printf("rpmsg_char_init failed, ret = %d\n", ret);
		return ret;
	}

	status = rpmsg_char_ping(rproc_id, dev_name, local_endpt, remote_endpt, num_msgs);

	rpmsg_char_exit();

	if (status < 0) {
		printf("TEST STATUS: FAILED\n");
	} else {
		printf("TEST STATUS: PASSED\n");
	}

	return 0;
}

Tested more than 100,000 times and set the priority with the command ./rpmsg_char_simple -r 2 -n 100000 & chrt -f -p 10 $!. The test results are shown in the following figure：

From the figure, it can be seen that the maximum latency is 1177us, which is too high and does not meet our current requirements. Is there room for optimization?

over 1 year ago

0 Kevin Peng over 1 year ago

TI__Expert 3802 points

Hi TI Experts,

Could you please check if this issue is related to the below thread？

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1388960/sk-am64b-rpmsg-between-a53-and-r5-performance-update/5328241?

Customer needs to improve this performance before selecting am64x, so this is quite important for us to support them.

Thanks,

Kevin

0 Nick Saulnier over 1 year ago

TI__Guru** 101240 points

Hello there,

I see you found my sample test code from a previous thread.

Notes on the test output

Average latency output
Note that the average latency measurement calculation in the code is broken for longer test runs (e.g., millions or billions of runs). I still haven't figured out a clean way to calculate out the weighted average for that many iterations - I ended up manually calculating it afterwards in Excel for those earlier tests (import histogram.txt, multiply the number of times a latency was measured by the latency, do a SUM() and divide by the total number of measurements).

Is a test of 100,000 iterations representative of edge cases?
No. If you are really trying to get a feel for longer term behavior, I would try running for on the order of millions or billions of runs. With that said, don't run tests yet - I want to talk about your results.

I have since learned that the priority should really be placed above 50. -p 80 or -p 98 have been suggested to me as good settings for future tests.

Your results are very different from mine - let's figure out why

Even when I was using stress-ng to add a bunch of background load on the A53 cores, I saw an average latency of 30-35 usec over 3 billion iterations, and the latency went over 1msec only 2 times. Your average latency is about 30x worse: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1388960/sk-am64b-rpmsg-between-a53-and-r5-performance-update/5328241#5328241

What frequency are your A53 cores running at?

What other software is running on Linux at the same time?

What filesystem are you using?

Where is your filesystem running? (i.e., off of the SD card? Are you using NFS to access the filesystem on your PC? an NFS filesystem cannot be used for latency tests like this)

What results do you see if you just use the default filesystem image from the download page, running on an SD card?
PROCESSOR-SDK-LINUX-AM64X Software development kit (SDK) | TI.com
tisdk-default-image-am64xx-evm

What is your usecase? What is your latency target for average & worst case?

This will be a longer discussion. So keep it in mind, but for now let's focus on figuring out why your initial test results were so different from mine.

Regards,

Nick

0 ? ?? over 1 year ago in reply to Nick Saulnier

Prodigy 20 points

Hi Nick，

Currently, the CPU is running at a frequency of 1 GHz. I am using the following file system.

By default, the file system is placed on an SD card for operation. There are only the system's default tasks, no other tasks.. After the system starts, I run the following task:.

/rpmsg_char_simple -r 2 -n 100000 & chrt -f -p 10 $!

I am not sure if my operating environment is the same as yours？

Regards，

Mack

0 ? ?? over 1 year ago in reply to ? ??

Prodigy 20 points

To add, the SDK version we are using is 10.00.07.04.As following：

0 Nick Saulnier over 1 year ago in reply to ? ??

TI__Guru** 101240 points

Hello Mack,

Discussion about the test environment

Ok, that is the environment I would expect you to use.

I ran my tests on a prebuild of SDK 10.0. I should probably run tests again on the actual SDK 10.0 release, just to make sure things are consistent. That will probably take me a couple of days to get set up.

Disclaimer about RT Linux

In the meantime, we should discuss your use case. Before you spend a bunch of development time, I want to make sure that it ACTUALLY makes sense to include RT Linux in your critical control path. Keep in mind that RT Linux is NOT a true real-time OS. We can make RT Linux statistically LIKELY to meet certain latencies, but RT Linux cannot be GUARANTEED to meet certain latencies. So if missing a latency requirement once a year or once every couple of years will break your design, we do NOT suggest putting RT Linux within the latency-critical control path.

For more thoughts on that, refer to
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1085663/faq-sitara-multicore-system-design-how-to-ensure-computations-occur-within-a-set-cycle-time

Questions about your usecase

What do you want to do with the processor, and what tasks do you intend to run on each core?

How much data would you be transferring between cores?

What average latency would be needed?

What worst case latency would be needed?

Regards,

Nick

0 ? ?? over 1 year ago in reply to Nick Saulnier

Prodigy 20 points

Hi Nick

Discussion on "Missing a latency requirement once a year or once every couple of years ":

We do not have scenarios where the equipment runs for a year or several years, so we do not need to consider that situation.
Discussion on "what the processor is used for":

We use the R5F core in the AM64x for motor driving, and the A53 mainly to send commands to the R5F to control motor operation. Currently, we use the AM6442, which has two A53 cores. In the future, we can separate real-time tasks and non-real-time tasks, placing real-time tasks on one core and non-real-time tasks on another. For example, tasks that send commands to the R5F core are real-time tasks, and we will run them on a CPU dedicated to real-time tasks.
Discussion on "how much data will be transferred between cores":

We send a maximum of 4k data to the R5F.
Discussion on "what is the required average latency and what is the required worst-case latency":

We only care about the worst-case latency. Currently, our requirement is that the maximum delay should not exceed 100 microseconds (us).

Regards,

Mack

0 Nick Saulnier over 1 year ago in reply to ? ??

TI__Guru** 101240 points

Hello Mack,

missing latency requirements

Let's talk about this a bit more. What happens if the maximum delay does occasionally exceed 100 us? Does the entire system break, or is that ok as long as it is a rare occurrence?

how much data will be transferred at once

Since your usecase can require more than 496 bytes of data to be transferred at once, we do not want to just use RPMsg for data transfer (each Linux RPMsg data packet can only hold up to 496 bytes. It is more efficient to allocate a shared memory region and set up a notification mechanism when the shared memory is ready, than it is to just send a bunch of RPMsg messages).

The zerocopy example would be a good starting point for you. I am still working on porting the code to Linux kernel 6.6 - hoping to have the time to finish that porting within the next couple of days.

https://git.ti.com/cgit/rpmsg/rpmsg_char_zerocopy/

worst-case latency

Is this maximum delay one-way latency, or round-trip latency? If it is one-way latency, is it for Linux --> R5F, R5F --> Linux, or both?

Does this latency include computation time? Or is it JUST for the time to get data from one core to the other core?

Keep in mind with Linux that the interrupt response time will be a limiting factor for how quickly Linux can react to an interrupt from the R5F. More details at the link I provided above: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1085663/faq-sitara-multicore-system-design-how-to-ensure-computations-occur-within-a-set-cycle-time

And you can get an idea about the out-of-the-box interrupt response latency for the default AM64x SDK 10.0 filesystem from the cyclic test benchmarks:
https://software-dl.ti.com/processor-sdk-linux/esd/AM64X/10_00_07_04/exports/docs/devices/AM64X/linux/RT_Linux_Performance_Guide.html#stress-ng-and-cyclic-test

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101240 points

zerocopy update

For anyone watching the zerocopy repo, you'll see that we just split off branch ti-linux-6.1 for customers who are using Linux kernel 5.10 & Linux kernel 6.1, and I just merged some patches into the main branch adding remoteproc_cdev support for Linux kernel 6.6: https://git.ti.com/cgit/rpmsg/rpmsg_char_zerocopy/

Please wait another day or so before playing around with kernel 6.6. I still have to make some major updates to the README file to explain all the steps. Tentative timeframe is
* Linux SDK 10.0: support added for all supported devices this week
* MCU+ SDK 10.0: support will be added one processor at a time - it might take a couple weeks or longer. Until then, you can use an earlier version of the MCU+ SDK.
* benchmark code: since multiple customers are looking at this project from a low-latency standpoint, I am starting to play around with code to benchmark & improve performance. Code will be added as it matures, exact time TBD
* AM62Px support: will be added eventually, exact time TBD. If you're an AM62Px customer and support still hasn't been added, please create a new e2e thread and ask for me.

A note on building AM64x RTOS zerocopy code

It looks like the AM64x code hasn't been modified since SDK 8.4, but I just verified the AM64x code can be built with MCU+ SDK 9.0 if the uint32 variable that throws a build error is changed to uint16 - not sure if the other processors require similar modifications. I haven't tried the AM64x project with an SDK release with the memory configurator tool yet (SDK 9.1 and later).

The build looked like this on a Linux computer:

// copy the rtos project to the MCU+ SDK
~/sdks/mcu_plus_sdk_am64x_09_00_00_35$ mkdir examples/drivers/ipc/ipc_rpmsg_zerocopy_linux
~/sdks/mcu_plus_sdk_am64x_09_00_00_35$ cp -r ~/git/rpmsg_char_zerocopy_worktree/rpmsg_char_zerocopy/rtos/* examples/drivers/ipc/ipc_rpmsg_zerocopy_linux/
// build the examples
// I had to go fix the uint32/uint16 build error here
~/sdks/mcu_plus_sdk_am64x_09_00_00_35$ make -s -C examples/drivers/ipc/ipc_rpmsg_zerocopy_linux/am64x-evm/system_freertos all
// copy the R5F0_0 output to my EVM
~/sdks/mcu_plus_sdk_am64x_09_00_00_35$ scp examples/drivers/ipc/ipc_rpmsg_zerocopy_linux/am64x-evm/r5fss0-0_freertos/ti-arm-clang/ipc_rpmsg_zerocopy_linux.release.out root@192.168.1.164:~

Regards,

Nick

0 ? ?? over 1 year ago in reply to Nick Saulnier

Prodigy 20 points

Hi Nick

latency requirements

if the maximum delay does occasionally exceed 100 us, the entire device will lose precision.So we cannot allow this to happen.

worst-case latency

This maximum delay is round-trip latency.The communication time between A53 and R5F, and R5F to A53 refers to the hardware communication time, not the time it takes for an application to send and receive data. Previously, we used EtherCAT bus to measure the time from sending data to receiving data, which was between 20-30 microseconds. This measurement also refers to the interval between the network card sending and receiving data, not the interval at the application layer.

Here, I have a question: what steps do I need to take to achieve the performance shown in the diagram below?

Regards

Mack

0 Nick Saulnier over 1 year ago in reply to ? ??

TI__Guru** 101240 points

Hello Mack,

More details needed to make design suggestions

Ok. I need more details about your expected usecase. I am not sure if the public forums are the easiest way to do that, or if there is a better way for you to give me that information. I am asking Kevin for his suggestions.

I will need specifics like
* what is the exact start time and end time of your latency requirements, and what are the exact steps that need to happen between start and end times (for example, is the R5F controlling elements of your system over EtherCAT, and then we have 100usec between receiving an EtherCAT message and sending an EtherCAT message? Are we using the PRU subsystem to do motor control?)
* what do we expect each core to be doing?
* what data is travelling in each direction between cores?
* Is the amount of data to transfer variable, or is it always 4096 bytes? If we are transferring different amounts of data between cores, are there different latency requirements for each amount of data?

How to replicate my test results

I have an AM64x board set up with SDK 10.0 RT Linux now. I am working on finishing the Linux side of the zerocopy example, since that would be the starting point for testing performance for your usecase. I will let you know once porting is complete.

Disclaimer - RPMsg might not be the right IPC method for your usecase

Depending on your exact usecase, we might not be able to use RPMsg as the signaling mechanism between cores. That IPC protocol was designed for ease-of-use and ease of upstreaming into a Linux driver, but it was NOT designed for ultra-low latency. That does not mean that your usecase is impossible - there are many different ways to signal and pass data between different cores. But it DOES mean that software development will take more effort, since you won't be able to just copy-paste code.

Regards,

Nick

0 Nick Saulnier over 1 year ago in reply to Nick Saulnier

TI__Guru** 101240 points

Hello Mack,

zerocopy update

Linux SDK 10.0 support has been added to the master branch for AM62x, AM62Ax, & AM64x:
https://git.ti.com/cgit/rpmsg/rpmsg_char_zerocopy/

Please do expect additional updates to the repo over the coming weeks and months. In addition to updates listed in my previous response, we might update the build process from autotools to cmake (still evaluating whether that makes sense for this project).

Other updates

I have rewritten my benchmark code for RPMsg latencies that you found above, and have started adding similar benchmark capabilities into the zerocopy project. I will provide additional updates as I start testing and debugging the benchmark code.

Regards,

Nick

0 ? ?? over 1 year ago in reply to Nick Saulnier

Prodigy 20 points

Hi Nick

I tested the rpmsg_char_zerocopy example using the command (./rpmsg_char_zerocopy -r 2 -n 100,) and the following issue occurred.

When using the command (./rpmsg_char_zerocopy -r 2 -p 14 -n 100）, the following error occurred. Could you please advise on how to correctly set the remote_endpt parameter and explain why the error messages in the screenshot appear?

Regards，

Mack

0 Nick Saulnier 11 months ago in reply to ? ??

TI__Guru** 101240 points

Hello Mack,

Summary: the zerocopy code is not getting loaded into the R5F core. I think you are still loading the default RPMsg Echo firmware.

Details

For remoteproc IDs, please refer to this file:
https://git.ti.com/cgit/rpmsg/ti-rpmsg-char/tree/include/rproc_id.h

I tested last week with this command which worked fine (where I modified 0xa500_0000 in the Linux devicetree file to be my shared memory region):
./rpmsg_char_zerocopy -r 2 -n 100 -e /dev/dma_heap/carveout_ipc-memories@a5000000

-r 2 refers to R5F0_0, which should work fine if you have the correct firmware loaded and running on R5F0_0

16 is the RPMsg endpoint number defined in the R5F firmware:
https://git.ti.com/cgit/rpmsg/rpmsg_char_zerocopy/tree/rtos/ipc_rpmsg_zerocopy.c#n60

On the other hand, 13 & 14 are the RPMsg endpoints that are defined in the default RPMsg Echo firmware.
from mcu_plus_sdk_am64x_09_02_01_05/examples/drivers/ipc/ipc_rpmsg_echo_linux/ipc_rpmsg_echo.c

#define IPC_RPMESSAGE_SERVICE_PING        "ti.ipc4.ping-pong"
#define IPC_RPMESSAGE_ENDPT_PING          (13U)

/* This is used to run the echo test with user space kernel */
#define IPC_RPMESSAGE_SERVICE_CHRDEV      "rpmsg_chrdev"
#define IPC_RPMESSAGE_ENDPT_CHRDEV_PING   (14U)

Next steps

While you are running the RPMsg Echo code, you can try running the out-of-the-box echo example to see if it works as expected:
https://dev.ti.com/tirex/explore/node?a=7qm9DIS__LATEST&node=A__Ab31zORiXVgIbeWGmbktOA__AM64-ACADEMY__WI1KRXP__LATEST

And then you can follow these steps to load your zerocopy code into the R5F core instead of the RPMsg Echo code:
https://dev.ti.com/tirex/explore/node?a=7qm9DIS__LATEST&node=A__AdAyuKWUWVV5j4wBc7C6XA__AM64-ACADEMY__WI1KRXP__LATEST

Warning: graceful shutdown might be broken on SDK 9.2.1 & SDK 10.0

I just became aware of this a day ago, so I have not had a bunch of time to test. But if doing echo > stop, echo > start to an R5F core causes the processor to freeze, that means graceful shutdown is broken. In that case, just update the link to your new firmware, and then reboot the processor to load the new R5F firmware.

Regards,

Nick

0 ? ?? 11 months ago in reply to Nick Saulnier

Prodigy 20 points

Hi Nick：

1、I have modified the device tree as shown below.

	apps-shared-memory {
		compatible = "dma-heap-carveout";
		reg = <0x00 0xa6000000 0x00 0x2000000>;
		no-map;
	};

The device node carveout_apps-shared-memory appeared under the /dev/dma_heap directory. Using the following command ./rpmsg_char_zerocopy -r 2 -n 100 -e /dev/dma_heap/carveout_apps-shared-memory still results in the following error.

2、I follow these steps to load your zerocopy code into the R5F core instead of the RPMsg Echo code:
https://dev.ti.com/tirex/explore/node?a=7qm9DIS__LATEST&node=A__AdAyuKWUWVV5j4wBc7C6XA__AM64-ACADEMY__WI1KRXP__LATEST

The following error occurred, and I am using the RT version of SDK 10.00.07.

Is the SDK version you are using the same as mine?

Do I need to flash any firmware on the R5F core?

Regards,

Mack

0 Nick Saulnier 11 months ago in reply to ? ??

TI__Guru** 101240 points

Hello Mack,

The error output gives the right steps. If R5F0 core1 is still running, you cannot stop R5F0 core0. So you would need to disable core1 first.

HOWEVER. I have confirmed that there is a bug in the remoteproc driver so that graceful shutdown does not work properly in SDK 10.0. So after you update the remoteproc firmware link, you will want to reboot the board. Then the remoteproc driver will load your new firmware into the R5F core during boot time.

Here are the exact steps that I want you to do:

// SDK 10.0
am64xx-evm login: root
root@am64xx-evm:~# uname -a
Linux am64xx-evm 6.6.32-rt32-ti-rt-g04a9ad081f0f-dirty #1 SMP PREEMPT_RT Fri Jul 26 14:42:37 UTC 2024 aarch64 GNU/Linux

// first, let's check what happened during boot time
// we can see that both r5f0_0 and r5f0_1 booted successfully
// this is one way to see which remoteprocX value was assigned to each core
// remember remoteprocX mappings can change every boot
root@am64xx-evm:~# dmesg | grep remoteproc
[    9.932368] platform 78000000.r5f: configured R5F for remoteproc mode
[    9.961286] remoteproc remoteproc0: 78000000.r5f is available
[    9.971516] k3-m4-rproc 5000000.m4fss: configured M4 for remoteproc mode
[    9.972892] remoteproc remoteproc0: powering up 78000000.r5f
[    9.972928] remoteproc remoteproc0: Booting fw image am64-main-r5f0_0-fw, size 434436
[    9.974851] remoteproc remoteproc1: 5000000.m4fss is available
[   10.003441] remoteproc remoteproc1: powering up 5000000.m4fss
[   10.003471] remoteproc remoteproc1: Booting fw image am64-mcu-m4f0_0-fw, size 88248
[   10.038734] platform 78200000.r5f: configured R5F for remoteproc mode
[   10.055169] remoteproc remoteproc2: 78200000.r5f is available
[   10.067912] remoteproc remoteproc2: powering up 78200000.r5f
[   10.067942] remoteproc remoteproc2: Booting fw image am64-main-r5f0_1-fw, size 141772
[   10.096189] remoteproc remoteproc0: remote processor 78000000.r5f is now up
[   10.098237] remoteproc remoteproc1: remote processor 5000000.m4fss is now up
[   10.129346] remoteproc remoteproc2: remote processor 78200000.r5f is now up
...

// it doesn't matter where your R5F binary is stored.
// For this test, my binary and userspace application is in /root/
root@am64xx-evm:~# ls
ipc_rpmsg_zerocopy_linux.release.out  rpmsg_char_zerocopy

// this is where the remoteproc driver looks for firmware to load
root@am64xx-evm:~# cd /lib/firmware/

// where are the links currently pointing?
root@am64xx-evm:/lib/firmware# ls -al
total 24788
drwxr-xr-x  8 root root    4096 Feb 27 19:49 .
drwxr-xr-x 75 root root   49152 Mar  9  2018 ..
-rw-r--r--  1 root root    2040 Mar  9  2018 LICENCE.ibt_firmware
-rw-r--r--  1 root root    2046 Mar  9  2018 LICENCE.iwlwifi_firmware
lrwxrwxrwx  1 root root      42 Feb 27 19:49 am64-main-r5f0_0-fw -> /root/ipc_rpmsg_zerocopy_linux.release.out
lrwxrwxrwx  1 root root      79 Mar  9  2018 am64-main-r5f0_0-fw-sec -> /usr/lib/firmware/ti-ipc/am64xx/ipc_echo_test_mcu1_0_release_strip.xer5f.signed
lrwxrwxrwx  1 root root      59 Mar  9  2018 am64-main-r5f0_1-fw -> /usr/lib/firmware/mcusdk-benchmark_demo/am64-main-r5f0_1-fw
lrwxrwxrwx  1 root root      79 Mar  9  2018 am64-main-r5f0_1-fw-sec -> /usr/lib/firmware/ti-ipc/am64xx/ipc_echo_test_mcu1_1_release_strip.xer5f.signed

// I already have am64-main-r5f0_0-fw pointing to my firmware
// but let's pretend I also wanted to load the same binary into r5f0_1
root@am64xx-evm:/lib/firmware# ln -sf ~/ipc_rpmsg_zerocopy_linux.release.out am64-main-r5f0_1-fw

// is the link updated?
root@am64xx-evm:/lib/firmware# ls -al
total 24788
drwxr-xr-x  8 root root    4096 Mar  7 11:19 .
drwxr-xr-x 75 root root   49152 Mar  9  2018 ..
-rw-r--r--  1 root root    2040 Mar  9  2018 LICENCE.ibt_firmware
-rw-r--r--  1 root root    2046 Mar  9  2018 LICENCE.iwlwifi_firmware
lrwxrwxrwx  1 root root      42 Feb 27 19:49 am64-main-r5f0_0-fw -> /root/ipc_rpmsg_zerocopy_linux.release.out
lrwxrwxrwx  1 root root      79 Mar  9  2018 am64-main-r5f0_0-fw-sec -> /usr/lib/firmware/ti-ipc/am64xx/ipc_echo_test_mcu1_0_release_strip.xer5f.signed
lrwxrwxrwx  1 root root      42 Mar  7 11:19 am64-main-r5f0_1-fw -> /root/ipc_rpmsg_zerocopy_linux.release.out

// reboot to load the new firmware
root@am64xx-evm:/lib/firmware# reboot -f

Regards,

Nick

0 Nick Saulnier 11 months ago in reply to Nick Saulnier

TI__Guru** 101240 points

For future readers, please find a note on building the R5F code here:
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1410313/am6442-communication-latency-issues-between-a53-and-r5-in-a-linux-rt-system/5407807#5407807

Regards,

Nick

0 Nick Saulnier 11 months ago in reply to Nick Saulnier

TI__Guru** 101240 points

Hello Mack,

I wanted to give you a partial update.

ti-rpmsg-char example is updated and ready for testing

I have updated the ti-rpmsg-char example code that you found earlier, and now the code should be able to calculate average latency over billions of test runs. This is different from the test code you found a week or so ago, where the average latency calculations in the code would break with large numbers of test runs. As a reminder, for the 1 billion test runs in this previous thread, I had to calculate average latency by importing the histogram data into excel, and calculating the average within Excel: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1388960/sk-am64b-rpmsg-between-a53-and-r5-performance-update/5328241#5328241

The updated code is here:

I will not push it to the public repo yet, since I might want to make additional changes in order to integrate the latency benchmarks into future SDK documentation: https://software-dl.ti.com/processor-sdk-linux/esd/AM64X/10_00_07_04/exports/docs/devices/AM64X/linux/RT_Linux_Performance_Guide.html

https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_Update_2D00_max_2D00_message_2D00_size_2D00_to_2D00_496.patch

https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/0002_2D00_Replace_2D00_fixed_2D00_number_2D00_of_2D00_read_2D00_bytes.patch

https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/0003_2D00_example_2D00_add_2D00_latency_2D00_avg_2D00_worst_2D00_case_2D00_histogram.patch

Initial test results

With the understanding that I have NOT optimized the filesystem and Linux kernel, this is about as good as RPMsg performance gets with the out-of-the-box default filesystem. Adding shared memory reads and writes will add more latency. If these numbers are not good enough for your usecase, you would need to evaluate other methods of signaling between Linux and the R5F.

The latency to send 1 byte over RPMsg is much shorter than the latency to send 496 bytes over RPMsg. The test that is 1 billion iterations of sending 496 bytes is still running as I type this, but I will add the results as soon as the test finishes.

Test methodology

SOFTWARE

RT Linux filesystem:
tisdk-default-image-am64xx-evm.rootfs.wic.xz from https://www.ti.com/tool/download/PROCESSOR-SDK-LINUX-AM64X, SDK 10.00.07.04

R5F code:
use the ipc_rpmsg_echo_linux firmware that comes prebuilt on the filesystem image

Linux userspace code:
Apply the patches above to https://git.ti.com/cgit/rpmsg/ti-rpmsg-char/

Modify these lines in the Linux userspace code and rebuild the application in order to select between sending 1 byte, 496 bytes, or something else:

                /* select how many bytes to send per message */

                /* send 1 byte */
                sprintf(packet_buf, "0");
                /* send 4 bytes */
                //sprintf(packet_buf, "0123");
                /* send 32 bytes */
                //sprintf(packet_buf, "01234567890123456789012345678901");
                /* "normal" test: do the hello message */
                //sprintf(packet_buf, "hello there %d!", i);
                /* maximum test: send 496 bytes (i.e., 495 bytes plus null termination) */
                //sprintf(packet_buf, "012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234");

I renamed the output files to help me keep track of which binary was sending what amount of bytes.

Finally, I disabled core1 of the R5F subsystem, and ran tests with core0 (just to make sure code running on the other core did not impact latencies).

The below tests were run like this:
talk to R5F0_0. run the 1 byte test 1 billion times, and increase the priority above 50 and below 99
./rpmsg_char_simple_1byte -r 2 -n 1000000000 & chrt -f -p 80 $!

Another potential test would be to add a background load on Linux. In that case, the command would look like
stress-ng --cpu-method=all -c 4 & ./rpmsg_char_simple_1byte -r 2 -n 1000000000 & chrt -f -p 80 $!

Gut check: tests with 10,000 iterations

Test	Avg latency	worst latency
./rpmsg_char_simple_1byte -r 2 -n 10000 & chrt -f -p 80 $!	28 usec	205 usec
./rpmsg_char_simple_496byte -r 2 -n 10000 & chrt -f -p 80 $!	147 usec	257 usec

Did not plot the histograms for these tests

Tests with 1 billion iterations

Test	Avg latency	worst latency
./rpmsg_char_simple_1byte -r 2 -n 1000000000 & chrt -f -p 80 $!	28 usec	203 usec
./rpmsg_char_simple_496byte -r 2 -n 1000000000 & chrt -f -p 80 $!	147 usec	334 usec

Test	histogram
./rpmsg_char_simple_1byte -r 2 -n 1000000000 & chrt -f -p 80 $!
./rpmsg_char_simple_496byte -r 2 -n 1000000000 & chrt -f -p 80 $!

Regards,

Nick

0 Nick Saulnier 11 months ago in reply to Nick Saulnier

TI__Guru** 101240 points

The 1 billion test run results for sending 496 bytes has been updated above.

Also, this test validates the updated average & worst-case calculations used in my updated code. I will add the same measurements to the zerocopy example next.

Regards,

Nick

0 Nick Saulnier 11 months ago in reply to Nick Saulnier

TI__Guru** 101240 points

Hello Mack,

I will be on vacation from Oct 7 - Oct 11. Please let me know if you need anything from me before then, and I will try to get it done for you before I go on vacation.

Regards,

Nick

0 ? ?? 11 months ago in reply to Nick Saulnier

Prodigy 20 points

Hi Nick:

I have tested the rpmsg_char_zerocopy example, sending 4096 bytes of data from the A53 core to the R5F core, and then having the R5F core return 4096 bytes back to the A53 core. The maximum round-trip time is around 350 microseconds, which basically meets our requirements. I have a question: what is the difference between the rpmsg_char_zerocopy example and the rpmsg_char_simple example? Could you illustrate their principles?

Regards，

Mack

0 Nick Saulnier 11 months ago in reply to ? ??

TI__Guru** 101240 points

Hello Mack,

What is the difference between the two examples?

rpmsg_char_simple sends an RPMsg from Linux userspace to the remote core, and then the remote core sends that same message back. In the code I modified above (https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1410313/am6442-communication-latency-issues-between-a53-and-r5-in-a-linux-rt-system/5434861#5434861), you can send a message that is 1 byte, up to a message that is 496 bytes.

rpmsg_char_zerocopy sets up a shared memory region. Linux copies data to that shared memory region, and then sends an RPMsg to notify the remote core that the data is ready to read. The remote core reads the shared memory, writes a different data pattern to the shared memory, and then sends an RPMsg back to the Linux userspace app. Finally, Linux userspace reads the shared memory, and checks that the data has been modified in the expected way.

Codesys questions

Our "regular" cyclic test results are reported in the SDK docs. For SDK 10.0, we saw worst-case cyclic test results of about 50-60 usec:
https://software-dl.ti.com/processor-sdk-linux/esd/AM64X/10_00_07_04/exports/docs/devices/AM64X/linux/RT_Linux_Performance_Guide.html

However, one of my team members reported seeing huge spikes in Linux cycle time when they were using Codesys. The large latency was caused by the CODESYS Codemeter license application. For more information on those tests, see https://www.ti.com/lit/spradh0, section “Optimizations”.

Are you also seeing much larger cyclic test results while Codesys is running? If you are still seeing small cyclic test results, what are you doing to prevent the Codemeter code from hurting performance?

My team is still learning about Codesys, so it is possible we do not understand something (like if Codemeter does not actually run with certain licenses of Codesys).

Other updates

I did not have time to do the testing I wanted to do on zerocopy before my vacation. I will be back the week of October 14.

It can be helpful to have someone else review your test code (to make sure that the code is actually testing what you want the code to test). If you want me to go over your test code after I return, please share it with Kevin Peng and I will take a look when I get back.

Regards,

Nick

0 Nick Saulnier 10 months ago in reply to Nick Saulnier

TI__Guru** 101240 points

Hello,

Apologies for the delayed responses. Let me know if you need anything else from me to enable IPC.

Do you have any information for me about cyclic test results while CODESYS is running?

Regards,

Nick

0 Nick Saulnier 9 months ago in reply to Nick Saulnier

TI__Guru** 101240 points

For future readers, we started a new discussion around IPC between a Linux driver and a remote core. That discussion is here:
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1441319/am6442-creating-ipc-between-linux-kernel-driver-and-r5f

0 Ashwani Goel 8 months ago in reply to Nick Saulnier

TI__Mastermind 27300 points

Hi,

Closing the thread, as there is no response for long. Feel free to ping back, if you want to continue discussion.

Regards

Ashwani

Processors

Processors forum

AM6442: Communication Latency Issues between A53 and R5 in a Linux-RT System