AM4378: PRU-ICSS0 shared memory issue

Andrey Mozzhuhin

Part Number: AM4378

I am developing an application that uses both PRU-ICSS on an AM4378 processor. I am using TI CGT PRU 2.3.3 to compile my PRU firmwares.

On the ARM core, I run Linux with the custom PRU-ICSS driver. This driver enables the PRU-ICSS, enable the OCP masters, upload the PRU firmwares, starts the PRU cores, and communicates with the PRU firmware through the PRU-ICSS1 Shared RAM.

When I load/store 4 byte values in PRU-ICSS1 Shared RAM, everything works fine. But when I store values larger than 4 bytes by the PRU-ICSS0 cores the first 4 bytes are lost.

To deal with this problem in more detail, I made the following test firmware:

#include <stdint.h>

extern far uint8_t __PRU_CREG_BASE_PRU_SRAM;

void main()
{
	uint8_t *dst = &__PRU_CREG_BASE_PRU_SRAM;
	uint8_t tmp[16];
	uint32_t i;

	for (i = 0; i < 16; i++)
	{
		/* Fill temporary buffer with incremental sequence */
		tmp[i] = i;
		/* Fill Shared RAM with 0xff */
		dst[i] = 0xff;
	}

	memcpy(dst, tmp, 16);
	for (;;);
}

The program prepares a buffer of 16 bytes with an incrementing sequence and copies the contents of this buffer to the beginning of Shared RAM.
Then I read this data from Shared RAM from the ARM core by custom driver.

When this program is running on PRU-ICSS0 PRU0 I got: 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f ff ff ff ff
When this program is running on PRU-ICSS1 PRU0 I got: 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f

I generated an assembler listing using dispru. They are the same for PRU-ICSS0 and PRU-ICSS1 except for the Shared RAM offset (0x50000 and 0x10000, respectively).

To copy memory, compiler generate these instructions:

LBBO & R14.b0, R0, 0, 16
SBBO & R14.b0, R1, 0, 16

Thus, the compiler generates the correct instructions. But these instructions are executed differently on PRU-ICSS0 and PRU-ICSS1.

What could be wrong? How can I fix this?

over 6 years ago

0 Nick Saulnier over 6 years ago

TI__Guru** 108470 points

Hello Andrey,

Since you are using PRU CGT 2.3.3, I am guessing you are not using the Linux Processor SDK (which as of SDK 6.1 only has PRU CGT 2.3.2). Let me know if that is not the case.

Note that on AM437x, ICSS0 does not have a shared DRAM (reference PRU-ICSS / PRU_ICSSG Feature Comparison Across Devices). So I'm not sure where the compiler is trying to write the code for ICSS0.

You can see TI's example linker command files for AM437x ICSS0 and ICSS1 in our PRU Software Support Package. For example, see examples/am437x/PRU_RPMsg_Echo_Interrupt0_0/AM437x_PRU_SS0.cmd and examples/am437x/PRU_RPMsg_Echo_Interrupt1_0/AM437x_PRU_SS1.cmd

Feel free to reply with additional discussion.

Regards,

Nick

0 Andrey Mozzhuhin over 6 years ago in reply to Nick Saulnier

Prodigy 120 points

Hello Nick.

Yes, I don't use Linux Processor SDK. But I has explore it on early stages of development and it help me a lot.

Yes, I know that PRU-ICSS0 doesn't have internal Shared RAM. I use PRU-ICSS1 Shared RAM on ICSS1 and ICSS0. My linker scripts are very similar to one from PRU Software Support Package. Because PRU-ICSS0 have local mapping of PRU-ICSS1 memories at 0x40000 (Ext port to PRU-ICSS1) in my linker script for ICSS0 I have added:

PRU_SRAM        : org = 0x00050000 len = 0x00008000 CREGISTER=28 /* 32kB Shared RAM */

So, in my map file for PRU-ICSS0 I can see global symbol:

abs   00050000  __PRU_CREG_BASE_PRU_SRAM

As you can see in my test program I have removed the cregister attribute definition, so access to the SRAM is done by LBBO&SBBO.

0 Andrey Mozzhuhin over 6 years ago in reply to Andrey Mozzhuhin

Prodigy 120 points

I did some more tests.

1) Copy 16 bytes from PRU-ICSS0 Data RAM0 to PRU-ICSS0 Data RAM0 by PRU-ICSS0 PRU0.

#include <stdint.h>

void main()
{
	uint8_t *src = (uint8_t *) 0x00010;
	uint8_t *dst = (uint8_t *) 0x00000;
	uint32_t i;

	for (i = 0; i < 16; i++)
	{
		/* Fill temporary buffer with incremental sequence */
		src[i] = i;
		/* Fill destination buffer with 0xff */
		dst[i] = 0xff;
	}

	memcpy(dst, src, 16);
	for (;;);
}

Test done successfully. Destination buffer: 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f.

2) Copy 16 bytes from PRU-ICSS0 Data RAM0 to PRU-ICSS0 Data RAM1 by PRU-ICSS0 PRU0.

#include <stdint.h>

void main()
{
	uint8_t *src = (uint8_t *) 0x00010;
	uint8_t *dst = (uint8_t *) 0x02000;
	uint32_t i;

	for (i = 0; i < 16; i++)
	{
		/* Fill temporary buffer with incremental sequence */
		src[i] = i;
		/* Fill destination buffer with 0xff */
		dst[i] = 0xff;
	}

	memcpy(dst, src, 16);
	for (;;);
}

Test done successfully. Destination buffer: 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f.

3) Copy 16 bytes from PRU-ICSS0 Data RAM0 to PRU-ICSS1 Data RAM0 by PRU-ICSS0 PRU0.

#include <stdint.h>

void main()
{
	uint8_t *src = (uint8_t *) 0x00010;
	uint8_t *dst = (uint8_t *) 0x40000;
	uint32_t i;

	for (i = 0; i < 16; i++)
	{
		/* Fill temporary buffer with incremental sequence */
		src[i] = i;
		/* Fill destination buffer with 0xff */
		dst[i] = 0xff;
	}

	memcpy(dst, src, 16);
	for (;;);
}

Test fails. Destination buffer: 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f ff ff ff ff.

4) Copy 16 bytes from PRU-ICSS1 Shared RAM to PRU-ICSS1 Shared RAM by PRU-ICSS0 PRU0.

#include <stdint.h>

void main()
{
	uint8_t *src = (uint8_t *) 0x50010;
	uint8_t *dst = (uint8_t *) 0x50000;
	uint32_t i;

	for (i = 0; i < 16; i++)
	{
		/* Fill temporary buffer with incremental sequence */
		src[i] = i;
		/* Fill destination buffer with 0xff */
		dst[i] = 0xff;
	}

	memcpy(dst, src, 16);
	for (;;);
}

Test fails. Destination buffer: 00 01 02 03 04 05 06 07 08 09 0a 0b ff ff ff ff.

0 Nick Saulnier over 6 years ago in reply to Andrey Mozzhuhin

TI__Guru** 108470 points

Hello Andrey,

I do not see anything obviously wrong with what you are doing. I do not recognize this issue, but I am checking around on this end.

Interesting to note that the 4 bytes can disappear from the beginning or the end of the copied data. Does that part of the pattern seem to hold? i.e., ICSS0 transfer from ICSS0 -> ICSS1 loses the first 4 bytes, ICSS0 transfer from ICSS1 -> ICSS1 loses the last 4 bytes?

Regards,

Nick

0 Nick Saulnier over 6 years ago in reply to Nick Saulnier

TI__Guru** 108470 points

Hello Andrey,

Summary: You found a bug

It looks like AM437x has an issue when writing (or reading?) more than 4 bytes through the local VBUSP bridge at 0x004_0000. This behavior is expected for ICSS0 writing in ICSS1, or ICSS1 writing in ICSS0. To the best of my knowledge, this issue should only affect AM437x devices (i.e. other PRU containing devices are fine). This issue should be documented in the next version of the AM437x Errata.

Workarounds:

1) use the global address for bulk reads & writes to the other ICSS instead of the local address

2) use the local address, but do reads & writes 4 bytes at a time (e.g., create a loop where only 4 bytes are copied at a time, use an unrolled jump table, etc)

After a certain size threshold, I would expect a bulk read with global addressing to be faster than a bunch of local reads in a loop. However, I am not sure what that threshold is. Our PRU Read Latencies documentation may provide some guidance for time critical applications.

Feel free to reply with any follow-up.

Regards,

Nick

0 Matthijs van Duin over 6 years ago in reply to Andrey Mozzhuhin

Mastermind 8040 points

Andrey Mozzhuhin said:
Copy 16 bytes from PRU-ICSS0 Data RAM0 to PRU-ICSS1 Data RAM0 by PRU-ICSS0 PRU0.
Destination buffer: 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f ff ff ff ff.

Whether Shared or Data ram is used doesn't seem to matter (I don't see how it could), just whether or not the transaction crosses between the two instances. It looks like something really funky going on in the ICSS0↔ICSS1 bridge, as if a fifo pointer is off by one or something like that.

Something to try: after prefill dst with 0xAA instead of 0xFF, and after filling src and dst but before doing the memcpy, fill some other buffer in ICSS1 with 0x55 and read it back, to ensure any fifo buffers along the way in the interconnect are filled with that value. That might reveal whether the bridge is receiving an extra word (last four bytes of dst are still 0xFF), writing stale data from a fifo (last four bytes are 0x55), or not writing the last word of dst at all (last four bytes are 0xAA).

Andrey Mozzhuhin said:
Copy 16 bytes from PRU-ICSS1 Shared RAM to PRU-ICSS1 Shared RAM by PRU-ICSS0 PRU0.
Destination buffer: 00 01 02 03 04 05 06 07 08 09 0a 0b ff ff ff ff.

So, does my guess would be that in the last case, the lbbo (from ICSS1 by ICSS0) reads something like
ff ff ff ff 00 01 02 03 04 05 06 07 08 09 0a 0b
and the subsequent sbbo (to ICSS1 by ICSS0) turned that into
00 01 02 03 04 05 06 07 08 09 0a 0b ff ff ff ff

Does copying from ICSS1 to ICSS0 by ICSS0 indeed result in ff ff ff ff 00 01 02 03 04 05 06 07 08 09 0a 0b ?

Does the same thing happen when ICSS1 is performing a transfer to/from an ICSS0 data ram?

0 Nick Saulnier over 6 years ago in reply to Matthijs van Duin

TI__Guru** 108470 points

Hello Matthijs,

Thanks for commenting - I always appreciate your insight into PRU subjects. I haven't run tests on this end, but the IP developer expects there to be issues with LBBO and SBBO commands that cross the ICSS0 <-> ICSS1 bridge from both ICSS0 and ICSS1.

Regards,

Nick

0 Andrey Mozzhuhin over 6 years ago in reply to Nick Saulnier

Prodigy 120 points

Thanks guys for the answers.

It is sad to know that this is a hardware bug. I will try to use suggested workarounds.

Just for note I did another round of tests.

1) Copy 16 bytes from PRU-ICSS0 Data RAM0 to PRU-ICSS1 Shared RAM using global address 0x54410000 by PRU-ICSS0 PRU0. Data copied without loss.

0 Matthijs van Duin over 6 years ago in reply to Andrey Mozzhuhin

Mastermind 8040 points

Nick Saulnier said:
After a certain size threshold, I would expect a bulk read with global addressing to be faster than a bunch of local reads in a loop. However, I am not sure what that threshold is. Our PRU Read Latencies documentation may provide some guidance for time critical applications

Writes to L3 are as fast as local writes as long as there's space in the FIFO, but sustained write throughput from PRU to its local memories via the L3 interconnect is pretty bad, and reads even more so. Here are some measurements I did on an AM335x [correction: AM572x]:

first few writes: 1+n cycles per n-word write
[see correction in next post] sustained writes: average 4.5*n cycles per n-word write, i.e. 4.5 cycles/word regardless of the number of words per write
[see correction in next post] reads: average 33.4+4.9*n cycles per n-word read

These numbers strongly imply that the L3->PRUSS bridge breaks requests up into single-word transfers. This would also explain why this workaround even works, even though access from L3 to PRUSS0 also traverses the problematic VBUSP bridge.

Another problem with this workaround is that you need to be really careful with synchronization since writes are asynchronous. Given that it seems to break transfers up into individual words, atomicity is lost as well (although I'm not 100% sure whether atomicity is guaranteed for multi-word accesses within PRUSS).

I noticed that the PRU Read Latencies doesn't include any information about latency for access across the PRUSS0↔PRUSS1 bridge. Does the bridge add any latency?

Nick Saulnier said:
Thanks for commenting - I always appreciate your insight into PRU subjects. I haven't run tests on this end, but the IP developer expects there to be issues with LBBO and SBBO commands that cross the ICSS0 <-> ICSS1 bridge from both ICSS0 and ICSS1.

I had my post sitting in draft so I hadn't seen yet that a hardware bug had been confirmed before I hit post.

It would be nice to get explicit confirmation that the expected behaviour is consistently what it appears to be so far:

an aligned n-word write (n>1) via the bridge actually writes the last n-1 words of the write-data followed by a garbage word
an aligned n-word read (n>1) via the bridge returns a garbage word followed by the first n-1 words of the actual data

Knowledge of this behaviour would allow the following workarounds:

// replace:
lbbo	&r3, addr, 0, 16
// by:
lbbo	&r2, addr, 0, 20  // caution: clobbers r2
// or:
lbbo	&r3, addr, 4, 16
lbbo	&r3, addr, 0, 4

// replace:
sbbo	&r3, addr, 0, 16
// by:
sbbo	&r2, addr, 0, 20  // caution: clobbers 4 bytes at addr+16
// or:
sbbo	&r2, addr, 0, 16
sbbo	&r6, addr, 12, 4

The first workaround for each costs merely 1 cycle extra, but clobbers a register (for load) or 4 bytes of memory (for store).
The second workaround for each costs the time of a single-word load/store (and sacrifices atomicity), but avoids clobbering anything.

0 Matthijs van Duin over 6 years ago in reply to Matthijs van Duin

Mastermind 8040 points

I mucked up the measurements (used wrong target address). Corrected values:

sustained single-word writes: average 2.8 cycles per write
sustained multi-word writes: average 2.3 cycles per word
reads: average 28+2.25*n±1 cycles per n-word load

Also I mistakenly said it was measured on an AM335x, I actually used an AM572x for the tests. (I don't expect this to matter much.)

Regardless, the conclusions are still the same: the PRU→L3 master port, the L3 interconnect, the PRUSS interconnect, and the target memory are all able to sustain 1.0-1.1 cycles/word when using large transfers. The bottleneck is therefore the L3→PRUSS bridge, and the logical conclusion is that it breaks requests up into single-word transfers.

0 Nick Saulnier over 6 years ago in reply to Matthijs van Duin

TI__Guru** 108470 points

Hello Matthijs & Andrey,

Thanks again for bring this up and for your feedback. We'll need to do more testing on this side, so I'm going to close the thread.

Regards,

Nick

0 Matthijs van Duin over 6 years ago in reply to Nick Saulnier

Mastermind 8040 points

Nick Saulnier said:
We'll need to do more testing on this side, so I'm going to close the thread.

The logic of that statement seems puzzling to me. If more testing is done, this thread would be the logical place to report the results thereof. How else would anyone following this topic and anyone who finds this topic in the future find these results?

0 Nick Saulnier over 6 years ago in reply to Matthijs van Duin

TI__Guru** 108470 points

Hello Matthijs,

Typically I do not let e2e threads go weeks without a response. I will not have bandwidth to dive into this until 2020, so it could be some time until I have any updates. But I hear you, and I will make a note to go back and update this thread when we figure out what we are going to put in the AM437x Errata.

Regards,

Nick

Processors

Processors forum

AM4378: PRU-ICSS0 shared memory issue