Linux/AM4378: Trying to debug a race condition related to PRU rpmsg communications

Chris Moore

Part Number: AM4378

Tool/software: Linux

I am developing a small application running in linux userspace on the AM4378 using the 4.1 kernel (specifically the kernel from the 02.00.02 SDK) which communicates with a small program running on PRUSS1 that I am also developing using the PRU software support package. I am using rpmsg for PRU communications and am mostly able to send large volumes of data to the PRU this way without any apparent issues. In some circumstances I end up wanting to "flush" the incoming rpmsg queue, which I do by reading messages off of the queue until the queue is empty in a very tight loop on the PRU. Usually this works fine but very rarely this leads to an infinite loop where the PRU is constantly "reading" the same set of messages again and again from the rpmsg queue. This continues even when terminating the linux userspace application that was writing these messages, which would appear to rule out the possibility that this application is sending the same set of messages repeatedly.

To me this seems to strongly indicate some kind of race condition going on with the rpmsg stack itself. Looking over the PRU side of this code, I noticed that in this line, the PRU is incrementing the used buffer index that the linux kernel uses to determine which entries in the used buffer table have entries that need to be processed before the PRU fills in the idx and length fields for that entry. The PRU doesn't kick the kernel until after filling in these fields, but as far as I can tell once kicked the kernel will process all used elements that precede the current used index, so if the PRU is updating another used element when the kernel gets around to responding to the kick from the first used element then the kernel can still end up processing invalid data. (Not 100% sure about this -- the kernel side code is a bit harder to follow.)

My questions here are:

What is the correct tag/branch from the PRU software support package repository to be using with the 02.00.02 SDK kernel? (I checked in these files alongside my own code and did not keep track of which version I had copied them from.)
Is there some mechanism that I am not seeing that prevents the kernel from receiving invalid data in the situation I laid out?
Would waiting to increment that used index until after writing all data to the entry be sufficient to fix this potential issue? (I am a little worried about the compiler reordering assignments since nothing seems to be declared as volatile.)
Are there any other known issues that could explain this behavior? (I am actually having a bit of a hard time figuring out how this potential race condition would actually cause the issue that I am actually seeing.)

over 8 years ago

0 Jason Reeder over 7 years ago

TI__Genius 10440 points

Chris,

Check out this commit (along with it's commit message): git.ti.com/.../d5add00c73293f5ae6f5eef304ce97f93d180322

That bug was discovered in v5.0.0 or the package and corrected in v5.0.1. The gist of it was that after 65k messages the PRU always thought that there was an available message for it due to the type mismatch of the indexes.

Proc SDK v2.0.2 shipped with v4.0.2 of the pru-software-support-package so your version definitely has this bug.

Jason

0 Chris Moore over 7 years ago in reply to Jason Reeder

Prodigy 70 points

Thanks for the response. I will try updating the rpmsg library to include this fix and see if it helps. It will probably take a while to verify whether this fixes the issue because it is fairly rare.

Processors

Processors forum

Linux/AM4378: Trying to debug a race condition related to PRU rpmsg communications