Tool/software: Linux
I am developing a small application running in linux userspace on the AM4378 using the 4.1 kernel (specifically the kernel from the 02.00.02 SDK) which communicates with a small program running on PRUSS1 that I am also developing using the PRU software support package. I am using rpmsg for PRU communications and am mostly able to send large volumes of data to the PRU this way without any apparent issues. In some circumstances I end up wanting to "flush" the incoming rpmsg queue, which I do by reading messages off of the queue until the queue is empty in a very tight loop on the PRU. Usually this works fine but very rarely this leads to an infinite loop where the PRU is constantly "reading" the same set of messages again and again from the rpmsg queue. This continues even when terminating the linux userspace application that was writing these messages, which would appear to rule out the possibility that this application is sending the same set of messages repeatedly.
To me this seems to strongly indicate some kind of race condition going on with the rpmsg stack itself. Looking over the PRU side of this code, I noticed that in this line, the PRU is incrementing the used buffer index that the linux kernel uses to determine which entries in the used buffer table have entries that need to be processed before the PRU fills in the idx and length fields for that entry. The PRU doesn't kick the kernel until after filling in these fields, but as far as I can tell once kicked the kernel will process all used elements that precede the current used index, so if the PRU is updating another used element when the kernel gets around to responding to the kick from the first used element then the kernel can still end up processing invalid data. (Not 100% sure about this -- the kernel side code is a bit harder to follow.)
My questions here are:
- What is the correct tag/branch from the PRU software support package repository to be using with the 02.00.02 SDK kernel? (I checked in these files alongside my own code and did not keep track of which version I had copied them from.)
- Is there some mechanism that I am not seeing that prevents the kernel from receiving invalid data in the situation I laid out?
- Would waiting to increment that used index until after writing all data to the entry be sufficient to fix this potential issue? (I am a little worried about the compiler reordering assignments since nothing seems to be declared as volatile.)
- Are there any other known issues that could explain this behavior? (I am actually having a bit of a hard time figuring out how this potential race condition would actually cause the issue that I am actually seeing.)