This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MCU-PLUS-SDK-AM243X: Cores deadlocked on IpcNotify_sendMsg

Part Number: MCU-PLUS-SDK-AM243X

Tool/software:

I occasionally have 2 or 3 cores locked up in infinite loops on RPMessage_vringPutEmptyRxBuf()->IpcNotify_sendMsg().

Obviously no core can be emptying the mailbox if they are all stuck in an infinite loop waiting for another core to empty the mailbox.

Looking at the overall design of RPMessage, I don't see how deadlocks like this can be avoided.

Can the RPMessage designer confirm that this is a fundamental characteristic? Perhaps a tradeoff for speed? I'll rewrite RPMessage if I need to but I'm hoping I'm missing something. Meanwhile having one core lock up another is not something we can have in our system.

  • Hello Traveler,

    Thanks for your query.

    I occasionally have 2 or 3 cores locked up in infinite loops on RPMessage_vringPutEmptyRxBuf()->IpcNotify_sendMsg().

    Can you please tell which cores are locked up in infinite loop?

    Have you modified the default example provided in the MCU+SDK?

    Which version of MCU+SDK are you using?

    Regards,

    Tushar

  • I'm using cores 0,2,3,4(per the IPC core enumeration) for RPMessage IPC . As I've been developing I've had 0,3,4 deadlock simultaneously once. Any pair of cores deadlock a couple times every 2-3 days of development. I haven't been recording which pair, but in seems like 0 (the M4F core) might be in the deadlock more often than the others.

    All cores are heavily loaded. Processing loops on each core have periods of 20-500 microseconds. I have ~30 receivers using callbacks. Receivers respect their interrupt context. All senders have timeouts, most with timeout 0 and a higher level queue to deal with rejected sends.

    It all works quite nicely except for the deadlocks.

    We had to lock versions at industrial comms sdk 9.0.0.3.
    However I back ported the fixes to RPMessage_Send that are in the main development branch by Ashwin Raj as that issue was also causing deadlocks.

  • Hi Traveler,

    Can you please provide a sample code for the replication of issue for faster debug?

    Can you please also check when deadlock happens what is piece of code that each cores are currently executing?

    Regards,

    Tushar

  • Hi Tushar,

    I'm sorry but the code is too large and too dependent on the specific hardware we have.

    The code my cores lock up in is inside vringPutEmptyRxBuf:

        IpcNotify_sendMsg(remoteCoreId,
            IPC_NOTIFY_CLIENT_ID_RPMSG,
            rxMsgValue,
            1 /* wait for message to be posted */
            );

    My system is heavily loaded on most cores with shared memory and lots of IPC (RPMessage and spinlocks). However this looks like it would be a logic error in the design of RPMessage. There's not a lot of detail about the crossbars involved in the TRM but RPMessage code seems to assume that there are no mailbox writes in-flight and that there is a strong ordering between how different cores access a given mailbox when multiple transactions are in progress.

    I have since modified RPMessage to not wait infinitely on the other cores (with the necessary additional logic to ensure reliable comms).

  • Hi Traveler,

    Thanks for your patience.

    I have filed an internal Jira ticket for the above issue and is likely to be fixed in upcoming release of SDK (i.e. v10.1).

    Regards,

    Tushar