MCU-PLUS-SDK-AM243X: Cores deadlocked on IpcNotify_sendMsg

Traveler Hauptman

Tool/software:

I occasionally have 2 or 3 cores locked up in infinite loops on RPMessage_vringPutEmptyRxBuf()->IpcNotify_sendMsg().

Obviously no core can be emptying the mailbox if they are all stuck in an infinite loop waiting for another core to empty the mailbox.

Looking at the overall design of RPMessage, I don't see how deadlocks like this can be avoided.

Can the RPMessage designer confirm that this is a fundamental characteristic? Perhaps a tradeoff for speed? I'll rewrite RPMessage if I need to but I'm hoping I'm missing something. Meanwhile having one core lock up another is not something we can have in our system.

5 months ago

0 Tushar Thakur 5 months ago

TI__Mastermind 32993 points

Hello Traveler,

Thanks for your query.

Traveler Hauptman said:
I occasionally have 2 or 3 cores locked up in infinite loops on RPMessage_vringPutEmptyRxBuf()->IpcNotify_sendMsg().

Can you please tell which cores are locked up in infinite loop?

Have you modified the default example provided in the MCU+SDK?

Which version of MCU+SDK are you using?

Regards,

Tushar

0 Traveler Hauptman 5 months ago in reply to Tushar Thakur

Prodigy 240 points

I'm using cores 0,2,3,4(per the IPC core enumeration) for RPMessage IPC . As I've been developing I've had 0,3,4 deadlock simultaneously once. Any pair of cores deadlock a couple times every 2-3 days of development. I haven't been recording which pair, but in seems like 0 (the M4F core) might be in the deadlock more often than the others.

All cores are heavily loaded. Processing loops on each core have periods of 20-500 microseconds. I have ~30 receivers using callbacks. Receivers respect their interrupt context. All senders have timeouts, most with timeout 0 and a higher level queue to deal with rejected sends.

It all works quite nicely except for the deadlocks.

We had to lock versions at industrial comms sdk 9.0.0.3.
However I back ported the fixes to RPMessage_Send that are in the main development branch by Ashwin Raj as that issue was also causing deadlocks.

0 Tushar Thakur 5 months ago in reply to Traveler Hauptman

TI__Mastermind 32993 points

Hi Traveler,

Can you please provide a sample code for the replication of issue for faster debug?

Can you please also check when deadlock happens what is piece of code that each cores are currently executing?

Regards,

Tushar

0 Traveler Hauptman 5 months ago in reply to Tushar Thakur

Prodigy 240 points

Hi Tushar,

I'm sorry but the code is too large and too dependent on the specific hardware we have.

The code my cores lock up in is inside vringPutEmptyRxBuf:

    IpcNotify_sendMsg(remoteCoreId,
        IPC_NOTIFY_CLIENT_ID_RPMSG,
        rxMsgValue,
        1 /* wait for message to be posted */
        );

My system is heavily loaded on most cores with shared memory and lots of IPC (RPMessage and spinlocks). However this looks like it would be a logic error in the design of RPMessage. There's not a lot of detail about the crossbars involved in the TRM but RPMessage code seems to assume that there are no mailbox writes in-flight and that there is a strong ordering between how different cores access a given mailbox when multiple transactions are in progress.

I have since modified RPMessage to not wait infinitely on the other cores (with the necessary additional logic to ensure reliable comms).

+1 Tushar Thakur 4 months ago in reply to Traveler Hauptman

TI__Mastermind 32993 points

Hi Traveler,

Thanks for your patience.

I have filed an internal Jira ticket for the above issue and is likely to be fixed in upcoming release of SDK (i.e. v10.1).

Regards,

Tushar

Arm-based microcontrollers

Arm-based microcontrollers forum

MCU-PLUS-SDK-AM243X: Cores deadlocked on IpcNotify_sendMsg