Linux/AM5728: Mailbox failure when using both DSPs

Gerard

Part Number: AM5728
Other Parts Discussed in Thread: BQ40Z60

Tool/software: Linux

I'm running a test where I start and stop the DSPs on the AM5728 in a continuous loop. The test involves a few MessageQ messages going back and forth once the DSP has been started.

I can run the test successfully for an entire day on either DSP independently as long as I am only doing the start/stop loop on 1 of the DSPs.

If I run the test, asynchronously and start/stop both DSPs the test only lasts a few minutes and I see these errors:

[ 809.010549] omap-mailbox 48840000.mailbox: Try increasing MBOX_TX_QUEUE_LEN
[ 809.017583] omap-rproc 40800000.dsp: PM mbox_send_message failed: -105
[ 822.130458] omap-mailbox 48842000.mailbox: Try increasing MBOX_TX_QUEUE_LEN
[ 822.137461] omap-rproc 41000000.dsp: PM mbox_send_message failed: -105

I'm at a loss as to why these mailbox failures only pop up when I'm starting/stopping both DSPs at the same time.

Any ideas?

over 8 years ago

0 Gerard over 8 years ago

Expert 1290 points

Here's some device tree detail:

dra74x.dtsi:
&mailbox5 {
mbox_ipu1_ipc3x: mbox_ipu1_ipc3x {
ti,mbox-tx = <6 2 2>;
ti,mbox-rx = <4 2 2>;
status = "disabled";
};
mbox_dsp1_ipc3x: mbox_dsp1_ipc3x {
ti,mbox-tx = <5 2 2>;
ti,mbox-rx = <1 2 2>;
status = "disabled";
};
};

&mailbox6 {
mbox_ipu2_ipc3x: mbox_ipu2_ipc3x {
ti,mbox-tx = <6 2 2>;
ti,mbox-rx = <4 2 2>;
status = "disabled";
};
mbox_dsp2_ipc3x: mbox_dsp2_ipc3x {
ti,mbox-tx = <5 2 2>;
ti,mbox-rx = <1 2 2>;
status = "disabled";
};
};

custom dts file:
&dsp1 {
status = "okay";
memory-region = <&dsp1_cma_pool>;
mboxes = <&mailbox5 &mbox_dsp1_ipc3x>;
timers = <&timer5>;
};

&dsp2 {
status = "okay";
memory-region = <&dsp2_cma_pool>;
mboxes = <&mailbox6 &mbox_dsp2_ipc3x>;
timers = <&timer6>;
};

0 Biser Gatchev-XID over 8 years ago in reply to Gerard

TI__Guru**** 393215 points

The software team have been notified. They will respond here.

0 Rex Chang over 8 years ago in reply to Biser Gatchev-XID

TI__Guru 50170 points

Hi, Gerard,

I am trying to set up to reproduce it on TI EVM. From the error message, I assume MessageQ traffic is required. My idea is to have our ex02_messageq example running, then unbind both DSPs. Then, bind again, re-run, and re-unbind. There may be issues in setting it up, but let's see what I get.It will take me a while to get things ready.

Rex

0 Gerard over 8 years ago in reply to Rex Chang

Expert 1290 points

Thank you. When it fails I'll often see an oops like the one below. In this case, the application software on the ARM had just closed the DSP's MessageQ and was getting ready to destroy the host MessageQ.

[ 950.515904] [0000002c] *pgd=9cfcb003, *pmd=9cfb6003, *pte=00000000
[ 950.522361] Internal error: Oops: a07 [#1] PREEMPT SMP ARM
[ 950.527874] Modules linked in: cmemk(O) rpmsg_proto virtio_rpmsg_bus omap_remoteproc remoteproc virtio_ring virtio fpga_config bnep hci_uart btbcm usb_f_ecm rfcomm xhci_plat_hcd xhci_hcd bluetooth g_ether usb_f_rndis libcomposite u_ether usbcore extcon_palmas dwc3 udc_core uio_pdrv_genirq uio phy_omap_usb2 dwc3_omap bridge stp llc xt_tcpudp ipv6 iptable_filter ip_tables x_tables hw_info bq40z60_battery mpu9250(C) contact_closure_vail leds_vail extcon_usb_gpio extcon rtc_isl1208
[ 950.570792] CPU: 0 PID: 1104 Comm: hwtest_server Tainted: G WC O 4.4.41-gf9f6f0db2d #1
[ 950.579616] Hardware name: Generic DRA74X (Flattened Device Tree)
[ 950.585736] task: dd34d400 ti: de0bc000 task.ti: de0bc000
[ 950.591168] PC is at rpmsg_sock_release+0xe4/0x11c [rpmsg_proto]
[ 950.597198] LR is at 0xfffffffa
[ 950.600355] pc : [<bf27672c>] lr : [<fffffffa>] psr: 80060013
[ 950.600355] sp : de0bdee8 ip : dd52b774 fp : de0bdefc
[ 950.611882] r10: ddbd4c08 r9 : 00000008 r8 : dd51c660
[ 950.617129] r7 : de047e50 r6 : 00000000 r5 : dd4548c0 r4 : dcc50800
[ 950.623685] r3 : 00000000 r2 : 00000000 r1 : 00000003 r0 : bf2771dc
[ 950.630243] Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 950.637408] Control: 30c5387d Table: 9ccb16c0 DAC: fffffffd
[ 950.643179] Process hwtest_server (pid: 1104, stack limit = 0xde0bc218)
[ 950.649823] Stack: (0xde0bdee8 to 0xde0be000)
[ 950.654199] dee0: dd4548c0 bf2773c0 de0bdf14 de0bdf00 c0559e1c bf276654
[ 950.662414] df00: ddbd4c00 dd4548e0 de0bdf24 de0bdf18 c0559eac c0559e00 de0bdf5c de0bdf28
[ 950.670627] df20: c012632c c0559ea4 00000000 00000000 de0bdf54 dd34d770 c0b3189c 00000000
[ 950.678839] df40: dd34d400 c000fbe4 de0bc000 00000000 de0bdf6c de0bdf60 c01264dc c01262b0
[ 950.687055] df60: de0bdf8c de0bdf70 c004d51c c01264d8 de0bc010 c000fbe4 de0bdfb0 de0bc000
[ 950.695268] df80: de0bdfac de0bdf90 c0012cf4 c004d490 b5795fa0 afbfda10 00000011 00000006
[ 950.703483] dfa0: 00000000 de0bdfb0 c000fa8c c0012c4c 00000000 afbff4d4 00000002 00000000
[ 950.711697] dfc0: b5795fa0 afbfda10 00000011 00000006 b5795eb4 b2500880 00000003 00000006
[ 950.719910] dfe0: 00000000 afbfda08 afbff910 b5ce4764 80060010 00000011 82993410 00000028
[ 950.728120] Backtrace:
[ 950.730595] [<bf276648>] (rpmsg_sock_release [rpmsg_proto]) from [<c0559e1c>] (sock_release+0x28/0xa4)
[ 950.739941] r5:bf2773c0 r4:dd4548c0
[ 950.743550] [<c0559df4>] (sock_release) from [<c0559eac>] (sock_close+0x14/0x1c)
[ 950.750976] r5:dd4548e0 r4:ddbd4c00
[ 950.754591] [<c0559e98>] (sock_close) from [<c012632c>] (__fput+0x88/0x1d8)
[ 950.761587] [<c01262a4>] (__fput) from [<c01264dc>] (____fput+0x10/0x14)
[ 950.768314] r10:00000000 r9:de0bc000 r8:c000fbe4 r7:dd34d400 r6:00000000 r5:c0b3189c
[ 950.776218] r4:dd34d770
[ 950.778775] [<c01264cc>] (____fput) from [<c004d51c>] (task_work_run+0x98/0xcc)
[ 950.786123] [<c004d484>] (task_work_run) from [<c0012cf4>] (do_work_pending+0xb4/0xb8)
[ 950.794072] r7:de0bc000 r6:de0bdfb0 r5:c000fbe4 r4:de0bc010
[ 950.799793] [<c0012c40>] (do_work_pending) from [<c000fa8c>] (slow_work_pending+0xc/0x20)
[ 950.808003] r7:00000006 r6:00000011 r5:afbfda10 r4:b5795fa0
[ 950.813719] Code: e30701dc e3a02000 e34b0f27 e59331d0 (e583202c)
[ 950.822682] ---[ end trace d0089e7fbb7baa73 ]---

0 Gerard over 8 years ago in reply to Rex Chang

Expert 1290 points

Rex Chang said:

I am trying to set up to reproduce it on TI EVM. From the error message, I assume MessageQ traffic is required. My idea is to have our ex02_messageq example running, then unbind both DSPs. Then, bind again, re-run, and re-unbind. There may be issues in setting it up, but let's see what I get.It will take me a while to get things ready.

Rex,

Any luck getting this test setup with the example TI MessageQ application going?

Thanks

0 Rex Chang over 8 years ago in reply to Gerard

TI__Guru 50170 points

Hi, Gerard,

I apologize for the slow response. I am getting different error from yours. I am trying resetting each DSP in 1 dsp scenario, and different reset intervals to see what's the difference.

Using numLoops: 100000; payloadSize: 8, procId : 4
Entered MessageQApp_execute
Using numLoops: 100000; payloadSize: 8, procId : 3
Entered MessageQApp_execute
Local MessageQId: 0x81
Local MessageQId: 0x80
Error in MessageQ_open [-1]
Error in MessageQ_open [-1]
Leaving MessageQApp_execute

Leaving MessageQApp_execute

Rex

0 Gerard over 8 years ago in reply to Rex Chang

Expert 1290 points

Does it work for you if you run the loop test with only a single DSP?

0 Rex Chang over 8 years ago in reply to Gerard

TI__Guru 50170 points

Hi, Gerard,

One DSP case, it got the first error at 10th iteration, recovered at 11th, then got the errors thereafter. From the code, it seems the error coming from NameServer. Trying to understand what's the complain and checking to see if possible something isn't done cleanly.

Rex

0 Gerard over 8 years ago in reply to Rex Chang

Expert 1290 points

Rex Chang said:

One DSP case, it got the first error at 10th iteration, recovered at 11th, then got the errors thereafter. From the code, it seems the error coming from NameServer. Trying to understand what's the complain and checking to see if possible something isn't done cleanly.

I don't think this will apply to you and your EVM testing, but on our custom hardware, we'd get intermittent errors like you described if we did not reset PCIe before powering off the DSP.

0 Gerard over 8 years ago in reply to Rex Chang

Expert 1290 points

Rex,

Could you tell me what version of the TI IPC library/MessageQ example application you're using and provide a link, please?

Thanks

0 Rex Chang over 8 years ago in reply to Gerard

TI__Guru 50170 points

Hi, Gerard,

I used ProcSDK 3.3.0.4, and IPC version should be 3.44.01.01. The MessageQBench is included in the Linux filesystem. Linux ProcSDK 3.3.0.4 can be downloaded from software-dl.ti.com/.../index_FDS.html

My shell script to test:

i=0
while true
do
let "i=i+1"
echo "================================================="
echo "================================================="
echo "Test #$i"
echo "================================================="

MessageQBench 100000 2 4 &
MessageQBench 100000 2 3 &

sleep 2

echo " "
echo "-------------------------------------------------"
echo " "

echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind
echo 41000000.dsp > /sys/bus/platform/drivers/omap-rproc/unbind

echo " "
echo "-------------------------------------------------"
echo " "

sleep 15

echo 40800000.dsp > /sys/bus/platform/drivers/omap-rproc/bind
echo 41000000.dsp > /sys/bus/platform/drivers/omap-rproc/bind

echo " "
echo " "
sleep 20
done

Rex

0 Rex Chang over 8 years ago in reply to Rex Chang

TI__Guru 50170 points

Hi, Gerard,

The mailbox errors seems to me is the mailbox queues get filled up before they get consumed. If your applications use the mailbox, you may need to do something when it clogs up. We are checking internally to see if the pending messages in the tx queue would be resent after DSP comes up, but at the mean time could you try to increase the TX_QUEUE_LEN to cover the DSP downtime? We can't understand why 2 DSPs make difference from only 1 DSP scenario because we use separate mailbox to talk to DSP1 and DSP2. Both have their own state and behave in the exactly same fashion as if one processor is running. It may be both DSP are using the same tx fifo_id which exacerbates it.

Rex

0 Rex Chang over 8 years ago in reply to Rex Chang

TI__Guru 50170 points

Hi, Gerard,

Here are more info about the errors you are getting.

Remoteproc attempts auto-suspend based on timeout expiry of 10 seconds.during which there was no communication with DSP. Before it going into suspend state, it sends one more message to DSP. A suspended DSP is actually woken up when a message is being attempted to be sent. DSP reads a message creates space in the h/w fifo to send one more message. Since none on the DSP is processing the messages, remoteproc's last attempt failed due to the queue is full. Hence in the errors, there is the error "omap-remoteproc: PM mbox_send_msg failed" which is caused by TX_QUEUE_LEN reached.

Rex

0 Gerard over 8 years ago in reply to Rex Chang

Expert 1290 points

Rex Chang said:

The mailbox errors seems to me is the mailbox queues get filled up before they get consumed. If your applications use the mailbox, you may need to do something when it clogs up. We are checking internally to see if the pending messages in the tx queue would be resent after DSP comes up, but at the mean time could you try to increase the TX_QUEUE_LEN to cover the DSP downtime? We can't understand why 2 DSPs make difference from only 1 DSP scenario because we use separate mailbox to talk to DSP1 and DSP2. Both have their own state and behave in the exactly same fashion as if one processor is running. It may be both DSP are using the same tx fifo_id which exacerbates it.

Our application does not use the mailboxes directly. We're currently only using MessageQ directly. As posted above, we have separate mailboxes specified for each DSP so I wouldn't think there is any kind of conflict when both DSPs are active.

0 RonB over 7 years ago in reply to Gerard

TI__Mastermind 30506 points

Gerard,

The mailboxes are used by MessageQ, but we agree that different sets should be used by each MessageQ and shouldn't be the direct reason for the issues you are seeing. When the DSPs wake-up in your tests, are they able to consume messages? How many messages do you consume before resetting them?

Since this might be timing related because now both DSPs are being reset, have you tried extending the time between each reset to see if that has an impact?

0 Gerard over 7 years ago in reply to RonB

Expert 1290 points

RonB said:

The mailboxes are used by MessageQ, but we agree that different sets should be used by each MessageQ and shouldn't be the direct reason for the issues you are seeing. When the DSPs wake-up in your tests, are they able to consume messages? How many messages do you consume before resetting them?

Since this might be timing related because now both DSPs are being reset, have you tried extending the time between each reset to see if that has an impact?

While the test is successfully working they consume 3 MessageQ messages before being powered off/reset. Adding delays to extend time did not have an impact, unfortunately. Has there been any progress on getting the MessageQ example application working on the EVM on your end?

Thanks!

0 Rex Chang over 7 years ago in reply to Gerard

TI__Guru 50170 points

Hi, Gerard,

The mailbox driver doesn’t clean up any pending messages when shutting down a remote processor. So, if there are pending messages in the Tx queue, they will continue to remain in the mailbox IP. We'd like to understand your test scenario. Are there multiple applications, is each application talking to both DSP cores, or just a single core, how many MessageQs are being used?

A MessageQ_open() typically involves looking up a MessageQ on any remote core and is based on querying a remote processor if that named MessageQ object exists on that processor. And typically applications do this in a do ..while loop, and each invocation sends messages to all active processors. So, even if your application is designed to talk to a single core, this lookup keeps sending messages to all cores. There should be logic in IPC libraries to prevent sending these queries to a downed processor.

One could add a logic to empty all the outstanding messages while freeing up the channel at the driver level, but we need to understand if there’s a gap in the first place as to why there are 20+ odd pending messages in the queue. The PM runtime suspend should also not happen (which we don't believe is the case in your test scenario), since it gets triggered when there is no activity between MPU and DSP for 10 seconds to begin with, and when it is running it is all but guaranteed that it is the only message being exchanged (a request and an ack). And there is no runtime suspend if the processor is shutdown. We do have some error recovery tests, which usually test a recovery (shutdown and restart) when an application is running, but we never ran into the issue of pending messages. So we are curious to know what your DSP image is doing.

Rex

0 Gerard over 7 years ago in reply to Rex Chang

Expert 1290 points

Rex Chang said:

The mailbox driver doesn’t clean up any pending messages when shutting down a remote processor. So, if there are pending messages in the Tx queue, they will continue to remain in the mailbox IP. We'd like to understand your test scenario. Are there multiple applications, is each application talking to both DSP cores, or just a single core, how many MessageQs are being used?

There is 1 application running on the ARM per DSP. Each application is assigned to and only talks to a single DSP core. Each application has a single host MessageQ and attempts to open the slave MessageQ of the DSP it is talking to.

Thanks,

Gerard

0 Rex Chang over 7 years ago in reply to Gerard

TI__Guru 50170 points

Gerard,

As in my previous post, mailbox doesn't clean up any pending messages, but when remote processor comes up, there should be pending mailbox interrupt for it, so it depends on what the DSP does with those messages.

We had an internal discussion and can add code while releasing the mailbox channel to drain/remove all outstanding messages, but that’s only a patch up at kernel level. We still don’t have an explaination for stale/outstanding messages, need to understand if there are any gaps in IPC code that needs to be plugged or if there are gaps in the customer application/baseimage itself.

Because I am seeing different errors from yours, Is there any way we get your DSP application to reproduce the the exact errors on our EVM so we can look at the issue further? If it can be shared, could you send it to Randy to forward it to me?

Rex

0 Gerard over 7 years ago in reply to Rex Chang

Expert 1290 points

Rex Chang said:

Because I am seeing different errors from yours, Is there any way we get your DSP application to reproduce the the exact errors on our EVM so we can look at the issue further? If it can be shared, could you send it to Randy to forward it to me?

It's concerning that you were seeing errors, period, given that it was your own demo code on an EVM. Is that going to result in a patch?

I will get our stripped down DSP application up and running on the EVM and try to reproduce the asynchronous load failure related to the mailbox queue.

Thanks

Processors

Processors forum

Linux/AM5728: Mailbox failure when using both DSPs