Hello,
This is my first ever post to a forum of any kind, so I hope I have given enough information below. Reading it again, I now worry that it is not too long :)
This is more of an architectural question than a how-do-i question. We have a fully functional system with an instability that, after much testing and investigation, I am currently attributing to an instability within DSP/Link. By the way, before I question the stability of DSP/Link below, let me say that we have been using TI processors (DSP and OMAP) and DSP/BIOS for many years now and I am very happy with both of those systems. Some of the ancillary software has been much more challenging to use due to the packaging system, though much of it has proven to be fairly stable.
We are using DSP/Link in a slightly different way than just a codec interface, though one that it would seem to support, and seeing random crashes within both the Linux system and the DSP/BIOS system. Since we are using DSP/Link as a more general IPC mechanism between the ARM and the DSP, our use case is very different than that of a codec: many messages per second, non-periodic operation, either side can initiate a message, etc.
To give a little background, we have been running a system on the OMAP2 and now the OMAP3 with an IPC system that we implemented using the mailbox subsystem. Now that we want to integrate media codecs into our system using TI's media stack (DSP/BIOS, DSP/Link, Codec Engine, DMAI, gstreamer-ti), we have ported the very bottom of our ARM/DSP communication system to use DSP/Link MSGQ. Note that we have not yet integrated Codec Engine, just all the system requirements for Codec Engine to run on. Our system running on the DSP is a wireless communication system where we take in voice samples, Ethernet frames as input into our system with data rates range from very low to very high (many Mbps). This data is transferred between the ARM and the DSP.
In general, the system works as required, though as we run more and more data through the system, we get these random crashes. The data rate does not seem to affect the rate of crashes. These are indicated by a Linux kernel PREEMPT, a Linux crash with a stack dump indicating that something in LDRV or MSGQ has failed when it is an ARM-side crash. We see DSP MMU faults with fairly random addresses indicating a crash of the DSP. These crashes happen after fairly random amounts of time, sometimes on the order of minutes, sometimes hours. Just to be clear, I've never, ever seen a DSP MMU fault with our previous IPC system, except during the very first stages of integration. It is rare that Linux ever crashes such that a power cycle is the only way to recover it. These are now regular occurrences in a previously stable system.
I realize that these indications could be caused by a myriad of bugs and instabilities within our own code, but I am sure that these particular indications are not. The integration with DSP/Link was an exercise in changing one file of our code and maybe 30 lines of actual code on the Linux and the DSP side. This is very simple code initializing the MSGQ and sending messages with it, basically replacing our our allocate functions with MSGQ_alloc and replacing our send/receive with MSGQ_send and MSGQ_recv. The rest of our system has not changed in any significant way and the data profile has not changed from our previous system, at least in the ways that we are running tests to validate behavior with DSP/Link.
- Has anyone attempted using DSP/Link in production code and been able to run for days without crashes? This seems like the first engineering test for an IPC mechanism, but I would say DSP/Link is not stable in this way. I am open to hearing otherwise and to see some test data indicating that it is on our platform, but I have not been able to find that documentation.
- What are the limitations of DSP/Link in hard terms such as messages/second, etc.? We have not seen throughput issues, but a qualified set of measurements (OMAP3530 at 500/350 MHz, n messages/second) would be helpful for designers using to looking to use the system. If I have missed it in the documentation, I would appreciate a reference.
- Is this use case something that the designers of DSP/Link would shy away from? Every indication from the documentation says that this is an approved use case for DSP/Link. I realize that systems are designed for their purpose and I wonder if DSP/Link was designed for the general case but only tested for the more specific case of a media codec accelerator. The requirements for successful operation could be vastly simpler for the latter vs. the former. The fact that DSP/Link is now deprecated does not give me a warm and fuzzy feeling.
- Is there any way that a slight setup issue could cause something like this? I highly doubt this. We spent a lot of time getting to know DSP/Link, in fact more time than it took to write and test our previous system, before integrating and I believe we have it set up properly. I have read through E2E posts enough to know that we have a pretty clear understanding of this system.
- Is SYS/Link a viable alternative yet? I would be willing to give it a try, but it is beta1 and not very well supported in the greater community due to this. I also would need some positive indication that it would resolve these type of issues.
I hope that there is a simpler explanation for what we are seeing, but in the absence of some ideas from the TI community, my current plan of attack is to rework the bottom of DSP/Link (IPS, etc.). I hope that I can find a race condition that can be easily resolved, but looking at the complexity of the design of the multiprocessor communication structures, I highly doubt that this will be possible. I realize that reworking this is extreme, but I have to be 100% certain that this is stable. Right now, I can't verify either by reading the code or by reading documentation, so if anyone can help with either, it would be very useful.
Thanks in advance,
Tom
System:
- OMAP3530
- DSP/BIOS 5.41.09.34
- DSP/Link 1.65.00.03
- Framework Components 2.26.00.01
- XDC 3.20.06.81
- CGT 6.1.19
- Linux 2.6.34 with OpenEmbedded