DSP/Link - Heavy use of DSP/Link, Linux and DSP/BIOS crashes, are they related?

TSC

Other Parts Discussed in Thread: OMAP3530, OMAP-L137

Hello,

This is my first ever post to a forum of any kind, so I hope I have given enough information below. Reading it again, I now worry that it is not too long :)

This is more of an architectural question than a how-do-i question. We have a fully functional system with an instability that, after much testing and investigation, I am currently attributing to an instability within DSP/Link. By the way, before I question the stability of DSP/Link below, let me say that we have been using TI processors (DSP and OMAP) and DSP/BIOS for many years now and I am very happy with both of those systems. Some of the ancillary software has been much more challenging to use due to the packaging system, though much of it has proven to be fairly stable.

We are using DSP/Link in a slightly different way than just a codec interface, though one that it would seem to support, and seeing random crashes within both the Linux system and the DSP/BIOS system. Since we are using DSP/Link as a more general IPC mechanism between the ARM and the DSP, our use case is very different than that of a codec: many messages per second, non-periodic operation, either side can initiate a message, etc.

To give a little background, we have been running a system on the OMAP2 and now the OMAP3 with an IPC system that we implemented using the mailbox subsystem. Now that we want to integrate media codecs into our system using TI's media stack (DSP/BIOS, DSP/Link, Codec Engine, DMAI, gstreamer-ti), we have ported the very bottom of our ARM/DSP communication system to use DSP/Link MSGQ. Note that we have not yet integrated Codec Engine, just all the system requirements for Codec Engine to run on. Our system running on the DSP is a wireless communication system where we take in voice samples, Ethernet frames as input into our system with data rates range from very low to very high (many Mbps). This data is transferred between the ARM and the DSP.

In general, the system works as required, though as we run more and more data through the system, we get these random crashes. The data rate does not seem to affect the rate of crashes. These are indicated by a Linux kernel PREEMPT, a Linux crash with a stack dump indicating that something in LDRV or MSGQ has failed when it is an ARM-side crash. We see DSP MMU faults with fairly random addresses indicating a crash of the DSP. These crashes happen after fairly random amounts of time, sometimes on the order of minutes, sometimes hours. Just to be clear, I've never, ever seen a DSP MMU fault with our previous IPC system, except during the very first stages of integration. It is rare that Linux ever crashes such that a power cycle is the only way to recover it. These are now regular occurrences in a previously stable system.

I realize that these indications could be caused by a myriad of bugs and instabilities within our own code, but I am sure that these particular indications are not. The integration with DSP/Link was an exercise in changing one file of our code and maybe 30 lines of actual code on the Linux and the DSP side. This is very simple code initializing the MSGQ and sending messages with it, basically replacing our our allocate functions with MSGQ_alloc and replacing our send/receive with MSGQ_send and MSGQ_recv. The rest of our system has not changed in any significant way and the data profile has not changed from our previous system, at least in the ways that we are running tests to validate behavior with DSP/Link.

Has anyone attempted using DSP/Link in production code and been able to run for days without crashes? This seems like the first engineering test for an IPC mechanism, but I would say DSP/Link is not stable in this way. I am open to hearing otherwise and to see some test data indicating that it is on our platform, but I have not been able to find that documentation.
What are the limitations of DSP/Link in hard terms such as messages/second, etc.? We have not seen throughput issues, but a qualified set of measurements (OMAP3530 at 500/350 MHz, n messages/second) would be helpful for designers using to looking to use the system. If I have missed it in the documentation, I would appreciate a reference.
Is this use case something that the designers of DSP/Link would shy away from? Every indication from the documentation says that this is an approved use case for DSP/Link. I realize that systems are designed for their purpose and I wonder if DSP/Link was designed for the general case but only tested for the more specific case of a media codec accelerator. The requirements for successful operation could be vastly simpler for the latter vs. the former. The fact that DSP/Link is now deprecated does not give me a warm and fuzzy feeling.
Is there any way that a slight setup issue could cause something like this? I highly doubt this. We spent a lot of time getting to know DSP/Link, in fact more time than it took to write and test our previous system, before integrating and I believe we have it set up properly. I have read through E2E posts enough to know that we have a pretty clear understanding of this system.
Is SYS/Link a viable alternative yet? I would be willing to give it a try, but it is beta1 and not very well supported in the greater community due to this. I also would need some positive indication that it would resolve these type of issues.

I hope that there is a simpler explanation for what we are seeing, but in the absence of some ideas from the TI community, my current plan of attack is to rework the bottom of DSP/Link (IPS, etc.). I hope that I can find a race condition that can be easily resolved, but looking at the complexity of the design of the multiprocessor communication structures, I highly doubt that this will be possible. I realize that reworking this is extreme, but I have to be 100% certain that this is stable. Right now, I can't verify either by reading the code or by reading documentation, so if anyone can help with either, it would be very useful.

Thanks in advance,

Tom

System:

OMAP3530
DSP/BIOS 5.41.09.34
DSP/Link 1.65.00.03
Framework Components 2.26.00.01
XDC 3.20.06.81
CGT 6.1.19
Linux 2.6.34 with OpenEmbedded

over 14 years ago

Chris Ring over 14 years ago

TI__Genius 17205 points

Tom Tom said:
Has anyone attempted using DSP/Link in production code and been able to run for days without crashes? This seems like the first engineering test for an IPC mechanism, but I would say DSP/Link is not stable in this way. I am open to hearing otherwise and to see some test data indicating that it is on our platform, but I have not been able to find that documentation.

We have many customers going to production with DSP Link. I'm sure this statement would carry more weight if you were hearing it from some of them (or TI field teams that can vouch for it), so I welcome them posting followups on this.

Tom Tom said:
What are the limitations of DSP/Link in hard terms such as messages/second, etc.? We have not seen throughput issues, but a qualified set of measurements (OMAP3530 at 500/350 MHz, n messages/second) would be helpful for designers using to looking to use the system. If I have missed it in the documentation, I would appreciate a reference.

I think we used to ship some benchmark figures in previous releases, but I couldn't find them in the more recent releases. I'll see if I can dig some of these up.

Tom Tom said:
Is this use case something that the designers of DSP/Link would shy away from? Every indication from the documentation says that this is an approved use case for DSP/Link. I realize that systems are designed for their purpose and I wonder if DSP/Link was designed for the general case but only tested for the more specific case of a media codec accelerator. The requirements for successful operation could be vastly simpler for the latter vs. the former. The fact that DSP/Link is now deprecated does not give me a warm and fuzzy feeling.

Using the lower-level primitives is certainly a supported use case. While the traditional DVSDK/Codec Engine use cases (e.g. PROC and MSGQ) may get more use, many customers use the lower level APIs.

And to be clear, DSP Link is a supported product. Deprecated is too strong a term as deprecated means that users are encouraged to stop using it - that's not the case with DSP Link. On many platforms, e.g. DM644x, DM6467, OMAP-L137, DSP Link is currently the only production qualified TI-provided solution, and will remain that way. On newer devices, we're pushing the next generation SysLink product, but there are no plans to backport SysLink to some of these other devices, and as a result, DSP Link support will continue. (Although, admittedly, new development is not taking place on DSP Link).

Tom Tom said:
Is there any way that a slight setup issue could cause something like this? I highly doubt this. We spent a lot of time getting to know DSP/Link, in fact more time than it took to write and test our previous system, before integrating and I believe we have it set up properly. I have read through E2E posts enough to know that we have a pretty clear understanding of this system.

Not likely, but perhaps. The fact that things generally work for a while and then a random MMU fault occurs makes me think (like you) that it's a bug on the DSP side. Somebody following a bad pointer or blowing their stack.

Sometimes the MMU faults are more interesting than you may think. For the faulting addresses, are there any patterns to them? Comparing them to your memory map, are they consistently in a code section? Data section?

Are you providing addresses from the ARM side that might have been translated poorly to DSP-side addresses? Do the crashes go away if you disable cache?

All just general common debugging tips that may help isolate the issue.

Tom Tom said:
Is SYS/Link a viable alternative yet? I would be willing to give it a try, but it is beta1 and not very well supported in the greater community due to this. I also would need some positive indication that it would resolve these type of issues.

I wish I could say "yes", but I have to say "no" at this point. SysLink is not yet GA qualified, and won't be for a few more months. It will also require an update from BIOS 5 to BIOS 6, which is an additional hurdle you may not want to introduce.

I tend to recommend "whatever the DVSDKs are using" as they do get the most testing and mileage on them. Right now, I don't know of any OMAP3-based DVSDK planning to move to SysLink.

If you do happen to find a bug, _please_ report it so we can address it. If you have a simple test case, or can reproduce it in one of the DSP Link examples (maybe find one that looks like your use case and throw it in a loop for a few days), please let us know! We want to fix it for everyone.

Chris

TSC over 14 years ago in reply to Chris Ring

Prodigy 20 points

Hi Chris,

Thanks for your response. I was more interested in the breadth of use cases. I certainly believe that companies are deploying DSP/Link-based media systems, but are your customers using it for other purposes. I'm also glad to hear that DSP/Link is still supported. I would definitely appreciate any benchmarks you can find, even on older versions. We are currently digging down into the details of the MPCS component to try to come to our own conclusions as to the performance. I will post with our conclusions and test approach when our testing is a little more mature.

Chris Ring said:

Using the lower-level primitives is certainly a supported use case. While the traditional DVSDK/Codec Engine use cases (e.g. PROC and MSGQ) may get more use, many customers use the lower level APIs.

We are actually only using the PROC, MSGQ, POOL interfaces. We are just using a bit differently than the codec server use case.

Something I should have mentioned in my original post is that we have had to make some modifications to DSP/Link. I realize that this completely invalidates any guarantees that TI would make. Here is a quick rundown of our changes. I would not mind posting the actual changes if someone wanted to take a look at them.

DSP MMU no longer requires that virtual and physical addresses are the same.
- we are interfacing the DSP with an FPGA on the GPMC bus, so there is no way to map the FPGA into the DSP's physical address space
exposed some of the kernel apis to our kernel modules.
- we are using dsplink to send messages from the network stack, and the network stack is in the kernel, and parts of our driver run in interrupt context
the Linux spinlock was using a kernel mutex, which we changed to a spinlock in order to use these apis in interrupt context. this change required a fix to the call to kernel_create, which can sleep so can't be called under a spinlock
- see above
applied a patch from OpenEmbedded to share the DSP MMU interrupt
- i believe to be able to print out a message
turn the core off when stopping dsplink
- power management
changed the DSP GPTIMER (5) to use the 32 kHz clock
- power management

Chris Ring said:

Not likely, but perhaps. The fact that things generally work for a while and then a random MMU fault occurs makes me think (like you) that it's a bug on the DSP side. Somebody following a bad pointer or blowing their stack.

The DSP MMU faults are not the only problem. We also see faults on the Linux side (PREEMPT and stack traces from MSGQ, LDRV). I agree that a problem on the DSP side could cause the latter, but the former, likely not.

How much additional stack does DSP/Link require? I have not increased my stack size since integrating with DSP/LInk, though it is large and we are not near the limits of last time I checked. I will try increasing it and see if that resolves the issue.

We actually assert on the physical memory range of the pointers being given to MSGQ on both sides. It is a pretty wide range, but the address we are faulting on are not within that range, nor within our memory space. So at some level, we know that the message pointers we give to dsplink are within the correct range. Previously, we could validate that the addresses we were dealing with were actually allocated / deallocated from the right place but we can no longer do that.

Chris Ring said:

Sometimes the MMU faults are more interesting than you may think. For the faulting addresses, are there any patterns to them? Comparing them to your memory map, are they consistently in a code section? Data section?

We are definitely translating addresses properly. It would be hard to imagine how it could run for an hour with improperly translated addresses.

We have not yet identified a discernible pattern to the fault addresses. In some cases the fault address is within our DSP memory space, in others it looks more like a register address inaccessible and unknown to the DSP.

We have not tried disabling the cache on the ARM side yet, we'll give this a try and see how it goes. We have not disabled caching on the DSP side and probably cannot since this is a real-time system and the running without the cache puts us at risk of not meeting our timelines.

Chris Ring said:

If you do happen to find a bug, _please_ report it so we can address it. If you have a simple test case, or can reproduce it in one of the DSP Link examples (maybe find one that looks like your use case and throw it in a loop for a few days), please let us know! We want to fix it for everyone.

We are currently digging in. It will likely take a few days, but I will respond here with whatever we find.

Regards,

Tom

Processors

Processors forum

DSP/Link - Heavy use of DSP/Link, Linux and DSP/BIOS crashes, are they related?