This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM6467T ARM hangs after Comm_create during instantiation of DSP codec with DVSDK 3.10 GA

We have a product using two DM6467Ts that are booted over PCI, running Linux and a custom frame capture application that is based on the demonstration applications. Each DM6467T handles a separate video stream, resizing and encoding with the 1080p HD h.264 encoder codec (h264fhdvenc). The codec runs on the DSP with the Codec Engine managing the connection to the Codec Server.

We have a rare problem with one of our ARMs locking up when initializing the codec. Once this starts to happen, it will happen each time we begin encoding frames, even after performing a reset on the PCI bus. But if we power cycle the box, the problem will disappear for a long time - possibly days and dozens of runs - before it arises again. Usually the problem is repeatable at that point, but not always.

We were able to remotely connect to our device at a customer's site when it was in this repeatable error state. When I ran our program with CE_DEBUG=2, here are the last few messages emitted before the ARM hangs:

 

@304,924,477us: [+2 T:0x49181490] ti.sdo.dmai - [Venc1] Creating encoder h264fhdvenc for max 736x480 bitrate 622000 ratectrl 4
@304,924,719us: [+0 T:0x49181490] ti.sdo.ce.video1.VIDENC1 - VIDENC1_create> Enter (engine=0xa9218, name='h264fhdvenc', params=0xd4a30)
@304,924,869us: [+0 T:0x49181490] CV - VISA_create(0xa9218, 'h264fhdvenc', 0xd4a30, 0x520, 'ti.sdo.ce.video1.IVIDENC1')
@304,924,998us: [+0 T:0x49181490] CV - VISA_create2(0xa9218, 'h264fhdvenc', 0xd4a30, 0x30, 0x520, 'ti.sdo.ce.video1.IVIDENC1')
@304,925,193us: [+0 T:0x49181490] CE - Engine_createNode(0xa9218, 'h264fhdvenc', 520, 0xd4a30, 0x30, 0x49180a78)
@304,925,332us: [+0 T:0x49181490] CE - Engine> allocNode Enter(engine=0xa9218, impId='h264fhdvenc')
@304,925,477us: [+0 T:0x49181490] CE - Engine> allocNode(). Calling (Comm_create(gppfromnode_6670_2, 0xd4d50, NULL)
I've dug around a bit in that code but was unable to see any way for the software to fail. We've had this happen on a few units. We have temperature sensors on the board and they are all nominal (none greater than 46 C).
There are other debugging steps I can (and will) take, such as trying to dig out dmesg messages from DDR2 via PCI on the host. I don't hold out much hope for that. Also, we have a serial port on the ARM UART, so if we can get it to happen in our lab then we might be able to gather more information here.
Are there any known issues with this lower layer Codec Engine code? Is it possible that there is some sort of resource conflict with the DSP side of the comm channel (like DMA channels)? Any other ideas?

 

  • I'm not familiar with any known issues related to Comm_create() hanging the ARM.

    From your description it sounds like some system resource is being exhausted and the code is not handling that case very well.  Perhaps your DSPLink kernel module needs to be rmmod'ed and re-insmod'ed, not that that would be a solution, but an indication of where the problem lies.

    Comm_create() produces entry/exit traces, and we can see other entry/exit traces before the hang, so I would expect to see the entry trace for Comm_create().  However, when the Linux kernel hangs (which doesn't necessarily mean it's dead, perhaps it's just spinning) there might be other user-level prints that were performed but never made it to the display, due to buffering of user-level output or UART difficulties.  This is all to say that your application might have gone further than the point indicated by the trace output, and we just don't know.

    When the hang occurs, can you kill the app and rmmod dsplinkk, then insmod dsplinkk.ko (possibly through a "loadmodules.sh" script) again and try the app?

    Regards,

    - Rob

     

     

  • Rob,

    Thanks for the reply.

    No, there is no way to restart the app nor to reload dsplinkk. We have UART0 available on a header and even that console just stops (though it spews some spaces first).

    Also, I reload dsplinkk from within the application before each run. This is not strictly necessary, but earlier in the development cycle I had been modifying the buffer sizes so frequently that I added system() calls to reinsert cmemk and dsplinkk when the program starts. I've never removed that from the application because it seemed to work well. Plus, that keeps the sizes in one place - the application. I don't have to have dependencies between a boot script and the application. I've found those drivers to be reliable. Do you see a problem with that approach?

    We were able to get a system in this state again and we ran it with CE_DEBUG=3. Here are the last few lines:

     

    @62,267,176us: [+0 T:0x49159490 S:0x491589c4] OC - Comm_create> Enter(queueName='gppfromnode_2081_1', queue=0x8ca68, attrs=0x0)
    @62,267,318us: [+0 T:0x49159490 S:0x491589ac] OM - Memory_alloc> Enter(0x4)
    @62,267,453us: [+0 T:0x49159490 S:0x491589ac] OM - Memory_alloc> return (0x8ca88)
    @62,268,723us: [+0 T:0x49159490 S:0x491589c4] OC - Comm_create> return (0x8ca88)
    @62,268,907us: [+0 T:0x49159490 S:0x491589cc] OC - Comm_put> Enter(queue=0x0, msg=0x49966900)
    @62,269,112us: [+0 T:0x49159490 S:0x491589cc] OC - Comm_put> return (0)
    @62,269,260us: [+0 T:0x49159490 S:0x491589c4] OC - Comm_get> Enter(queue=0x10000, msg=0x49158a74, timeout=-1)
    

    So it got past Comm_create() and stopped at Comm_get().

    Honestly, I haven't looked at the codec interactions very much since I discovered through dmesg (log_buf over our PCI memory interface - that still works) that the last kernel message is the enabling of McASP0 to read audio samples via EDMA3 from an AIC32. So I was looking into DMA conflicts with the DSP as a possible source of error. See http://e2e.ti.com/support/embedded/f/354/p/140290/510872.aspx. I modified my ARM kernel to mark all the DSP channels as used (I think I got them all - still no reply on that thread). But it made no difference.

    There's something in the ARM/DSP relationship that is failing; I just don't know what it is. I'm still leaning towards DMA (I don't know how the QDMA channels are reserved for the DSP in the ARM kernel, and the DSP seems to need 7 of the 8!).

    Also keep in mind that this could be a hardware failure. It is rare, but when it happens it seems to be repeatable. And when it falls out of that condition, it could go a long time before we can get it back into that state. I'm trying to narrow down (even if it's a hardware problem) by looking at what could cause these symptoms.

    I will try to use the PCI MMR interface to determine what the ARM processor is doing when it dies. As you said, it may be spinning. It's either spinning or halted, huh?

    Thanks for the ideas, Rob.

     

  • From the other thread, "Which DMA (EDMA or QDMA) channels are used by the Codec Engine and Codec Server in DM6467T DVSDK 3.10?", the following were posted that may be of interest:

    Posted by  Replied on Mon, Oct 31 2011 7:45 PM

    If you don't find a resource conflict with ARM McASP driver and DSP side, you may want to dig into the SDMA/MDMA lockup issue in Advisory 3.0.3.

    The problem has to do with a EDMA hardware deadlock situation that arises when the same EDMA TC is used to perform writes to BOTH DSP SDMA (L1, L2, and HDVICP RAM/Buffers through SDMA port) AND slave memories (DDR2, EMIFA, HDVICP0/1 EDMA ports, or ARM TCM)

    Since EDMA is a shared peripheral, this may be caused by parallel accesses from multiple cores. For example, if McASP driver is using TC0 for issuing EDMA writes to one of the slave memories while simulatenously the codec on the DSP is using TC0 for writes to L2 the deadlock condition can arise.

    The same condition can arise if a single DSP side codec, for example uses the same TC for writes to L2 and DDR simulatenously as well, but most codecs that are aware of the issue has probably addressed this by selecting different TCs for submitting transfers to based on the destination memory type. If you are able run the codecs without issue (until you start the ARM McASP) then chances are this is already taken care of. Otherwise you will need to contact your codec vendor. (I am sorry but I am not sure who that is, and you should go to the source).

    One experiment you can do is to change the ARM or DSP side DMA channel to TC mappings. You can look at the EDMA DMAQNUMi registers (vIa CCS by displaying the contents of these registers - look up EDMA documentation for your board for the address and how to parse the register bits) There are 8 32-bit DMAQNUMi registers (i: 0..7) where each register encodes mappings for of 8 of the EDMA channels, and each channel is thus mapped to its associated TC Queue using the 4-bit encoding. All 64 channels are mapped this way. Observe what these registers are after linux side boot, to see what the kernel mapped. (Each driver potentially will set the entry for its own channel, so you may want to pause after the McASP driver is intalled and see the assignment. )

    See if this is helpful,

    Murat

    ----

    Posted by  replied on Tue, Nov 1 2011 9:10 AM

     

    Murat,

    I read through the erratum more carefully yesterday and concluded that since I do not know what the codec is doing, it is indeed possible that this is an issue.

    It looks like the deadlock can happen for memory transactions initiated by a source other than just EDMA TCs. All that matters is that the source generate a write command through the DSP's SDMA followed immediately by a memory write transfer to one of the other memory devices. Then under certain conditions deadlock ensues. On our board the sources can be the ARM data cache, an EDMA TC, PCI bus or VLYNQ(?). We use none of the others that are mentioned in the advisory.

    One of the areas where I'm confused is the DSP SDMA. In the advisory it seems like this is a separate device, but in my reading of the interconnect diagram in spraaw4.pdf (DM6467 Architecture and Throughput), it appears that the DSP SDMA is really part of the caching infrastructure for the entire chip (both the ARM and DSP). Does that sound right to you?

    I will take a look at how the TC's are configured and see. I know that our audio driver uses TI's EDMA services in the ARM Linux kernel to copy from McASP0 to DDR2. The reason I do not suspect our audio driver (as much!) is because we have two video input processors (what we call "channels") that are identical except that only one of them handles audio. Yet both channels appear to suffer from this problem (one at a time - that is, one channel's DM6467 will hang and the other's will not).

    The codec is provided by TI. It is h264fhdvenc 01.10.02.03 (in DVSDK 3_10_00_19 GA). TI is generally very good about documenting their software - especially about the version numbers and what the test platforms were. But I saw nothing in the User's Guide (SPRUGN8D.pdf) about the advisory, nor did I see anything about whom to contact for support issues such as this. To me, the customer, it seems like TI is the point of contact for this codec.

     

    (Q8) So could you please point me to someone who knows the source of the codec? Someone at TI must know.

    If you don't mind, I will copy your message and my reply to the other thread (http://e2e.ti.com/support/embedded/f/354/t/140023.aspx "ARM hangs after Comm_create during instantiation of DSP codec"). We're really discussing that problem rather than "which DMA channels are used by ARM/DSP".

    Thanks for all your help, Murat.

    Doug

    ----

     

     

  • I just noticed another thread on this subject:

    http://e2e.ti.com/support/embedded/f/354/p/102526/523318.aspx  (DM6467CZUT7 dm6467_h264fhdvenc_01_10_02_03_production.bin encode 720p system halted)

    There's just as little action there as here.

     

  • We are also having the same problem . dm6467t hangs at function call to "VIDENC1_create" for h264fhdvenc. happens rarely ( 1 out of 300 dm6467t boot)

    Is there a solution for this?

    Regards

    Sharath

  • Sarath,

    Sadly, no. For all the details, see http://e2e.ti.com/support/embedded/linux/f/354/t/140290.aspx (Which DMA (EDMA or QDMA) channels are used by the Codec Engine and Codec Server in DM6467T DVSDK 3.10?). It's clearly a problem with the start-up code for the Codec Engine - they're using DMA channels in a manner that can (and does, on rare occasion) lead to processor lock-up as indicated in Advisory 3.0.3.

    We've been back and forth with TI about this. We're too small for TI to expend any more effort resolving the issue. They spent a lot of time bickering about it, but very little time looking directly at my observations and trying to see what could be going wrong within their own code. They constantly pushed me to do the work of tracking down the problem. Each of their software departments claimed that they are careful not to violate the constraints listed in the erratum (perhaps they are, in isolation), but my observations clearly show improper allocation of DMA channel/queue combinations for the transactions being performed. TI's engineers have moved on and even a third party company solution looks unlikely at this point.

    In the last bit of correspondence we had with TI on this matter, we pressed them to know if other customers were experiencing this problem. They claimed that no one was (even though there are others on the forum). But they did reveal that the customers that use the 1080p encoder were not using ARM Linux. Apparently they're running stand-alone programs rather than the full DVSDK promoted by TI. Their final recommendation to us was to have us hire a third party company to help us write such an application. This is not a good option for us. It would take a man-year or more to recreate the services we are using with the ARM Linux kernel.

    It is possible that this summer I will dig into TI's source code to find the culprit and fix it. I had already found an incorrect set of DMA channel reservations - different on the ARM and DSP (documented in the other thread). We have thousands of DM6467s in the field that we cannot use with the 1080p codec, and we'll have thousands more before our next design is available. Luckily there are more options now for encoding 1080p in real-time.

    If I do find a solution later this summer, I'll post it here and in the other thread.

    Doug

  • Upon rereading my message, I see that it is more negative on TI than I wanted. Sorry about that. TI did spend time with us trying to determine the cause of the problem and they did look at my observations. But they did not seem to think their code could be at fault (at least not the version we are using). I think TI has no one with the knowledge of how all the components fit together in the full DM6467 ARM Linux-based solution for DVSDK 3.10. Somewhere in that vast infrastructure the problem lurks.

    Doug