This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CPU DMA interrupt stops occuring even though dma's are still running.

CPU: 5535, CCS v4.2, CGTOOLS 4.3.9, BIOS: 5.41.13.42 Platform:5535ezdsp

We are having trouble with the aggregate DMA interrupt.

We have a SW system that is running the Codecs at 48Khz Stereo and streaming the adc data to an SD Card.

We are using the dmas for the rx/tx codec transfers and for the transfers of data to/from the sd card (mmc0), so we essentially have three dma sources that can create a CPU DMA Interupt.

dma1:ch0-3 are used for codec. dma0ch0 for sd writes, dm0ch1 for sd reads.

the dma_isr( ) is pretty straight forward, here is some psuedo code:
dma_isr()
{

ifrValue = CSL_SYSCTRL_REGS->DMAIFR;
CSL_SYSCTRL_REGS->DMAIFR = ifrValue;  // clear/acknowledge all latched events imediately.


if (ifrvalue shows rx/tx codec event has happened)
{
    service the rx/tx code interrupt
}

if (ifrvalue shows sd card event has happend)
{
   service the sd card interupt.
}
}//done


The system runs fine for the most part, but after 1/2 to hour of continuous running, will suddenly just stop. By stop I mean we stop getting all dma interrupts.

When I halt the system with the emulator, this is what I see.

IER0: 0x0130  (dma int (0x0100) is enabled)
IFR0: 0xE010  (no DMA int pending).
DMAIFR = 0x00F0 (shows all 4 codec dma events have happened)
DMAIER = 0x00F0 (shows all 4 codec dma events are enabled).

So somehow the codec dma events failed to trigger a CPU DMA INTERUPT in the IFR0.

If I continue running the emulator, I see that the dmas are still running, and all my other threads are still running, its just that we have stopped getting the CPU DMA interrupt.

I put in observation variables at the end of my dma_isr( ) and I know that on occasion, while I am in the dma_isr for one type of dma event, a new DMAIFR bit has been set for one of the other events. In this situation, I see the the DMA interrupt bit of IFR0 is also set. Lets call this situation a latentDMA event.  For example I might be servicing a sd event and I'll see at the end of the isr that a new DMAIFR bit is set for the codec dma event.  I view this as a normal situation, as I expect upon exit of the dma_isr( ), once the DMA INT bit of IER0 is re-enabled, that it will generate another dma_isr to service this new one.

On even rarer occasion, I see at the end of my dma_isr( ) that a new DMAIFR bit has been set as described above, HOWEVER the DMA interupt bit of IFR0 is NOT set. Lets call this situation a missedDMA Event.

When the system breaks, is see that in the last dma_isr( ) a missedDMA Event occured.

However, a missedDMA Event didn't always break the system. For example in one run, out of 533535 dma isrs, there were 54 missed DMA events. The last one that occured broke the system. In that missed event, we were servicing the sd dma event when a new Rx codec dma event came in.

So something in the timing and/or mechanism of acknowledging the DAMIFR so that the next DMA event will trigger a CPU DMA INT is broken.  Its seems I have done all one should do on the SW side of things: acknowledge the events that occured, service them, return.

Are there any known pipeline issues, silicon issues, SW bugs anything that can explain why the aggregate DMA interrupt architecture can ocassionaly fail if another DMA interupt comes in while we are in the dma_isr?

What is the timing from a DMAIFR bit being set to the eventual setting of the CPU DMA interrupt bit in IFR0?  What is the silicon logic that determines when a new DMAIFR event will create a new CPU DMA INT?

Any help will be greatly appreciated.

-Shawn

(PS also posted this on the BIOS forum)

  • Shawn,

    This is the first report of such matter. We are aware that all 16 DMA interrupts in DMAIFR share same CPU DMA interrupt. Thus the DMA ISR must read the DMAIFR to service all channels that are flagged during a single execution of the DMA ISR. You are seeing a timing relationshio here. I will check and post here if new information is found. Please do same if you have new data.

    Regards.

  • Steve,

    So do you agree that I'm not doing anything improper in the sw logic of my dma_isr( )? 

    Update: I tweeked the dma codec architecture so that I am only servicing a Codec Rx dma interrupt and a sd Interupt in my dma_isr.  In other words, I reduced the number of DMA interupt sources from 3 to 2. 

    I ran the code in this configuration last night. It ran for 236 minutes (almost 4 hours) without failure, but it stilled failed in the same manner.  In the last dma interupt that was serviced, a "missedDMAEvent" was logged. I was servicing an sd interupt when a codec rx dma event occured.

    Here are the stats from this run:

    Total dma_isrs: 3699214  
    rx dma isrs: 1327132
    sd dma isrs: 2372561
    latent Dma Events: 233
    Missed Dma Events: 410

    So we now have a system with only two different DMA interrupt sources that still exhibits an issue with the timing mechanism of DMAIFR clears and CPU dma Interupt latching in IFR0.  I can't make it any simpler from a DMA use perspective. 

    Where can I find detailed information about the logic/timing relating to DMAIFR events and when the dma bit is set in the CPU IFR0? I've already read the info in spruh87c and swpu073e. 

    The next thing I am going to try (even though I shouldn't have to) is to disable other interrupt sources while in my dma_isr( ) so that I am sure there are no context switches while in my dma_isr( )

    thanks

    Shawn

  • Steve,

    Here is an update.  I modified the code such that within the dma_isr( ), I have disabled other interrupt sources so that dma_isr( ) will not be interuptable.  This seems to have fixed the problem. I successfully ran for about 19hrs with no failures! (enough data to almost fill a 16G sd card).  I stopped the emu just to check on some of my metrics.

    Total dma_isrs: 17,620,985
    rx dma isrs: 6,320,876
    sd dma isrs: 11,300,685
    latent Dma Events: 2040
    Missed Dma Events: 0!
     

    So it seems:

    If I have other interrupt sources enabled during the dma_isr( ), we create the rare situation of missing dma interrupts if a new dma event occurs before dma_isr completion.

    If interrupts are essentially disabled during the dma_isr( ), we never miss a dma interrupt if a new dma event occurs before the dma_isr( ) completion.

    Do you know of any requirement/restriction that would dictate that the dma isr needs to be NOT interrupt-able?

    So at this point there is either :

    (a) Something wrong with BIOS (this is a BIOS project) in the way its HWI dispatcher works that breaks the aggregate DMA interrupt if other interrupts are enabled, a new HWI comes in (interupting the dma_isr), and then another dma event occurs before the dma isr has completed.

    OR

    (b) Some timing issue within the CPU itself that breaks the aggregate DMA interrupt if interrupts are enabled within the dma_isr, it gets interrupted, and then another dma event occurs before the dma isr has completed.

    Thoughts?

    -S

  • Hello Shawn,

      One suggestion is to try with all the printf removed in your code ?

     

    Thanks

     Vasanth  

     

  • There are no printf( ) calls in use

    -S

  • Shawn,

    Thanks for the update. Vasantha and I are tracking this. We will create a test case with other interrupt sources enabled during a dma_isr without BIOS to run on our EVM.

    Thanks for your feedback.

    Regards.

    Steve

  • Shawn,

        I understand that codec is running at 48Mhz, but can you provide more details on the other configuration, especially codec/DMA configuration that you have done in your environment.

     Thanks

     Vasanth

      

  • pop me an e-mail at information@appliedsignalprocessing.com and I can send you a couple of files

    -S

  • Steve et al.

    I did my investigation and posed my questions based on the initial information I had at the time, which was that none of our ISRs were manipulating the IFR.

    But after a couple different people raised this question again, we dug deeper and found a rouge line of code in one of our ISRs that was manipulating the IFR, explaining all of the observations I made during my investigation.

    We've modified that ISR and we are re-running the tests with this new code.  I expect all will go well, but will know for sure after about 24hrs.

    Thanks to all who provided suggestions and insightful questions and thanks for the willingness to run your own tests to try and repeat/understand the behavior I was seeing.

    Best regards,

    -Shawn

  • Shawn,

    Good news! Sounds like you have found the spot. We will wait for your confirmation.

    Regards.