This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Random NMI Watchdog Reset on Concerto C28 core

We are trying to track down the cause of a random restart issue where the Concerto F28M36 is reset by the C28 NMI Watchdog. The issue happens randomly, sometimes showing up after a couple hours while other times it will reset after a couple days or even longer.  We've seen it on happen on multiple boards, but have not been able to identify a condition which causes it to happen predictably.

Using a Spectrum Digital XDS560V2 emulator, we have identified that the CRESC and MRESC registers both indicate the system was reset due to a C28 NMI watchdog timeout, but we never get an NMI interrupt. We have an NMI interrupt handler implemented that includes an ESTOP0 instruction. We have verified the interrupt handler works, and that the debugger stops at the ESTOP0 instruction, by using the NMIFLGFRC register to force an NMI interrupt.

Additionally, we have verified that following the workaround in the sprz375f silicon errata document which indicates writing to address 0x4E58 as a workaround does not fix our issue. After setting this address to 7 before calling InitAnalogSystemClock, we have still seen the reset happen.

Are there any known issues in the Concerto processors that could explain these resets? We have verified multiple times and on multiple boards that the reset was caused by an NMI watchdog timeout, but I don't understand why this would be the case if we never get an NMI interrupt.

  • Hi John,

    On this device, C28 NMI watchdog will not reset the F28M36 device. C28 NMI watchdog will only reset the C28 (control) subsystem but it'll generate an NMI to M3. So in this case do you have the NMI handler on M3 as well? If not then M3 NMI watchdog will reset the full device and this reset will come on XRSn pin.

    Are you seeing the XRSn getting toggle in this case?

    You mentioned that this issue is seen on many boards. Are there some boards where this issue is not seen ever?

    Are there any known issues in the Concerto processors that could explain these resets? 

    We have not come across a issue like this where reset is caused by NMI but NMI does not get generated. There are many reason why an NMI could get generated (please see CNMICFG register detail) but in all such cases NMI handler should be called unless NMI vector is getting corrupted somehow.

    Regards,

    Vivek Singh

     

  • When this condition occurs, we see CRESC as 0x14, indicating XRESN and NMIWD as the causes of reset. MRESC is 0x4000 0000, also indicating an NMI watchdog from the C28. This is expected because we are not handling the NMI on the M3, instead allowing the watchdog to reset the processor.

    The issue has been seen on several boards, but it is difficult to say whether some boards do not exhibit the symptoms as they occur so infrequently.

    We haven't been able to determine which of the C28 NMIs causes the reset because we have not been able to catch the NMI before the reset. We have read through the reference manual and looked at the CNMIFLG register to see the potential causes. After reading throught the list of potential NMI causes, it seems that these flags are set by a hardware issue. Is there something that we could be doing in software to cause an NMI condition to be generated?

    We will continue to investigate, and will look in to the possibility of the NMI interrupt vector getting corrupted.
  • Hi John,

    Can you check the value in CTOMIPCBOOTSTS register after code jumps to your application?

    Also after code execution jumps to your application set a breakpoint at C28x BOOTROM entry point (0x3ffead) so that if reset happens and  code execution stops at BOOTROM entry so none of the code get executed. Also have the M3 in halt (after running the enough code to get C28x running) and check the status of CRESC and NMIFLG  registers after code execution halts at BOOTROM entry points (after reset).

    Regards,

    Vivek Singh

     

  • Hi Vivek,

    We have set up a test using the breakpoint at 0x3ffead as you specified. It might take a couple of days to reproduce the problem, so we will provide more info then. We are not able to halt the M3 core because the C28 and M3 are communicating with each other, so the C28 would not continue operating with the M3 halted.

    Thanks,
    John
  • Hi John,

    Thanks. If M3 can not be halted then it would be good to have the NMI handler for M3 and have the breakpoint at the entry of the NMI handler (due to C28 reset). Basically want to avoid the reset via XRSn due to M3 NMIWD . Is that possible?

    Regards,

    Vivek Singh

  • Hi Vivek,

    Yes, we have an NMI interrupt handler that loops forever. I will place a breakpoint in this handler to stop the M3 core.

    Thanks,
    John
  • Hi Vivek,

    We've been working through this test using the force NMI flag register, and we have a couple questions. First, when we force an NMI, we never stop at the breakpoint you specified, 0x3FFEAD. It looks like the C28 BOOTROM entry point for the processors we are using is located at 0x3FFE9A. Is this the location you intend for us to break at?

    The second question is in regards to the CTOMIPCBOOTSTS register. You've indicated you want us to check the status of the register after jumping to our application. Did you want us to look at this register before or after we get the NMI? The register value is 0x40010000 when the application first starts after a power cycle. We haven't caught the interrupt yet and will provide info on the status of the register after that.

    Thanks,
    John
  • Hi Vivek,

    We have a couple more questions that we'd appreciate clarification for. Testing with the NMI Flag Force register, we have been able to force an NMI interrupt to occur. However, when we allow the NMI WD to reset the C28, the NMI flag is set to 0. Will this behavior be different with an NMI that occurs without the NMI Flag Force?

    The second question is in regard to the CTOMIPCBOOTSTS register. After using the NMI Flag Force register, we have noticed that this register is not set to indicate the NMI flag that reset the device. Will this also behave different if an actual condition causes the NMI?

    Thanks,
    John
  • Hi Vivek,

    Over the weekend we finally got some results. The NMIFLG register is set to 0x0005, indicating a C28 Ram Uncorrectable error. In this case, we never got into the NMI ISR, and the debugger stopped at the breakpoint at the C28 Boot Rom entry point.

    What can cause the Ram Uncorrectable NMI error to occur, and is there any way to determine more precisely what area of RAM is causing the problem? Is there anything we could be doing in the firmware to cause this kind of error?

    We still have the debugger connected and stopped, so any assistance you can give in further digging in to the issue would be greatly appreciated.

    Thanks,
    John
  • Hi John,

    RAMs are ECC/Parity protected and if there is 2bit ECC error or 1bit Parity error then this NMI flag get set and an NMI get generated.

    Now there are two issues-

    Ist, why RAM uncorrectable error is happening.

    IInd, why NMI not getting generated in this case.

    For the Ist issue - In your application are you using the RAMTEST feature to check the working of ECC/Parity logic?

    We have error log registers which captures the address which caused the issue. Error status log is captured in "CUEFLG" register and then based on what has caused the error, you can check the value in CCUNCREADDR (CPU) or CDUNCREADDR (DMA) to check the address for which error has been generated. Once you know the address, you need to see if that address is getting modified using RAMTEST feature. If not then it could be genuine error getting generated here and this need to be handled properly.

    For IInd issue - In case of uncorrectable error during fetch, it might be possible the CPU execute wrong instruction which could create ITRAP instead of NMI hence NMI handler will not get call. You could map the ITRAP handler also to the NMI handler function and then see if that works.

    Regards,

    Vivek Singh

  • Hi Vivek,

    It appears the registers capturing the ram error details were reset after the NMI WD reset the C28 CPU, so we aren't able to see any information regarding the cause of the NMI. We will follow your suggestion and remap the Illegal Trap ISR to the same handler as the NMI handler and hopefully will be able to catch the error that way. We will let you know when we have more information.

    -John
  • John,

    You are right. Those registers get cleared by C28x reset. Since we know that there is error with RAMs, if you are having the NMI handler function in RAM, could you move it to Flash instead.

    Regards,

    Vivek Singh 

  • Hi Vivek,

    Our NMI handler is already located in Flash, so no change is needed. We will let you know when we catch the error again.

    Thanks,
    John
  • Hi Vivek,

    We had this issue occur again over the weekend. We had a breakpoint set on both the ITRAP ISR and an ESTOP instruction in the NMI ISR, but we were not able to catch the error occurring before the C28 core had reset. So far we have seen the issue being caused by a C28 Ram Uncorrectable NMI flag on multiple occasions, but have not been able to determine the address causing the error because the C28 has been reset.

    Is there any other way we can determine the address causing the RAM uncorrectable error, or any other information that would be helpful in further debugging this issue?

    Thanks,
    John
  • Hi John,

    This is very strange.

    Other option I could think of to debug this is to have a timer setup which generates periodic interrupts with period much less than NMIWD counter (so that it generates interrupt before NMIWD counter expires). Inside the timer interrupt ISR check the NMI FLAG and if it's set then have ESTOP so that CPU halts which prevents from C28x getting reset.

    This need a bit change in your code though. Is this change possible in your code without having much impact on application?

    Regards,

    Vivek Singh

  • Hi Vivek,

    We will attempt to add something as you proposed. However, we have seen in the past that code changes can make the problem not occur, so we might not be able to catch the problem after making this change. We will update you with more information in a few days.

    -John
  • John,

    Could you also check the revision of silicon on which you are seeing this failure ?

    Regards,

    Vivek Singh

  • Hi Vivek,

    We have been testing with two boards. One of the boards has a Revision 0 part and the other board has a production revision B part. Of note, this issue occurs more frequently on the board with a revision 0 part, usually once every 2-3 days. On the revision B board, the issue occurs generally after about 5-6 days.

    Thanks,
    John
  • Hi John,

    Thanks for providing the info. We have following advisory in Errata document for this device -

    "Advisory C28x Flash: The SBF and BF Instructions Will Not Execute From Flash"

    Though this should only impact Rev-0 device but I would still suggest to see if this condition applies to this case (have SBF/BF instruction in your code) and if yes, then try the suggested workaround  to see if that fixes the issue.

    Regards,

    Vivek Singh

  • John,

    would you be able to share the NMI/iTRAP handler ISR code for review? do you disable/enable PIE at any instance? The RAM contents of C28x should be intact between C28NMIWD resets, for ex: if you write any debug information in C28x RAMs(other than the part that is used by C-BootROM), it should remain intact for debug.

    Are you able to HALT M3 (probably a while(1) loop with WDOG dosabled on M3) as soon as it gets an NMI saying that C28x got reset? are there any other NMI flags set on M3 when this happens, other than the one that shows C28NMIWDRST happened.

    Are you installing any C28 PIE mismatch handler, using USER_SWREG1 and USER_SWREG2 registers?

    were you able to try this on any later Silicon revision, later than REVA?

    Best Regards

    Santosh

  • All,

    This issue is getting discussed off-line.

    Regards,

    Vivek Singh

  • After detailed debug it was found that issue was related to "POP RB" instruction getting interrupted which is not supported by CPU.  This was happening because inside ISR EINT instruction was used to enable the global interrupt (to allow nested interrupt) but interrupt was not disable while existing the ISR.

    If user has enabled the interrupt inside ISR (using EINT instruction) then they must disable interrupt (using DINT instruction) before existing the ISR to avoid unexpected issues. 

    Regards,

    Vivek Singh