This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SYSBIOS Exceptions and Errors

Other Parts Discussed in Thread: SYSBIOS

Hi,

I’m trying to understand how Exceptions and Errors are handled by SYSBIOS.


We are using 6678 on the custom HW (CCSv5.3 and DSPBIOS 6.34.2.18)

The code I’m working with has the following in the .cfg file.

Exception.returnHook = "&ExceptionEvent";
Error.raiseHook = "&ErrorEvent";
Error.policy = Error.UNWIND;
Error.maxDepth = 1;
Exception.enablePrint = true;

And the code is as follows:

void ExceptionEvent(void)
{
ErrorEvent(NULL);
}

void ErrorEvent(Error_Block *eb)
{
// Print Debug Data
}

Basically the same code is executed when either an exception or error is raised.

Here is what I have noticed:
- When error is raised, “DEBUG_ErrorRaiseEvent()” is called once and SYSBIOS is stopped.
- When exception is raised, “ExceptionEvent()” is called continuously and only hard reset stops it.
- I noticed that both Exception and Error are sometimes raised at the same time.

My questions:
- Is above behavior expected?
- Where can find some documentation explaining the difference between Errors and Exceptions?
- I got E_spOutOfBounds error that was raised by Task_checkStacks() in Task.c but it is hard to track which task caused the problem . None of the stacks were blown according to ROV (not even close). Also this error does not always happen (maybe 1 in 5 resets). It does not happen in the same task either. Any suggestions how to find the problem area? I tried following the steps defined here:
http://processors.wiki.ti.com/index.php/SYS/BIOS_FAQs#4_Exception_Dump_Decoding_Using_the_CCS_Register_View
Following directions above “ti_sysbios_family_c64p_Hwi_dispatchC__I” was on the top of the stack.

- As for exceptions, what causes it? What are the examples? Is there a list? Can we disable/enable certain exceptions and not others?

Thanks

  • Errors are software events that can be generated by calling Error_raise by any module. In short, when you set the error policy to UNWIND, the following is expected to happen when an error is detected and Error_raise is called:
    1. Error.raiseHook function is invoked.
    2. If eb==NULL, application terminates, otherwise the caller to Error_raise gets the control back, and eb is initialized with the error details.

    The module Exception operates on a lower, hardware level and connects directly to the CPU exception. When an exception happens, various functions hooked through the module's parameters are invoked, and you should see the message that describes what happened and the content of all registers. Also, every time and an exception happens, Error_raise is invoked too.
    I can't tell why is the function ExceptionEvent() invoked repeatedly, but you may want to try first to move all printing code from ErrorEvent() to ExceptionEvent(), and not call ErrorEvent() from ExceptionEvent() at all. There could be some interplay between the two that I can't figure out right now.

    E_spOutOfBound should print an address of a stack instance as a first parameter, and that should give you the idea which stack is causing the error. In ROV, you can go to that task and check its base stack address and the stack size, and the address of its context. If task->context is not within the stack's limit, you get E_spOutOfBound.

    The exceptions are CPU events, and you can find more about them in the docs for the CPU you are using (www.ti.com/.../sprugh7.pdf). All SYS/BIOS does is to let you run some functions in case an exception occurs. You can disable the Exception module by setting ti.sysbios.family.c64p.Hwi.enableException to 'false'.

    To see Exception's docs, go to the directory 'docs/cdoc' in your SYS/BIOS installation and open index.html. Using the left pane, navigate to ti->sysbios->family->c64p, and click on Exception. For further reading on Error, there is a section on error handling in the SYS/BIOS User's Guide in Chapter "Instrumentation". The guide can be found in the directory 'docs' in your SYS/BIOS installation or here. Bear in mind that the linked guide is for a newer version of SYS/BIOS, while the version that you are using is fairly old now. Another resource is a document on error support in XDCtools and the API for the module Error.

  • Hi,

    Thanks for a response. This is very helpful.
    I'll go through the documentation you've provided and will try your suggestion.
    I'll get back to you later if I have any more questions.

    Thanks
  • Hi,

    I decided to modify .cfg file and add the following:

    Exception.internalHook
    Exception.externalHook
    Exception.nmiHook

    I also removed the “Exception.returnHook” which was configured in the original code.

    I wanted to find out exactly what exception is being triggered.

    As a result External Exception triggered multiple times (around 1 in 5 resets).

    The following data was also displayed in the console window:

    DMC Exception MPFAR=0x600xxxxx MPFSR=0x120
    Supervisor Read violation, Fault ID=0x1

    I was able to find the source code that prints it out in the “Exception.c”
    Looks like the exception was triggered when PCIe space “0x600xxxxx” was accessed?

    Here are my questions:

    - does the above error mean that I’m running in the “User Mode” trying to access restricted area that can be accessed in “Supervisor Mode” only? I’ve been accessing PCIe space all the time without any issues.
    - How can I check if I’m in the “User Mode” or “Supervisor Mode”?
    - How can I change from one to another?
    - When Exception is detected, is NMI interrupt triggered? Our Nmi routine only has “b nrp” command followed by bunch of “nop” calls. How is the exception module executed? What context is it running under?
    - Any other suggestions how to debug this error?

    It would be easier if this exception happened after each reset but it happens only occasionally (around 1 in 5 times).

    Thanks

  • Hi,
    Can someone help me out with this issue?

    Thanks
  • I moved this to the device forum. It seems like the software portion has been answered and you need more hardware information now.

    Todd
  • DSP Engineer said:
    - does the above error mean that I’m running in the “User Mode” trying to access restricted area that can be accessed in “Supervisor Mode” only? I’ve been accessing PCIe space all the time without any issues.

    No, the exception message "Supervisor Read violation" means you were running in Supervisor mode and attempted a read that was not allowed.

    DSP Engineer said:
    - How can I check if I’m in the “User Mode” or “Supervisor Mode”?

    The TSR (Task State Register) CXM bits (bits 7-6) record the current mode - 0 for Supervisor and 1 for User (2 & 3 are undefined).  But we know you're in Supervisor mode from the printed Exception info.

    DSP Engineer said:
    - How can I change from one to another?

    Interrupts and Exceptions will transition to Supervisor mode (including a SW exception generated with the SWE assembly instruction).  So if you're in User mode when an interrupt/exception happens then the ISR will run in Supervisor mode, and could choose to return in Supervisor mode (as opposed to the nominal case of returning to the mode that was interrupted) by modifying the CXM bit in the ITSR or NTSR register (those registers are a copy of the TSR state when the interrupt (ITSR) or exception (NTSR) occurred).

    DSP Engineer said:
    - When Exception is detected, is NMI interrupt triggered? Our Nmi routine only has “b nrp” command followed by bunch of “nop” calls. How is the exception module executed? What context is it running under?

    Sort of.  NMI is just one of the sources of an exception.  Exceptions are serviced by ISTP vector #1, and these include internal, external, NMI, and SW-generated (SWE asm instruction).  When in the vector #1 service routine, you can look at the EFR (Exception Flag Register) to determine which of those 4 happened.

    For the output you're seeing, this would be an external exception (value 0x40000000 in EFR), which you already know.

    If your vector #1 contains just B NRP (and NOPs) then you don't have the Exception Module enabled.  But since other things you say clearly indicate that you do have it enabled, something else is going on.  Just to clarify, your vector #1 (which would be at address ISTP + 0x20) is just "B NRP" plus NOPs?

    The SYS/BIOS Exception module contains a flag "enableExternalMPC".  If you don't already have that set to true then I would suggest setting it.

    Now, all that being said, you need to figure out why the PCIe access is causing this intermittently.  I see that this thread has been moved to the Keystone forum, so hopefully someone there might have an idea.

    Regards,

    - Rob

  • Hi,

    Thanks for your response.

    You said that “Supervisor Read violation” means that the DSP was running in Supervisor mode and attempted a read that was not allowed. What does “not allowed” mean? Can you give me few examples?
    Could a read from a memory address that does not exist cause it?

    I did some more digging and found the following:
    - “Completion Timeout Status” bit is set in “PCI Express Uncorrectable Error Status Register (PCIE_UNCERR)” when the exception occurs. It is not set when the exception is not triggered.
    - After the exception, when reading data from PCIe address space using Memory Browser all 0s are returned after a short delay (few seconds). All other addresses work fine. Looks like PCIe core has crashed. During normal operation a read from the PCIe address space using Memory Browser works without any issues.

    The exception always occurs shortly (approx. 1s) after main() is done when all tasks start running. Once we’re past this critical time we seem to run fine even though we perform thousands of PCIe read/writes every second. The crash does not happen all the time either (maybe 1 in 5 resets). Looks like PCIe bus might be getting flooded initially. It should not “crash” though.

    On the other side of the PCIe bus we have the FPGA that we read/write from/to. FPGA also performs DMA transfers to/from DSP over PCIe.
    Is there a way to tell if the DSP or the FPGA (DMAs) is causing the PCIe core to crash/not respond?
    Are there any other registers I can read to narrow down the problem?

    I tried setting up a breakpoint in the NMI handler but it did not trigger when the exception occurred. Maybe the NMI is not setup correctly? Here is the NMI setup in the .cfg file.

    var Int_NMIParams = new ti_sysbios_family_c64p_Hwi.Params();
    Int_NMIParams.instance.name = "NMI_hdlr";
    Int_NMIParams.eventId = -1;
    Program.global.Int_NMI = ti_sysbios_family_c64p_Hwi.create(2, "&Nmi_hdlr", Int_NMIParams);

    To answer your question, NMI handler contains a call to function that sets a global variable used for debugging right before “B NRP”.
    I’m also setting “enableExternalMPC” to true.

    Is the decision made which exceptions occurred in the NMI handler (ISTP + 0x20)? Is the debug data (registers etc.) that is printed out in the console window in the CCS printed out in the context of the NMI as well?

    Thanks for your help

  • DSP Engineer said:
    You said that “Supervisor Read violation” means that the DSP was running in Supervisor mode and attempted a read that was not allowed. What does “not allowed” mean? Can you give me few examples?
    Could a read from a memory address that does not exist cause it?

    Note the MPFSR for DMC (which controls L1D) that was printed in the CCS console window, it contains the bits that were lacking in an MPPA entry. Your previous post shows this as 0x120.  The "Supervisor read violation" stems from that value, as well as the Fault ID of 0x1.  I'm unable to figure out how to paste the MPPA register diagram here, but the bit assignments, from 15 -> 0, are:
        AID5 AID4 AID3 AID2 AID1 AID0 AIDX LOCAL rsvd rsvd SR SW SX UR UW UX
    and a value of 0x120 maps to the LOCAL and SR bits being set.  The LOCAL bit means it was the CPU and not DMA.

    A read from a memory address that does not exist could well cause this, but I'm not sure since such an access might also be reported by the L2 (UMC) memory controller.

    DSP Engineer said:

    I tried setting up a breakpoint in the NMI handler but it did not trigger when the exception occurred. Maybe the NMI is not setup correctly? Here is the NMI setup in the .cfg file.

    var Int_NMIParams = new ti_sysbios_family_c64p_Hwi.Params();
    Int_NMIParams.instance.name = "NMI_hdlr";
    Int_NMIParams.eventId = -1;
    Program.global.Int_NMI = ti_sysbios_family_c64p_Hwi.create(2, "&Nmi_hdlr", Int_NMIParams);

    You need to use 1 for the vector # in the Hwi.create() call.  But if you're enabling the Exception module then it will plug vector #1 with its handler.

    I will point out again that vector #1 is not just the NMI handler, that is legacy usage.  Vector #1 is the exception handler and can be triggered by the 4 sources I mentioned above - internal exception, external exception (your case), NMI, and SW exception (SWE instruction).  An exception handler will look at EFR to figure out which of the 4 triggered it.

    DSP Engineer said:
    Is the decision made which exceptions occurred in the NMI handler (ISTP + 0x20)? Is the debug data (registers etc.) that is printed out in the console window in the CCS printed out in the context of the NMI as well?

    An NMI handler would not print what you see in the console window. When EFR.NMI is set, that means that the NMI input (pin) to the chip was asserted, nothing more and nothing less.  What that means depends on the board that houses the DSP.

    For other exceptions:

    • internal exception - look in IERR to further decode
    • external exception - look in L1D/L1P/L2 memory controller fault registers (as is done by CCS in your case)
    • NMI - what I said above
    • SW exception - the code issued an SWE assembly instruction, and decoding it depends on the SW setup, but SYS/BIOS does not do this ever.

    I don't have any insight into the PCIe issues that you ask about.  It sure sounds like the PCIe is dead when your problems occur, and the "Supervisor read violation" is probably because of that, as are your CCS memory window values for the PCIe addresses.

    Regards,

    - Rob

  • Hi,

    Thanks for the reply. It's very good information.

    I think PCIe is dead as well. When we reduce traffic over PCIe the exceptions do not happen as often.
    Now we need to figure out what is crashing the PCIe.

    If anyone has any suggestions what we can look at it would be appreciated.

    Thanks