This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6678 internal exception

Other Parts Discussed in Thread: SYSBIOS

I am using 6678 DSP with XDC 3.24.7.73, SYS BIOS 6.32.5.54 and Compiler c6000_7.3.4. One of the cores had an internal exception. I have hooks into the BIOS error module and XDC exception module and record the registers from the exception context.

 

DEFAULT 4441: <<<  Core 0 detected core 7 is in abort state with code 0x0

DEFAULT 4442: <<<  Core 7 Abort Msg: ti.sysbios.family.c64p.Exception: line 248: .

DEFAULT 4443: <<<  Core 7 Abort ArgsInfo:  ; .

DEFAULT 4444: <<<  Core 7 Register log: ERF 0x2 NRP 0x0 NTSR 0x1000C IERR 0x0 A31 0x14D000 A30 0xBE4 B31 0xD4 B30 0xBD8 ITSR 0x1000C IRP 0x0 SSR 0x0 AMR 0x0 ILC 0x2 RILC 0x0.

DEFAULT 4445: <<<  Core 7 Memory Exception Regs: UMC_MPFAR 0x184A280 UMC_MPFSR 0x110 PMC_MPFAR 0x0 PMC_MPFSR 0x0 DMC_MPFAR 0x184A280 DMC_MPFSR 0x120

DEFAULT 4446: <<<  Core 7 Register A[0-14]: 0x0 0x0 0xABFF6914 0x0 0xABFF6910 0xFFFFFFFF 0xABFF6914 0x398 0x3F 0x2A 0x0 0xABFF5D3C 0x1 0xF9513FDF 0x88100

DEFAULT 4447: <<<  Core 7 Register A[15-29]: 0x0 0x4 0xABFF656C 0xD400FD78 0x1A8 0x264011C 0x0 0xABFF5FF8 0xABFF5FE8 0xABFF5FD8 0xABFF5FE8 0xE4C 0x830E1C 0x1D 0xE4C

DEFAULT 4448: <<<  Core 7 Register B[0-14]: 0x1 0x1 0x0 0x1 0xABFF6914 0xFFFFFFFF 0x0 0xFFFFFFFF 0x1 0x82DA94 0x1 0xABFF6918 0x0 0x1 0x82DF80

DEFAULT 4449: <<<  Core 7 Register B[15-29]: 0x83EB68 0x264031C 0xD40001C0 0x2640310 0xFFFFFFFF 0x8622 0x0 0x8A 0x7F 0x50 0xE1 0x6B667CD2 0x0 0xBE0 0xABFF691C

 

An internal exception is recorded in the ERF. I see a memory protection registers caught addresses during the exception. These addresses map to internal registers of the 6678.

 

From the memory map:

01800000 01BFFFFF 0 01800000 0 01BFFFFF 4M C66x CorePac Registers

  • Any pointer about how to debug the issue further?

    The address 0184 A280 is not defined in the CorePac users guide. The address 0184 A27Ch is the last Level 2 Memory Protection Page Attribute Register 31.

  • Hi SThakkar,
    Please refer below IT-RTOS forum thread for debugging the issue,
    e2e.ti.com/.../1335028
    We would recommend you to start a new thread on TI-RTOS for further support and follow up. Thank you.
  • Hi Rajasekaran,

    Thank you for the pointer. I looked at the thread in the BIOS forum. I get that the idea is to look at the memory map and the disassembly around the instruction where the exception happened.

    I my case though, both NRP and IERR are 0. So, I cannot even roughly determine what section of code was executing before the exception was caught.

    In addition to the register log, I do have the memory exception registers which DID catch something. So, lets see what we can determine from this data.

    UMC_MPFSR 0x110
    When set, indicates a supervisor write request. And Access was a "LOCAL" access.

    UMC_MPFAR 0x184A280
    The address 0x184A280 is not defined in the memory map of the C66 core pac users guide. So, it is quite odd that this address is recorded during the exception.
    And the C6678 data sheet the memory map has it as a core pac register:
    01800000 01BFFFFF 0 01800000 0 01BFFFFF 4M C66x CorePac Registers

    Similarly,
    DMC_MPFSR 0x120
    When set, indicates a supervisor read request. And Access was a "LOCAL" access.

    DMC_MPFAR 0x184A280
    same unknown address.

    The application leaves the memory protection configuration to the default values.
    This exception happens after several hours of execution, not at bootup when I would expect some configuration code to setup registers normally.

    The issue is also reproducible, and consistently points to same values in the exception registers log. So, the addresses are not random or a coincidence.

    Register dump of another crash:
    DEFAULT 1493: <<< Core 0 detected core 2 is in abort state with code 0x0
    DEFAULT 1494: <<< Core 2 Abort Msg: ti.sysbios.family.c64p.Exception: line 248: .
    DEFAULT 1495: <<< Core 2 Abort ArgsInfo: ; .
    DEFAULT 1496: <<< Core 2 Register log: ERF 0x2 NRP 0x0 NTSR 0x1000C IERR 0x0 A31 0x10 A30 0xA1FF6C7C B31 0xB B30 0x0 ITSR 0x1000C IRP 0x0 SSR 0x0 AMR 0x0 ILC 0x2 RILC 0x0.
    DEFAULT 1501: <<< Core 2 Register B[15-29]: 0x83EB68 0x3F 0xC1 0x264031C 0x1A8 0xA1FF5FC8 0x1F40 0xA1FF5FC8 0xA1FF5FF8 0x0 0xFEC0FEC0 0x0 0x0 0xBC8 0x881
    DEFAULT 1499: <<< Core 2 Register A[15-29]: 0x0 0x3F 0xD4000000 0x0 0x0 0xD67 0x0 0xA1FF5FF8 0xCC 0xA1FF5FD8 0xA1FF5FE8 0xE34 0xC900C90 0xE34 0xE34
    DEFAULT 1497: <<< Core 2 Memory Exception Regs: UMC_MPFAR 0x184A280 UMC_MPFSR 0x110 PMC_MPFAR 0x0 PMC_MPFSR 0x0 DMC_MPFAR 0x184A280 DMC_MPFSR 0x120
    DEFAULT 1498: <<< Core 2 Register A[0-14]: 0x0 0x0 0x1 0x0 0x0 0xFFFFFFFF 0x0 0x82DAC4 0xD400FD58 0x20 0x2 0x5 0x0 0x0 0x0
    DEFAULT 1500: <<< Core 2 Register B[0-14]: 0x0 0x1 0x0 0x1 0x0 0xFFFFFFFF 0x1 0x2 0x0 0x1 0x0 0x0 0xBFD7FF74 0x0 0x82DF78
  • Any new information? Can we close the thread?

    Ran
  • The only similar case where not explained crash was observed was with code that resides inside ECC protected area of the memory.   Here is a description of what needs to be done and why.  A code to do it is enclos/cfs-file/__key/communityserver-discussions-components-files/791/3581.EricEdc_5F00_sample.ced

    1. L1PEDADDR records the address info of the last instruction fetched, customer needs to check whether there are 128-byte "NOP" or instructions after that address.
    • If nothing, that means the memory is not initialized and parity bit is randomized, this can cause the parity error when fetch into the L1P, BUT not from SER
    • If the following 128 bytes are initialized (”NOP” or instructions), and SL2 didn’t see any parity error, this is the proof of SER in L1P   
    1. If SER is confirmed above, the solution is to invalidate the L1P cache
    2. When L1P parity error occurs, it triggers NMI interrupt
    3. CorePac document 11.2.4 L1P Cache Error Recovery Upon Error Detection===> When there is a parity error for program fetch from the L1P cache, error detection logic sends a direct exception event to the DSP (IERR.IFX event). In turn, the DSP invalidates program code by flushing the content of the L1P Cache. ===è this needs to be done by user, regardless we use SYS/BIOS or not, not sure if we can be done inside NMI handler, may check with Bhavin
    4. If parity error inside L2 with double-bit error, that is can't be corrected, the option is to do local DSP reset; if single bit error, it can be corrected by L2 controller, no user action required.

     

    Topic 2: Review the EDC example code intended for Processor SDK

    • L1P doesn’t support scrub, revised code to do L1P invalidate
    • The functions look correct and can be used at the start up time