This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

[FAQ] TDA4VM: Why does my board enter an abort state after running my application for a long time?

Part Number: TDA4VM
Other Parts Discussed in Thread: SYSBIOS

I am running my deep learning application on a custom board for a very long time, but after 2 to 48 hours the deep learning application hangs/freezes and enters an abort state while running code in C7x. I tried running the application on a TI EVM but we also see the same issue. How can we debug?

  • We have found two issues that could cause boards that use TDA4VM to enter an abort state. They are:

    1. Issues with the board design, especially with the PDN
    2. Using an undertested chip that has not been tested with production level test program

    Here are the steps to handle similar issues:

    1. Try running the application on different boards including TI EVM. If some boards fail but other pass long duration test case, then it is most likely an issue with the chip.
    2. If using a custom board, get your schematic reviewed by TI. This process can be done through a local FAE.
    3. Get die ID of the chip using the following commands and create a new E2E thread by clicking "Ask a related question" and post the die ID. TI will be able to internally determine what version of testing program was used for the chip, but this process will take 1 to 2 weeks.
      1. devmem2 0x43000020 – for die id 0
      2. devmem2 0x43000024 – for die id 1
      3. devmem2 0x43000028 – for die id 2
      4. devmem2 0x4300002C – for die id 3
    4. While waiting for schematic review and die ID analysis, set up CCS by following documentation: software-dl.ti.com/.../ccs_setup_j721e.html
    5. In the past, we observed C7x to be the root cause of a system abort. To determine whether behavior is similar, load symbols for C7x in CCS and set breakpoint at ti_sysbios_family_c7x_Exception_dispatch__E and compare following register values:
      1. IERR (if 0x0800, then means Streaming Engine exception)
      2. IEAR (if 0xAB31B308, then this could be an address within the function MMALIB_CNN_convolve_row_ixX_ixX_oxX_exec_ci. This could be different)
      3. SE0_FSR (if 0x01005, means parity error)
    6. Reduce C7x speed using k3conf set clock 15 0 500000000. This halves C7x speed/performance, but in our past tests we have found that halving the speed of C7x will work around the issue tracking. If this does not work, then it is very likely that the issue is not the same as the one observed in the past.

    In the end, we were able to fix the abort issue after long duration by making improvements to the custom board and switching out the chip used on the board with a newer one.

    Regards,

    Takuma