This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4AL-Q1: Linux crashes during a long test

Part Number: TDA4AL-Q1

Hi,TI's firend

linux ver:ti-processor-sdk-linux-j721s2-evm-08_06_00_10

rtos ver: ti-processor-sdk-rtos-j721s2-evm-08_06_00_11

startup mode:SBL

hardware:ourself

After Linux Crashes:

1: Linux serial port not responding and Linux serial port log printing also has no abnormalities info

2: MCU1_0 serial can out put info

3: Monitoring cpu and mem information, cpu and mem before crash without any exception

wyptestcpust.txt

4: In linux, I opened a thread to write emmc, and this thread did not write emmc data after the crash.

5: The problem occurs for a few hours or days

So,Is there a good way to troubleshoot crashes?

Thanks

  • Hi,

    Can you share the complete Linux crash log?

    - Keerthy

  • Hi,

    It looks like a power outage.

    [2023/9/22 21:30:51] ...i=3009300
    [2023/9/22 21:30:51] [MCU2_0] 100301.468232 s: REMOTE_SERVICE: RX: mpu1_0 (port 1025) -> mcu2_0 (port 21) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:51] [MCU1_0] 100301.496317 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:51] [MCU1_0] 100301.496415 s: mcu1_0 get fps:1e,3000
    [2023/9/22 21:30:52] [MCU1_0] 100301.496474 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:52] [MCU2_0] 100301.495807 s: REMOTE_SERVICE: TX: mcu2_0 (port 21) -> mpu1_0 (port 1025) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:52] 100301.523696 s: SOC TEMP = 86.664 ~ 89. 78 !!!
    [2023/9/22 21:30:52] 100301.523729 s: Left Camera TEMP = 54.100,Right Camera TEMP = 44.350 !!!
    [2023/9/22 21:30:52] 100301.529160 s: PCBA TEMP = 75!!!
    [2023/9/22 21:30:52] [MCU1_0] 100301.709599 s: [handle_input_voltage]: gTimeoutFlag_50ms = -1
    [2023/9/22 21:30:52] [MCU1_0] 100301.709700 s: Send voltage modle voltage=11.857374
    [2023/9/22 21:30:53] [MCU2_0] 100302.508911 s: REMOTE_SERVICE: RX: mpu1_0 (port 1025) -> mcu2_0 (port 21)!cmd = 8x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:53] [MCU1_0] 100302.537324 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:53] [MCU1_0] 100302.537426 s: mcu1_0 get fps:1e,3000
    [2023/9/22 21:30:53] [MCU1_0] 100302.537487 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:53] [MCU2_0] 100302.536809 s: REMOTE_SERVICE: TX: mcu2_0 (port 21) -> mpu1_0 (port 1025) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:53] 100302.565161 s: SOC TEMP = 86.664 ~ 89.447 !!!
    [2023/9/22 21:30:53] 100302.565194 s: Left Camera TEMP = 54.100,Right Camera TEMP = 44.350 !!!
    [2023/9/22 21:30:53] 100302.570693 s: PCBA TEMP = 75!!!
    [2023/9/22 21:30:53] ...i=3009350
    [2023/9/22 21:30:53] [MCU1_0] 100302.749705 s: [handle_input_voltage]: gTimeoutFlag_50ms`= -1
    [2023/9/22 21:30:53] [MCU1_0] 100302.749819 s: Send voltage modle voltage=11.857374
    [2023/9/22 21:30:54] [MCU2_0] 10030s.532024 s: REMOTE_SERVICE: RX: mpu1_0 (port 1025) -> mcu2_0 (port 21) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:54] [MCU1_0] 100303.560322 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:54] [MCU1_0] 100303.560420 s: mcu1_0 get fps:1e,3000
    [2023/9/22 21:30:54] [MCU1_0] 100303.560479 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:54] [MCU2_0] 100303.559808 s: REMOTE_SERVICE: TX: mcu2_0 (port 21) -> mpu1_0 (port 1025) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:54] 100303.587862 s: SOC TEMP = 85.728 ~ 89.262 !!!
    [2023/9/22 21:30:54] 100303.587894 s: Left Camera TEMP = 54.100,Right Camera TEMP = 44.350 !!!
    [2023/9/22 21:30:54] 100303.593373 s: PCBA TEMP = 75!!!
    [2023/9/22 21:30:54] [MCU1_0] 100303.791582 s: [handle_input_voltage]: gTimeoutFlag_50ms = -1
    [2023/9/22 21:30:54] [MCU1_0] 100303.791684 s: Send voltage modle voltage=11.830129
    [2023/9/22 21:30:54] ...i=3009400
    [2023/9/22 21:30:54] PERF: TOTAL: 30. 0 FPS
    [2023/9/22 21:30:55] [MCU2_0] 100304.554103 s: REMOTE_SERVICE: RX: mpu1_0 (port 1025) -> mcu2_0 (port 21) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:55] [MCU1_0] 100304.582389 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:55] [MCU1_0] 100304.582498 s: mcu1_0 get fps:1e,3000
    [2023/9/22 21:30:55] [MCU1_0] 100304.582561 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:55] [MCU2_0] 100304.581821 s: REMOTE_SERVICE: TX: mcu2_0 (port 21) -> mpu1_0 (port 1025) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:55] 100304.610065 s: SOC TEMP = 86.664 ~ 89.262 !!!
    [2023/9/22 21:30:55] 100304.610124 s: Left Camera TEMP = 54.100,Right Camera TEMP = 44.350 !!!
    [2023/9/22 21:30:55] 100304.615838 s: PCBA TEMP = 74!!!
    [2023/9/22 21:30:55] [MCU1_0] 100304.831553 s: [handle_input_voltage]: gTimeoutFlag_50ms = -1
    [2023/9/22 21:30:55] [MCU1_0] 100304.831653 s: Send voltage modle voltage=11.862823
    [2023/9/22 21:30:56] [MCU2_0] 100305.595742 s: REMOTE_SERVICE: RX: mpu1_0 (port 1025) -> mcu2_0 (port 21) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:56] [MCU1_0] 100305.624336 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:56] [MCU1_0] 100305.624437 s: mcu1_0 get fps:1e,3000
    [2023/9/22 21:30:56] [MCU1_0] 100305.624494 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:56] [MCU2_0] 100305.623806 s: REMOTE_SERVICE: TX: mcu2_0 (port 21) -> mpu1_0 (port 1025) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:56] 100305.652033 s: SOC TEMP = 86.664 ~ 88.893 !!!
    [2023/9/22 21:30:56] 100305.652065 s: Left Camera TEMP = 54.100,Right Camera TEMP = 44.350 !!!
    [2023/9/22 21:30:56] 100305.657496 s: PCBA TEMP = 75!!!
    [2023/9/22 21:30:56] [MCU1_0] 100305.872530 s: [handle_input_voltage]: gTimeoutFlag_50ms = -1
    [2023/9/22 21:30:56] [MCU1_0] 100305.872631 s: Send voltage modle voltage=11.873722
    [2023/9/22 21:30:56] ...i=3009450
    [2023/9/22 21:30:57] [MCU2_0] 100306.637306 s: REMOTE_SERVICE: RX: mpu1_0 (port 1025) -> mcu2_0 (port 21) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:57] [MCU1_0] 100306.665316 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:57] [MCU1_0] 100306.665416 s: mcu1_0 get fps:1e,3000
    [2023/9/22 21:30:57] [MCU1_0] 100306.665478 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:57] [MCU2_0] 100306.664807 s: REMOTE_SERVICE: TX: mcu2_0 (port 21) -> mpu1_0 (port 1025) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:57] 100306.693146 s: SOC TEMP = 86.477 ~ 89.631 !!!
    [2023/9/22 21:30:57] 100306.693202 s: Left Camera TEMP = 54.100,Right Camera TEMP = 44.350 !!!
    [2023/9/22 21:30:57] 100306.698646 s: PCBA TEMP = 75!!!
    [2023/9/22 21:30:57] [MCU1_0] 100306.912592 s: [handle_input_voltage]: gTimeoutFlag_50ms = -1
    [2023/9/22 21:30:57] [MCU1_0] 100306.912702 s: Send voltage modle voltage=11.857374
    [2023/9/22 21:30:57] ...i=3009500
    [2023/9/22 21:30:58] [MCU2_0] 100307.678374 s: REMOTE_SERVICE: RX: mpu1_0 (port 1025) -> mcu2_0 (port 21) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:58] [MCU1_0] 100307.706327 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:58] [MCU1_0] 100307.706427 s: mcu1_0 get fps:1e,3000
    [2023/9/22 21:30:58] [MCU1_0] 100307.706487 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:58] [MCU2_0] 100307.705811 s: REMOTE_SERVICE: TX: mcu2_0 (port 21) -> mpu1_0 (port 1025) cmd = 0x0000000c, prm_size = 384 bytes ... 1!!
    [2023/9/22 21:30:58] 100307.734334 s: SOC TEMP = 86.664 ~ 88.893 !!!
    [2023/9/22 21:30:58] 100307.734367 s: Left Camera TEMP = 54.100,Right Camera TEMP = 44.350 !!!
    [2023/9/22 21:30:58] 100307.739735 s: PCBA TEMP = 74!!!
    [2023/9/22 21:30:58] [MCU1_0] 100307.952560 s: [handle_input_voltage]: gTimeoutFlag_50ms = -1
    [2023/9/22 21:30:58] [MCU1_0] 100307.952663 s: Send voltage modle voltage=11.830129
    [2023/9/22 21:30:59] [MCU2_0] 100308.719630 s: REMOTE_SERVICE: RX: mpu1_0 (port 1025) -> mcu2_0 (port 21) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:59] [MCU1_0] 100308.748389 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:59] [MCU1_0] 100308.748505 s: mcu1_0 get fps:1e,3000
    [2023/9/22 21:30:59] [MCU1_0] 100308.748570 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:30:59] [MCU2_0] 100308.747820 s: REMOTE_SERVICE: TX: mcu2_0 (port 21) -> mpu1_0 (port 1025) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:30:59] 100308.775773 s: SOC TEMP = 86.477 ~ 89. 78 !!!
    [2023/9/22 21:30:59] 100308.775852 s: Left Camera TEMP = 54.100,Right Camera TEMP = 44.350 !!!
    [2023/9/22 21:30:59] 100308.781729 s: PCBA TEMP = 75!!!
    [2023/9/22 21:30:59] [MCU1_0] 100308.992575 s: [handle_input_voltage]: gTimeoutFlag_50ms = -1
    [2023/9/22 21:30:59] [MCU1_0] 100308.992675 s: Send voltage modle voltage=11.868273
    [2023/9/22 21:30:59] ...i=3009550
    [2023/9/22 21:31:00] [MCU2_0] 100309.761590 s: REMOTE_SERVICE: RX: mpu1_0 (port 1025) -> mcu2_0 (port 21) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:31:00] [MCU1_0] 100309.789324 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:31:00] [MCU1_0] 100309.789423 s: mcu1_0 get fps:1e,3000
    [2023/9/22 21:31:00] [MCU1_0] 100309.789479 s: canhandler type 0xaaa2c00a
    [2023/9/22 21:31:00] [MCU2_0] 100309.788806 s: REMOTE_SERVICE: TX: mcu2_0 (port 21) -> mpu1_0 (port 1025) cmd = 0x0000000c, prm_size = 384 bytes ... !!!
    [2023/9/22 21:31:00] 100309.817023 s: SOC TEMP = 86.664 ~ 88.708 !!!
    [2023/9/22 21:31:00] 100309.817061 s: Left Camera TEMP = 54.100,Right Camera TEMP = 43.850 !!!
    [2023/9/22 21:31:00] 100309.822644 s: PCBA TEMP = 75!!!
    [2023/9/22 21:31:00] [MCU1_0] 100310.032545 s: [handle_input_voltage]: gTimeoutFlag_50ms = -1
    [2023/9/22 21:31:00] [MCU1_0] 100310.032647 s: Send voltage modle voltage=11.857374
    [2023/9/23 9:44:41] [MCU2_0] 100310.802269 s: REMOTE_SERVICE: RX: mpu1_0 (port 1025) -> mcu2_0 (port 21) cmd = 0x0000000c, prm_size = 384 b
    [2023/9/23 9:44:41] Connection closed.
    
    [END] 2023/9/23 9:44:41
    

    If the power supply is unstable, how to troubleshoot it from the software

    Thanks

  • Hi,

    Can you capture if there were any voltage fluctuations with MPU and CORE rail while this happens?
    If there is no power on Linux side not sure how software can run?

    - Keerthy

  • Hi,

    1、The measured voltage is normal when stuck

    2、How does the software catch voltage anomalies? Are there any special registers?

    3、If it is mpu voltage problem, what is the way to avoid it

    Thanks

  • Renf,

    I read power outage. How did you detect that?

    Best Regards,

    Keerthy 

  • Hi,

     

    3、If it is mpu voltage problem, what is the way to avoid it

     

     

    I'm not sure if it's the voltage

    I assume that there is a voltage problem. I mean if it is a voltage problem, how should I solve it?

     

    The kernel log doesn't have a spit-out message, so I just suspect a power problem

    Thanks

  • Hi,

    Aah okay. So can you please attach any debugger to A72 and check if we can still stop though?

    If not check if any other core is active but connecting to a debugger.

    Best Regards,

    Keerthy 

  • Hi,

    sorry,Our board before there is no gTAG, now has been added, the current card dead gTag can not read the value of reg,gTag shows A7 free running.

    Thanks.

  • Hi,

    That means A72 is now inaccessible. Any chance you could connect to MCU R5F?

    - Keerthy

  • Hi,

    c7x can connect

    This is log in mcu1_0,we can find other core is ok addition to A7(mpu1_0)

    Thanks

  • Hi Renf,

    Thanks for the logs. It appears to be that other cores are running so we can rule out voltage/power glitch
    issues. As C7x and A72 are fed by the same voltage source.

    Few questions:

    • What is the application you are running?
    • Can you see any crash logs on the Linux console?
    • The logs below are from MCU_UART?

      This is log in mcu1_0,we can find other core is ok addition to A7(mpu1_0)

    • It will be difficult to comment without any traces. Is this reproducible on TI EVM?

    - Keerthy

  • Hi,

    1:Our own applications

    What is the application you are running?

     2:Can't find any crash logs on the linux console

    Can you see any crash logs on the Linux console?

    3:Yes,this is log in mcu1_0

    The logs below are from MCU_UART?

    4:We don't have a vl board here.

    It will be difficult to comment without any traces. Is this reproducible on TI EVM?

  • 1:Our own applications

    What is being run on A72 specifically? Is it some vision application?

    - Keerthy

  • Hi,

    What is being run on A72 specifically? Is it some vision application?

    yes

    thanks

  • Hi,

    JTAG log:

    CortexA72_0_0: Error: (Error -6310) PRSC module failed to read a register. (Emulation package 9.13.0.00201)
    CortexA72_0_0: Trouble Halting Target CPU: (Error -2062 - (0:23:2)) Unable to halt device. Reset the device, and retry the operation. If error persists, confirm configuration, power-cycle the board, and/or try more reliable JTAG settings (e.g. lower TCLK). (Emulation package 9.13.0.00201)

    Thanks

  • Renf,

    Is there a way to binary search on what application is doing?
    Like disable some of them and check if this is due to one particular feature? This is really hard to predict as we do not have A72 connected
    and no debug logs as well. Since the application is custom we cannot reproduce on EVM as well.

    - Keerthy

  • Hi,

    We used SBL startup,whether will be about this way?

    Thanks

  • Hi,

    I didn't understand the question. The boot flow is optimised which means ATF/Optee will directly boot Linux kernel as against booting the boot loader for A72.

    Best Regards,

    Keerthy 

  • Hi,

    optimized and development Will these two methods lead to some differences in linux after startup.

    Or uboot will configure some common register of arm and ATF/Optee will not be configured ,So Linux doesn't print a single thing after it dies.

    Thanks

  • Hi Renf,

    There should be no difference in the kernel image used in optimised flow or development flow it's just that we avoid U-Boot in the middle.

    Best Regards,

    Keerthy 

  • Hi,

    We added sbl_uboot, still very easy to get A72 not responding; we tested spl boot today tested it every 3 minutes for 5 hours, no A72 stuck, so I think it's still a sbl boot bug.

    Our board sbl start-up at high temperature to adjust the pmic voltage to start the A72 normally, so the difference between spl and sbl should be the cause of A72 stuck dead.

    Thanks

  • Hi Renf,

    So to summarize:

    • SPL boot works fine with your application
    • SBL boot flow whether it is optimized boot or using U-Boot you still see the A72 hang issue.
      • Our board sbl start-up at high temperature to adjust the pmic voltage to start the A72 normally,
      Is this a custom change?

    - Keerthy

  • Hi,

    Adjust PMIC voltage by FAE

    Is this a custom change

    Thanks

  • Hi Renf,

    Can you share the PMIC adjustment change? Can you try without that?

    Are you reducing the voltage or increasing the voltage? Also where is the code change done? Is it SBL or SPL?

    Best Regards,

    Keerthy 

  • Hi,

    This is PMIC adjustment.

    SBL would be stuck running during BL31/BL32/linux initialization memory without adding voltage to boost PMIC,SPL does not have this problem

    Thanks

  • Renf,

    This is custom board change. So can you make sure if both SPL and SBL are setting the same voltages?

    Best Regards,

    Keerthy 

  • Hi,

    SPL is the pmic default voltage

    A72 Why isn't there any abnormal print information, which cuts off the power What else is possible

    Thanks

  • Renf,

    Since a72 cannot be connected to debugger it will be harder. May I know the voltage bump done in SBL can it be increased further?

    Best Regards,

    Keerthy 

  • Hi,

    If I remember correctly, the maximum should not exceed 1 V, so I should have increased to the maximum, but the effect is still the same.

    Thanks

  • Hi Renf,

    What is the VDD_CORE voltage?

    - Keerthy

  • Hi,

    VDD_CORE is 0.8v,it's no adjustment.

    Thanks

  • Hello,

    -0-

    What you describe matches the A72 likely hanging up while trying to access a collapsed peripheral address range.  The fastest way to resolve would be to use ETM to get the instruction flow into the hang then map it back to likely some access to a failed IO range.  ETM is somewhat complex to decode and it will require a tool which can connect properly and extract from a hung system.   A less complete but very useful alternate is to use read (or poll and log) the A72s EDPCSR (aarch64 regs) or (dbgpcsr aarch32 regs).  A tool can do it for you or you can setup a spare master (which can read high addresses) where each core has its debug spaced mapped to for system accesses.

    I've made an example video which shows how to do this on some raw code.  Doing it with Linux is also possible but it can require more setup on your debugger so it can understand multiple address spaces.  If the hang is in the kernel space, its easier as that mostly is just working in the well known kernel space (a module can complicate but it can also be worked through).

    Either ETM's trace of waypoints or use of the pc sampling registers are very effective for this type of issue.   The ARM Arch manual is the best reference or Linux aware extension documents in the Lauterbach-TRACE32 debugger.

    See video example:

    -1-

    One side note is, when doing a voltage experimented like asked for above, its good to physically probe and verify what you have done has taken.  Always a low level correlation of a signal is needed (or a register in the debugger).  Logical tests at a DTB level should never be assumed correct with out a correlation to a low level signal in this kind of lab hang situation.

    Regards,
    Richard W.
  • Hi,

    1: TRACE32 can link xds110?

    2:CCS can replace TRACE32?

    Thanks

  • Hello Renf,

    The TRACE32 debug system's HW and SW need to be used as a pair. This tool often works well in hang situation as its accesses can be shaped to not get hung up on stuck subcomponents.

    You should be able to use CCS to get the EDPCSR (PC for both cores) at the hang point.  If you connect to the hidden DAP target and use the 0x4cxxxxxxxx addresses from the video.  You can also write some code and place it on a spare core to get a similar effect to what TRACE32 did for the snooper video.

    Some mix of coding + debugger can make any route eventually work.  The time to get at the details is variable.
    Regards,
    Richard W.
  • Hi,

    Where we can see detailed description about "coding+ccs debugger" ?

    coding + debugger

    Thanks

  • Hello Renf,

    What is implied in my high level statement is if you have a spare DSP or a spare R5 core you could set it up to read the EDPCSR from the high system address the write the samples into some spare sram.  At A72 hang time, you can then attach with the debugger and read the spare sram with A72 PC's logged into it.  This would give a sample history into the hang.  You then would need to map that back to the active processes.  If you have some kind of 'in field' crash creating this kind of log thread which samples key program counters and other debug status something customers do.  If the crash is in your lab then hooking up jtag and getting the values via DAP is typical.  How much manual effort is required to associate a EDPCSR with a given Linux thread is mostly a function of the debug system in use.  Some debuggers do most of the work and others require you to figure it out via SW decoding.  One quicker way to get information is what I showed in the video.

    Regards,
    Richard W.