This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM-Q1: Reset/ reboot the TDA4VM custom board when multi applications & stress test on CPU.

Part Number: TDA4VM-Q1
Other Parts Discussed in Thread: TDA4VM, DRA829

Hi All,

We are facing a issue in custom designed TDA4VM hardware when we running following applications

- run_app_srv.sh

- run_app_tidl_od.sh - with disabling display in cfg file 

- stress-ng --cpu 2 

with the above tests we are seeing two different behaviors in our custom TDA4VM boards.

Issue1:

When we run the above tests in TDA4VM generic chip( Non- HS version) the display is broken and boards hangs after some time ( ~1.5hours)

Issue 2:

When we run the above tests in TDA4VM HS version chip the board reboots after 20 Seconds after running above mentioned tests. In this scenario we sees SOC_SAFETY_ERRORn signal and MCU_SAFETY_ERRORn signals get asserted and shall be reported to PMIC which is causing board reboot.

Kindly request to provide us the pointers to debug this.

Regards,

Chaitanya

  • Chaitanya,

    Lets continue discussions on this thread, and see how we can recreate this or get closer to the problem.

    It will be important if you can attempt to provide some more information around the points captured below.

    1. Can you see any logs from any of the UART instances available before the crash occurs that might indicate a failure signature?
    2. Have you attempted to connect via debugger to any of the cores and if yes, we can try to come up with a list of things to do.
    3. I"m assuming that you have more than one board for the GP and HS variant, do you see this consistently across multiple boards - just want to rule out any board specific marginality here.
    4. Is there any watchdog setup on your system?
    5. PDN sensitivity - has the PDN gone through a review?
    6. Is there any software that invokes TISCI APIs to reset the main domain - has such code been added in your system?
    7. Another set of ideas - may be we should make a list of all the various software entities that are running in the various cores and try to isolate by removing the one's that we don't need for this usecase.

    Regards

    Karthik

  • Other ideas/suggestions:

    1. Need to provide the possible reason for the assertion of MCU_SAFETY_ERRORn & SOC_SAFETY_ERRORn. TRM doesn't have the description for this.
    2. Another data point is that the MCU_SAFETY_ERRORn & SOC_SAFETY_ERRORn are isolated (not reaching the PMIC), the reset from PMIC is still occuring. Should the debug focus on the PMIC? What is the trigger condition of RESET from PMIC perspective?
    3. I2C to PMIC will be monitored to check if there is any transaction.

    - we will need your help here.

  • Hi Karthik,

    Thanks for your inputs. Here is my feedback on your below suggestions

    1. Can you see any logs from any of the UART instances available before the crash occurs that might indicate a failure signature?
    2. [Chaitanya]: We have enabled both SOC and Wakeup UART however during crash we don't see any logs on both the uart for reset reasoning 
    3. Have you attempted to connect via debugger to any of the cores and if yes, we can try to come up with a list of things to do.
    4. [Chaitanya]: We have not connected any debugger during our experiments. we will connect and update you the results
    5. I"m assuming that you have more than one board for the GP and HS variant, do you see this consistently across multiple boards - just want to rule out any board specific marginality here.
    6. [Chaitanya]: Yes verified in multiple sample boards and observed the crash in HS variant only
    7. Is there any watchdog setup on your system?
    8. [Chaitanya]: watchdog is disabled in the system for PMIC. for SOC i will check with my software team and update the results
    9. PDN sensitivity - has the PDN gone through a review?
    10. [Chaitanya]: Design was reviewed and verified 
    11. Is there any software that invokes TISCI APIs to reset the main domain - has such code been added in your system?
    12. [Chaitanya]: Need to check on this part with my software team. will comeback with results.
    13. Another set of ideas - may be we should make a list of all the various software entities that are running in the various cores and try to isolate by removing the one's that we don't need for this use case.
    14. [Chaitanya]: we have performed following combinations and tested.

    Combination1: When we Run '' run_app_srv.sh and - run_app_tidl_od.sh - with disabling display in cfg file '' application examples there is no crash observed

    Combination2: when we run '' run_app_srv.sh and - stress-ng --cpu 2 '' No crash was observed in the system

    Combination3: When we run '' run_app_tidl_od.sh - with disabling display in cfg file and - stress-ng --cpu 2'' applications no crash observed in the setup

    When we run all 3 applications together the system reboots.

    Regards,

    Chaitanya

  • Hi Karthik,

    Please find attached here is the I2C sniffer data. we have monitored the I2C signals after bootup and during crash also. 

    another one we have also same time dump PMIC I2C registers after bootup and after crash PMIC dumps.

    Can you please let me know further debugging steps

    configured 0x4D reg to 0x80 in PMIC-A
    
    -----------------------------------------------------------------------
    	
    	After Power ON Before running applications
    
    --------------------------------------------------------------------------
    			PMIC-A
    --------------------------------------------------------------------------
    
         0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
    00: 00 82 13 02 31 2b 20 2b 30 2f 31 2b 31 1b 3a 37    .???1+ +0/1+1?:7
    10: 37 37 fd fd 41 41 b2 b2 1b 1b 1b 1b 1b 31 31 31    77??AA???????111
    20: 31 00 00 38 38 10 38 1b 1b 1b 1b 21 3f 00 00 00    1..88?8????!?...
    30: 00 20 40 58 c8 29 28 38 78 01 d8 43 19 00 01 cc    . @X?)(8x??C?.??
    40: 0f 5a 96 05 1e 01 55 55 15 00 00 00 00 80 00 ff    ?Z????UU?....?..
    50: ff 3f 11 02 20 00 00 00 00 3f 18 00 00 00 00 00    .??? ....??.....
    60: 00 00 00 00 00 02 01 00 00 00 00 00 00 00 00 00    .....??.........
    70: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00    ...?............
    80: 00 1b 06 00 0f 00 00 00 00 00 00 0b ff ff 00 00    .??.?......?....
    90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    a0: 00 00 00 80 00 00 08 00 00 00 00 00 00 00 00 00    ...?..?.........
    b0: 00 00 00 00 00 00 00 00 01 01 00 00 00 00 00 00    ........??......
    c0: 00 00 00 f8 80 00 00 00 00 00 00 00 00 58 9d 00    ...??........X?.
    d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    f0: 7a 28 ea a9 a4 45 ff b0 95 9f 5a e2 00 00 b6 df    z(???E.???Z?..??
    
    --------------------------------------------------------------------------
    			PMIC-B
    --------------------------------------------------------------------------
    
    
         0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
    00: 00 82 11 02 31 2b 20 2b 20 22 20 22 31 1b 37 00    .???1+ + " "1?7.
    10: 37 00 00 00 00 00 41 00 1b 1b 00 00 1b 31 31 31    7.....A.??..?111
    20: 31 00 00 f4 f4 38 38 1b 1b 1b 1b 21 3f 00 00 00    1..??88????!?...
    30: 00 00 1c 01 03 20 20 10 00 01 f8 01 19 04 04 06    ..???  ?.???????
    40: 0c 0a a2 06 1e 01 51 55 15 00 00 00 00 00 00 fd    ??????QU?......?
    50: fd 3f 11 02 e0 00 00 a0 09 3f 96 00 00 00 00 01    ?????..????....?
    60: 22 00 00 08 02 00 01 00 00 08 00 00 00 00 00 00    "..??.?..?......
    70: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00    ...?............
    80: 00 18 08 00 0f 00 00 00 00 00 00 0b ff ff 00 00    .??.?......?....
    90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    a0: 00 00 00 80 00 00 08 00 00 00 00 00 00 00 00 00    ...?..?.........
    b0: 00 00 00 00 00 00 00 00 01 01 00 00 00 00 00 00    ........??......
    c0: 00 00 00 f8 80 00 00 00 00 00 00 00 00 00 1d 00    ...??.........?.
    d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    f0: 73 bb 95 45 3b ee aa 79 c8 4b 9f 51 00 00 2d a4    s??E;??y?K?Q..-?
    
    
    -------------------------------------------------------------------------
    
    		After Reboot happened due to applications
    
    --------------------------------------------------------------------------
    			PMIC-A
    --------------------------------------------------------------------------
    
    
         0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
    00: 00 82 13 02 31 2b 20 2b 30 2f 31 2b 31 1b 3a 37    .???1+ +0/1+1?:7
    10: 37 37 fd fd 41 41 b2 b2 1b 1b 1b 1b 1b 31 31 31    77??AA???????111
    20: 31 00 00 38 38 10 38 1b 1b 1b 1b 21 3f 00 00 00    1..88?8????!?...
    30: 00 20 40 58 c8 29 28 38 78 01 d8 43 19 00 01 cc    . @X?)(8x??C?.??
    40: 0f 5a 96 05 1e 01 55 55 15 00 00 00 00 00 00 ff    ?Z????UU?.......
    50: ff 3f 11 02 20 00 00 00 00 3f 99 02 00 02 00 00    .??? ....???.?..
    60: 00 00 00 00 00 02 01 00 00 04 00 00 00 00 00 00    .....??..?......
    70: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00    ...?............
    80: 00 1b 06 01 0f 00 00 00 00 00 00 0b ff ff 00 00    .????......?....
    90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    a0: 00 00 00 80 00 00 08 00 00 00 00 00 00 00 00 00    ...?..?.........
    b0: 00 00 00 00 00 00 00 00 01 01 00 00 00 00 00 00    ........??......
    c0: 00 00 00 f8 80 00 00 00 00 00 00 00 00 58 9d 00    ...??........X?.
    d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    
    --------------------------------------------------------------------------
    			PMIC-B
    --------------------------------------------------------------------------
    
         0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
    00: 00 82 11 02 31 2b 20 2b 20 22 20 22 31 1b 37 00    .???1+ + " "1?7.
    10: 37 00 00 00 00 00 41 00 1b 1b 00 00 1b 31 31 31    7.....A.??..?111
    20: 31 00 00 f4 f4 38 38 1b 1b 1b 1b 21 3f 00 00 00    1..??88????!?...
    30: 00 00 1c 01 03 20 20 10 00 01 f8 01 19 04 04 06    ..???  ?.???????
    40: 0c 0a a2 06 1e 01 51 55 15 00 00 00 00 00 00 fd    ??????QU?......?
    50: fd 3f 11 02 e0 00 00 a0 09 3f 96 00 00 00 00 01    ?????..????....?
    60: 22 00 00 08 02 00 01 00 00 08 00 00 00 00 00 00    "..??.?..?......
    70: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00    ...?............
    80: 00 18 08 01 0f 00 00 00 00 00 00 0b ff ff 00 00    .????......?....
    90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    a0: 00 00 00 80 00 00 08 00 00 00 00 00 00 00 00 00    ...?..?.........
    b0: 00 00 00 00 00 00 00 00 01 01 00 00 00 00 00 00    ........??......
    c0: 00 00 00 f8 80 00 00 00 00 00 00 00 00 00 1d 00    ...??.........?.
    d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
    f0: 73 bb 95 45 3b ee aa 79 c8 4b 9f 51 00 00 2d a4    s??E;??y?K?Q..-?
    
    
    ---------------------------------------------------------
    	After Power ON
    ---------------------------------------------------------
    write to 0x54 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x48 ack data: 0x0E 
    read to 0x48 ack data: 0x37
    write to 0x48 ack data: 0x05 
    read to 0x48 ack data: 0x2B
    write to 0x48 ack data: 0x0E 0x3A 
    write to 0x50 nak
    write to 0x51 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x51 nak
    write to 0x52 nak
    write to 0x54 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x51 nak
    write to 0x52 nak
    write to 0x54 nak
    
    
    ---------------------------------------------------------
    	After reboot happened
    ---------------------------------------------------------
    
    write to 0x54 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x48 ack data: 0x0E 
    read to 0x48 ack data: 0x37
    write to 0x48 ack data: 0x05 
    read to 0x48 ack data: 0x2B
    write to 0x48 ack data: 0x0E 0x3A 
    write to 0x50 nak
    write to 0x51 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x51 nak
    write to 0x52 nak
    write to 0x54 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x51 nak
    write to 0x52 nak
    write to 0x54 nak
    write to 0x54 nak
    
    
    
    
    
    ---------------------------------------------------------
    	After Removing SOC_SAFETY_ERROR 
    Got same wakup i2c_dump for Power ON and reboot time
    ---------------------------------------------------------
    write to 0x54 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x48 ack data: 0x0E 
    read to 0x48 ack data: 0x37
    write to 0x48 ack data: 0x05 
    read to 0x48 ack data: 0x2B
    write to 0x48 ack data: 0x0E 0x3A 
    write to 0x50 nak
    write to 0x51 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x51 nak
    write to 0x52 nak
    write to 0x54 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x50 nak
    write to 0x51 nak
    write to 0x51 nak
    write to 0x52 nak
    write to 0x54 nak
    write to 0x54 nak
    
    
    
    
    

    Regards,

    Chaitanya

  • Need to provide the possible reason for the assertion of MCU_SAFETY_ERRORn & SOC_SAFETY_ERRORn. TRM doesn't have the description for this.

    The TRM has a description of these signals.

    Please see:

    The ESM “aggregates” the many error events that can happen (e.g., ECC) and if enabled … can assert interrupts or SAFETY_ERRORn output signals.

    And then there is a mapping table:

    Regards

    Karthik

  • Hello,

    The PMIC dumps indicate issues:

    (1) After power on, PMICB LDO1 and LDO2 UV errors are indicated.  Additionally, the SOC_PWR_ERR_INT is also set (0x69=0x08).

    (2) After reboot, PMICA BUCK3 has a UV error and there is a corresponding MCU_PWR_ERR_INT.

    (3) After reboot, PMICB still has UV errors on LDO1 and LDO2.  

    (4) Both PMICA and PMICB are pre-release NVM revisions (beta).  I will need to do some more investigating to understand what limitations if any are in these NVMs.  

    Can you confirm the amount of capacitance on LDO1 and LDO2 of PMICB?

    I am not understanding the I2C logs.  For the PMICs there is the I2C address, Register address and the data.

    Regards,

    Chris

  • Hi Chris,

    Thanks for the feedback. please see my feedback on your below questions

    (1) After power on, PMICB LDO1 and LDO2 UV errors are indicated.  Additionally, the SOC_PWR_ERR_INT is also set (0x69=0x08).

    [Chaitanya]: Yes we see that however board is booting without any issues also could you please confirm (0x69=0x08) is because of LDO1 and LDO2 UV detection?

    (2) After reboot, PMICA BUCK3 has a UV error and there is a corresponding MCU_PWR_ERR_INT.

    [Chaitanya]: we observed the feedback voltage on BUCK3 is 3.1V . As per NVM PMIC A BUCK3 PG WINDOW is configured as 5% for UV. is that the reason to PMICA Reasons.

    (3) After reboot, PMICB still has UV errors on LDO1 and LDO2.  

    [Chaitanya]: Yes we don't see PMICB doesn't change the errors after reboot also. however as i stated we don't see any boot issues even though LDO supplies has detected as under voltage.

    (4) Both PMICA and PMICB are pre-release NVM revisions (beta).  I will need to do some more investigating to understand what limitations if any are in these NVMs.  

    Can you confirm the amount of capacitance on LDO1 and LDO2 of PMICB?

    [Chaitanya]: LDO1: 2.3UF, LDO2: 16.9uF.

     

    In addition to above i have set of questions related PMIC. could you please check and revert back to us with your answers

    Q1) When shutdown or reboot happens due to moderate or severe error and PMIC performs the shutdown or orderly shutdown sequence, and when it comes to active state thereafter shall we get to know the reasoning of shutdown by reading PMIC INT registers? 
    Q2) When we dump the PMIC registers in normal working operations we see PMIC-B ( TPS65941111) 0x5A shows 0x96 which means the SOC_PWR_ERR interrupts are generated due to PMIC-B LDO1 and LDO2 under voltage however we don't see PMIC is doing SOC power rails shutdown. As per PMIC user guide when SOC_PWR_ERR triggers it should shutdown the SOC power rail groups orderly. can we get feedback on the same
    Q3) When we power on the board we see the startup interrupts are generated by PMIC-A, However we are not able to clear them over I2C. Could you please guide us the sequence to do the startup interrupts.
    Regards,
    Chaitanya
  • Hello,

    [Chaitanya]: Yes we see that however board is booting without any issues also could you please confirm (0x69=0x08) is because of LDO1 and LDO2 UV detection?

    0x60 = 0x22

    https://www.ti.com/lit/ds/symlink/tps6594-q1.pdf#page=258

    0x69 = 0x08

    [Chaitanya]: LDO1: 2.3UF, LDO2: 16.9uF.

    This is ok. The maximum is 20uF.  

    Q1) When shutdown or reboot happens due to moderate or severe error and PMIC performs the shutdown or orderly shutdown sequence, and when it comes to active state thereafter shall we get to know the reasoning of shutdown by reading PMIC INT registers?

    Yes.  This is correct.  The PMIC will attempt a recovery 15 times.  If any recovery attempt is successful then the PMIC will properly power the processor and indicate the reason for the recovery.  If the PMIC reaches 15 attempts it will no longer power the processor but can still be accessed via I2C to read out the interrupts.

    Q2) When we dump the PMIC registers in normal working operations we see PMIC-B ( TPS65941111) 0x5A shows 0x96 which means the SOC_PWR_ERR interrupts are generated due to PMIC-B LDO1 and LDO2 under voltage however we don't see PMIC is doing SOC power rails shutdown. As per PMIC user guide when SOC_PWR_ERR triggers it should shutdown the SOC power rail groups orderly. can we get feedback on the same

    Your expectation is correct and I suspect the issue is related to the NVM revision but need to confirm and attempt to recreate.  

    Q3) When we power on the board we see the startup interrupts are generated by PMIC-A, However we are not able to clear them over I2C. Could you please guide us the sequence to do the startup interrupts.

    Before you clear the interrupts you must set the nsleep bits to the active mode.  This should not however prevent you from clearing the interrupts (write '1' to clear).  If you clear the interrupts without setting the nsleep bits then you will go to the retention state.

    https://www.ti.com/lit/ug/slvucf3/slvucf3.pdf#page=42

    Regards,

    Chris

  • Hello,

    I am still working to recreate the issue on my side.  I will provide an update tomorrow.

    Regards,
    Chris

  • Hi Chris,

    we have made the following progress on the above issue. Made two experiments on our side to solve the problem.

    EXPERIMENT1:

    1. we have updated the PMIC-A registers by writing after board is powered on 

    ''  #i2cset -y 2 0x48 0x1A 0x3F''.

    2. After updating the PMIC BUCK3n PG window register value to 10% instead of 5% from NVM, we ran all the applications ( TI Deep Learning, TI SRV and Stress NG) tests and we do not observe the board reboot.

    EXPERIMENT2:

    1. We have remove the PMIC-A input protection MOSFET Q19 and shorted Drain to source as shown in below and thereafter we have powered on our board and run the applications .

    2. with this hardware modification we have not observed any reboot our board.

    we have few more questions on the PMIC to get clarified.

    Q1) As Per PMIC register description the BUCK3_PG_WINDOW register is described as 5%/-50mv or 5% /+50mv for UV and OV detections. should we need consider 5% of feedback voltage or 50mV from feedback voltage

    Example:

    we have connected 3.3V  on PMIC-A buck3 feedback, if we consider 5% UV condition the feedback voltage is = 3.3*0.005= 0.165= 3.3-0165= 3.135V

    in another way if we consider only 50mv from the feedback voltage the UV condition shall be 3.3-0.050 = 3.25V

    could you please clarify on the above point.

    Q2) we see the PMIC-A NVM revision is 02 however as per ''Optimized TPS65941213-Q1 and TPS65941111-Q1 PMIC
    User Guide for JacintoTm 7 J721E, PDN-0C user guide document'' the PMIC buck 3 PG window register threshold is configured as 10%/100mv. could you please clarify how these thresholds are defined .

     

    Regards,

    Chaitanya

  • Q1) As Per PMIC register description the BUCK3_PG_WINDOW register is described as 5%/-50mv or 5% /+50mv for UV and OV detections. should we need consider 5% of feedback voltage or 50mV from feedback voltage

    Please use 5% for 3.3V output voltage.  The selection, % vs mV, is a function of the output voltage.  Refer to section 7.10 of the datasheet, https://www.ti.com/lit/ds/symlink/tps6594-q1.pdf#page=31.

    could you please clarify how these thresholds are defined .

    The thresholds are defined based upon the requirements for the DRA829, which is +/-5%.  The output of the load switch, which is monitored by FB_B3, is unique because input is the VCCA and the VCCA OV/UV is +/-10%.

    Regards,

    Chris

  • EXPERIMENT2:

    1. We have remove the PMIC-A input protection MOSFET Q19 and shorted Drain to source as shown in below and thereafter we have powered on our board and run the applications .

    If you bypass the protection FET, then also be sure to connect VSYS_SENSE to gnd and leave OVPGDRV floating.  The VSYS_SENSE is used for the charge-pump to drive OVPGDRV and is part of the fail short test.  The test conducts a brief interruption of the charge pump to verify that the VCCA drops and the FET has not failed short.