TMS320F28032: While running program, there is a periodic reset signal, the reset period is about 24ms

Part Number: TMS320F28032

Hi Every one,

     Board: F28032

     The program is running in Flash, and periodically resets when running for a period of time, Can you provide a reference idea for the solution?

     When connected to the emulator for single-step debugging, it is found that the Flash program enters an illegal interrupt.

     The normal operation is shown in Figure 1, and the actual code operation is shown in Figure 2.

Fig.1

   The next line of code of 3F39FB has changed from a normal code to an illegal interrupt code.

Fig.2

  INT5 is not enabled in the program, but it calls INT5 during the illegal interrupt code, which causes unknown phenomena and risks.

  So, Experts, have you encountered this situation before? From the software, there is a way to locate the cause ?

  • While running program, there is a periodic reset signal, the reset period is about 24ms

    Do you have watchdog or missing clock enabled? It would be good to check the flags in WDCR, PLLSTS, and NMIFLG to see if any of the fault detection circuits asserted a reset.

    Can you monitor the supplies to make sure that they are well regulated? The BOR can assert a reset if the voltage levels are out of spec.

       The next line of code of 3F39FB has changed from a normal code to an illegal interrupt code.

    This is a flash location so it is likely that there was an erroneous execution of Flash_Program() that took place. It's possible that the flash contents were corrupted by an electrical disturbance, but this is much less likely.

  • Can you monitor the supplies to make sure that they are well regulated? The BOR can assert a reset if the voltage levels are out of spec.

    Hi tlee,

        Thank you very much for your reply, that is, Can you give a more detailed explanation?

         This phenomenon is probably because the chip is normal after being powered on for a period of time, and the next line of code that runs to 3F39FB again changes from normal code to INT5 that is not enabled. The customer checks that the power-on is normal, and then confirms that the reset is caused by the watchdog. The reset of the watchdog is because the program has entered a cyclic program without an enabled interrupt.

        As below the red box, It will execute cyclically, and then INT5 is not enabled. So how to locate a section of the program that enters an unknown state, and during the execution process, the program code with the same source address will change.

  • This is a flash location so it is likely that there was an erroneous execution of Flash_Program() that took place. It's possible that the flash contents were corrupted by an electrical disturbance, but this is much less likely.

    Hi tlee,

         Each time the customer's faulty product runs, the above-mentioned faults are 100% reproducible.

  • This is a flash location so it is likely that there was an erroneous execution of Flash_Program() that took place.

    Hi tlee,

         Yes, The code is running in Flash, but every time a fault occurs, the code will change.

  • It's possible that the flash contents were corrupted by an electrical disturbance, but this is much less likely.

    Hi tlee,

        This situation is unlikely to happen, because after erasing the Flash and then reprogramming the code, the chip can work normally.
        

  •   Thank you very much for your reply, that is, Can you give a more detailed explanation?

    Use an oscilloscope to monitor the VDD and VDDIO supplies to make sure that it is maintained within the datasheet recommended operating conditions.

         This phenomenon is probably because the chip is normal after being powered on for a period of time, and the next line of code that runs to 3F39FB again changes from normal code to INT5 that is not enabled. The customer checks that the power-on is normal, and then confirms that the reset is caused by the watchdog. The reset of the watchdog is because the program has entered a cyclic program without an enabled interrupt.

    I recommend focusing on the problem of code corruption in flash. The undesired branch to INT5 is caused by the flash corruption.

    Can you confirm that there are calls to the Flash API to program flash contents during operation? If so, is this CAUTION from the Flash API UG observed?

  • I recommend focusing on the problem of code corruption in flash.

    Hi tlee,

         I also suspected this problem before. Customers read the Flash content of the normal product and the faulty product, and the comparison results are consistent. When the faulty product is powered off again and loaded into Flash, it can work normally afterwards.

  • Customers read the Flash content of the normal product and the faulty product, and the comparison results are consistent.

    Do you mean to say that the failing device returns the same flash image as a working device without reprogramming? Or do you mean to say that the flash image on a failing device is consistently corrupted.

     When the faulty product is powered off again and loaded into Flash, it can work normally afterwards.

    This observation is good evidence to help support or dispute root-cause theories, but it is not a strong indicator of root-cause failure by itself.

  • Do you mean to say that the failing device returns the same flash image as a working device without reprogramming? Or do you mean to say that the flash image on a failing device is consistently corrupted.

    Hi tlee,

       The code has not changed. Check that the contents of the two products (as the faulty product and the normal product) in Flash are the same content.

  • Check that the contents of the two products (as the faulty product and the normal product) in Flash are the same content.

    Can you clarify your original problem statement? You had attached an image of showing different flash contents in the address range of 0x3F39FB - 0x3F39FD.

    Under what conditions will the differences exist. And under what conditions (or sequence of events) will the contents revert back to original?

  • Hi Tommy,

         The left side of the figure below is the code running on the normal product, and the right side of the figure below is the code running on the faulty product. Why does the code running in Flash change? It enters INT5, and then INT5 is not enabled.(not the Illegal-instruction trap).

  •  The code has not changed. Check that the contents of the two products (as the faulty product and the normal product) in Flash are the same content.

    Is the 0x3F39FB - 0x3F39FC memory part of the programmed flash image?

    If so, at what point does the erroneous content on a faulty unit revert back to the same content as a good unit?

  • Is the 0x3F39FB - 0x3F39FC memory part of the programmed flash image?

    Hi tlee,

        Yes, the 0x3F39FB - 0x3F39FC is memory part of the programmed flash image.

        And I clarify my question again:

        The left side of the figure below is the code running on the normal product, and the right side of the figure below is the code running on the faulty product in Fig1.  It enters INT5, and then INT5 is not enabled.(not the Illegal-instruction trap).

    Fig.1

        And then enter INT5(Which is EQEP1_INT_ISR), run to the 3F2E43 line change to the  3F2E43 SUBU ACC,@0 in Fig.2,  and then next enter INT5, So it will execute the code back and forth in Fig3.

    Fig.2

    Fig.3

         For problems like the above, how do I start if I want to locate the fault? Because when the program re-burns the Flash, the chip can work normally.

        Under what circumstances will an interrupt that is not enabled enter? Like INT5

        Looking forward to your reply.

  •  For problems like the above, how do I start if I want to locate the fault?

    Build a knowledge base by organizing the fault behaviors in terms of observed cause-and-effect events with a description of repeatability.

      The left side of the figure below is the code running on the normal product, and the right side of the figure below is the code running on the faulty product in Fig1.  It enters INT5, and then INT5 is not enabled.(not the Illegal-instruction trap).

    The CPU is entering INT5 because it is executing the INTR "Emulate Hardware Interrupt" op code. You can read about it in the C28x Instruction Set Reference Guide:

    Emulate an interrupt. The INTR instruction transfers program control to the interrupt
    service routine that corresponds to the vector specified by the instruction. The INTR
    instruction is not affected by the INTM bit in status register ST1. It is also not affected by
    enable bits in the interrupt enable register (IER) or the debug interrupt enable register
    (DBGIER).

    And then enter INT5(Which is EQEP1_INT_ISR), run to the 3F2E43 line change to the  3F2E43 SUBU ACC,@0 in Fig.2,  and then next enter INT5, So it will execute the code back and forth in Fig3.

    The crux of the problem is that something is corrupting the flash image.

    Because when the program re-burns the Flash, the chip can work normally.

    Do NOT reprogram the flash until the nature of the corruption is better understood. Does the flash image retain the same corrupted image between power-down and power-up? Is this seen on multiple devices? If so, do they have identically corrupted images?

  • Does the flash image retain the same corrupted image between power-down and power-up?

    Hi tlee,

        According to the customer’s previous description,
    1. They used CCS/Memory to extract the contents of Flash, and compared the code burned in flash after power-on with the code when the failure occurred, and the results were consistent.
    2. They also compared the Flash of other normal products and faulty products, and the results were the same.

       So I am a little skeptical about the results, whether there is a more reliable way to verify the flash image?

       

  • Is this seen on multiple devices?

    Hi tlee,

        Yes, maybe it is seen multiple devices,  a greater probability.

        

  • Can you monitor the supplies to make sure that they are well regulated? The

    Hi tlee,

        The power supply is normal. I also measured and confirmed this on the site.

  •     According to the customer’s previous description,
    1. They used CCS/Memory to extract the contents of Flash, and compared the code burned in flash after power-on with the code when the failure occurred, and the results were consistent.
    2. They also compared the Flash of other normal products and faulty products, and the results were the same.

       So I am a little skeptical about the results, whether there is a more reliable way to verify the flash image?

    This inconsistency must be resolved in order to find the root cause.

    From the screenshots, they can clearly see flash corruption from the disassembly view. When in this failing state with CPU halted:

    • Are the flash contents the same between the disassembly view vs the memory browser view?
    • Do the contents change if alternating the memory browser view between Data and Program pages?
    • Do the contents change after performing a CPU reset through CCS?
    • Do the contents change after a power-down and power-up?
  • If so, at what point does the erroneous content on a faulty unit revert back to the same content as a good unit?

    Hi tlee,

        Load Flash again to restore normal, can you understand this problem now?

  • The CPU is entering INT5 because it is executing the INTR "Emulate Hardware Interrupt" op code. You can read about it in the C28x Instruction Set Reference Guide:

    Hi tlee,

         Sorry, I donnot understand this comment.

  • Hi,

    tlee is out of the office today, you should receive a reply tomorrow.

    regards, Joe

  •  Sorry, I donnot understand this comment.

    It seems like we are having some communication difficulties. Are there any local colleagues who might be able to assist with this discussion?

  • Hi tlee,

        I have communicated with local colleagues before and gave me some suggestions, but I still couldn't solve the problem.

        So,  you can tell me what needs to be done, or what to see to solve this problem.

        Thanks tlee, This problem is very urgent and important, and it will affect the promotion of new C2K products.

        The content you mentioned above needs to be read, I will go to the customer side to check it.  

  • Shaoxing,

    Based on the information provided, I believe the following to be true:

    • Once the flash is corrupted, the CPU behavior is expected because it is executing corrupted instructions
    • The flash is not suffering from physical damage because it can be reprogrammed
    • Something at run-time is causing the flash to be corrupted, where the most likely causes are:
      • Faulty input parameters passed to the Flash API calls
      • Power disturbances while the Flash API is attempting to program the flash

    Can you confirm if the application is trying to program flash during operation? If so, please focus on the API calls and power stability while programming.

    -Tommy

  • Hi tlee,

        I need to clarify one thing to you, last week I went to the customer site to debug. The fault appears before the customer's main function. However, Flash API functions are only called in the customer's main function. The location of the fault is when performing peripheral initialization. So the customer does not think it is caused by the Flash API function, and the chip power supply is normal. 

  • Shaoxing,

    Please clarify....

    From my prior understanding, it sounded like this was the corruption sequence:

    1. Device is programmed with good firmware image
    2. Device runs application
    3. Device firmware is corrupted during operation
    4. Device halts in erroneous ISR
    5. Device firmware remains corrupted through power-off and power-on
    6. Device firmware can be reprogrammed successfully

    Can you confirm my understanding?

    When you say that "the fault appears before the customer's main function," are you describing the condition at (5)? If so, please confirm that the corrupted firmware image remains static at (5) between power-off and power-on. Please also check to see if other devices share the same corrupted firmware image when they arrive at (5).

    The auditing of the API calls and power stability is only valid at (2) before the firmware is corrupted.

    -Tommy

  • Hi tlee,

        Yes, your understanding is right. And when chip power on and off, the fault still exists. 

         Other devices are shared the same corrupted firmware image when they arrive at (5).

         So, are there any suggestions that can be provided to customers?

  • Shaoxing,

    Ideally, the next goal would be to reproduce the failure in the lab for further debug.

    I would recommend an audit of the Flash API calls to make sure that they are not susceptible to corner cases where the wrong address or size might be passed in.

    If the failure is reproducible, they can add range-checking to the Flash API calls for the purpose of halting the execution in place if the wrong flash address is being programmed. This would be helpful for tracing the error.

    -Tommy

  • Hi tlee,

        The fault is found before calling the Flash API function, which means that it has not been involved in calling Flash API related functions.

  • Shaoxing,

    Please clarify.

    From the agreed upon corruption sequence, the device begins in a good state (1-2) until something corrupts the flash (3). How are you certain that the Flash API function is never executed during operation (2-3)?

    -Tommy

  • Hi tlee,

        When the fault appears before the customer's main function. However, Flash API functions are only called in the customer's main function. I don’t know if my statement is clear. If not, free call to me. Thanks Tommy.

  • However, Flash API functions are only called in the customer's main function.

    In which state of the corruption sequence is this behavior that you are describing?

    To reemphasize:

    The auditing of the API calls and power stability is only valid at (2) before the firmware is corrupted.
  • Hi tlee,

         Okay, But the fault occurs when the peripheral is initialized, and then what the customer wants to know is why the interrupt response is not enabled?

  • customer wants to know is why the interrupt response is not enabled?

    Have you shared the below comment with them?

    The CPU is entering INT5 because it is executing the INTR "Emulate Hardware Interrupt" op code. You can read about it in the C28x Instruction Set Reference Guide:

    Emulate an interrupt. The INTR instruction transfers program control to the interrupt
    service routine that corresponds to the vector specified by the instruction. The INTR
    instruction is not affected by the INTM bit in status register ST1. It is also not affected by
    enable bits in the interrupt enable register (IER) or the debug interrupt enable register
    (DBGIER).

    Do they agree or disagree with it?