This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

BQ79616-Q1: Power Regulator Circuit Failure

Part Number: BQ79616-Q1
Other Parts Discussed in Thread: BQ79616

Tool/software:

In the previous post,  it was believed that the power regulator failure was a one off event but it has since been discovered that this is a recurring issue. After replacing the board in the failed battery, the system worked properly for several days and then without any warning experienced the exact same failure as the previous board. More information was gathered with this occurrence and is outlined below. Any help in diagnosing why this is happening and what is causing it to happen is greatly appreciated.

Problem Behavior

Capacitor C302 and BJT Q301 front the power regulation circuit below become extremely hot and the BQ79616 becomes inoperable. This behavior persists until power is removed from the board/chip but does not appear to  return immediately upon reconnecting the board to the battery stack. When in the failed state, capacitor C302 measured  over 190F with an IR temp gun and Q301 measured over 130F. Heat could be felt rising from the board and it smelled of overheating electrical components. The output voltage of the BJT was 3.85V.

Ref Description MPN
C302 CAP, CERM, 1uF, 100V, +/- 10%, X7R, 1210 C1210X105K101T
Q301 TRANS NPN 150V 1A TO252 ZXTN4004KTC
R317 RES, 200 OHM, 1%, 2010, SMD RMCP2010FT200R
R318 RES, 100 OHM, 1%, 2010, SMD RMCP2010FT100R

Problem Scenario

The board in question was operating normally for some period of time (days/weeks). The chip was addressing properly and communicating without any apparent issues. Upon powering the host controller and attempting to address the stack chips, the chips failed to address and the first stack chip exhibited the overheating capacitor and BJT issues. The second stack chip did not present with these issues but of course was not addressed due to the failure of the chip before it. (This system does not implement the ring architecture) When the issue occurred, the chip should have been in a sleep state. Once the problem occurs, the problem persists and the chip becomes non-responsive and inoperable until physically disconnected from the battery stack.

This exact issue is confirmed to have occurred twice. Both times in the same battery module and both times it was the first stack board which failed. The issue is suspected to have occurred 3 additional times but in each of those 3 cases the boards were replaced before testing for overheated components and the issue did not re-occur.

First Occurrence Testing / Troubleshooting

  • The problem board was removed from the battery and tested for failed components.
    • The C302, Q301, R317, and R318 were all found to be within manufacturer spec and showed no signs of failure.
  • When connected to a benchtop power supply in a testing environment, the board functioned properly and the BQ79616 addressed properly with no apparent issues.
    • 60V was applied across BAT0 and BAT16 and the board connected to a host via via the cap isolated serial lines. 
    • There was no excessive heating of any components during this test.
  • C302 showed no deviations in continuity nor capacitance versus a known good part.
  • Q301 did not exhibit any signs of damage or defectiveness. Q301 was functional during bench tests.
  • There were no measured continuity issues between any of the BAT0 / GND and any other BAT pins on the boards.

Second Occurrence Testing / Troubleshooting

  • The problem board was removed from the battery and allowed to cool.
  • The problem board was reconnected to the battery stack while not mounted in the battery to determine if the components would overheat while the board was "free hanging"
    • There was no excessive heating of any components on the board
  • The metallic standoffs used to mount the board were attached to either side of the board to test the theory that the ground plane was being pinched and causing the issue. The board was still "free hanging" and otherwise not mounted into the battery and was totally isolated minus the cell tap connections. 
    • There was no excessive heating of any components on the board
  • The board was remounted into the battery and the stand offs intentionally overtightened to test the theory that the standoffs are the root of the issue.
    • There was no excessive heating of any components on the board
    • The board now addressed and functioned without issue.
  • The BAT0 pin was intentionally shorted to the metallic standoffs and the aluminum battery enclosure to test if this would cause the issue.
    • There was no excessive heating of any components on the board
    • The board now addressed and functioned without issue.
  • The system was power cycled and addressed 15+ times and exhibited no issues.

Ideas and Theories

  1. The C302 capacitor is shorting to ground and sinking an excessive amount of current causing the BJT not to function properly and the BQ79616 to become unresponsive.
  2. Metallic mounting hardware leads to a scenario where the board is shorted and problem behaviors occur.

Useful / Relevant Information

  • System architecture is comprised of a single host board with a BQ79161 acting as a COMM transceiver connected to one or more stack boards via a twisted pair wire with cap only isolation.
  • BAT voltage to the stack chips is between 44V and 68V depending on battery state of charge. 
    • Stack chips are fully populated with all 16 cell voltage connections and 8 GPIO connections to thermistors.
  • C302 is oversized based on the Ti reference schematic which suggests a 0.22uF capacitor.
  • C302 is a single capacitor and not two capacitors in series as is suggested in the Ti design reference.

Questions

  1. Would a short to GND of C302 cause the BJT to become non-functional?
  2. What scenarios other than a short to GND would cause the C302 and Q301 to overheat even in a sleep or shutdown state?
    1. Where would we look schematically to find a potential design issue?
  3. Could C302 being oversized (1uF) versus the 0.22uF in the design reference cause any issues?
  4. Is it an issue that C302 is just a single capacitor and not two capacitors in series, as suggested by the Ti design reference document, cause any issues?
  5. Can you think of a scenario where a High Voltage isolation issue somewhere else in the battery or vehicle would cause an issue?
  6. Why would this exhibit as an intermittent failure and be very difficult (so far impossible) to reproduce?
  7. Do you have any ideas on how we can intentionally replicate this issue in a testing environment?
  8. Is there any logical explanation as to why the problem seems to occur either on startup or shutdown?

Known Design Issues

  • It has been noted that R320 should not be populated. The issue has occurred on boards that both did and did not have R320 populated.

Schematics

  • Blakely,

    I can answer some of the questions immediately, the rest we will need time to look into:

    1) Yes. If C302 was replaced with a short circuit, the device would not work.

    4) While two capacitors would be better at heat dissipation, these parts should not get this hot.

    I have some follow up questions:

    1) Is this happening with multiple different battery packs, or just the one?

    2) When it failed the other times was it also the first stack device that failed?

    3) I am unclear with what you mean by:

    a BQ79161 acting as a COMM transceiver

    I am unfamiliar with the BQ79161, perhaps it was a typo and you meant to say 616. If that is the case, is this device not connected to any cells? Is this just the device that is connected via UART to the host? Is this the "first stack device" that was causing problems or is the next device in the chain the problematic one? 

    Regards,

    Ben

  • Hi Benjamin thank you for the reply and for looking into this further.

    In response to your follow up questions:

    1) Is this happening with multiple different battery packs, or just the one?

    The behavior exhibits as a perfectly functioning battery module all of the sudden becoming unresponsive and unable to address. At this point no attempts to reestablish communications via software are successful. We have seen this behavior occur on 4 different battery modules but only one of those modules were we able to confirm that the power regulator circuit was overheated.

    1) On the first occurrence of this behavior, the stack chips behaved normally after being physically disconnected from the battery stack, which is likely attributed to the power off reset which would take place in this case. The entire stack of boards including the host board was replaced as a precaution and the issue has not since reoccurred and has been operating normally for over 1 year now.

    2) In the second case the entire battery module was replaced for the customer under warranty and upon being returned to us, the problem battery module no longer exhibited the same issues. It is undetermined if this module ever experienced the same failure we are discussing in this thread of if it was somehow entirely unrelated.

    3) In the third case, the vehicle in which the module was installed experienced a HV isolation issue which we believed at the time to have damaged the BMS host board. The entire stack of boards including the host board was replaced as a precaution and the issue has not since reoccurred. It was never determined whether or not this module showed the overheating on the power regulator of the stack board.

    4) In the fourth case, the symptoms and testing outlined above confirmed the overheating power regulator components on the first stack board. Both stack boards were replaced and the module reinstalled and the same issue occurred a second time within several weeks.

    2) When it failed the other times was it also the first stack device that failed?

    I cannot definitively confirm that it was the first stack device in instances 1-3 above, but I can confirm that the first stack device was not responding when queried. Whether this was due to failed addressing or the failure in question I cannot say. However; in the fourth case both times the failure occurred it was the first stack device with the overheating components.

    3) I am unclear with what you mean by:

    This was a typo. I intended to say that we utilize a BQ79616 as the bridge between the Host UART and the first stack device. The BQ79616 acting as the bridge is not connected to any battery cells. It is powered off of the 12V regulated power supply on the Host PCB. It appears to be the "first stack device" causing the issues but as mentioned above this is only definitive for two instances of the failure. To clarify, the base device resides on the host PCB and is not connected to any battery cells. The "first stack device" is the first device connected to the battery stack and monitors the 16 most negative cells. The "second stack device" is identical to the first but monitors the 16 most positive cells in the module. In all of these cases this is a 120V module with one host board/base device and two stack devices. It may or may not be relevant but this would mean that VC16 on the first stack device is also VC0 on the second stack device since the cells are in series.

    In our analysis of the issue we have also identified a couple of potential issues with the PCB design but are unsure if these could be contributing to this failure or what other issues, if any, they may cause. Despite these issues, we have had these boards functioning as intended for several years with this current PCB design.


    1)  Ti notes a new recommendation for a 470nF capacitor between CB16 and BAT in the design recommendations document. Our PCB design does not include this capacitor. 

    2) Several issues were identified with the UART layout on both base and stack devices. It is unclear what if any role the level shifters play in this.

    1. Base device RX is pulled to VIO with a 10k resistor instead of CVDD with a 100k resistor as recommended in the documentation.
    2. Base device Tx is pulled to CVDD with a 10k resistor.
    3. Stack device RX and TX are both pulled high to CVDD instead of being left floating.
    4. NFAULT is pulled to CVDD with a 10k resistor instead of 100k on the base device.

    3) There are no parallel bypass capacitors along side the decoupling caps on the LDOs.

  • Hi Benjamin, Just wanted to follow up and see if there was any other information I could provide?

  • Blakely,

    Sorry for the delayed response. 

    My colleagues and I have discussed this and the evidence here indicates that this is not an issue with the BQ79616 device. 

    - This only occurs in a small percentage of your battery packs

    - The only pack that has had repeated incidents had issues with two different BQ79616 devices

    - The BQ79616 device in that pack shows no sign of damage and is fine when placed back in the pack.

    - This issue is not reported from other customers

    I would recommend investigating other potential causes. 

    Regards,

    Ben