Other Parts Discussed in Thread: BQ79616
We have encountered an issue in the field that we have been unable to rectify via software and are struggling to identify what may be the actual issue. Our implementation of the BQ79616 utilizes a host PCB with the MCU and a single BQ79616 acting as a transceiver and a plurality of stack boards each with a single BQ79616. This design does not utilize the ring architecture and the BQ79616 acting as the host transceiver has only commh+ and commh- broken out to a connector. The implementation in question utilizes a single host, and two stack boards. The system is sealed within the battery enclosure and generally physically inaccessible. External communications with the host MCU are accomplished either via CANBUS or Serial. Additionally, since the BQ79616 residing on the host board is powered from the boards power supply (13.6V nominal), it undergoes a POR with each power cycle of the host board.
The problem condition that has occurred is that the two stack chips are completely unresponsive to any communication attempts to read, write, address, or reset the devices.The condition manifested itself specifically when the host board was powered up by an external supply to receive a firmware update. The issue occured before the update was performed and has persisted both after the new host MCU firmware was installed and after rolling back to the previous working version.
We are unable to replicate this issue through bench testing. We have confirmed by scoping the UART and COMMH/COMML lines that the HW_RESET procedure propagates from the MCU to the base device and through the stack as expected on the bench. Attached are some images of the scope data for reference. The firmware on the host mcu contains no provisions for manipulating the commh/comml transceivers of any devices directly and given that the condition occurred during an otherwise normal start up it seems exceedingly unlikely that the stack chips would have had any transceivers disabled. It seems as though the most likely issue is that the base device is not actually forwarding on any commands to the stack chips rather than the stack chips being truly unresponsive; however, a HW_RESET ping does nothing to remedy the issue, nor does manually enabling the commh/comml transceivers via the COMM_CTRL registers on the base device.
The base BQ79616 does communicate with the host MCU via UART without issue and the registers can be read from and written to as expected. As mentioned, neither a HW_RESET ping nor a POR results in any change to this behavior. We have tried both of these reset methods followed by a double wake ping after the HW_RESET period, as well as manually enabling the VIF through the COMM_CTRL debug registers, all without any change in behavior. Given that these devices are inaccessible, we are unable to physically probe any of the LDO or other electrical signals.
At this point the only physical condition that we can think of that could have occured is a temporary Vbat voltage below the 9V minimum on the base device. This would have been momentary but could have happened during the connection to the external power supply and also possibly during the device's internal start up routine. Is anyone aware of any other condition and/or state where a device may become unresponsive and unrecoverable through various reset attempts or where a base device may become unable to communicate with stack devices despite functional UART communications with a host?
This is a scope image of a stack device on the bench receiving the HW_RESET tone and shutting down the LDOIN regulator. In bench testing, all devices respond to the HW_REST tones or ping in the case of the base device