BQ40Z50-R2: Can not communicate on SMBus after shutdown command / parameter file updated, no wake up of chip

Cedric Milleret

Part Number: BQ40Z50-R2
Other Parts Discussed in Thread: EV2400, BQSTUDIO, BQ40Z50

Tool/software:

Hello,

I was developing the FW of my master to drive the gauge (v5.04).

The gauge has a 4S (16V) setting, but with a double current (via the TI procedure to go from 30000mA to 60000mA, so calibrating with a factor of 2).

I had several points that I could not resolve (SOH, abnormal T°C on a probe, no FET respond command...), so I decided to do a reset (which did not change anything), then a "shutdown" via the bq studio command.

From this moment on, it is impossible to reconnect to the chip, the gauge is no longer detected, and if I choose the chip manually, the data remains at 0. I spied on the SMBus scope, I can see that the chip does not respond (the bus is neither blocked at 0, nor at 3.3, I have clean signals from the EV2400, so edges, idle state...).

I applied a voltage of 16V on BAT, PACK, VCC to wake up, and also forced these 3 pins to 0 (shunt GND) before restoring the voltage. Note that the cells are glued/soldered, the PCB is sandwiched, so I cannot disconnect VC1, VC2, VC3, VC4 and access to signals.

The commands in bq studio have no effect. No more for FW download.

I don't understand why the chip doesn't wake up
There is no reset pin, so what did I miss?

Regards

8 months ago

0 Cedric Milleret 8 months ago

Intellectual 260 points

Hello,

Continuing my investigations, I crashed my second prototype again, and now I have no more cards to work with.
The context is different.

On the second card,
1) I upgraded the FW to version v5.05 (the latest from last month)
2) I loaded the chemistry
3) I loaded my parameter file
4) I did the clear lifetime
5) export the parameter file
6) perform the scaling of parameters (/2 of the mA and cW to be able to use a double current up to 65000mA instead of 32500mA)
7) import the modified parameter file
8) current calibration (2000mA real for 1000mA in bq studio)
the card was not connected to anything, neither charger nor load

Since then:
- the main FETs went OFF
- if I force the wake-up (sys voltage, pack, bat), the chip no longer responds, no more SMBus COM
I do not understand why the chip is stuck and does not restart, and how I can reset it
I specify that the card was working before, for days. It is from the last steps of setting/calibration that it no longer works, so purely software, but I did not see exactly the fateful step and the moment of the stop.

So I crashed 2 differents cards whose common point is the setting/calibration

Because I can not communicate on SMBus (the circuit is silent), I can not do change of parameter , erase or download FW.

I am completely blocked.

How can I recover SM Bus communication ? (Is gauge have bootloader mode ?)

To help with the analysis, the attached configuration file. (I try to upload .srec, but it seems to be not allowed)

r5_250102_scaled.gg.csv

Regards,

0 Anthony 8 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi CM,

Would it be possible to scope the TS1 pin of the device to see if there is any pulsing apparent from the device? This will give us some idea of what state the gauge is in during this time.

Regards,

Anthony

0 Cedric Milleret 8 months ago in reply to Anthony

Intellectual 260 points

Hello,

I put the scope on TS1 (pin 10) that is connected to a 10k NTC.

I put scope in single shot to capture anything since startup (200mV/div, 200µs and 100mV positive edge trigger)

I do not see anything, the signal left to 0V.

I waiting few minutes (since Vsys powered to 16V).

then it seems signal is 0V permanent.

Regards,

(note: in order to be better in phase with the time zone difference, are you in Texas, i.e. -7h compared to France?
If this is the case, the usable time slot: 8-13h Texas --> 15h-20h France)

0 Cedric Milleret 8 months ago in reply to Cedric Milleret

Intellectual 260 points

To add more information for investigation, I continue today to search...

I took out the demo kit that I had used before developing my card: bq40z50EVM-651.
I upgraded it with FW v5.05

I tried to load a slightly old parameter file (which worked and was older than the last one that coincides with the crash), and I got an error, so already it is not very normal.
This tells me of a failure on a parameter that does not seem to be a problem. (I remind you that this file was in the card and worked)

So :
- I put back the initial FW v5.05 from TI (to restart initial state)
- I change parameter by parameter the different values between the initial TI config file (exported to compare) and my parameter file.
- the parameter that was causing problems when writing the entire file is written without problems (very surprising)

- I continued and it is when writing "GasGauging \ State \ Update Status" that the writing does not work

I do not know in what order the parameters are written, but I suspect that the error announced in bqstudio does not correspond to reality, which adds to the confusion for understanding
for the moment, I do not dare to put the last suspect parameter file for fear of crashing the chip.

see attached file

- export param file from FW v5.05 just programmed

TI_initial_5.05.gg.csv

- file that fail to program just after

test_fail.gg.csv

Regards

0 Anthony 8 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi CM,

Cedric Milleret said:
I continued and it is when writing "GasGauging \ State \ Update Status" that the writing does not work

What is being attempted to be written to the Update Status?

If possible can you try sending command 0x08 and read back 0x0D at this time and see what is received?

BqStudio will not open immediately and the device will have to be chosen from the pop-up list, however these commands should still be able to be sent from the device.

Also, yes our team is located in Texas.

Regards,

Anthony

0 Cedric Milleret 8 months ago in reply to Anthony

Intellectual 260 points

Hello,

For update Status, the "golden file" (in construction) contains "enable" to 1, then value is 0x06 and card is initially at 0x02

For the requested test, I do the advanced Comm SMB action :

1) I connect the demo card that has only FW 5.05 with default parameters

-->read respond = 0x000F

Just for understood, what is the goal of this tests ?

Referred to doc, read word at adress 0x0D is for RSOC. Value is 0x000F and it correspond to 15% (because 15V supply of 3 cells by default, it is consistent)

But, what is the cmd = 0x08 ? (and why before ?). I try another value (like 0x00 and 0x44), I have the same result of read work 0x0D

2) I leave bq studio on and I connect the EV2400 probe on the other card, which Vsys is forced by an external power supply to 15V to power the chip.
Cmd 0x08 is failed. No ACK (as I said at the beginning of the ticket, the SMBus com is silent)

- do you have any idea that explains the chip crash?

- is there a way to force the boot in bootloader?

- were you able to detect anything in the files I added to the ticket?

- Do you have a series of actions or investigations to suggest that I can carry out given the one-day time difference?

Regards

0 Anthony 8 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi CM,

Cedric Milleret said:
For update Status, the "golden file" (in construction) contains "enable" to 1, then value is 0x06 and card is initially at 0x02

Understood, I do not believe there is any issue in leaving this at 0x06 since this shows that the learning cycle has already been completed.

Cedric Milleret said:
1) I connect the demo card that has only FW 5.05 with default parameters

-->read respond = 0x000F

Just for understood, what is the goal of this tests ?

Referred to doc, read word at adress 0x0D is for RSOC. Value is 0x000F and it correspond to 15% (because 15V supply of 3 cells by default, it is consistent)

But, what is the cmd = 0x08 ? (and why before ?). I try another value (like 0x00 and 0x44), I have the same result of read work 0x0D

Understood, the process of sending 0x08 then reading back 0x0D was done to see if the gauge was in ROM mode, to which I believe it is not based on the received value. Typically, if the gauge is in ROM mode than the device will produce a much different value.

Cedric Milleret said:
I leave bq studio on and I connect the EV2400 probe on the other card, which Vsys is forced by an external power supply to 15V to power the chip.
Cmd 0x08 is failed. No ACK (as I said at the beginning of the ticket, the SMBus com is silent)

If the 0x0B is NACKing at this time, it does seem like there is an issue with the comms at this time as you stated before. In this case, when you say that the external power supply is connected to Vsys, is this the pack side or battery side connection? If it is from the battery side, is there any effect from a power on reset being implemented by removing all power to the device to try and restart the comms?

In the two files sent earlier, it seems like different firmware versions are being used (5.05 on the new file and 5.04 on the error file). We do not recommend using .gg files to program the gauges of different firmware's incase there are differences in the data flash mapping. Since using the .srec file avoids this since the file contains the firmware and the data flash configuration, we recommend using that since there will be no mix up in versions.

Regards,

Anthony

0 Cedric Milleret 8 months ago in reply to Anthony

Intellectual 260 points

Hello,

Investigations on FW v5.05 are subsequent to the problem, the problems explained in the tickets were carried out on FW v5.04.
I put FW 5.05 to see if there were functional differences. I have same result in v5.04.
The minor version evolution of the FW does not seem to change anything in the contents of the parameter file (verified by comparing), so indeed it is worth reporting it, but it does not change anything a priori.
(in my previous designs, with this gauge and others, I have indeed already seen that the mapping of major versions can change, so I am aware that the files should not be mixed).

Concerning powering up via Vsys, given that the FETs are OFF, this allows the PACK pin to be powered.
As I said, BAT and VCCFET are always powered. Given that the electronic card is not accessible (sandwitch and soldered / glued to the cells), I am not able to disconnect the signals (VC1 to VC4), but if you go back to the details of the manipulations made at the beginning of the ticket, you will see that I have precisely tried to force the OFF by shunting (to GND) the PACK, BAT and VCCFET signals. This did not change anything unfortunately, the COM did not come back.
I therefore suspect that the FW starts on an immediate crash, hence my question on the operating mode of the bootloader.

As the ticket is starting to grow in information, I remind you that:
- I built the "golden file" from an export of my reference card on which I regularly (on weeks...) changed the parameters (as development progressed)
- when I reimported the file into my cards after scaling (to have > 30000mA), the cards never restarted after the reboot
- I tried to reboot them without success (via PACK, BAT, VCCFET), the SMbus com remains silent.

My questions are therefore:
- when the FW starts, is it the application that has a bootloader function (therefore a bootloader managed by the application as to manage the parameters), or does the bootloader start first and then load the application (and in this case is there a way to stay in bootloader)?
- Is it possible that a combination of parameter values can crash the execution of the FW at boot (e.g. div /0, blocking on a threshold...)
- is it possible that writing the file in 1 go (all the parameters at once) can crash while the construction of the memory plan param by pram did not cause a crash (because this is also a point that I raised during my last tests with you)?

- do you try to load my gg.csv file onto your board ?

Regards

Cédric

0 Anthony 8 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Understood, please allow us to use the .gg file previously sent to try and recreate the issue. When programming firmware, bqStudio will send the device into ROM mode, then sends the necessary information to the device. I have not seen the changing of a single parameter cause the device to crash as a whole when programming, however was the input to the parameter within the thresholds of that parameter?

Regards,

Anthony

0 Cedric Milleret 8 months ago in reply to Anthony

Intellectual 260 points

Hello,

Of course I attached the files to you so that you can use them, otherwise what would be the point?
There are several files in order to do the different steps that I described.

For my part, the situation is critical, and already two weeks have passed without significant progress: my cards are unusable (all crashed), and I have no way of solving the problem without your help. All my development is on hold (because the BMUs power other cards), and since they are power BMUs integrated into a system, I cannot simulate them or replace them with something else.

From what I understand, there is no bootloader startup (ROM mode?), then in application mode. It is the application that starts directly (so first), then if bqstudio asks for a firmware download, it is then that the application launches the bootloader, (ROM mode)

is that right?

If this is the case, it means that if the application is crashed, the bootloader no longer works,

is that right?

If this is the case, what is the method to force the boot in bootloader (ROM mode)?

Concerning the modification of the parameters, I have never had an alert or a restriction by the software, neither on import nor export. In addition, by checking all the parameters when I was looking for what the problem could be, I did not see anything abnormal.
It is precisely by loading the files that you will be able to see something since you have the source code to analyze.

To summarize (each exchange of the ticket raised several independent points to analyze)
- the import of the suspect file, followed by the reboot led to a crash of the card
- the creation of the parameter file by parameter (before export / import) did not crash the card
- an import of another file failed on a parameter that is good (when modifing manually)
So all this makes me suspect that there may be a problem when writing a file in one go, with a certain combination of parameters that would be poorly managed internally by the FW.

And the ultimate question, how to get out of the crash? how to find SMBus communication in order to be able to reinject a clean FW?

Being completely blocked, I am of course open to any actions that you ask me to do or files to provide you in order to unblock the situation as quickly as possible. My constraint is to have a very integrated, compact electronic card, with large soldered and glued cells, there are few accessible signals, and the chip is not accessible without destroying part of the assembly, which would be very expensive.
I had previously developed in "exploded", use demo-kit..., validated the HW and FW operation, but as it was in finalization, I did the real assembly, and it was during the final injection of the parameters that I lost the chip.

Thank you for helping me,

Regards

Cédric

0 Anthony 8 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Sorry for the delay, I received some time to test the fail case .gg file on one of our bq40z50 EVMs. First, I tried to reproduce the case using the 5.04 FW to get a baseline of what was occurring, where we attached a power supply with ~11V to mimic the case seen in the bqStudio image above. The gauge was able to be programmed with the .gg file, however it immediately fell into shutdown mode once the programming was done.

When this was done with the 5.05 FW, we were able to reproduce the same issue above where the ERETM Voltage Threshold error would be thrown, and the gauge would immediately fall into shutdown. However, since we are using a power supply instead of cells, we were able to raise the voltage to ~16V, which allowed the gauge to exit shutdown and communicate. Also, when we redid the process using 16V, there was no issue programming the fail case .gg file and the gauge could communicate fine.

Based on this, I do not believe the gauge is crashing but just stuck in shutdown since the SDV bit also gets set at this time. Even if the gauge is woken up, it will still fall into shutdown if the minimum cell voltage meets the threshold.

Has this process been done with the cell voltages being at a fully charged state? What is the voltage being used at this time as well?

Regards,

Anthony

0 Cedric Milleret 8 months ago in reply to Anthony

Intellectual 260 points

Hello,

My configuration is 4S, or x4 3.7V, which corresponds to a 14.8V pack (16V full charge) and not 11V (3S). The parameter file is therefore for 4S.

When the problem occurred (and which is still the current state), the cells were fully charged, or approximately 4.10V each, which gives more than 16V. As the pack is inactive, they have not discharged, the total voltage is still greater than 16V.

So, I do not see why the SDV bit would be activated in my case. On your side, if you have supplied 11V to a pack configured in 3S, this would be normal, but this does not correspond to the case studied.

The image with bqstudio at 11V does not match my file, but as I said to the default file of FW v5.05 (3S) before I injected my parameters file. This screenshot corresponded to the test you asked me for with the commands (on an EVM kit, then on my card where the response is precisely in failure). I do not have a screenshot of my case since BqStudio can no longer connect.

To test my case, you have to use the files at the beginning of the ticket, therefore the settings files that have the 4S configuration.

The test you did therefore does not correspond to the ticket.

I need to unblock the situation. Currently, I can't do anything except wait...
Can you answer me about the bootloader?
So, I could also continue to investigate.

Reagrds

Cédric

0 Anthony 8 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Understood, thank you for the clarification. We were able to reproduce the issue of programming the .gg file, however from our side we are able to shut the device down and restart it with no issues. In this case as well, since a 4 cell formation is being used, there should be no reason to have to WAKE the device since there are 4 fully charged cells in use. Regarding the SMBus engine, the best way to restart this is to cause some sort of reset, to which in this case seems like the only possible way would be by a power-on reset, which is probably unreasonable given the difficulty of removing the cells from the card.

Cedric Milleret said:
From what I understand, there is no bootloader startup (ROM mode?), then in application mode. It is the application that starts directly (so first), then if bqstudio asks for a firmware download, it is then that the application launches the bootloader, (ROM mode)

is that right?

If this is the case, it means that if the application is crashed, the bootloader no longer works,

is that right?

If this is the case, what is the method to force the boot in bootloader (ROM mode)?

The gauge should start in FW mode then proceed to ROM mode if programming is requested. The process of entering and exiting ROM mode (if the SMBus engine was working correctly) can be found below:

We apologize for the difficulty, we will continue looking into the .gg file to recreate the state at this time.

Regards,

Anthony

0 Cedric Milleret 8 months ago in reply to Anthony

Intellectual 260 points

Hello,

1) When you say that you managed to recreate the problem, what problem are you talking about exactly?

I myself do not know which step precisely is causing the problem since I could not try again given that I no longer have any cards (all crashed). I had done several steps (see my previous descriptions)
- the parameter file that is not loaded completely and that aborts in progress when it is written in 1 go?
- the FW crash when loading the parameter file?
- the FW crash after calibration (SLUA760 procedure §2 for high current systems) ?
There must be a combination of several factors to obtain this issue.

2) So, this is what I suspected: the application starts first, and the switch to bootloader is only possible if the application is operational. In the event of an application crash, the bootloader is no longer accessible because the SMBus communication is managed by the application (and the "ROM mode" command is a message managed by the application).
Do you confirm?
In this case, what is the hardware way to enter the bootloader mode, or how to program by an external programmer? It is definitely possible.

3) To restart the SMBus engine, of course that would be the goal, but in the case I presented, it seems that it crashes upon startup.
Similarly, I tried to "force" the reset as I said by shunting (to GND) the signals that were accessible, namely BAT, PACK, VCCFET. I do not have access to the other signals.

Is doing this enough to force the reset?

4) In a previous message, you had me spy on the analog signal TS1, which apparently also has a logical functionality (in ROM mode). Do the other 3 channels TS2, TS3, TS4 also have transverse functionalities for example the ROM mode? (there is necessarily a way to program an empty chip).

My situation is very delicate because I have BMUs fully assembled and broken only because the software does not start anymore, and my only immediate need is to be able to start them, so I could analyze and test at the same time as you what makes everything crash. Currently I can not do anything (all project blocked) and I have to wait for you to do tests, but it can still last a very long time...

Regards

Cédric

0 Anthony 8 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

When we try to reproduce this by following the steps above, from inputting the fail case .gg file, sending both a reset or shutdown to put the gauge into this state, then attempting to recover, we have been able to establish communication with the device. We have not seen a situation before of the gauge entering this state just from parameter configuration, especially with the SMBus engine being down. I believe that since this is reproduceable, we should look into a customer return since they would be able to look to see if there are any issues with the physical device at this time. The TI Representative or FAE should have more information on how to accomplish this, if more information is needed please let me know.

Cedric Milleret said:
2) So, this is what I suspected: the application starts first, and the switch to bootloader is only possible if the application is operational. In the event of an application crash, the bootloader is no longer accessible because the SMBus communication is managed by the application (and the "ROM mode" command is a message managed by the application).
Do you confirm?
In this case, what is the hardware way to enter the bootloader mode, or how to program by an external programmer? It is definitely possible.

This is correct. The only way for the device to enter ROM is through the command being sent, there is no hardware entry to force it into this mode.

Cedric Milleret said:
3) To restart the SMBus engine, of course that would be the goal, but in the case I presented, it seems that it crashes upon startup.
Similarly, I tried to "force" the reset as I said by shunting (to GND) the signals that were accessible, namely BAT, PACK, VCCFET. I do not have access to the other signals.

Is doing this enough to force the reset?

Removing power by ground to BAT and VCC at the same time should be enough to cause a reset since they are main power sources of the device.

Cedric Milleret said:
4) In a previous message, you had me spy on the analog signal TS1, which apparently also has a logical functionality (in ROM mode). Do the other 3 channels TS2, TS3, TS4 also have transverse functionalities for example the ROM mode? (there is necessarily a way to program an empty chip).

The purpose of this test was to check to see the amount of pulsing is occurring to see the activity of the gauge. Whether there is pulsing apparent, and if so the period of the pulsing can give insight to what mode the gauge is in if scoped. These pins cannot be used for programming.

Regards,

Anthony

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,

Thanks again for your help.

1) To reproduce the problem, I think it is important to reproduce each step precisely. As I said, I do not know exactly when it was decisive, but what I know is that before that I had been using the gauge for several weeks with one-off configurations, calibration tests, I deleted and restarted certain steps several times (the usual process in development), everything was fine, and it was when I did everything at the same time in "production" mode (parameter file in 1 go and calibration / scaling) that the issue occurred.

From what you have already managed to reproduce, even if you have not yet had the complete problem, what have you identified? (so regarding the loading of the parameter file which generates a problem, this could also give me some leads to follow)

For the FAE, I looked for a contact in France, and I found no one. There is no directory available on the TI website. I tried to call TI France without success. You are the only ones to answer.

2) I am very surprised that the only way to enter ROM mode is only via the FW (in IAP), and that there is no alternative hardware way. So, at the slightest problem, it's over. In my opinion, this is a major flaw in the chip.

How do you program a blank chip?

3) OK, so if the SMbus engine does not start after my manipulation, there is no hope.

4) OK, so there is no solution to recover the programming?

From what I understand now:
- whatever the problem that will be identified, in the current state, there is no way to reprogram a crashed chip. So I have no solution to restore the initial operation. So I am forced to destroy my BMUs to access and unsolder the chip and replace it with a new one?
- then, once the chip is replaced (but the BMUs are very damaged), what am I going to do to avoid falling into the same problem again?

From now on, the time to supply other chips, disassemble a card, and replace the chip, it is at least 1 to 2 weeks of delay.

------additionnal information since this morning------

So, today I tried to move forward alone. I dismantled a BMU, so some damage to separate the cells in order to recover the PCB.

I analyzed the PCB, no HW problem, all components are in good condition (diodes, transistors...).
I connected the PCB to a laboratory power supply (16V) by simulating the 4 cells by a 1k divider bridge (4x 1k in series).
- BAT = ~15.5V
- VCC = ~15.7V
- PBI = ~15.2V
- PACK : switch to BAT to wakeup
- VC1 to VC4 : ~4V each
- no conduction of external balancing
- TS1 to TS4, PTC : no voltage
- chip is connected only to SMbus to EV2400
- SCL/SDA verified at scope (idle at 3.3, clock and data send by PC)
- no communication (still the same observation)
- the power supply provides a current of 20mA. Knowing that the divider bridge consumes 4mA, this would mean that the chip consumes 16mA (I removed other circuits of the bord by disconnecting net supply).
According to the documentation, the chip consumes in "normal" <1mA, so obviously there is something abnormal.

Can the chip in perpetual "reset" have a consumption of this type?

----------------------------------------------------

Regards,

0 Anthony 7 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Cedric Milleret said:
1) To reproduce the problem, I think it is important to reproduce each step precisely. As I said, I do not know exactly when it was decisive, but what I know is that before that I had been using the gauge for several weeks with one-off configurations, calibration tests, I deleted and restarted certain steps several times (the usual process in development), everything was fine, and it was when I did everything at the same time in "production" mode (parameter file in 1 go and calibration / scaling) that the issue occurred.

From what you have already managed to reproduce, even if you have not yet had the complete problem, what have you identified? (so regarding the loading of the parameter file which generates a problem, this could also give me some leads to follow)

For the FAE, I looked for a contact in France, and I found no one. There is no directory available on the TI website. I tried to call TI France without success. You are the only ones to answer.

From what we have recreated so far, we have been able to get bqStudio to produce the same error where the ERETM Voltage Threshold will not be programmed correctly using the .gg file. After receiving council from our team, I was made aware that this can occasionally happen when programming using .gg files but will not occur when programming with an .srec. Based on this, we have yet to find reasoning why this would send the gauge into the state being observed from your side, but will re-attempt using the full production method to see if this is able to produce.

Regarding the FAE, I have reached out to our European team to try and find a name for you to reach out to regarding the return. Based on the symptoms being seen, I believe this is the best option if we cannot reproduce the issue from our end.

Cedric Milleret said:
So, today I tried to move forward alone. I dismantled a BMU, so some damage to separate the cells in order to recover the PCB.

I analyzed the PCB, no HW problem, all components are in good condition (diodes, transistors...).
I connected the PCB to a laboratory power supply (16V) by simulating the 4 cells by a 1k divider bridge (4x 1k in series).
- BAT = ~15.5V
- VCC = ~15.7V
- PBI = ~15.2V
- PACK : switch to BAT to wakeup
- VC1 to VC4 : ~4V each
- no conduction of external balancing
- TS1 to TS4, PTC : no voltage
- chip is connected only to SMbus to EV2400
- SCL/SDA verified at scope (idle at 3.3, clock and data send by PC)
- no communication (still the same observation)
- the power supply provides a current of 20mA. Knowing that the divider bridge consumes 4mA, this would mean that the chip consumes 16mA (I removed other circuits of the bord by disconnecting net supply).
According to the documentation, the chip consumes in "normal" <1mA, so obviously there is something abnormal.

So just to confirm, when the cells were removed this would of put the device into a fully non-powered state, then restarted using a PACK-BAT switch and connected to bqStudio. In most normal cases, this would have restarted the SMBus engine with the device and should of been able to communicate. This along with the higher consumption current makes me believe there could be something internally within the physical device, which would be able to be looked into with a return.

Regards,

Anthony

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,

1) OK. Then .gg file programming can corrupt something. Since there are already known issues, I probably encountered a new case. I am waiting for your analyse.

During the development phase, the srec file is not usable, so it will only be possible for "full" production. Even at the current stage, it is the gg file that I need. Especially since the question of calibration, if it is in question, is subsequent to the use of the srec file.

OK, I am waiting for FAE contact, thanks.

For the "return" you would therefore know how to mount the chip (which I would unsolder) on one of your cards and communicate with it to analyze the memory content?

2) what about it ?

3) For my part, I systematically crashed my cards with the manipulations, so you should be able to recreate the same problem as me

4) what about it ?

5) yes that's exactly it. A complete power down during disassembly, followed by a methodical power up with the lab tools, and a failed connection with bqStudio.

For your information, next week I will have difficulty answering you because I will not be in the office. I will nevertheless try to follow the ticket but I will not have the material available if manipulations are necessary.

Regards

Cédric

0 Anthony 7 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Cedric Milleret said:
During the development phase, the srec file is not usable, so it will only be possible for "full" production. Even at the current stage, it is the gg file that I need. Especially since the question of calibration, if it is in question, is subsequent to the use of the srec file.

OK, I am waiting for FAE contact, thanks.

For the "return" you would therefore know how to mount the chip (which I would unsolder) on one of your cards and communicate with it to analyze the memory content?

Understood, the .srec will contain calibration structures if pulled post calibration. Regarding the FAE contact, I have connected with the France team and should have a name for you shortly. They also want to confirm whether this was purchased through the TI store or through a vendor? When we do customer returns, there are multiple steps, however the first would be to validate the issue using a socket board (which we can directly place your device into), then there would be other test conducted on the device to check the internal functionality.

For the production side, I believe chips are received in ROM mode then programmed with the necessary firmware. I can look into this for more info.

Currently, if the SMBus engine is down, I believe there is no way to communicate with the device and pull it from this state, and that there is something internally occurring here.

Regards,

Anthony

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,

Thanks for feedback.

I will therefore have no choice to replace all the chips, but as I said what will I be able to do next, since I will certainly fall back into the same problem, namely blocking the execution with a buggy setting.
a) identifying the problem is therefore crucial to not reproduce these effects, and so that I can continue working.
b) Analysis of the crash, which is independent and secondary for me, will then be the way to correct the FW so that it cannot crash with a bad setting

For the purchase of the chips, I do not have the answer immediately, but if it is not on the store, it is certainly digikey or mouser.

So I am waiting for France support

Regards

Cédric

0 Anthony 7 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Understood, if the purchase of the parts was through a distributor, then I believe they need to be reached out to complete a customer return form. If they were purchased through TI.com, then we would be able to take care of it.

Regards,

Anthony

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,
Here we are completely moving away from the technical problem.
No matter how you buy the component, it changes absolutely nothing to the problem. I don't understand why we are talking about this.
Can we go back to topics a) and b)?
regards
CM

0 Anthony 7 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

The reason I ask about this information is to get an idea of how a customer return should be processed. Based on this situation, programming a single parameter or .gg file , or whether the gauge is in firmware mode or ROM mode, should in no situation cause the SMBus engine of the device to crash. Regarding the calibration step, this could maybe cause a protection or permanent fail to trigger if the thresholds are configured less than the 2000mA being applied at the time and turn the FETs off, however the gauge would still be able to be communicated with at this time.

Regards,

Anthony

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,
The customer feedback processing mode is a detail for me, and this only concerns topic b).
Topic b) is the second analysis step that would allow the FW to be hardened to correct the bug (in the long term for you for a release), but if you reproduce the problem, no need for my chip since you will have the same thing with yours.

At this point, it is topic a) that needs to be solved. Given that I easily crashed my cards with the files and the calibration, there is no reason why you should not have the same thing. To do this, you must reproduce my steps, the details of which I provided you.

At which step did you stop, and would you like further details?

I will soon receive new chips to resurrect a card, so I could redo a crash with detailed steps if necessary, but the ones I had already given seem to me to be already well detailed.

Regards,

Cédric

0 Cedric Milleret 7 months ago in reply to Cedric Milleret

Intellectual 260 points

Hello,

Today, I received new chip from Mouser.

I replace chip on a card. I power up and connect EV2400.

Communication with BQ studio works.

So just replacing the component was enough, which shows that the card has no problem.
So now that the card is usable again, what should I do, so what did you find to not reproduce the same crash?

I haven't programmed anything yet, (neither FW nor parameters), it's in the "factory" state

Regards,

Cédric

0 Anthony 7 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Cedric Milleret said:
On the second card,
1) I upgraded the FW to version v5.05 (the latest from last month)
2) I loaded the chemistry
3) I loaded my parameter file
4) I did the clear lifetime
5) export the parameter file
6) perform the scaling of parameters (/2 of the mA and cW to be able to use a double current up to 65000mA instead of 32500mA)
7) import the modified parameter file
8) current calibration (2000mA real for 1000mA in bq studio)
the card was not connected to anything, neither charger nor load

I have restarted the testing with a completely new chip, programmed the R5 firmware, hand inputted each value from the test_fail.gg file, exported the .gg file, and restarted the process again while programming this .gg file directly. I did not see any issue through out that, and will scale the parameters and do calibration next. To confirm, the scaled parameters are what are in the very first sent .gg file r5_250102_scaled.gg.csv? To preserve integrity, what was the chemID used here?

Regards,

Anthony

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,

Today, I also carried out similar tests on my side.
With a "clean" chip, by loading the FW of TI v5.05, when I load the "non-scaled" file, then "scaled", I have a Bq Studio error, but I could not go further until the crash.

The procedure you describe (yours, not mine from the beginning) follows the tests we did after the first posts during the investigations, with a still different methodology. You had then confirmed to me that it was a known problem, not necessarily in this form, but intimately linked, which is a clue to a more buried bug.

For the initial ticket (that you mention), I also tried to reproduce it on the new chip. I did not succeed in reproducing, I do not know why, but the initial conditions are very different, and I have the impression that everything is in the details. Indeed, here I am loading a "clean" chip while in December I had done a succession of updates, a unitary parameterization (1 by 1), imports/exports, commands sent (LT reset, reset, shutdown...). There must most certainly be a key step that I must be missing, it is very frustrating. I continue to investigate.

And indeed, the scaled parameters are indeed in the mentioned file.

For the chemistry, in December and January, I had just started with the default chemistry of the chip/FW, then loaded another one (0x1241, but I am not sure anymore, and I cannot query the chip to check because it does not respond to commands).

The problem seems really deep and vicious, I suspect a very specific sequence of actions, but I cannot put myself in the exact context for the moment.

Regards

Cédric

0 Anthony 7 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Cedric Milleret said:
For the initial ticket (that you mention), I also tried to reproduce it on the new chip. I did not succeed in reproducing, I do not know why, but the initial conditions are very different, and I have the impression that everything is in the details. Indeed, here I am loading a "clean" chip while in December I had done a succession of updates, a unitary parameterization (1 by 1), imports/exports, commands sent (LT reset, reset, shutdown...). There must most certainly be a key step that I must be missing, it is very frustrating. I continue to investigate.

Understood, thank you for confirming the chemID number, if I begin with a clean chip as well I am unable to reproduce anything that represent the issue that was previously seen as well. I will also continue with different actions to see if there is any difference in the process order that might make this appear.

Regards,

Anthony

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,

Since the last message, I have continued my investigations.
I have dismantled my 2 other prototypes trying to imagine other ways of testing in order to cross-reference the methods given that I do not know what I am looking for exactly.
So I dismantle as I go along, and I regularly test the SMBus communication with the PC.
# on one card, I made the same observation as the first,
    - powering down (total --> cells dismantled), then powering up (with lab power supply / divider bridge) did not restore operation
    - "simply" replacing the chip with a new one was enough to restart the card, so no modification or repair of the HW was necessary.
    - The dismantled chip is stuck.

# on the other card, there it is more interesting
    - the card was stuck, no SMbus COM, so same symptom.
    - As I mentioned in the initial report (see start of the ticket) at the beginning of the month, I was unable to disconnect the cells (without causing damage)
    - So, VC1 to VC4 remained connected, but I had managed to shunt the PACK, VCC and BAT signals to 0 (at the same time). But the card remained silent...
    - there, I disassembled in stages (without any real specific goal, but I look at everything that comes to mind)
    - I ended up disconnecting the cells one by one from (VC4, VC1, VC3, VC2, in this order because of the mechanical assembly)
    - I had to disconnect the 4 cells for the SMbus communication to come back, I didn't believe it!

So
- I have 2 non-functional chips, replaced
- I have a chip that restarted by completely disconnecting all the cells

On 2 chips, I certainly did a combination of actions that deeply crashed the FW, but not on the 3rd which was able to restart. At this stage, I am unable to identify the details of actions that I was able to do between these 3 cards, which differentiate them.

Concerning the Chem-ID, as said previously, I had a doubt. On this card (read thanks to the command), it is the 0x1189 which corresponds to a cell of 8460mAh, in comparison to the ref 0x1241 (4000mAh). This is because I was doing some tests considering that I use the x2 calibration, and I was trying to see the difference of having an initial chemistry at the real value or scaled (this remains an open question, but which is only of interest after having solved the crash problem)

So I emphasize 2 points:
1) From this, despite having forced BAT, PACK and VCC to zero, it seems to me a fact that the permanent connection of the VCx prevents the reboot of the gauge, despite what was said.
I looked at the datasheet again, the VCx channels are connected to an ADC, but we do not see a clamp diode, so how is the internal circuitry really?

From my findings, the VCx inputs maintain an internal power supply to the gauge that prevents it from being powered down, and therefore from restarting

2) I see 2 failure outcomes: a complete and partial crash
on the card that restarted, so I can read the parameters and export the .srec, which I did. So now we have something to analyze, even if this is clearly not the worst case (of the total crash sought).
This can however be very useful to understand what happened, and can be extrapolated what could have happened on the other 2 gauges.
I kept the 2 crashed gauges available if it is possible to analyze them.

What do you think about the next step, how to send you the srec file (not authoriezd to download) ?

Regards

Cédric

params_survival_card.gg.csv

0 Anthony 7 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

I reached out to our hardware team to check if there is any internal relationship between the VC1-VC4 pins of the device and the power pins, to which they confirmed there is no path internally between the two. Unless there is an external path between these pins, then it should not interact. However, the PBI pin of the device would need to be grounded at this time as well.

Regarding the second topic, I will send you a friend request to send the .srec in the private messenger, where it should be allowed to send. For the .gg file, can you confirm whether the Lifetime functionality was enabled? If it was, the gauge will track certain values in the data flash of the gauge, which are good for situations like this where the issue cannot be logged.

Regards,
Anthony

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,

Concerning VC1 to VC4, I do not understand why their disconnections allowed the reboot since they were the only signals still connected for weeks, the others being powered off.
I do not have any external paths, the cells are connected respectively to the VCx via external balancing such as the app.note SLUA420
The PBI pin was not accessible before disassembly (short signal near chip, very closed), I could not force it to zero, but given that I waited a very long time (tested again after days), the capacitor voltage would have ended up canceling out.
To follow up on your PBI remark, although there are a priori no "direct" internal paths between the VCx pins and the power supplies, could there nevertheless be "parasitic" internal paths that could maintain a residual voltage on PBI with very few current ?

I compared my schematic with
- datasheet schematic
- App note SLUA420 schematic (external balancing)
- SLUUAV7B schematic (EVM kit)
I can't find any difference except the PBI capa at 10µ instead of 2.2µ (then better). (In the SLUA420, there is an additional diode in series with VCC, but I haven't seen anything about its use...).

At the same time, I attached the srec file to the private messaging, as well as the VCx / power supply schematic.
I had activated LifeTime, but given that only a few seconds passed between the FW update, the loading of parameters, the loading of chemistry, the sending of commands and the crash, will the log have had time to record something?

Regards,

Cédric

0 Anthony 7 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

I was not able to receive the .srec sent in the private messenger, receiving the "plug in not supported issue" would it be possible to put it in a zip file and send it? There should be no problem with sending a zip file.

When the .srec is received, I will see if this can be recreated and if anything can be seen from the lifetimes at this moment.

Regards,

Anthony

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,
OK, I didn't think that the web page would want to play a video instead of offering the download link. Too bad, I wasted time for nothing.
I have sent you the file again in private messaging in zip and txt format

Regarding the schematic, have you been able to see my remarks?
- PBI capacitor
- added diode on VCC (on app note)
- parasitic path (indirect) from VCx

Regards

Cédric

0 Anthony 7 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Thank you for the zip file, we were able to receive the .srec.

Cedric Milleret said:
FW update, the loading of parameters, the loading of chemistry, the sending of commands and the crash,

Just to confirm, this is the set of actions being completed that can reproduce the crash from your end? While following this we have not been able to reproduce it, however while looking into the parameters from the srec, it seems like there have been 6 full resets that have occurred prior to the .srec being pulled from the device. When the device is in application, how quickly is information being read from the device?

Regarding the PBI cap, the reasoning this was brought up by our hardware team was due to the PBI being able to hold charge for a very long time.

Information about the diode on VCC can be found below:

I do not believe there is a parasitic path that could cause this but can check with our team.

Regards,

Anthony

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,

a) SREC file
- Indeed, the crash seems to be a combination of several conditions, and that is why its implementation is not obvious to you. I myself have had variants, and the difficulty is to identify the step or detail that is important.
As said in the ticket, I used the cards for several weeks without problems, with many independent iterations of parameter development (unitary), sending commands... When at the end I wanted to do everything at once to put myself in production conditions, that is where it crashed (parameter file at once, chemistry loading, current scaling, close sending of commands, all with BQ Studio...), without being able to tell you exactly the precise sequences.

- There must certainly be something in the parameter/chemistry file that is causing problems, or an incompatible command (calibration/scaling, Lifetime, reset, shutdown...) with the configuration...

- Regarding resets, it is possible, the behavior before the total shutdown was chaotic, I seem to have noticed reboots

- Regarding the frequency of SMBus requests, when the system boots, my µC quickly chains, but always one transaction after the previous one is finished (like BQ Studio), then when the system is active, the refresh is at most of the order of a second. This operation has existed for a long time, and I use it on other gauges developed for several years.

b) PBI
What do you mean by "very long"? For my part, I checked the system during/after several days... and until the last intervention

c) PACK , VCC and diode
Reading your excerpt, this caught my attention because it did not correspond to what I had. By analyzing more closely, you are referring to the oldest of the datasheets, and by comparing with the latest, we see differences:
- I respect the 10k on PACK, and 100R on VCC
- there would be an internal diode on the VCC input which does not require an external diode

what about the diode, is there this ?

d) Parasitic path
In any case, I noticed that after several weeks where only the VCx were still connected, the simple fact of disconnecting them was enough to be able to then turn the card back on, so an electronic phenomenon internal to the chip must have happened.

Regards,

Cédric

0 Cedric Milleret 7 months ago in reply to Cedric Milleret

Intellectual 260 points

Hello,
Were you able to clarify the last points raised?

Regards

Cédric

0 Anthony 7 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Cedric Milleret said:
- Indeed, the crash seems to be a combination of several conditions, and that is why its implementation is not obvious to you. I myself have had variants, and the difficulty is to identify the step or detail that is important.
As said in the ticket, I used the cards for several weeks without problems, with many independent iterations of parameter development (unitary), sending commands... When at the end I wanted to do everything at once to put myself in production conditions, that is where it crashed (parameter file at once, chemistry loading, current scaling, close sending of commands, all with BQ Studio...), without being able to tell you exactly the precise sequences.

- There must certainly be something in the parameter/chemistry file that is causing problems, or an incompatible command (calibration/scaling, Lifetime, reset, shutdown...) with the configuration...

- Regarding resets, it is possible, the behavior before the total shutdown was chaotic, I seem to have noticed reboots

- Regarding the frequency of SMBus requests, when the system boots, my µC quickly chains, but always one transaction after the previous one is finished (like BQ Studio), then when the system is active, the refresh is at most of the order of a second. This operation has existed for a long time, and I use it on other gauges developed for several years.

I agree, this is difficult for us to reproduce since none of these functionalities should ever cause the gauge to crash, nor have we seen a combination of them causing a crash. As we discussed prior, this is extremely difficult to reproduce as well since we do not know the exact chain of actions as well as the time in between that caused this to happen on your end. Regarding the resets, was there any consistency in the time or action that caused them to occur or were they sporadic? Thank you for confirming the frequency as well.

Cedric Milleret said:
What do you mean by "very long"? For my part, I checked the system during/after several days... and until the last intervention

I am not sure about the exact timing, however since there is typically a 2.2uF capacitor attached, it would need to dissipate everything from that component while the other power inputs are being grounded.

Cedric Milleret said:
c) PACK , VCC and diode
Reading your excerpt, this caught my attention because it did not correspond to what I had. By analyzing more closely, you are referring to the oldest of the datasheets, and by comparing with the latest, we see differences:
- I respect the 10k on PACK, and 100R on VCC
- there would be an internal diode on the VCC input which does not require an external diode

I believe you are correct here regarding the internal diode, however I would need to confirm with our hardware team that this is the placement of it.

Cedric Milleret said:
In any case, I noticed that after several weeks where only the VCx were still connected, the simple fact of disconnecting them was enough to be able to then turn the card back on, so an electronic phenomenon internal to the chip must have happened.

I do not believe there should be no interaction between the powering system of the device and the VCx inputs, to confirm whether this is being caused by a parasitic path would require more extensive testing.

Regards,

Anthoyn

0 Cedric Milleret 7 months ago in reply to Anthony

Intellectual 260 points

Hello,

a) Resets
Concerning the resets, they happened just after having done all the sequences (FW update, parameters, chemistry, scaling, commands). From memory, they happened very quickly (all in less than a few seconds).
I lost control at the end of the sequence, I could only see the killing.
Aside question: can a command received at the gauge boot be misinterpreted if the gauge has not finished its internal boot?

b) PBI capacitor timing reserve
Yes of course, it is a reserve, but what order of magnitude are we talking about? (minutes, hours, days?) I suspect that it is linked to standby consumption, but not knowing the order of magnitude, nA, µA, you are the only ones who can answer. Furthermore, if we have an internal leakage current of the order of the PBI consumption, the voltage of the PBI capacitor can be maintained and never return to zero.

c) PACK, VCC and diode
In view of this element and the previous points discussed (such as leakage current, PBI capacitor), it seems to me that there are several points in fact that would deserve to be clarified by the HW team because not everything is clear, and not everything is sufficiently detailed. As an aside, I suspect that leakage currents and the internal design of the chip can have an impact on the proper functioning of the reset.

d) Again, this remark is the responsibility of the HW team. From my experience, an input, here ADC inputs coupled to other types of inputs (therefore certainly quite complex interactions such as VC4 having multiple paths, with multiplexers) certainly have a high impedance, but we always find clamps somewhere with the rest of the circuitry. In most component docs, the clamp docs are indicated. When it is not clearly drawn on the block diagrams, we can assume it thanks to the indications of the "maximum ratings" when for example we have +/-0.3V indicated. Here, the upper limit of the VCx is conditional on both the VCx between them, and each VCx in relation to the power supply because VSS+30 corresponds to VCC (via BAT...).

Could the "VSS+30" have been "VCC" or "VCC+0.3V" ? in this case it changes everything when the power supply is < 30V....

z) And what's next?
Now, concretely, what should I do? I can't continue to wait, the project is blocked, I don't know how to delay anymore. I took everything apart, repaired what was damaged, restored the chips, I'm ready to reassemble everything to continue.
I would like to help you more by doing the manipulations you would ask me (if it is feasible), but in the state I do not see what else to do, I have already done so much research.
- Will the HW team be able to decide quickly? (and should I consider a schema evolution?)
- What to do for the FW and its configuration, the scaling procedure?
- What are the recommendations / restrictions for sending commands?

Regards

Cédric

0 Cedric Milleret 6 months ago in reply to Cedric Milleret

Intellectual 260 points

Hello,

These last few days I have been working on restoring my BMUs.
I have repaired / reassembled the cards, and re-prepared the golden file from:
- the latest FW 5.05
- an adequate parameters
- the chemistry
- the default calibration
I did not do any scaling, I lowered the power of my load already to have a simpler case to manage, namely the "normal" 1:1 scale

I noticed something very interesting: I managed to crash a gauge (among the three ones), that is to say that it was crashed (like in December), and it did not restart !
I had to disassemble the power signals to make it reboot. It works again. Remove supply allowed re-start.

Just before assembling the cells, the gauge worked because I had tested the communication (with SW BqStudio) by simulating the 4x cells with a divider bridge on the laboratory power supply.
It was after soldering the cells that the gauge stopped working.

There is still a difference compared to December, there the failure was immediate, whereas last time the card had turned off (permanently) after having carried out operations with the software.

I solder the cell terminals with a spot welder (specific to cells, so a real adapted machine). It is low voltage / high current welding (the classic welder for soldering nickel pads)

So I suspect that the gauge is sensitive to an EMC effect, such as an electric / magnetic field, what do I know ... and that its internal electronics are latching. This comes back to my previous posts where we wondered about the clamps of the inputs. This observation seems to me very linked to my suspicions ...

The Hardware (therefore the clamps) therefore seems very relevant to me...

Regards,

Cédric

0 Cedric Milleret 6 months ago in reply to Cedric Milleret

Intellectual 260 points

Hello,

I continued today in the continuation of yesterday and I noticed that a BMU was again down.
The symptoms allowed me to make a significant advance in the analysis.

Here is the observation:
- the power output was at 0, so the FETs were OFF
- the connection with BQStudio worked, fortunately because it allowed to understand what was happening
- the actions (commands) on the FETs have no action, the FETs remain OFF despite the status of the FETs being ON.
- the log (Power, PF, BBR) shows partial reset events, watchdog, T°, and data flash corruption. (I saved the configuration and srec files if necessary)

- impossible to re-program FW, immediate error

Programming is not possible, but I can read (very strange...). In the registers tab, I noticed abnormal values:
- TS3 at -53° while the probe is OK, real T° is about 24°
- BAT pin at 19V while all the cells are at 16V
- PACK pin at 4.7V while the FETs are OFF

It was then that by manipulating the NTC TS3 probe (probe with legs of a few cm folded back on the cell) that the 3 values returned to normal! The controls became functional again and the FETs were activated.

From an internal electronic design point of view of the gauge, there is therefore a link (clamp or other path) between several inputs, and the entire operation is disrupted.

I therefore investigated more precisely as to the origin of this problem. Unlike the other 3 NTC probes at 0V, the TS3 probe fluctuated at about 4.0V - 4.2V (which did not correspond to a cell voltage of 3.9V if you were wondering).
I use pouch cells, and the external envelope (although insulating) has a very low leakage current on the glued edge (protruding edge of the pouch junction) when the probe leg rubs against it, incredible!

When I put voltmeter on envelop edge, I do not have 4V, I can see variable voltage of 1V to 2V (depend of touch probe), then a leakage current... 4V is probably created with internal gauge path
This very low leakage current is enough to fault the gauge.

With this observation, I wonder if this had not been the case previously, and depending on the probe(s) concerned by the leakage current, either it corrupted the gauge, or just crashed it.
In any case, the inputs are not as "floating" and independent as the datasheet mentions, there is indeed an internal electronic link between all the channels.

This HW hazard causes both other HW effects (other corrupted inputs), but also software effects because I could read but not write!

So, I can move forward again, I isolated the pins!
Could you forward this (and yesterday's remarks) to the HW BE and the workaround to be planned, thanks ?

As an aside, I looked for any recommendations to protect TSx inputs (eg ESD), but I found nothing. Do you have any information on this?

I feel that this will unblock the analysis now because it opens a tangible and reproducible analysis.

Regards,

Cédric

0 Anthony 6 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Sorry for the delay, and thank you for the details from your debug. Ill bring these details to our hardware team.

Cedric Milleret said:
I therefore investigated more precisely as to the origin of this problem. Unlike the other 3 NTC probes at 0V, the TS3 probe fluctuated at about 4.0V - 4.2V (which did not correspond to a cell voltage of 3.9V if you were wondering).
I use pouch cells, and the external envelope (although insulating) has a very low leakage current on the glued edge (protruding edge of the pouch junction) when the probe leg rubs against it, incredible!

On the other three NTC probes where 0V is being seen, is there any pulsing? When the gauge is in normal mode, there should be about a pulse every 1s for measurements. Just for my own understanding, the TS3 pin is being raised to ~4V by an external connection that could be backfeeding this back into the pin? I will check with our hw team to see what kind of state this situation can put the gauge in.

Also, Ill check if there are any ESD measures that can be put in place here.

Regards,

Anthony

0 Cedric Milleret 6 months ago in reply to Anthony

Intellectual 260 points

Hello,

Over the past two weeks, several questions have been raised and the latest information seems to show that parasitic HW re-powering paths seem to exist despite the documentation not mentioning anything.
Reminder of unclear points:
- VCx input clamp (path scheme?)
- TSx input clamp (path scheme?)
- PBI capacitance, holding current (not returning to zero), recommended interval ?
- VCC internal diode ?
- paths between PACK, BAT, VCC, TSx, VCx ?

After digging deeper, the TSx inputs are referenced to the internal 1.8V regulator. This "low" voltage suggests to me that it could be used by the core, therefore probably the flash memory. Given that the BBR has recorded flash corruptions, I wonder if TSx input faults can compromise the integrity of the FW via its memory, via a voltage that is too high, again by internal parasitic paths (overvoltage of the core, corruption of registers...).

If all ADC channels are referenced to Vreg (1.8V), given that I found that an abnormal voltage on TS3 had an impact on PACK and BAT, internal paths exist.

When TS3 was faulty, I did not pay attention to whether there was a pulsed signal on the other 3 NTC probes. So I do not know how to answer without recreating this failure context. On the other hand, in BqStudio, the other 3 measurements were valid (that's for sure I checked them, that's how I started to identify the problem), so we can assume that the measurements were made.

For the voltage observed on TS3 of about 4.0V-4.2V (which fluctuated with a period close to the second by the way...), I think it is a combination between the internal paths of the gauge and the leakage current of contact with the pouch edge, because the direct measurement on the edge of the cell envelop gives a leakage voltage difficult to measure close to the volt, the simple measurement impedance of the voltmeter is enough to "absorb" the leakage current, so the current seems really very low, but sufficient to disturb the gauge.

And so yes I would like to know how to protect the inputs. As the measurement of the NTC is direct, any introduction of resistance or filter could corrupt the measurements. An ESD diode would be ineffective without other components, and a protection voltage lower than Vreg would be required... Intuitively, I do not see a solution for the moment without knowing the internal functioning of the gauge

Regards,

Cédric

0 Anthony 6 months ago in reply to Cedric Milleret

TI__Mastermind 28470 points

Hi Cedric,

Cedric Milleret said:
- VCx input clamp (path scheme?)
- TSx input clamp (path scheme?)
- PBI capacitance, holding current (not returning to zero), recommended interval ?
- VCC internal diode ?
- paths between PACK, BAT, VCC, TSx, VCx ?

I am unsure if I can share the exact path schemes for the internal functionality of the gauge, however I will reach out to our hardware team to see what is available.

Regarding the TS pin, I have seen issues when there are other attachments that can cause backfeeding into the TS pins. From one of the previous conversations I have had with our hardware team, the TS pins have internal diodes to the internal 1.8V regulator, to which if they are back fed into by a source into can put the gauge into a bizarre state.

Cedric Milleret said:
And so yes I would like to know how to protect the inputs. As the measurement of the NTC is direct, any introduction of resistance or filter could corrupt the measurements. An ESD diode would be ineffective without other components, and a protection voltage lower than Vreg would be required..

I agree with your statement here. Is it possible to use anything to block the potential connection point between the cell and the thermistor?

Regards,

Anthony

0 Cedric Milleret 6 months ago in reply to Anthony

Intellectual 260 points

Hello,
Thanks for your feedback.

First of all, what is good is that the hardware team agrees that what I had and described does indeed exist and that it can cause the problems I have noticed! Identifying the cause is already very important, it allows us to know what happened and now where to intervene.
From what I have observed with my tests:
- when the core is affected by the leakage current, depending on the level of feedback, the core can be quickly and permanently crashed,
- at a lower level (my hypothesis), and that depending on the SMBus command received, or a periodic internal log write, the flash memory can be corrupted (data area and also code?), therefore a delayed fault.
In the end, the context was very complex, the effects themselves difficult to interpret, and a "rather basic" and difficult to identify hardware cause generated both hardware and software hazards on distant functionalities; not to mention that I was afraid of not being taken seriously given the difficulties in reproducing the problems and thus sinking into oblivion.

So I have already insulated the TSx connections. This protects like direct leakage currents.
On the other hand, it does not protect against EMC disturbances. This concerns both the VCx and TSx channels.
For example, what capacitance value can I put in parallel to thermistors to clip an EMC shock without degrading the measurement (time constant)? This is just an example for principle, I am looking for means of protection, here it is very simplistic, do you have more effective solutions to suggest?
Also, what about the still open question about PBI and the internal diode?

Even though I understand that you cannot reveal confidential design schematics, the block diagrams are still a necessary basis for understanding and implementing countermeasures / protections, knowing that I am hardening my schematic for the CAD review.

Kind regards,
CM

0 Cedric Milleret 6 months ago in reply to Cedric Milleret

Intellectual 260 points

Hello,

I haven't heard from you.
Were you able to get any details from the hardware team?

Regards,

Power management

Power management forum

BQ40Z50-R2: Can not communicate on SMBus after shutdown command / parameter file updated, no wake up of chip