BQ34110: I2C timeout

Cédric St-Amand

Part Number: BQ34110
Other Parts Discussed in Thread: BQSTUDIO

Hi,

We have a rare occurrence of I2C timeout and we have a timeout set at 10ms for a 100kHz I2C. We don't find any limitation in the datasheet on the I2C. Is there any other document that could put constrain on the I2C like delays between commands and things like that.

Note: Before we had a timeout at 1ms and we were getting that error from time tot time. But since we have put the 10ms it become very reliable, but not perfect.

over 6 years ago

0 Batt over 6 years ago

TI__Mastermind 42535 points

A df write can cause a nack or a timeout, but it really shouldn't affect it unless the comms are continuous.

I2C as is has infinite clock stretch.

0 Michael Desjardins over 6 years ago in reply to Batt

Intellectual 580 points

Hi,

Sometimes after the timeout the BQ34110 doesn't respond at all... What are the conditions, with a df write, that can cause a nack or a timeout? Could you tell more about this?

Thank you.

0 Michael Desjardins over 6 years ago in reply to Michael Desjardins

Intellectual 580 points

Also, on the first post above, when we have the communication timeout, it's on reading values like Current, SOC which are not in the data flash...and the device doesn't respond anymore. We are constantly having I2C communication timeout (timeout set at 10 ms).

We are using using the DF write on the production bench to write the parameters, it's really important to us to know the conditions that can cause communication timeout and nack to avoid having them.

Could you give us details on the right conditions to have to avoid getting timeout and nack?

Thank you.

0 Batt over 6 years ago in reply to Michael Desjardins

TI__Mastermind 42535 points

how fast are you scanning? you don't need to read regs more than once a sec. that can overload the comm bus and cause nacks. also don't clock the i2c really fast. most users set it to 100khz and it works.

0 Michael Desjardins over 6 years ago in reply to Batt

Intellectual 580 points

We are reading different registers back to back once in a while, like said before our clock is set at 100 kHz. The BQ34110 is the only device communicating on the bus.

Is there any limitations? When we read a register, obviously we are waiting for an answer (ack) before reading the next. In the documentation there is no mention of a delay that we must wait after reading a register and before reading the next...

0 Batt over 6 years ago in reply to Michael Desjardins

TI__Mastermind 42535 points

Yes, we have specified no delay but there is no need to scan all registers faster than once a second. A min delay of 5ms between transactions won't overload the gauge comms.

0 Cédric St-Amand over 6 years ago in reply to Batt

Prodigy 150 points

Hi,

In our application we have a immunity test mode that constantly poll the various functionality of the product to confirm it's working during the perturbation, including the BQ device. Therefore I cannot put artificial huge delay between access to the BMS unless there is a clear device limitation. I understand that voltage, temperature, SOC, don't change quickly in normal operation, but I need to guarantee all access works.

What are the limitations on the BQ device for I2C access (reads and writes) and depending on the commands. This issue is blocking the qualification process.

0 Batt over 6 years ago in reply to Cédric St-Amand

TI__Mastermind 42535 points

Do leave a gap of about 10-15ms between blocks of commands. Do not hammer the gauge. Comms are prioritized over ADC and other computations but not in all cases. The recommended scanning interval for registers is 1s or you can even do 0.25s but if you do it any faster and get comm errors we recommend that your mcu implement error correction and read regs multiple times to confirm values before committing or using them.

0 Cédric St-Amand over 6 years ago in reply to Batt

Prodigy 150 points

We will put a gap, but what gap is safe under all circumstances to have a reading that don't timeout and measure that are always good. You mentioned ADC vs comm priority.

If the gap diver vs the command change depending on the command, I can plan the worst case to be safe.

0 Cédric St-Amand over 6 years ago in reply to Cédric St-Amand

Prodigy 150 points

We have tried various delay between the commands paced by the MCU. 1ms, 5ms 50ms and 100ms all update doing spurious error after a few minutes. We have made a test at 200ms. No error within an hour, but what are the guarantee?

Can you tell us what is the pace that will always guarantee the BMS to behave?

Did using 100kHz or 400kHz on the I2C change something?

This instability is holding qualification test due to fake failed that all points toward firmware issues, not hardware.

0 Batt over 6 years ago in reply to Cédric St-Amand

TI__Mastermind 42535 points

The comms should be done at 100kHz with a delay of at least 10-20ms. Do not use 400kHz mode. It's only used for signaling.

0 Cédric St-Amand over 6 years ago in reply to Batt

Prodigy 150 points

We have tried 10-20ms and we were still seeing timeout. We are running with 200ms delay and we got one a rare occurrence last week.

0 Bryan Kahler over 6 years ago in reply to Cédric St-Amand

TI__Mastermind 25955 points

Hi user4427506,

I’m sorry to hear that there are communication lock-up issues currently with the bq34110.

This training session is helpful with respect to the topic at hand:

training.ti.com/gauge-programming-fundamentals

Most of the information you will find helpful with respect to this topic is around the 40 minute mark in the above video

The rule of thumb given by TI is no more than 2 standard commands per second, but this is extremely conservative.

A good guideline to follow for minimum times are the delays given in the Filestream files (BQFS and DFFS). You can export these in the golden image tab of bqStudio. Following the delays associated with the specific commands in the file should help speed things up without going too fast.

For more information on parsing the flashstream, please refer to this app note: www.ti.com/.../slua801

Sincerely,
Bryan Kahler

0 Jerome Godbout86 over 6 years ago in reply to Cédric St-Amand

Intellectual 330 points

I work with Michael Desjardins, the clock is at 100 KHz, but from time to time the device seem to NAK a request to read/write to it. After that the BQ34110 seem to be jam into that state and no more ack any request made to him (no recovery). Even if I break and wait many secondes and start the code again, the next request will do the same and the BQ will NACK everything.

I can reproduce the behavior faster if I talk to it quickly. I can reproduce the bug even if I don't talk to the other device on the same bus. Since I cannot talk to it anymore, I would need a way to reset it by hardware or to avoid this condition. Anyway to monitor this or strong condition or limitation (not approximation or probably) for that chipset? We cannot run a full day without having this problem and we poll it every seconde. Is there any watchdog or settings we could set to make it recover or reboot if this happen?

0 Jerome Godbout86 over 6 years ago in reply to Jerome Godbout86

Intellectual 330 points

Here is another example of NAK message:

0 Jerome Godbout86 over 6 years ago in reply to Jerome Godbout86

Intellectual 330 points

Seem like polling the BQ34100 at 200 ms alternating each property, 7 that we do read one at a time:
TEMPERATURE,
CURRENT,
VOLTAGE,
SOC,
SOH,
FULL_CHARGE_CAPACITY,
REMAINING_CAPACITY,
So each property were updated every 1400 ms ( 200 ms x 7 ), this was still problematic, we are far off the pitched 20 ms. Seem too much for the BQ. I changed the polling to 1000 ms (so every characteristics get updated every 7s), seem to work so far. I wish I was having a true number for this, especially it's not just a bad packet but a full chip infinite lockdown (I2C will NACK until chip is reset) and I'm not sure all the BQ34100 will have the same behavior.

0 Batt over 6 years ago in reply to Jerome Godbout86

TI__Mastermind 42535 points

I think it can just be done by trial and error unfortunately.

0 Jerome Godbout86 over 6 years ago in reply to Batt

Intellectual 330 points

Hi, we did try to see if we could use the ChipEnable CE pin to reset the device and recover from this. It doesn't seem to work, we did put the CE down for a whole 1 seconde before pulling back to high. The next request get NAK no matter what. I did not see any timing information on the CE, maybe I did not get what the CE really do either.

Is there a way to recover from this NAK state? I cannot talk on I2C to the chip, I think I cannot use the CE. Since I'm powered on battery I cannot open the device to disconnect the battery every time this happen (the battery is not really accessible in this design), since this for a critical usage, I cannot really hope this won't happen too often sadly.

0 Jerome Godbout86 over 6 years ago in reply to Jerome Godbout86

Intellectual 330 points

Hi, I also reduced the sleep current to 0 mA just to make sure we won't go into sleep with the chip to see if this could be the source of wrong state, but it seem to just take a little longer to end up into that bug. I guess we will need special circuitry to cut the chip power to hard reset it when this occur.

0 Bryan Kahler over 6 years ago in reply to Jerome Godbout86

TI__Mastermind 25955 points

Hi Jerome,

Please monitor the TS pin - when the device is in normal mode it will pulse ~1 sec. Does this pin continue to pulse after the device exhibits this NAK state? Does it pulse ~1 sec? ~20 sec? Not at all?

If not at all, please try to communicate with the device at address 0x16 instead of 0xAA. Is there a response?

If you reduce the comm rate from 100 kHz to 50 kHz, do you still see the error?

If the error persists after these tests, please let us know.

Sincerely,
Bryan Kahler

0 Jerome Godbout86 over 6 years ago in reply to Bryan Kahler

Intellectual 330 points

The TS pin is pulsing at 20 sec interval (~19.88 sec). Do we have a configuration problem? normal mode is 1 sec from what you said, what mode give 20s pulse? Do you have table ms = what? I'm not the one who did the bqstudio for this, so maybe something is missing there which I'm not aware or our init phase should put the BQ into a different mode.

I can provide the I2C logic analyzer data acquisition if it can help, it's not top notch, it's a Saleae8 with logic Software and I have to start it manually for around 5 minutes max before memory get too much. Right now I do a loop polling to read the FCC (Full Chare Capacity) every 20 ms to reproduce the bug, it doesn't occur to often so debugging it make it a little hard. Sometime it take a few secondes, some time it take a few hours. The fater the polling the faster I get a NAK, so the 200 ms, take a few hours to a few days. But from what I can monitor into the MCU, I get NAK.

0 Bryan Kahler over 6 years ago in reply to Jerome Godbout86

TI__Mastermind 25955 points

Hi Jerome,

Those tests were to determine if the device was going into ROM mode or if the device was executing firmware.

This is good news - the TS pin pulsing at 20 sec just means the device is in sleep mode. The firmware is executing and the device has not gone into ROM mode.

In firmware mode the device should respond at 0xAA, not 0x16.

Please try reducing the rate at which commands are sent and/or the i2c clock frequency.

Sincerely,
Bryan Kahler

0 Jerome Godbout86 over 6 years ago in reply to Bryan Kahler

Intellectual 330 points

Hi,

Since we saw the problem when we are into the Sleep mode, we did try to prevent the sleep mode to see if that could help us. We did try to activate the snooze bit.

When we have the snooze bit, it take longer to get the problem but we still get the problem (we do rush the polling to see if the I2C will fall into the same case, not sure this is a good idea, but it seem to reproduce the bug we see, at least from the external point of view).

Is it normal the TS pin pulse at 20 s in snooze just like sleep?

It seem to happen more often when we do environmental tests (heat/cold 75/-20 oC) maybe this can help us point to the source of the problem.

The fact that it can NAK a command because the chip is not ready is ok and this can happen, the fact that it stick into that mode forever is more of a problem here, we are checking to have a routing to control the LDO power to the BQ, so MCU can reset the BQ and try to communicate with it again, not sure what will be the impact on the reported value or if this will work properly.

How long should I power off the BQ so it reset properly?

We are currently trying for a long test with polling BQ at 1000 ms (1s) to see if this can prevent the problem. Since this is very long to test and we cannot be sure if this really fix the problem, we went to double the recommended 2x/s to 1x/s so we hope this will be safe enough. I have take a look at Linux kernel driver and they poll those chip every minutes (at least for the BQ27..). That might explain the problem.

Thanks for the help,

Jerome

0 Jerome Godbout86 over 6 years ago in reply to Jerome Godbout86

Intellectual 330 points

One of my colleague just tested to make sure the power consumption was above the sleep current to prevent the sleep mode, just in case the problem was link to the mode.

we did observed a weird behavior from the capture, the TS pulse every 1 sec, a glitch occurred when the bug arrive and the SCL after 6 secondes seem to start pulsing just like the TS pin should! Here is the acquisition we got (see attachment)

So we can exclude the sleep mode as the source of the I2C non recovery. I known we are polling it too fast, but it seem to reproduce the problem in a decent time (else we have to wait hours, nearly days).

I'm a little perplex what is happening here during the NAK the TS got a pulse, 6 TS pulse, then the pulse start on the SCL pin?!? Not sure what is going on when this happen, look like an overflow somewhere. Not sure if this is exactly the same bug we see on long usage or if this "normal" when polling too fast.

If you have any behavior info, that would help us.

0 Jerome Godbout86 over 6 years ago in reply to Jerome Godbout86

Intellectual 330 points

We measured the TS signal and SCL on an oscilloscope and we saw that there is a falling edge on TS (2.5V dropping to 1.5V, see CH1 in figure bellow) every time at the exact moment the first NACK happens. It looks like the beginning of a normal pulse on TS that was interrupted by something that also affects the I2C responses.

Ch1 TS

Ch2 SCL

We can now get some good readings after a bug (falling edge on TS + NACK) by reseting REGIN or reseting CE and reinit the I2C bus in the firmware. Only to reinit the I2C bus or even to reset the MCU (BQ does not reset) is not enough, the BQ really needs to be reset. If we only reset the MCU, the BQ ends up not responding at all at some point.
We are able the get this bug every 2.5 seconds by reading values rapidly, which is not what we want for our application, but it is the only effective way we can reproduce the problem QA see when they test the BMS for many days. In the application for QA, the readings are at least 200 ms apart (test with 1000 ms pace is under way but this may take a few days to complete).
The bug happens in every mode we tested (NORMAL with pulses every 1 s on TS, SNOOZE and SLEEP modes with pulses every 20-30 s on TS). There is always a falling edge, never a complete measurement pulse, on TS at the same moment.

So we are not 100% sure we are reproducing the same issue, but the symptom are the same, the device start NAK all request made to him. That pulse on TS is not like the others and not at the same time. Look like something is going wrong and result into this side effect. If we take a look at the previous capture message we can see it is not at TS pulse period, it just happen when the bug happen and he is not a square pulse but a spike that discharge slowly. The hard power reset is also a concerne, since if that would have work, we would have recover by toggling teh CE on the BQ. So even if we do recover with the power down of the BQ and reset the I2C bus the problem arise again pretty quickly, until we do a full power down. In the real like this won't be possible, since the device will always be power by battery.

Maybe something is wrong with our design or timing. If it can help here is the electrical schematic:

0 Bryan Kahler over 6 years ago in reply to Jerome Godbout86

TI__Mastermind 25955 points

Hi Jerome,

Thank you for the detailed post. Will need to discuss this information internally with more team members as we try to root cause this issue.

With respect to the schematic, nothing sticks out offhand. How heavily loaded is REG25? I see a line disappearing off from it. The connection to the TS pin is also missing in that portion of the schematic. Assuming the thermistor is there, but can't see it.

Sincerely,
Bryan Kahler

0 Kang Kang over 6 years ago in reply to Bryan Kahler

TI__Mastermind 24870 points

Hello Jerome,

I'm not sure if the TS pulse falling signals anything for us.

We have a similar part here. The majority of clock stretches for the gas gauge is <= 4 milliseconds.

If your host processor does not support clock stretching, it will get a NACK if it tries to send a command when the gauge is clock stretching.

This matches the observation of increasing the delay from 1 millisecond to 10 milliseconds.

I'm wondering if you see the same at 5 millisecond delay.

Thanks

0 Jerome Godbout86 over 6 years ago in reply to Kang Kang

Intellectual 330 points

Hi,

Yeah we were having trouble at first with that, the initial timeout was set at 1 ms and we were getting a lot of error. After reading this, we do allow timeout to finish the transaction to 10 ms per request to ensure it's not the wake up delay that causing the problem.

The clock stretching seem to be around 170 us max (avg 140 us) so it's not an issue.

We also did check by preventing the chip from going to sleep and check if we would see the same behavior and we were. So the sleep mode doesn't seem to affect the behavior. We also did try the Nucleo476L and the bq34110EVM board to see if our hardware was not at fault.

We also upgraded our CubeMX and firmware to 1.13 (from 1.11), seem more robust but still have the issue.

But again, we do test by sampling way faster then suppose to see the behavior, since I cannot test this and try to debug it when the bug occur once every few days when polling at 1000 ms. So I'm not sure my capture are exposing the actual bug, just that the symptom are the same, the BQ start NAK"ing" and we can't communicate with it anymore, reseting the I2C bus of the MCU is not enough I have to restart the BQ chip and the I2C bus (and all other device on the I2C bus). In our testing we do init other device on the I2C bus but we do not talk to them.

The fact that the I2C BQ jam once in a blue moon when pooling at 1000 ms is the actual problems but that's make it nearly impossible to debug and my capture device cannot handle that long (only a few seconds). The device at room temperature seem to work flawlessly, only when doing thermal chamber between (-40..+70 oC) the problem show up when polling at 1000 ms making it even harder to capture the problems.

Any internal clock that might drift a lot? any timing that we should modify when the temperature changes? If I increase the polling time, do I only make it less probable to have the bug, which would still happen over a long period or I can avoid it entirely?

0 Bryan Kahler over 6 years ago in reply to Jerome Godbout86

TI__Mastermind 25955 points

Hi Jerome,

Please try increasing the timeout to > 78 ms and run the test at the 1s rate (assuming this will be the rate for production). Please let me know if the error persists after these modifications.

Sincerely,
Bryan Kahler

0 Jerome Godbout86 over 6 years ago in reply to Bryan Kahler

Intellectual 330 points

Hi,

we have modified our firmware to comply with the given delay and are currently doing some test with 5 units to see if we will encounter the problems again. In the mean time I was testing a few thing we talked over the phone. Wait around 3 seconds when over flooding the device before continuing to let him goes back on his feet. here is my main loop example code (note this is wrong since it does not respect the 66 us between stop and start frame):

float temperature;
while(1)
{
    error = Read_Temperature_Celsius(&temperature);
    if(error != BQ34110_DRIVER_SUCCESS)
    {
        HAL_Delay(3500); // in ms
    }
}

It does recover a few times before it jam entirely. Not sure why it does work a few times and not after a short while. I will continue my investigation with the delay and monitoring to see the effect of each. it seem relatively stable so far with the given delay on other setup, I will need a few more days to ensure the fix is working properly (we are monitoring if a timeout occure between our prevoious 10 ms and the actual 78 ms given, that would told us that we would normally previously failed at that point previously and log it into our mcu flash for testing purpose).

Thanks for the information so far.

0 Jerome Godbout86 over 6 years ago in reply to Jerome Godbout86

Intellectual 330 points

I did zoom in to compare the first exchange when it did succeed and when it did not, I have a feeling like something is wrong in what the mcu expect at that point versus what the BQ is expecting. First message after recovery:

And the last message before the problem:

0 Cédric St-Amand over 6 years ago in reply to Jerome Godbout86

Prodigy 150 points

Hi,

We have good news around the 80ms delay and 500ms between access. We had a few units running since yesterday with a new firmware and we had the chances to see cases of access that took longer than 10ms and the unit behaved properly. We are running more tests to build up the confidence level, but so far, with modified firmware delay + extra tracking we have an I2C that looks good.

119, 000101 175721.772, HAL_I2C_Mem_Read Short Timeout Operation took 23 ms

162, 00/01/01 11:33:49.503, HAL_I2C_Mem_Read Short Timeout: Operation took 13 ms

161, 00/01/01 11:19:33.149, HAL_I2C_Mem_Read Short Timeout: Operation took 18 ms

160, 00/01/01 05:13:22.463, HAL_I2C_Mem_Read Short Timeout: Operation took 12 ms

We will keep you updated on it. I am posting the same information on the TI Forum.

Have a nice day,

0 Bryan Kahler over 6 years ago in reply to Cédric St-Amand

TI__Mastermind 25955 points

Hi Jerome and user4427506,

I am glad to hear that the increased 80ms delay and 500ms between access is working well thus far under test conditions. Thank you for the continued updates.

Sincerely,
Bryan Kahler

Power management

Power management forum

BQ34110: I2C timeout