This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM625: I2C controller becomes stuck with SCL low

Part Number: AM625

On our TQMa62xx SoM, we've observed the I2C controller becoming stuck under high system load (simultaneous stress test of CPU, RAM and various peripherals). We've noticed multiple symptoms of this state:

  • SCL is permanently pulled low by the AM62 I2C controller
  • A spurious call to omap_i2c_receive_data(). This currently causes a buffer overrun and usually a crash in the current kernel (ti-linux-kernel 09.00.00.009-rt), as the function never checks buf_len before writing to the buffer; I assume the cause must be a spurious RRDY interrupt, as that code path passes omap->threshold as num_bytes without checking the remaining buffer size. As part of our debugging, we've patched our kernel to detect this case and print a warning instead of crashing.
  • Besides BB, the XUDF and BF flags were set in I2C_STAT (at least one time we saw this issue; flags might not have been the same every time). I haven't been able to get a proper trace of the changes of the register values when the issue occurred, so I'm not sure if there is actually an underflow before the other symptoms become visible.

Once the controller is in this state, omap_i2c_wait_for_bb() will always time out and no further transfers are possible. As SCL is low, omap_i2c_recover_bus() fails with -EBUSY.

Recovery from the stuck state is possible if transfers are stopped for long enough for the controller to be suspended by runtime PM. I could also get it unstuck by manually triggering a Soft Reset of the controller using devmem.

Some of the issues should be easy to solve in software, by resetting the controller when recovery is impossible, and improving handling of buf_len (synchronization between omap_i2c_xfer_msg() and omap_i2c_xfer_data() also seems questionable - I suspect the barrier() in omap_i2c_xfer_msg() might be too weak if the IRQ is handled on a different core).

I'm concerned about the spurious omap_i2c_receive_data() call however, as it might also lead to data corruption. Is this caused by the hardware misbehaving, or is there a different explanation?

Our Device Tree configuration:

&main_i2c0 {
	pinctrl-names = "default";
	pinctrl-0 = <&main_i2c0_pins>;
	clock-frequency = <400000>;
	status = "okay";
	
	/* Not shown: several devices including PMIC, EEPROMs, RTC, ... */
};

Regards,
Matthias