This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/AM3358: Linux kernel freeze after pulling I2C lines to GND

Part Number: AM3358

Tool/software: Linux

Hello, 

We wanted to ask help on how we should address this issue since we don't know whether this kind of behavior is expected in the specs or not.

We started by noticing hard freezes after connecting some i2c gpios expanders we then moved to removing the kernel drivers mapping in the device tree and started probing using i2cdetect -y -r <i2cdev-num>.

with just one device.

We managed to recreate the problem by just randomly shorting to ground sda/scl continously during the i2cdetect probes, we may have, in this way, reached an annoying situation  

Reducing the clock seemed to benefit reducing the number of freeze events.

We would like to understand whether

a) this is expected and we should provide protection on our external circuitry in case these events are going to occour.

b) is a kernel bug

c) is some hardware bug

d) something else.

  • Hi,

    Pulling I2C signals to GND is definitely out of spec. This should be avoided. What Linux version are you using?
  • We are using the 4.4-bone-rt. We also tested this on the 4.9 both ti and bone.
    I don't know very well how things works between the arm and the i2c peripheral. Does this kind of event generates an unrecoverable hardware fault (or to better say recoverable only by reset) that halts the cpu?
    Or the kernel may be is still running?

    Thanks,

    Leo

  • Can you try with the Processor SDK provided by TI: www.ti.com/.../PROCESSOR-SDK-AM335X ? We do not support the versions you use.
  • I have installed the latest lts ti kernel. It did changed something, such as:

    the usr0 led is still beating.

    I cannot connect anymore via ethernet but via the serial interface on top of the board I could get:

    [  532.175272] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/160-4819c00:77]

    So I guess there's still hope that this bug can be solved via software?

  • I have asked the software team to look at this. They will respond here.
  • On the hardware side, one of our experts thought that you might be entering a clock stretching situation. Please check section 21.3.8 of the AM335x TRM Rev. P for details.
  • Hi Leonardo,

    Can you provide a scope trace when this occurs? This sounds like a hardware issue. I2C is pretty fundamental, and is getting exercised constantly because of the CPUFreq driver adjusting PMIC voltage as we change CPU Frequency under varying load conditions. We'd see a lot more issues if I2C had bugs, but anything is possible :)

    Can you also provide some information about your IO expander, and how it is connected to your beaglebone?


    Regards,
    Mike

  • Hi Mike,

    The problem with the scope trace is that sometimes becomes really hard to get the exact moment (we have to find out if we have a digital scope with good buffering).  We managed to save the signals chat with a digital analyzer. Or do you need also the tension values?

    the i2c device is http://www.horter.de/doku/i2c-hs-output_db.pdf.

    Regarding the clock stretching it looks like a possible reason. It seems that reducing the i2c clock speed makes it impossible to achieve this freeze.

    We think the i2c issue with the device is the same as the the one when poking sda/scl into ground just because of the result (the kernel freeze). 

    We tested this with multiple beaglebones.  We are definitely interested to find a solution that doesn't freeze the linux kernel. 

  • more logs by compiling kernel with more i2c debug messages.

    [ 680.885766] systemd-journald[118]: /dev/kmsg buffer overrun, some messages lost.
    [ 680.917773] systemd-journald[118]: /dev/kmsg buffer overrun, some messages lost.
    [ 683.490351] i2c i2c-2: SCL is stuck low, exit recovery
    [ 685.014028] sched: RT throttling activated

    and, after that:


    [ 984.154020] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/160-4819c00:83]
    [ 984.161973] Modules linked in: pru_rproc pruss_intc omap_aes_driver omap_sham pruss omap_rng rng_core evdev uio_pdrv_genirq uio 8021q garp mrp stp llc bnep usb_f_acm u_serial usb_f_ecm usb_f_rndis u_ether libcomposite bluetooth rfkill
    [ 984.183002] CPU: 0 PID: 83 Comm: irq/160-4819c00 Tainted: G L 4.4.68-ti-r108 #9
    [ 984.191473] Hardware name: Generic AM33XX (Flattened Device Tree)
    [ 984.197591] task: dc3b8680 ti: dc514000 task.ti: dc514000
    [ 984.203015] PC is at irq_finalize_oneshot.part.1+0xa4/0x110
    [ 984.208611] LR is at irq_gc_unmask_enable_reg+0x7c/0x8c
    [ 984.213858] pc : [<c00a949c>] lr : [<c00adf80>] psr: 600f0113
    [ 984.213858] sp : dc515ed0 ip : dc515e98 fp : dc515eec
    [ 984.225383] r10: c00a9508 r9 : dc4acfc0 r8 : dc209cc0
    [ 984.230628] r7 : dc209cd0 r6 : dc209d28 r5 : dc4acfc0 r4 : dc209cc0
    [ 984.237181] r3 : dc00d074 r2 : dc209cc0 r1 : fa200000 r0 : dc00d024
    [ 984.243735] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
    [ 984.250900] Control: 10c5387d Table: 9c728019 DAC: 00000051
    [ 984.256670] CPU: 0 PID: 83 Comm: irq/160-4819c00 Tainted: G L 4.4.68-ti-r108 #9
    [ 984.265141] Hardware name: Generic AM33XX (Flattened Device Tree)
    [ 984.271265] [<c001bab8>] (unwind_backtrace) from [<c0015a6c>] (show_stack+0x20/0x24)
    [ 984.279045] [<c0015a6c>] (show_stack) from [<c05be1bc>] (dump_stack+0x8c/0xa0)
    [ 984.286303] [<c05be1bc>] (dump_stack) from [<c0011c54>] (show_regs+0x1c/0x20)
    [ 984.293475] [<c0011c54>] (show_regs) from [<c010c7a4>] (watchdog_timer_fn+0x244/0x2ac)
    [ 984.301431] [<c010c7a4>] (watchdog_timer_fn) from [<c00be4f8>] (__hrtimer_run_queues+0x1b8/0x398)
    [ 984.310345] [<c00be4f8>] (__hrtimer_run_queues) from [<c00befe8>] (hrtimer_interrupt+0xd4/0x23c)
    [ 984.319174] [<c00befe8>] (hrtimer_interrupt) from [<c002c940>] (omap2_gp_timer_interrupt+0x38/0x40)
    [ 984.328263] [<c002c940>] (omap2_gp_timer_interrupt) from [<c00a814c>] (handle_irq_event_percpu+0xac/0x2b0)
    [ 984.337963] [<c00a814c>] (handle_irq_event_percpu) from [<c00a83a4>] (handle_irq_event+0x54/0x78)
    [ 984.346877] [<c00a83a4>] (handle_irq_event) from [<c00abc64>] (handle_level_irq+0xb8/0x150)
    [ 984.355268] [<c00abc64>] (handle_level_irq) from [<c00a76f0>] (generic_handle_irq+0x34/0x44)
    [ 984.363745] [<c00a76f0>] (generic_handle_irq) from [<c00a79fc>] (__handle_domain_irq+0x6c/0xc4)
    [ 984.372484] [<c00a79fc>] (__handle_domain_irq) from [<c0009500>] (omap_intc_handle_irq+0x44/0xa0)
    [ 984.381396] [<c0009500>] (omap_intc_handle_irq) from [<c0a85614>] (__irq_svc+0x54/0x70)
    [ 984.389431] Exception stack(0xdc515e80 to 0xdc515ec8)
    [ 984.394507] 5e80: dc00d024 fa200000 dc209cc0 dc00d074 dc209cc0 dc4acfc0 dc209d28 dc209cd0
    [ 984.402722] 5ea0: dc209cc0 dc4acfc0 c00a9508 dc515eec dc515e98 dc515ed0 c00adf80 c00a949c
    [ 984.410933] 5ec0: 600f0113 ffffffff
    [ 984.414439] [<c0a85614>] (__irq_svc) from [<c00a949c>] (irq_finalize_oneshot.part.1+0xa4/0x110)
    [ 984.423180] [<c00a949c>] (irq_finalize_oneshot.part.1) from [<c00a9564>] (irq_thread_fn+0x5c/0x64)
    [ 984.432181] [<c00a9564>] (irq_thread_fn) from [<c00a9954>] (irq_thread+0x170/0x234)
    [ 984.439875] [<c00a9954>] (irq_thread) from [<c0065e68>] (kthread+0x118/0x130)
    [ 984.447044] [<c0065e68>] (kthread) from [<c0010ee0>] (ret_from_fork+0x14/0x34)

  • Update:

    It seems that even placing an oscilloscope actually produce a similar issue. So I thought of placing a small capacitance circuit to scl. What is happening is just the freeze. Always the same thing. The controller timed-out, the kernel tries to recovery. And during the recovery (I'm still digging with printk) enables the RT throttling, then calls without delays (and it seems the source of the dead lock) omap_i2c_xfer.

    I don't know if I'm giving you enough infos. We will send you a scope trace and we will try with another architecture a similar scenario to see if this behavior is expected or not.

    we will definitely better design our i2c bus although we find a bit strange for the linux kernel to behave like this.

    thank you for your work,

    Leonardo

  • Hello, 

    I'm posting some traces.

    TITLE: SCL, 100Khz scope 1x, of i2c2. no external pull-up just internal. 

    TITLE: sda/scl 400Khz no external pull-up

    Comments: In this condition the kernel is very sensible to what happens in the bus. If it stretch too much because of the rise time or placing a capacitor of 60pF in the bus,  cause immediate freeze if the i2c bus is polled.

    TITLE: sda(green)/scl(yellow), 4K Ohm external pull-up, 100Khz 

    Comments: The rise time is now "fixed" even placing a capacitor to torture the i2c makes the waveform stable. No deadlocks when polling.

    Conclusion:
    It seems that there was indeed a communication issue. Said that, I still don't know if it's an expected behavior for the kernel to completely freeze for a "bad communication".  


    What do you think?

    Thank you,


    Leonardo

  • Hi Leonardo,

    Strong external pull ups are required on I2C lines per the spec (http://www.nxp.com/documents/user_manual/UM10204.pdf).   On our EVMs, we typically have values of 2.2K ohm.  It is possible the kernel driver is making some assumptions based on proper bus behavior, but it sounds like your issue is resolved now.

    TI has an app note that will help you calculate values for your system: 

    The Linux kernel driver was likely written with the assumption these resistors are in place, otherwise, I

    Regards,
    Mike

  • Hi,

    I had exactly same issue with non rt kernel linux-image-4.4.68-ti-r115. I have noisy i2c network with bus extenders which sometimes trigger this issue.

    I managed to replicate issue in my lab and added some debug messages to code. It seems that omap generates OMAP_I2C_STAT_XRDY when this happens. Because omap is in receiver mode, interrupt is never handled correctly. Which then lead interrupt loop.

    When commenting out receiver mode checking in interrupt handler I'm not anymore able to reproduce issue in lab.

    diff --git a/drivers/i2c/busses/i2c-omap.c b/drivers/i2c/busses/i2c-omap.c
    index ab1279b..65496c1 100644
    --- a/drivers/i2c/busses/i2c-omap.c
    +++ b/drivers/i2c/busses/i2c-omap.c
    @@ -1017,10 +1017,12 @@ omap_i2c_isr_thread(int this_irq, void *dev_id)
    stat &= bits;

    /* If we're in receiver mode, ignore XDR/XRDY */
    + /*
    if (omap->receiver)
    stat &= ~(OMAP_I2C_STAT_XDR | OMAP_I2C_STAT_XRDY);
    else
    stat &= ~(OMAP_I2C_STAT_RDR | OMAP_I2C_STAT_RRDY);
    + */

    if (!stat) {
    /* my work here is done */
  • Hi,

    thank you for your answer and work! I will try to test this patch as soon as I can.

    Will this patch be added to the kernel?

    -l

  • Hi,

    I'm not from TI side. Some from TI could comment why this code has been added 5 years ago.
    github.com/.../079d8af24b948261e1dae5d7df6b31b7bf205cb4

    This has been also noticed before, but patch never got to linux kernel tree
    patchwork.kernel.org/.../

    Oskari
  • Oskari,

    Thank you for digging up these patches.

    Based on the commit log, the code was probably added to prevent unwanted interrupts from literally interrupting an active RX or TX. If you only needed to remove the code, and have not suffered any side effects, it was probably added to address a specific corner case.

    That said, perhaps the second patch you referenced is better/ more correct. I will raise this internally with our Linux team.

    Thanks again,
    Mike