This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SK-AM62B: serial: 8250: 8250_omap: Fix possible interrupt storm on K3 SoCs

Part Number: SK-AM62B

Tool/software:

Hi TI Expert,

Based on this article (e2e.ti.com/.../4730883, we are aware that applying the patch "b67e830d38fa9335d927fe67e812e3ed81b4689c" may result in loss of UART data.

I have some questions:

1. Do we have a final solution to stop the interrupt storm?
2. If not, I would like to know what the worst case scenario is if the UART device goes into the interrupt storm. Will I get any corrupted data?
3. Will the amount of interruption be large enough to damage the OS?
4. How do I know if my OS is caught in an interrupt storm?

I have a wireless module that communicates with an AM62x SoC via a UART interface running at 3M baud rate. Sometimes, when I enable the wireless module for Bluetooth, the OS crashes.

In the first stage, I wanted to determine where the problem was, so I separated the wireless module. I connected UART TX to RX for loopback testing. In this test, the AM62x operates normally, and at 3M baud rate, the number of interrupts is about 9000 per second.

I'm just wondering if I don't apply the patch 'b67e830d38fa9335d927fe67e812e3ed81b4689c' and the OS is in an interrupt storm state, could it lead to the wireless module crashing? Or will it just receive some useless interrupts that won't harm anything but waste CPU time?

Thanks,

Sean

  • Hi Sean,

    The mentioned UART irq storm issue is i2310 in the AM62x Errata. But I don't think we have a real scenario to trigger the irq storm. Yes if the issue indeed happened, it would generate useless interrupt storm and waste CPU to not have bandwidth serving other tasks. But I cannot comment if it would cause the wireless module to crash, since I don't know how the wireless module would behave when no expected data coming.

    I recommend to not apply the patch "b67e830d38fa9335d927fe67e812e3ed81b4689c" to avoid the UART data loss issue.

  • Hi Bin,

    Thanks for your reply. Can I determine if OS is entering an interrupt storm by monitoring for an unusual number of interrupts? Or are there other symptoms of the OS that I should observe?

    Thanks,

    Sean

  • Hi Sean,

    In general, when irq storm happens, CPU load will be high to serve the irq, and the Linux might be less responsive, or not responsive at all. Of cause the UART interrupts would be abnormally high in this case, but if Linux is not response you won't be able the check the UART interrupt rate.

  • Hi Bin,

    During my testing today, I discovered that I do indeed encounter interrupt storms.


    I can easily reproduce the interrupt storm on the PCBA when I connect the UART wireless module. The number of interruptions is huge, approximately 10,000 times per second, causing the operating system to crash.

    The picture below shows that the number of interruptions increased by 500,000 in just 2 seconds:

    It seems that we have to choose between losing data or mitigating the storm. The current solution doesn't work for me because on my PCBA, it could lose data on the serial port and it could experience interrupt storms on the WiFi module. Is it possible to solve them simultaneously?

    In my observation, the WiFi module exhibits strange behavior: it sends strange data to the SoC when powered on. I suspect this may be the reason for triggering the interrupt storm.

    I also observed that the RTS has a peculiar jump before the interrupt storm occurs, as shown at point B, and point C marks the time when the interrupt storm occurs:

    A: When powering on
    B: Abnormal RTS jump
    C: Interrupt storm occurs

    Thanks,

    Sean

  • Hi Sean,

    Can you please test the kernel patch below to see if it resolves the irq storm problem?

    diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
    index 170639d12b2a..03404f14b1fc 100644
    --- a/drivers/tty/serial/8250/8250_omap.c
    +++ b/drivers/tty/serial/8250/8250_omap.c
    @@ -115,6 +115,11 @@
     /* RX FIFO occupancy indicator */
     #define UART_OMAP_RX_LVL               0x19                                                                                                                                                                                                                                                                                                                                 
    
    +/* Timeout low and High */
    +#define UART_OMAP_TO_L                 0x26
    +#define UART_OMAP_TO_H                 0x27
    +
    +
     /*
      * Copy of the genpd flags for the console.
      * Only used if console suspend is disabled
    @@ -663,13 +668,22 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
    
            /*
             * On K3 SoCs, it is observed that RX TIMEOUT is signalled after
    -        * FIFO has been drained, in which case a dummy read of RX FIFO
    -        * is required to clear RX TIMEOUT condition.
    +        * FIFO has been drained or erroneously.  
    +        * So apply solution of Errata i2310 as mentioned in
    +        * https://www.ti.com/lit/er/sprz536a/sprz536a.pdf
             */
            if (priv->habit & UART_RX_TIMEOUT_QUIRK &&
    -           (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
    -           serial_port_in(port, UART_OMAP_RX_LVL) == 0) {
    -               serial_port_in(port, UART_RX);
    +           (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT) {
    +                       unsigned char timeout_h, timeout_l;
    +                       timeout_h = serial_in(up, UART_OMAP_TO_H);
    +                       timeout_l = serial_in(up, UART_OMAP_TO_L);
    +                       serial_out(up, UART_OMAP_TO_H, 0xFF);
    +                       serial_out(up, UART_OMAP_TO_L, 0xFF);
    +                       serial_out(up, UART_OMAP_EFR2, 0x1);
    +                       serial_in(up, UART_IIR);
    +                       serial_out(up, UART_OMAP_EFR2, 0x0);
    +                       serial_out(up, UART_OMAP_TO_H, timeout_h);
    +                       serial_out(up, UART_OMAP_TO_L, timeout_l);
            }                                                                                                                                                                                                                                                                                                                                                                    
    
            /* Stop processing interrupts on input overrun */
    

  • Hi Bin,

    This doesn't work for me. Even after patching, the operating system still enters a state of interrupt storm.

    Thanks,

    Sean

  • Hi Sean,

    Thanks for testing the patch. Can you please undo the patch, but apply the following kernel patch, and see if you have the message in 'dev_info()' in kernel log when you test your WIFI module? I think this would tell if the irq storm is from the timeout interrupts.

    diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
    index 1393362df269..ae23f42e4826 100644
    --- a/drivers/tty/serial/8250/8250_omap.c
    +++ b/drivers/tty/serial/8250/8250_omap.c
    @@ -635,6 +635,12 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
            serial8250_rpm_get(up);
            lsr = serial_port_in(port, UART_LSR);
            iir = serial_port_in(port, UART_IIR);
    +
    +       if (!uart_console(port) &&
    +           (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
    +           serial_port_in(port, UART_OMAP_RX_LVL) == 0)
    +               dev_info(port->dev, "timeout irq happened\n");
    +
            ret = serial8250_handle_irq(port, iir);

  • Hi Bin,

    After entering the "hciattach /dev/ttyS3 any 3000000 flow" command, a large number of timeout messages will be displayed.

    [  243.331560] Bluetooth: Core ver 2.22
    [  243.331699] NET: Registered protocol family 31
    [  243.331703] Bluetooth: HCI device and connection manager initialized
    Device setup complete
    [  243.331726] Bluetooth: HCI socket layer initialized
    [  243.331734] Bluetooth: L2CAP socket layer initialized
    [  243.331740] Bluetooth: SCO socket layer initialized
    [  243.343238] Bluetooth: HCI UART driver ver 2.3
    [  243.343261] Bluetooth: HCI UART protocol H4 registered
    [  243.343361] Bluetooth: HCI UART protocol LL registered
    [  243.343657] Bluetooth: HCI UART protocol Broadcom registered
    [  243.343690] Bluetooth: HCI UART protocol QCA registered
    [  243.526461] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
    [  243.526484] Bluetooth: BNEP filters: protocol multicast
    [  243.526506] Bluetooth: BNEP socket layer initialized
    [  243.551888] NET: Registered protocol family 38
    ** 8540 printk messages dropped **
    [  243.716407] omap8250 2810000.serial: timeout irq happened
    ** 852 printk messages dropped **
    [  243.724803] omap8250 2810000.serial: timeout irq happened

    Thanks,

    Sean

  • Hi Sean,

    Thanks for confirm this.

    Can you please apply both of my patches above, and see if the "timeout irq happend" message still gets printed in kernel log?

  • Hi Bin,

    Prints the "timeout irq happened" message, which is generated by applying the two patch conditions you provide. The result of this test is after applying these two patches.

    @@ -650,6 +654,12 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
            serial8250_rpm_get(up);
            lsr = serial_port_in(port, UART_LSR);
            iir = serial_port_in(port, UART_IIR);
    +
    +       if (!uart_console(port) &&
    +             (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
    +             serial_port_in(port, UART_OMAP_RX_LVL) == 0)
    +                dev_info(port->dev, "timeout irq happened\n");
    +
            ret = serial8250_handle_irq(port, iir);
    
            /*
    @@ -658,9 +668,17 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
             * is required to clear RX TIMEOUT condition.
             */
            if (priv->habit & UART_RX_TIMEOUT_QUIRK &&
    -           (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
    -           serial_port_in(port, UART_OMAP_RX_LVL) == 0) {
    -               serial_port_in(port, UART_RX);
    +                                 (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT) {
    +                                      unsigned char timeout_h, timeout_l;
    +                                      timeout_h = serial_in(up, UART_OMAP_TO_H);
    +                                      timeout_l = serial_in(up, UART_OMAP_TO_L);
    +                                      serial_out(up, UART_OMAP_TO_H, 0xFF);
    +                                      serial_out(up, UART_OMAP_TO_L, 0xFF);
    +                                      serial_out(up, UART_OMAP_EFR2, 0x1);
    +                                      serial_in(up, UART_IIR);
    +                                      serial_out(up, UART_OMAP_EFR2, 0x0);
    +                                      serial_out(up, UART_OMAP_TO_H, timeout_h);
    +                                      serial_out(up, UART_OMAP_TO_L, timeout_l);
            }
    

    The message "timeout irq happened" will continue to print without stopping.

    Thanks,

    Sean

  • Hi Sean,

    I am not sure if I fully understand your comments, so please confirm:

    Is this log printed in the case with both patches or just the second patch? If latter, can you please attach the console log from the case which has both patches?

  • Hi Bin,

    The log below is printed in the case with only the patch applied to print the message, it indicates that an interrupt storm occurred:

    [  239.244272] Bluetooth: Core ver 2.22
    [  239.244423] NET: Registered protocol family 31
    [  239.244427] Bluetooth: HCI device and connection manager initialized
    [  239.244451] Bluetooth: HCI socket layer initialized
    [  239.244460] Bluetooth: L2CAP socket layer initialized
    [  239.244466] Bluetooth: SCO socket layer initialized
    Device setup complete
    [  239.278345] Bluetooth: HCI UART driver ver 2.3
    [  239.278372] Bluetooth: HCI UART protocol H4 registered
    [  239.278476] Bluetooth: HCI UART protocol LL registered
    [  239.278782] Bluetooth: HCI UART protocol Broadcom registered
    [  239.278815] Bluetooth: HCI UART protocol QCA registered
    [  239.456950] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
    [  239.456973] Bluetooth: BNEP filters: protocol multicast
    [  239.456993] Bluetooth: BNEP socket layer initialized
    [  239.483023] NET: Registered protocol family 38
    [  256.831967] omap8250 2810000.serial: timeout irq happened
    [  256.831994] omap8250 2810000.serial: timeout irq happened
    [  256.832002] omap8250 2810000.serial: timeout irq happened
    [  256.832010] omap8250 2810000.serial: timeout irq happened
    [  256.832017] omap8250 2810000.serial: timeout irq happened
    [  256.832024] omap8250 2810000.serial: timeout irq happened
    [  256.832032] omap8250 2810000.serial: timeout irq happened
    [  256.832039] omap8250 2810000.serial: timeout irq happened
    [  256.832046] omap8250 2810000.serial: timeout irq happened

    The log below is printed in the case with both patches applied:

    [   47.080665] Bluetooth: Core ver 2.22
    [   47.080804] NET: Registered protocol family 31
    [   47.080808] Bluetooth: HCI device and connection manager initialized
    [   47.080833] Bluetooth: HCI socket layer initialized
    [   47.080841] Bluetooth: L2CAP socket layer initialized
    [   47.080847] Bluetooth: SCO socket layer initialized
    Device setup complete
    [   47.115790] Bluetooth: HCI UART driver ver 2.3
    [   47.115813] Bluetooth: HCI UART protocol H4 registered
    [   47.115915] Bluetooth: HCI UART protocol LL registered
    [   47.116217] Bluetooth: HCI UART protocol Broadcom registered
    [   47.116246] Bluetooth: HCI UART protocol QCA registered
    [   47.125004] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.135742] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.146973] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.154484] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.161208] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.174169] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.180732] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.187220] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.193615] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.200012] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.212727] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.220957] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.223323] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.225393] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.227511] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.229695] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.232371] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.235017] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.242442] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.265473] omap8250 2810000.serial: apply solution of Errata i2310
    [   47.267479] omap8250 2810000.serial: apply solution of Errata i2310
    ** 578 printk messages dropped **
    [   47.271436] omap8250 2810000.serial: timeout irq happened
    ** 2572 printk messages dropped **
    [   47.288943] omap8250 2810000.serial: apply solution of Errata i2310
    ** 1467 printk messages dropped **
    [   47.298428] omap8250 2810000.serial: timeout irq happened
    ** 1242 printk messages dropped **
    [   47.306354] omap8250 2810000.serial: apply solution of Errata i2310

    The solution for Errata i2310 seems like it could not prevent the 'timeout irq' from occurring.


    Here is the code in omap8250_irq() that I used, with both patches applied.

    static irqreturn_t omap8250_irq(int irq, void *dev_id)
    {
            struct uart_port *port = dev_id;
            struct omap8250_priv *priv = port->private_data;
            struct uart_8250_port *up = up_to_u8250p(port);
            unsigned int iir, lsr;
            int ret;
    
    #ifdef CONFIG_SERIAL_8250_DMA
            if (up->dma) {
                    ret = omap_8250_dma_handle_irq(port);
                    return IRQ_RETVAL(ret);
            }
    #endif
    
            serial8250_rpm_get(up);
            lsr = serial_port_in(port, UART_LSR);
            iir = serial_port_in(port, UART_IIR);
    
            if (!uart_console(port) &&
                (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
                serial_port_in(port, UART_OMAP_RX_LVL) == 0)
                    dev_info(port->dev, "timeout irq happened\n");
    
            ret = serial8250_handle_irq(port, iir);
    
            /*
             * On K3 SoCs, it is observed that RX TIMEOUT is signalled after
             * FIFO has been drained or erroneously.
             * So apply solution of Errata i2310 as mentioned in
             * https://www.ti.com/lit/er/sprz536a/sprz536a.pdf
             */
            if (priv->habit & UART_RX_TIMEOUT_QUIRK &&
                (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT) {
                    unsigned char timeout_h, timeout_l;
                    timeout_h = serial_in(up, UART_OMAP_TO_H);
                    timeout_l = serial_in(up, UART_OMAP_TO_L);
                     serial_out(up, UART_OMAP_TO_H, 0xFF);
                    serial_out(up, UART_OMAP_TO_L, 0xFF);
                    serial_out(up, UART_OMAP_EFR2, 0x1);
                    serial_in(up, UART_IIR);
                    serial_out(up, UART_OMAP_EFR2, 0x0);
                    serial_out(up, UART_OMAP_TO_H, timeout_h);
                    serial_out(up, UART_OMAP_TO_L, timeout_l);
                    if (!uart_console(port))
                            dev_info(port->dev, "apply solution of Errata i2310\n");
            }
    

    Thanks,

    Sean

  • Hi Sean,

    Thanks for the details.

    In both cases (with 2nd patch only and both patches), when you see the console prints "timeout irq happened" then disconnect the UART WIFI device, does the console continue to print this message or it stops?

  • Hi Bin,

    In both cases, when I see the console print "timeout irq happened" and then hot-unplug the UART WiFi device, the console stops printing the "timeout irq" message, and the OS remains operational.

    Thanks,

    Sean

  • Hi Sean,

    I think we are close to the conclusion, but please help to do one more test.

    With either of the two cases (with 2nd patch only or both patches) you have right now, please apply the following patch to remove the RX TIMEOUT irq handling,

    @@ -1218,7 +1235,8 @@ static struct omap8250_dma_params am33xx_dma = {
     
     static struct omap8250_platdata am654_platdata = {
            .dma_params     = &am654_dma,
    -       .habit          = UART_HAS_EFR2 | UART_HAS_RHR_IT_DIS |
    -                         UART_RX_TIMEOUT_QUIRK,
    +       .habit          = UART_HAS_EFR2 | UART_HAS_RHR_IT_DIS,
     };

    then repeat the following test to see if "timeout irq happened" message stops or not when removed the UART WIFI device.

    In both cases, when I see the console print "timeout irq happened" and then hot-unplug the UART WiFi device, the console stops printing the "timeout irq" message, and the OS remains operational.
  • Hi Bin,

    Applying a patch to remove UART_RX_TIMEOUT_QUIRK (RX TIMEOUT IRQ handling) produces the same result in both cases: stopping the printing of the "timeout occurred" message when removing the UART WiFi device.


    There are four test cases:

    Test A:

    1) Print debug messages.

    Test B:

    1) Print debug messages.
    2) Implement the solution for Errata i2310.

    Test C:

    1) Print debug messages.
    2) Remove RX TIMEOUT IRQ handling.

    Test D:

    1) Print debug messages.
    2) Implement the solution for Errata i2310.
    3) Remove RX TIMEOUT IRQ handling.

    Since "removing RX TIMEOUT IRQ handling" essentially means not implementing the solution for Errata i2310, these make tests A, C, and D no different.

    I'm just wondering what we can get from tests C and D.

    Thanks,

    Sean

  • Hi Sean,

    Thanks for all the test.

    With all the results, I doubt the issue with the UART WIFI device is i2310.

    Now all the test result show that when the RX TIMEOUT interrupt happens with and without i2310 handling in the driver, the RX TIMEOUT interrupt stopped when the WIFI device is removed. I don't think this is how i2310 behaves.

    It seems somehow the UART WIFI device keeps generating a UART bus condition which would cause the AM62x UART controller to generate RX timeout interrupts. The interrupt storm stops when the bus condition disappears when the WIFI device is removed.

    I will do some homework here about i2310. Meanwhile can you please analyze the WIFI device and UART bus to understand what causes the abnormal bus activities?

  • Hi Bin,

    As shown in the picture below, I removed the WiFi device at point A (marked in red). The change on the bus is that RX goes low. Is this what stops the interrupt storm?

    Before point A, RX and CTS operated normally. It seems that no unusual signal was enough to drive the interrupt storm.

    The last data transmitted by the WiFi device also looks normal:

    Thanks,

    Sean

  • Hi Sean,

    With all the results, I doubt the issue with the UART WIFI device is i2310.

    Now all the test result show that when the RX TIMEOUT interrupt happens with and without i2310 handling in the driver, the RX TIMEOUT interrupt stopped when the WIFI device is removed. I don't think this is how i2310 behaves.

    Sorry I think my conclusion above is not correct.

    Test A:

    1) Print debug messages.

    Test B:

    1) Print debug messages.
    2) Implement the solution for Errata i2310.

    Test C:

    1) Print debug messages.
    2) Remove RX TIMEOUT IRQ handling.

    Test D:

    1) Print debug messages.

    All the 4 test cases show the UART keeps getting RX TIMEOUT interrupt while RX FIFO is empty (RX_LVL = 0), I guess somehow the UART WIFI causes an electrical condition which makes the AM62x UART controller to generate the RX TIMEOUT interrupts.

    I will check internally about this issue and get back to you.

  • Hi Sean,

    Sorry for the delayed response. I was in a full-day training for the past two weeks.

    Please remove all the kernel debug patches we have added so far, and apply the following kernel patch. What the patch does is to count the RX TIMEOUT interrupts.

    diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
    index 1393362df269..0d978baa2ff3 100644
    --- a/drivers/tty/serial/8250/8250_omap.c
    +++ b/drivers/tty/serial/8250/8250_omap.c
    @@ -635,6 +635,13 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
            serial8250_rpm_get(up);
            lsr = serial_port_in(port, UART_LSR);
            iir = serial_port_in(port, UART_IIR);
    +
    +       if (!uart_console(port) &&
    +           (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
    +           serial_port_in(port, UART_OMAP_RX_LVL) == 0) {
    +               port->icount.cts++;
    +       }
    +
            ret = serial8250_handle_irq(port, iir);
     
            /*
    diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
    index f0ed30d0a697..be97b034d9ac 100644
    --- a/drivers/tty/serial/serial_core.c
    +++ b/drivers/tty/serial/serial_core.c
    @@ -3313,7 +3313,7 @@ void uart_handle_cts_change(struct uart_port *uport, unsigned int status)
     {
            lockdep_assert_held_once(&uport->lock);
     
    -       uport->icount.cts++;
    +       // uport->icount.cts++;
     
            if (uart_softcts_mode(uport)) {
                    if (uport->hw_stopped) {

    When you booted the board with this patched kernel, then run command 'serialstats -i 1 -d /dev/ttySx' ('x' is the index of the UART port under test) on your board console, it prints the port 'x' status in 1 second interval, the first variable 'cts' is the count of RX TIMEOUT interrupts. Then you start the UART communication with the modem to generate the UART interrupt storm, get the UART interrupt count (/proc/interrupts) in 1 second interval too. Finally comparing the 'cts' counts and the UART interrupt counts to see if the RX TIMEOUT interrupt is the source of the interrupt storm.

  • Hi Bin,

    Here are the test results:

    The number of interrupts appears to be in sync with CTS.

    ./serialstats -i 1 -d /dev/ttyS3
    cts: 0 dsr: 0 rng: 0 dcd: 0 rx: 792 tx: 808 frame error 0 overuns 0 parity: 0 br                                                                   eak: 0 buffer overrun: 0
    cts: 0 dsr: 0 rng: 0 dcd: 0 rx: 1584 tx: 1616 frame error 0 overuns 0 parity: 0                                                                    break: 0 buffer overrun: 0
    cts: 0 dsr: 0 rng: 0 dcd: 0 rx: 2376 tx: 2424 frame error 0 overuns 0 parity: 0                                                                    break: 0 buffer overrun: 0
    cts: 55170 dsr: 0 rng: 0 dcd: 0 rx: 3118 tx: 2656 frame error 0 overuns 0 parity                                                                   : 0 break: 0 buffer overrun: 0
    cts: 222385 dsr: 0 rng: 0 dcd: 0 rx: 3168 tx: 3232 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 222385 dsr: 0 rng: 0 dcd: 0 rx: 3960 tx: 4040 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 278376 dsr: 0 rng: 0 dcd: 0 rx: 4481 tx: 4126 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 372711 dsr: 0 rng: 0 dcd: 0 rx: 4752 tx: 4848 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 372711 dsr: 0 rng: 0 dcd: 0 rx: 5544 tx: 5656 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 372711 dsr: 0 rng: 0 dcd: 0 rx: 6336 tx: 6464 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 372711 dsr: 0 rng: 0 dcd: 0 rx: 7128 tx: 7272 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 372711 dsr: 0 rng: 0 dcd: 0 rx: 7920 tx: 8080 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 372711 dsr: 0 rng: 0 dcd: 0 rx: 8712 tx: 8888 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 372711 dsr: 0 rng: 0 dcd: 0 rx: 8712 tx: 8892 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 372711 dsr: 0 rng: 0 dcd: 0 rx: 9504 tx: 9696 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 372711 dsr: 0 rng: 0 dcd: 0 rx: 10296 tx: 10504 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    

    28:       4135          0          0          0     GICv3 211 Level     2810000.serial
    28:       4138          0          0          0     GICv3 211 Level     2810000.serial
    28:     226682          0          0          0     GICv3 211 Level     2810000.serial
    28:     377333          0          0          0     GICv3 211 Level     2810000.serial
    28:     377333          0          0          0     GICv3 211 Level     2810000.serial

    Thanks,

    Sean

  • Hi Sean,

    Thanks for the testing. Yes I think the result shows most the UART interrupts are the RX TIMEOUT interrupts.

    I think we need a final test to confirm if the RX TIMEOUT interrupt is cleared in the irq handler, which will tell us that we should focus on the hardware but not software. US has a holiday coming on this Thursday, so I will create the final test patch early next week.

  • Hi Sean,

    Sorry for my late response. Please apply the following kernel debug patch (it includes the previous patch), run the same test and get the serialstats log again. I am expecting the cts and dsr to have the same numbers as which close to the uart interrupts in /proc/interrupts, but rng number should be 0.

    diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
    index ae23f42e4826..7117e6230964 100644
    --- a/drivers/tty/serial/8250/8250_omap.c
    +++ b/drivers/tty/serial/8250/8250_omap.c
    @@ -638,8 +638,9 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
     
            if (!uart_console(port) &&
                (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
    -           serial_port_in(port, UART_OMAP_RX_LVL) == 0)
    -               dev_info(port->dev, "timeout irq happened\n");
    +           serial_port_in(port, UART_OMAP_RX_LVL) == 0) {
    +               port->icount.cts++;
    +       }
     
            ret = serial8250_handle_irq(port, iir);
     
    @@ -652,6 +653,11 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
                (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
                serial_port_in(port, UART_OMAP_RX_LVL) == 0) {
                    serial_port_in(port, UART_RX);
    +               port->icount.dsr++;
    +
    +               iir = serial_port_in(port, UART_IIR);
    +               if ((iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT)
    +                       port->icount.rng++;
            }
     
            /* Stop processing interrupts on input overrun */
    diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
    index 8efe31448df3..e2608fcd8f0b 100644
    --- a/drivers/tty/serial/8250/8250_port.c
    +++ b/drivers/tty/serial/8250/8250_port.c
    @@ -1879,10 +1879,10 @@ unsigned int serial8250_modem_status(struct uart_8250_port *up)
            up->msr_saved_flags = 0;
            if (status & UART_MSR_ANY_DELTA && up->ier & UART_IER_MSI &&
                port->state != NULL) {
    -               if (status & UART_MSR_TERI)
    -                       port->icount.rng++;
    -               if (status & UART_MSR_DDSR)
    -                       port->icount.dsr++;
    +               // if (status & UART_MSR_TERI)
    +                       // port->icount.rng++;
    +               // if (status & UART_MSR_DDSR)
    +                       // port->icount.dsr++;
                    if (status & UART_MSR_DDCD)
                            uart_handle_dcd_change(port, status & UART_MSR_DCD);
                    if (status & UART_MSR_DCTS)
    diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
    index f0ed30d0a697..be97b034d9ac 100644
    --- a/drivers/tty/serial/serial_core.c
    +++ b/drivers/tty/serial/serial_core.c
    @@ -3313,7 +3313,7 @@ void uart_handle_cts_change(struct uart_port *uport, unsigned int status)
     {
            lockdep_assert_held_once(&uport->lock);
     
    -       uport->icount.cts++;
    +       // uport->icount.cts++;
     
            if (uart_softcts_mode(uport)) {
                    if (uport->hw_stopped) {

  • Hi Bin,

    Sorry for the late reply; I’ve been busy with other things lately.

    I'm not sure which specific test you need, so I conducted two tests:

    Test Case 1: Perform a dummy read of the RX FIFO at RX TIMEOUT.

    In Test Case 1, performing a dummy read of the RX FIFO works to inhibit the interrupt number.

     28:      44200          0          0          0     GICv3 211 Level     2810000.seria
    
    cts: 128 dsr: 13696 rng: 121 dcd: 0 rx: 214976 tx: 219958 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 128 dsr: 13696 rng: 121 dcd: 0 rx: 214976 tx: 219958 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    

    This is the patch I used for Test Case 1:

    diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
    index 38ec53a7ce49d..e90a917d08035 100644
    --- a/drivers/tty/serial/8250/8250_omap.c
    +++ b/drivers/tty/serial/8250/8250_omap.c
    @@ -650,6 +650,13 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
            serial8250_rpm_get(up);
            lsr = serial_port_in(port, UART_LSR);
            iir = serial_port_in(port, UART_IIR);
    +
    +       if (!uart_console(port) &&
    +               (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
    +               serial_port_in(port, UART_OMAP_RX_LVL) == 0) {
    +               port->icount.cts++;
    +       }
    +
            ret = serial8250_handle_irq(port, iir);
    
            /*
    @@ -657,10 +664,14 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
             * FIFO has been drained, in which case a dummy read of RX FIFO
             * is required to clear RX TIMEOUT condition.
             */
    -       if (priv->habit & UART_RX_TIMEOUT_QUIRK &&
    -           (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
    +       if ((iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
                serial_port_in(port, UART_OMAP_RX_LVL) == 0) {
                    serial_port_in(port, UART_RX);
    +               port->icount.dsr++;
    +
    +               iir = serial_port_in(port, UART_IIR);
    +               if ((iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT)
    +                       port->icount.rng++;
            }
    
    
    diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
    index 947737d0e46b2..52e8c7a3b2b12 100644
    --- a/drivers/tty/serial/8250/8250_port.c
    +++ b/drivers/tty/serial/8250/8250_port.c
    @@ -1851,10 +1851,10 @@ unsigned int serial8250_modem_status(struct uart_8250_port *up)
            up->msr_saved_flags = 0;
            if (status & UART_MSR_ANY_DELTA && up->ier & UART_IER_MSI &&
                port->state != NULL) {
    -               if (status & UART_MSR_TERI)
    -                       port->icount.rng++;
    -               if (status & UART_MSR_DDSR)
    -                       port->icount.dsr++;
    +               //if (status & UART_MSR_TERI)
    +               //      port->icount.rng++;
    +               //if (status & UART_MSR_DDSR)
    +               //      port->icount.dsr++;
                    if (status & UART_MSR_DDCD)
                            uart_handle_dcd_change(port, status & UART_MSR_DCD);
                    if (status & UART_MSR_DCTS)
    diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
    index 40fff38588d4f..b9d25dcaf3280 100644
    --- a/drivers/tty/serial/serial_core.c
    +++ b/drivers/tty/serial/serial_core.c
    @@ -3169,7 +3169,7 @@ void uart_handle_cts_change(struct uart_port *uport, unsigned int status)
     {
            lockdep_assert_held_once(&uport->lock);
    
    -       uport->icount.cts++;
    +       //uport->icount.cts++;
    
            if (uart_softcts_mode(uport)) {
                    if (uport->hw_stopped) {
    

    Test Case 2: Accumulate DSR at RX TIMEOUT.

    In Test Case 2, the CTS, DSR, and RNG are close to the UART interrupt.

    28:     412679          0          0          0     GICv3 211 Level     2810000.serial
    
    cts: 94425 dsr: 94476 rng: 94425 dcd: 0 rx: 806 tx: 1770 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 331765 dsr: 331816 rng: 331765 dcd: 0 rx: 806 tx: 1770 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    cts: 412570 dsr: 412670 rng: 412571 dcd: 0 rx: 1598 tx: 2578 frame error 0 overuns 0 parity: 0 break: 0 buffer overrun: 0
    

    This is the patch I used for Test Case 2:

    --- a/drivers/tty/serial/8250/8250_omap.c
    +++ b/drivers/tty/serial/8250/8250_omap.c
    @@ -650,6 +650,13 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
            serial8250_rpm_get(up);
            lsr = serial_port_in(port, UART_LSR);
            iir = serial_port_in(port, UART_IIR);
    +
    +       if (!uart_console(port) &&
    +               (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
    +               serial_port_in(port, UART_OMAP_RX_LVL) == 0) {
    +               port->icount.cts++;
    +       }
    +
            ret = serial8250_handle_irq(port, iir);
    
            /*
    @@ -657,10 +664,14 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
             * FIFO has been drained, in which case a dummy read of RX FIFO
             * is required to clear RX TIMEOUT condition.
             */
    -       if (priv->habit & UART_RX_TIMEOUT_QUIRK &&
    -           (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
    +       if ((iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
                serial_port_in(port, UART_OMAP_RX_LVL) == 0) {
    -               serial_port_in(port, UART_RX);
    +               //serial_port_in(port, UART_RX);
    +               port->icount.dsr++;
    +
    +               iir = serial_port_in(port, UART_IIR);
    +               if ((iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT)
    +                       port->icount.rng++;
            }
    
    
    
    
    
    diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
    index 947737d0e46b2..52e8c7a3b2b12 100644
    --- a/drivers/tty/serial/8250/8250_port.c
    +++ b/drivers/tty/serial/8250/8250_port.c
    @@ -1851,10 +1851,10 @@ unsigned int serial8250_modem_status(struct uart_8250_port *up)
            up->msr_saved_flags = 0;
            if (status & UART_MSR_ANY_DELTA && up->ier & UART_IER_MSI &&
                port->state != NULL) {
    -               if (status & UART_MSR_TERI)
    -                       port->icount.rng++;
    -               if (status & UART_MSR_DDSR)
    -                       port->icount.dsr++;
    +               //if (status & UART_MSR_TERI)
    +               //      port->icount.rng++;
    +               //if (status & UART_MSR_DDSR)
    +               //      port->icount.dsr++;
                    if (status & UART_MSR_DDCD)
                            uart_handle_dcd_change(port, status & UART_MSR_DCD);
                    if (status & UART_MSR_DCTS)
    diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
    index 40fff38588d4f..b9d25dcaf3280 100644
    --- a/drivers/tty/serial/serial_core.c
    +++ b/drivers/tty/serial/serial_core.c
    @@ -3169,7 +3169,7 @@ void uart_handle_cts_change(struct uart_port *uport, unsigned int status)
     {
            lockdep_assert_held_once(&uport->lock);
    
    -       uport->icount.cts++;
    +       //uport->icount.cts++;
    
            if (uart_softcts_mode(uport)) {
    

    If you need a different test or additional information, please let me know.

    Thanks,

    Sean

  • Hi Sean,

    Bin is Out of Office through Wednesday. Please expect a delayed response.

  • Hi Sean,

    Thanks for running these tests. I think it approves the commit b67e830d38fa9335d927fe67e812e3ed81b4689c can resolve the UART irq storm issue, but it would cause data loss.

    The patch which fixes this commit data loss issue which I provided above back in June 5th appears to have a bug. But following is the complete new patch. Can you please test it?

    diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
    index 0d46fb89b3ab..cf68d9f6924f 100644
    --- a/drivers/tty/serial/8250/8250_omap.c
    +++ b/drivers/tty/serial/8250/8250_omap.c
    @@ -117,6 +117,10 @@
     /* RX FIFO occupancy indicator */
     #define UART_OMAP_RX_LVL		0x19
     
    +/* Timeout low and High */
    +#define UART_OMAP_TO_L                 0x26
    +#define UART_OMAP_TO_H                 0x27
    +
     /*
      * Copy of the genpd flags for the console.
      * Only used if console suspend is disabled
    @@ -657,13 +661,25 @@ static irqreturn_t omap8250_irq(int irq, void *dev_id)
     
     	/*
     	 * On K3 SoCs, it is observed that RX TIMEOUT is signalled after
    -	 * FIFO has been drained, in which case a dummy read of RX FIFO
    -	 * is required to clear RX TIMEOUT condition.
    +	 * FIFO has been drained or erroneously.
    +	 * So apply solution of Errata i2310 as mentioned in
    +	 * https://www.ti.com/lit/pdf/sprz536
     	 */
     	if (priv->habit & UART_RX_TIMEOUT_QUIRK &&
     	    (iir & UART_IIR_RX_TIMEOUT) == UART_IIR_RX_TIMEOUT &&
     	    serial_port_in(port, UART_OMAP_RX_LVL) == 0) {
    -		serial_port_in(port, UART_RX);
    +		unsigned char efr2, timeout_h, timeout_l;
    +
    +		efr2 = serial_in(up, UART_OMAP_EFR2);
    +		timeout_h = serial_in(up, UART_OMAP_TO_H);
    +		timeout_l = serial_in(up, UART_OMAP_TO_L);
    +		serial_out(up, UART_OMAP_TO_H, 0xFF);
    +		serial_out(up, UART_OMAP_TO_L, 0xFF);
    +		serial_out(up, UART_OMAP_EFR2, UART_OMAP_EFR2_TIMEOUT_BEHAVE);
    +		serial_in(up, UART_IIR);
    +		serial_out(up, UART_OMAP_EFR2, efr2);
    +		serial_out(up, UART_OMAP_TO_H, timeout_h);
    +		serial_out(up, UART_OMAP_TO_L, timeout_l);
     	}
     
     	/* Stop processing interrupts on input overrun */
    

  • Hi Bin,

    Thanks for the solution. After applying the patch, I observed no interrupt storms during my stress tests. This seems to be the right solution.

    I will conduct additional testing to ensure the data is sent correctly and that there are no loss issues, and then update the test results here.

    Thanks,

    Sean

  • Hi Sean,

    I am glad the issue is resolved. Thanks for the update.

    Looking forward to your final test result.

    By the way, this kernel patch is already in the SDK10.0 kernel.

  • Hi Bin,

    I have conducted extensive tests to verify that the data is transmitted correctly and that no data loss occurs.

    Thanks again for your help.

    Sean

  • Hi Sean,

    Great news! Thanks for the update.