This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Spurious irq 95: 0xffffffdf, please flush posted write for irq 37 and Linux kernel thread "softirq" is consuming very high percentage of CPU - 95%

Dear TI/Forum members

  Hardware: OMAP 3503

  Software : Linux Kernel  (from TI PSP linux-02.01.03.11 ) v2.6.29 

  Ethernet :  LAN 9220  SMSC

  Problem : Occasionally when we try to restart the eth0 interface when changing the IPv4 configuration, we run into two issues

        Issue 1:  The   "Spurious irq 95: 0xffffffdf, please flush posted write for irq 37" is logged by kernel (seen via dmesg command)  , this message is seen a 2 to 5 times , and no more

       Issue 2:   The Linux kernel thread  "softirq" is  consuming a very high percentage of CPU,  95% , and any operation on the system becomes very very slow , the only way to recover from this issue is to hard reboot system.

Has anyone of you faced these two issues,  any pointers to resolve these issue would be great.

 

thanks

Pads

 

  • There have been at-least 3 major PSP releases since 02.01.03.11.

    Do you have any specific reason to stay at 2.6.29 kernel?

  • Thank you so much for fast response Sanjeev

    >> Do you have any specific reason to stay at 2.6.29 kernel?

    We are in v2.6.29 (PSP  02.01.03.11) , because we have about 40+ kernel patches for our custom hardware based on OMAP 3503,  and it will take quite some time before migrate to latest release from TI.

    Can you please let me know if this problem is resolved in the new release of kernel ? if yes which release ?

    Is this a bug in the softirq kernel thread ?  which is fixed in the newer release ?

     

    thanks

    Pads

     


     

     

  • Resmi said:
    Can you please let me know if this problem is resolved in the new release of kernel ? if yes which release ?


    Can't point to a specific release, but here is the patch that should be fixing it.
    https://patchwork.kernel.org/patch/18244/
    Resmi said:
    Is this a bug in the softirq kernel thread ?  which is fixed in the newer release ?

    None that I am aware of; but you may want to check the changelog to find any.
    Resmi said:
    ... because we have about 40+ kernel patches for our custom hardware based on OMAP 3503,

    Personal opinion: you should consider the effort in porting against time spend in hitting the problems and debugging and finding that it has already been fixed; and back-porting it.
    Remember this kernel version is really old.

  • Dear Sanjeev

      Thank very much once again for the prompt response,

    >>Can't point to a specific release, but here is the patch that should be fixing it

    >>https://patchwork.kernel.org/patch/18244/

    Even with this patch, I still see the problem , spurious interrupt message is displayed and the ksoftirqd kernel thread's CPU spikes and stays at 95% cpu usage.

    below is the additional details

    ------------- dmesg output after issue has occurred ------------------------

    .....

    Beginning of smsc911x_open method
    Beginning of smsc911x_soft_reset method
    End of smsc911x_soft_reset method
    net eth0: SMSC911x/921x identified at 0xa080c000, IRQ: 172
    End of smsc911x_open method
    ADDRCONF(NETDEV_UP): eth0: link is not ready
    ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
    eth0: duplicate address detected!
    Beginning of smsc911x_stop method
    Spurious irq 95: 0xffffffdf, please flush posted write for irq 37                 
    Spurious irq 95: 0xffffffdf, please flush posted write for irq 37                    <-------------- This msg is printed when execution is happening within smsc911x_stop() method of SMSC911x device driver, it is stuck at napi_disable() method , this issue occurs almost always in the napi_disable() which is called by smsc911x_stop() method, when eth0 interface is brought down to restart the eth0 interface.

    .......

    ----------------------------  end -------------------------------------------------------

    and the below is the output of  /proc/interrupt before and after this issue has occurred

    ------------------------ Before issue :  /proc/interrupts ----------------------------------------

               CPU0
     11:          0        INTC  prcm
     12:       4958        INTC  DMA
     24:          0        INTC  Omap 3 Camera ISP
     25:          0        INTC  OMAP DSS
     37:      49895        INTC  gp timer                     <------------------------this is low here
     56:        586        INTC  i2c_omap
     57:        175        INTC  i2c_omap
     61:         17        INTC  i2c_omap
     65:          0        INTC  omap_mcspi_isr
     72:          3        INTC  serial idle
     73:        448        INTC  serial idle
     74:       1649        INTC  serial idle, serial
     83:      15607        INTC  mmc0
     92:        170        INTC  musb_hdrc
     93:         63        INTC  musb_hdrc
    172:        882        GPIO  eth0
    174:          0        GPIO  maintenance_reset
    176:          0        GPIO  cpld-power
    369:          0     twl4030  twl4030_keypad
    378:          0     twl4030  twl4030_usb
    384:          0     twl4030  mmc0
    Err:          0
    -----------------------------------------end --------------------------------------

    and

    ----------------- After issue: /proc/interrupts -------------------------------

               CPU0
     11:          0        INTC  prcm
     12:       8574        INTC  DMA
     24:          0        INTC  Omap 3 Camera ISP
     25:          0        INTC  OMAP DSS
     37:     183364        INTC  gp timer                    <------------------------this is very high here
     56:       1084        INTC  i2c_omap
     57:        178        INTC  i2c_omap
     61:         17        INTC  i2c_omap
     65:          0        INTC  omap_mcspi_isr
     72:         11        INTC  serial idle
     73:        448        INTC  serial idle
     74:       3607        INTC  serial idle, serial
     83:      22277        INTC  mmc0
     92:        635        INTC  musb_hdrc
     93:        528        INTC  musb_hdrc
    172:       2305        GPIO  eth0
    174:          0        GPIO  maintenance_reset
    176:          0        GPIO  cpld-power
    369:          0     twl4030  twl4030_keypad
    378:          0     twl4030  twl4030_usb
    384:          0     twl4030  mmc0
    Err:          0
    ----------------------------------------------------------------------------

    My observation is that whenever the eth0 interface is restarted (i.e. stopped and started), during stop sequence, in the ethernet device driver (in drivers/net/smsc911x.c )  in the smsc911x_stop() method, after  successfully invoking netif_stop_queue(dev); the   napi_disable(&pdata->napi) is called, and this is when the spurious irq message is printed and following kernel threads sudden start consuming high % of cpu ,  mmcqd and ksoftirqd , in top I see  98.9 % usage by softirq thread.

    Any pointers/suggestions to resolve is greatly appreciated.

    >>Personal opinion: you should consider the effort in porting against time spend in hitting the problems and debugging and finding that it has already been fixed; and

    >>back-porting it. Remember this kernel version is really old.

    I completely agree with you,  as of last evening, in parallel to finding a fix for the above issue,  I have initiated the process of porting our patches to latest release,

    btw do you recommend moving to latest release or the one before the latest release  ?  as I am concerned about stability, need your expert inputs to decide on this.

     

    Thank you ,

    Pads,

     


     

     

  • Any updates ?

  • I did mention earlier that this is a really old version. You can check the lastest PSP updates at this URL:

    http://arago-project.org/git/projects/?p=linux-omap3.git;a=summary

    You may want to selectively backport selected patches. Have already pointed to a specific patch but there could be more associated - but not directly related - patches that would have helped in fixing the issue(s).