This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux Kernel 3.8 Bug "scheduling while atomic" using tun module

Other Parts Discussed in Thread: 4430

Hello,

I am using a Gumstix Duovero which uses the OMAP 4430 processor and I have recently updated to Linux Kernel 3.8.0. I was having a similar problem with kernel 3.6 but the upgrade to 3.8 seemed to make it more stable, although I can create the issue at will.

My application uses the tun kernel module pretty heavily to push data back and forth. It appears to be stable until I attempt to play a video on the system which causes it to lock up and the syslog keeps dumping out:

BUG: scheduling while atomic: swapper
Modules linked in: tun ipv6 (and one more that I can't seem to remember off hand)

Followed by a stack trace. The trace isn't very helpful as it is only 3 items deep with two of them being schedule_bug and schedule and the third being an unknown reference. I will post the full trace later today as I don't currently have it available.

Do any of you have any tips on debugging an issue like this? I haven't started enabling any of the kernel debugging features but that is the next step. You will have to forgive me as I am relatively new at this.  I can also post my board file and any other modifications that I have made to the 3.8 kernel to fix bugs I was seeing if you think it will be helpful.  

  • Here what prints out on the syslog:

    BUG: scheduling while atomic: swapper/1/0/0xffff0000
    Modules linked in: tun libcomposite ipv6
    [<c0014e4c>] (unwind_backtrace+0x0/0x11c) from [<c03a0720>] (__schedule_bug+0x48/0x5c)
    [<c03a0720>] (__schedule_bug+0x48/0x5c) from [<c03a536c>] (__schedule+0x68/0x6e0)
    [<c03a536c>] (__schedule+0x68/0x6e0) from [<c000ef08>] (cpu_idle+0xe4/0xfc)
    [<c000ef08>] (cpu_idle+0xe4/0xfc) from [<8039bea8>] (0x8039bea8)

  • I found a kernel config parameter called CONFIG_DEBUG_ATOMIC_SLEEP which is used to debug sleeping while atomic operations which I thought would be incredibly useful but it actually masks the error. Now when it happens the network device goes down and loses its IP address. An ifdown and ifup seems to put it back in a good state where i can work again but I would rather not have the error happen in the first place. 

    I used ethtool and it indicates that there is no link detected after the failure occurs. I can provide further information if needed. Also, if you know of a better place to address this issue, please let me know.

  • Interestingly enough I have found out a bit more about this issue and we can take the tun module out of it completely. I can recreate the problem at will by typing the following command:

    ping -s 1400 -f [someknownipaddress]

    The flood test option of ping seems to take it down in under 5 minutes. I still have no idea where to go from here but at least I have made the recreation of the problem much simpler.

  • Hi Alex.

    The scheduling while atomic bug indicates that a routine is attempting to acquire a resource that isn't available and the OS is therefore calling schedule().  And that routine is something that should never call schedule, such as an interrupt handler.

    When the stack dump shows the modules that are linked in, it means all the ones inserted on the machine, not just the ones involved in the stack leading to the problem.

    I've generally gotten a more complete stack dump when experiencing this problem.  Since your ping test suggests that it might be related to IP, I'd start off by looking at the ISR for your network (Ethernet?) interface.  Does the ISR or any routine it calls looks like it can block, such as including an attempt to obtain a mutex?

    Sometimes resolving the problem is as simple as using request_threaded_irq() instead of request_irq() in the device driver.  That way the ISR runs as a thread that can call routines that will invoke schedule().  But depending on your kernel config, request_irq() may already be mapped to request_threaded_irq().

    Regards,

        Steve

  • Hi Steve,

    Your reply was very informative. I have been poking around and here is what I have found.

    The ethernet device that I am using is the smsc911x. Here is the source:

    http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/smsc/smsc911x.c

    The ISR is registered at line 2464 with the request_irq function. Are you saying that if this problem is occurring that it could be fixed by changing to a request_threaded_irq function? It looks like this would require a partial rewrite of the source to create a thread that handles requests, correct?

    The ISR starts at line 1762. The only questionable calls that I see are smsc911x_rx_multicast_workaround, netif_wake_queue and napi_schedule[_prep]. The last two seem to be called in other ethernet ISRs so I figure they are safe and the first one is written specifically to be run in interrupt context.

    I tried reverting back to an earlier kernel version in 3.6 and I see the same problem where the interface goes down and loses its IP address. I think I'll add some debug to the driver routines to see if I can catch where it is crashing since I don't have the sleeping while atomic bug anymore.

  • Hi Alex.

    This will be moot if you are using the older kernel that doesn't have the scheduling bug, but...

    It is not necessary to redo the device driver when if you change to request_threaded_irq().  The OS takes care of the details for your.  The difference is the _context_ that the handler gets called from (a kernel thread instead of interrupt context).

    One potential issue is that the driver may need interrupt context for the handler, for example because it needs to have interrupts disabled.  But this is relatively rare these days in my experience.

    Regards,

        Steve

  • Hi Steve,

    It appears I have found a workaround to my problem. I'm not sure why it started off as a sleeping while atomic problem, but it ended up as a bad hardware/software interaction perhaps due to a bad bus timing figure.

    Here is a much better summary of the problem that was occurring:

    http://gumstix.8.n6.nabble.com/Duovero-ethernet-dropping-connection-td4966633.html

    I appreciate all of the help you have given me to narrow down the problem and characterize it better.

    Thanks again,

    Alex

  • Hi Steve and Alex,

    I am observing the same problem with my DuoVero board

    BUG: scheduling while atomic: swapper/1/0/0xffff0000 

    but the work around described  in the link didn't help.

    Any other ideas?

    Regards,

    Vlad

  • Hi Vlad.

    Did you get a stack dump when the bug occurred?  It should show what driver called a routine which, in turn, called schedule().  From that, you can look at what context you are in and why schedule() is not appropriate.  Interrupt context is a common source for this problem, but you might be seeing another.

        Steve

  • Hi Vlad,

    I am not 100% sure on this, but I ran into this problem a few times and it seemed to resolve itself when I rebuilt the kernel and kernel modules then redeployed it to my flash card. I wasn't able to 100% reproduce this issue from build to build, but it popped up randomly a time or two. I can give you my defconfig for my 3.8 kernel if you want to compare it with yours from what I am assuming is the 3.6 kernel. I would try the rebuild and redeploy first though.

    Perhaps someone on here can shed some light on why this would appear to fix the problem. For a little more information, I had a .config that I was using, I made a backup copy and turned on the sleeping while atomic debug config option, rebuilt and redeployed. I then noticed the behavior go away so I  restored the backup copy and rebuilt and redeployed and I didn't see the issue again.

    Hope this helps!

  • Thank you for the prompt reply guys. The call stack is very similar to the one Alex initially posted:

    Disabling lock debugging due to kernel taint
    Init 8000sx
    8000sx major 249
    BUG: scheduling while atomic: swapper/1/0/0xffff0000
    Modules linked in: 8000sx(PO)
    [<c0014bc8>] (unwind_backtrace+0x0/0x11c) from [<c048ee6c>] (__schedule_bug+0x48/0x5c)
    [<c048ee6c>] (__schedule_bug+0x48/0x5c) from [<c0494c3c>] (__schedule+0x68/0x784)
    [<c0494c3c>] (__schedule+0x68/0x784) from [<c000f044>] (cpu_idle+0x100/0x11c)
    [<c000f044>] (cpu_idle+0x100/0x11c) from [<8048af34>] (0x8048af34)
    BUG: scheduling while atomic: swapper/1/0/0xffff0000
    Modules linked in: 8000sx(PO)
    [<c0014bc8>] (unwind_backtrace+0x0/0x11c) from [<c048ee6c>] (__schedule_bug+0x48/0x5c)
    [<c048ee6c>] (__schedule_bug+0x48/0x5c) from [<c0494c3c>] (__schedule+0x68/0x784)
    [<c0494c3c>] (__schedule+0x68/0x784) from [<c000f044>] (cpu_idle+0x100/0x11c)
    [<c000f044>] (cpu_idle+0x100/0x11c) from [<8048af34>] (0x8048af34)
    BUG: scheduling while atomic: swapper/1/0/0xffff0000
    Modules linked in: 8000sx(PO)
    ....


    and this fragment continuously repeats.


    The 8000sx is our driver, which is very simple. I initially suspected it, but after reviewing it several times, I couldn't spot a problem. This driver has been running on x86 platform for many years with no issues. I also have done minor kernel and uboot modifications to support our hardware (downloading and interfacing custom FPGA).
    The problem occurs randomly when I run our user space application, which makes the environment a bit more complex to analyse. I wasn't able to reproduce the issue with the ping test Alex suggested. I also don’t observer network problems (our application is heavily using the network interface).


    Yesterday I rebuilt the kernel (3.6) with the "Sleep inside atomic section checking (DEBUG_ATOMIC_SLEEP)" option and I observed some messages on the console that might link the problem to our driver. I need to investigate it a bit further. It is possible that the issue is not related to the one discussed in this thread.

  • Hi ,

    I am also getting the issue " scheduling while atomic: swapper" in linux 2.6.10.

    The call stack is also copied here. Can someone help he in this......

      BUG: scheduling while atomic: swapper/0xefff0000/0
      caller is schedule+0x100/0x13c
      Call Trace:
     [<803042a8>] schedule+0x100/0x13c
     [<80303694>] __schedule+0xc0/0xb0c
     [<82550000>] ip_auto_config_setup+0x214/0x230
     [<80104100>] cpu_idle+0x58/0x60
     [<803042a8>] schedule+0x100/0x13c
     [<80104100>] cpu_idle+0x58/0x60
     [<82550000>] ip_auto_config_setup+0x214/0x230
     [<82550000>] ip_auto_config_setup+0x214/0x230
     [<80104100>] cpu_idle+0x58/0x60
     [<801040f8>] cpu_idle+0x50/0x60
     [<82534774>] start_kernel+0x230/0x244
     [<82534764>] start_kernel+0x220/0x244
     [<8253412c>] unknown_bootoption+0x0/0x2a

  • In my case the problem ended up to be in an old proprietary module code. I had to change spin_lock() to spin_lock_irqsave() and spin_unlock() to spin_unlock_irqrestore() .


    Hope this helps.

    Regards,
    Vlad