This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

PROCESSOR-SDK-AM437X: Linux kernel stall

Part Number: PROCESSOR-SDK-AM437X

We have a custom board with AM437X running TI Linux 4.14.79

This problem happens quite often, but we don't have a reproducible scenario. Suddenly all processes and work queues stop running

We can trigger SysRq and go to the kgdb over serial console, all threads are in context_switch:

  Id   Target Id         Frame

* 1    Thread 4294967294 (shadowCPU0) 0xc01ae060 in arch_kgdb_breakpoint () at kernel/debug/debug_core.c:1071

  2    Thread 1 (init)   context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:2811

  3    Thread 2 (kthreadd) context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:2811

  4    Thread 4 (kworker/0:0H) context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:2811

One of the threads:

#0  context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:2811

#1  __schedule (preempt=<optimized out>) at kernel/sched/core.c:3384

#2  0xc085ba70 in schedule () at kernel/sched/core.c:3428

#3  0xc085f760 in schedule_hrtimeout_range_clock (expires=0x0, delta=<optimized out>, mode=HRTIMER_MODE_ABS, clock=1) at kernel/time/hrtimer.c:1716

#4  0xc085f7f4 in schedule_hrtimeout_range (expires=<optimized out>, delta=<optimized out>, mode=<optimized out>) at kernel/time/hrtimer.c:1761

otherwise everything looks OK, kdb/kgdb are working over the serial console, but threads aren't scheduled

  • Hi,

    Can you try running your kernel on a TI board? Do you see the same thing there?

    It doesn't seem like you are sseing any kernel oops or panic?

    Thanks.

  • Yes, there is no kernel oops or panics, it just doesn't schedule anything anymore

    Occasionally we've seen "INFO: rcu_preempt self-detected stall on CPU", I assume it happens when eventually it starts scheduling again and rcu_preempt notices it wasn't scheduled for a long time. But it is rare, typically it just remains in this state

    We are not running particularly high load, and we are not increasing thread priority beyond default

  • How about running your kernel on a TI board? This would help determine if there might be some hardware contribution here.

    Also, would you be able to attach a full log of a running kernel?

    Thanks.

  • We cannot run this kernel on a TI board, our custom board has a different set of hardware. But this kernel is derived from the TI SDK versions

    I'm attaching 2 logs -  in stall1.log stall was detected by rcu_preempt, and in stall2.log it wasn't detected and the system remained stuck until I did SysRq  stall.zip

  • How did you configure your kernel options? It is recommended to start with tisdk_am437x-evm_defconfig, then adjust it based your needs.

    Do you use the root filesystem provided in the Processor SDK package? The kernel log shows the PM firmware am335x-pm-firmware.elf is failed to load, does this file exist in the root filesystem?

  • We build our own root filesystem. We are not loading CM3 firmware and not using any CM3 functionality.

    We also turned off CONFIG_NO_HZ and CONFIG_CPU_IDLE, but it doesn't help. Please see kernel configuration 3365.config.gz

  • I compared your kernel config with default tisdk_am437x-evm_defconfig, there are not much difference, mainly you disabled some network related options. I don't see any suspicion there.

    What is the frequency of the oscillator used on your board? Do you use RTC?

    Can you test with the kernel v4.19 in the latest Processor SDK release to see if the issue still happens?

  • it is 24 MHz, and we are using RTC:

    [    0.000000] OMAP clockevent source: timer2 at 24000000 Hz
    [    0.000013] sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478
    484971ns
    [    0.000033] clocksource: timer1: mask: 0xffffffff max_cycles: 0xffffffff, max
    _idle_ns: 79635851949 ns
    [    0.000043] OMAP clocksource: timer1 at 24000000 Hz
    [    0.000517] clocksource: 32k_counter: mask: 0xffffffff max_cycles: 0xffffffff
    , max_idle_ns: 58327039986419 ns
    [    0.000528] OMAP clocksource: 32k_counter at 32768 Hz

    Thanks

  • Mikhail Shoykher said:
    it is 24 MHz, and we are using RTC:

    Okay, this is the same as that on the EVM, so no sw changes needed.

    Can you test with the kernel v4.19 in the latest Processor SDK release to see if the issue still happens?

  • Mikhail Shoykher said:
    We build our own root filesystem. We are not loading CM3 firmware and not using any CM3 functionality.

    Please add the CM3 PM firmware am335x-pm-firmware.elf into your root filesystem under /lib/firmware/ and test again to see if the issue still happen. This firmware is required for all idle low power modes, so I am not sure how the system behaves if the firmware is absent, we don't validate Linux without the firmware.

  • Thank you

    We are running tests with 4.19 and CM3 firmware now