This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM3352: Kernel stall issue

Part Number: AM3352

Hello,

We are looking for advice on troubleshooting a stall seen in Linux on our AM3352 board.  We are testing with the kernel v4.19 from PROCESSOR-SDK-LINUX-AM335 v06.01 (20-Oct-2019) to match TI, but we are also experiencing the stalls in kernel v4.14.

Sometimes the board runs for a day with no stall, and sometimes it hangs after just an hour or 2 of run.

When the stall occurs, we are sending and receiving a lot of serial and Ethernet data.

After the stall, all the normal I/O interfaces including serial ports are unresponsive, so we have to read the __log_buf using XDS110 to capture the stall info.  JTAG seems like our only debug path after the stall.

I am not seeing much info in this log to fault isolate the location of failure, so setting up to perform program trace using the ETB.  Sharing this log with the group to see if any of you have insight from similar debug experiences.

Please send your thoughts/feedback/advice.  Thanks!  Chris Norris

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 0-...!: (1 ticks this GP) idle=8fe/1/0x40000004 softirq=69056/69056 fqs=0
rcu: (t=4682 jiffies g=181829 q=198)
rcu: rcu_preempt kthread starved for 4682 jiffies! g181829 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
rcu: RCU grace-period kthread stack dump:
rcu_preempt I 0 10 2 0x00000000
Backtrace:
[<c0925f38>] (__schedule) from [<c0926700>] (schedule+0x58/0xc4)
r10:c0e144c4 r9:c0e14ac0 r8:c0e14aa0 r7:db87dedc r6:c0e03048 r5:c0e14ac0
r4:ffffe000
[<c09266a8>] (schedule) from [<c092a348>] (schedule_timeout+0x194/0x284)
r5:c0e14ac0 r4:00014e0f
[<c092a1b4>] (schedule_timeout) from [<c0180a64>] (rcu_gp_kthread+0x76c/0xc28)
r9:c0e143f4 r8:00000005 r7:ffffe000 r6:c0e14aa0 r5:00000001 r4:db87c000
[<c01802f8>] (rcu_gp_kthread) from [<c0148a60>] (kthread+0x158/0x160)
r7:db87c000
[<c0148908>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
Exception stack(0xdb87dfb0 to 0xdb87dff8)
dfa0: 00000000 00000000 00000000 00000000
dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0148908
r4:db82ec00
Task dump for CPU 0:
myApp R running task 0 1242 683 0x00000002
Backtrace:
[<c010beb0>] (dump_backtrace) from [<c010c198>] (show_stack+0x18/0x1c)
r7:c0e143f4 r6:000002ab r5:c0e03048 r4:da8f2a00
[<c010c180>] (show_stack) from [<c014e824>] (sched_show_task.part.1+0xd4/0x104)
[<c014e750>] (sched_show_task.part.1) from [<c0153658>] (dump_cpu_task+0x38/0x3c)
r6:c0e03104 r5:00000000 r4:c0e143f4
[<c0153620>] (dump_cpu_task) from [<c0182254>] (rcu_dump_cpu_stacks+0x90/0xd0)
[<c01821c4>] (rcu_dump_cpu_stacks) from [<c0181abc>] (rcu_check_callbacks+0x608/0x8d0)
r10:c0e14aa0 r9:00000000 r8:c0e030f8 r7:c0e143f4 r6:c0e14814 r5:c0e14880
r4:c0e143f4 r3:16a3dd13
[<c01814b4>] (rcu_check_callbacks) from [<c01874f0>] (update_process_times+0x3c/0x64)
r10:200b0193 r9:c0197cfc r8:00000000 r7:00000117 r6:00000000 r5:da8f2a00
r4:ffffe000
[<c01874b4>] (update_process_times) from [<c0197ad4>] (tick_sched_handle+0x5c/0x60)
r7:00000117 r6:dd4d57c1 r5:da969eb0 r4:c0e16500
[<c0197a78>] (tick_sched_handle) from [<c0197d4c>] (tick_sched_timer+0x50/0xac)
[<c0197cfc>] (tick_sched_timer) from [<c0188158>] (__hrtimer_run_queues.constprop.4+0x18c/0x224)
r7:ffffe000 r6:c0e15c00 r5:c0e16500 r4:c0e15c40
[<c0187fcc>] (__hrtimer_run_queues.constprop.4) from [<c01888f8>] (hrtimer_interrupt+0xec/0x2fc)
r10:7fffffff r9:c0e15d30 r8:ffffffff r7:00000003 r6:200b0193 r5:ffffe000
r4:c0e15c00
[<c018880c>] (hrtimer_interrupt) from [<c0119f90>] (omap2_gp_timer_interrupt+0x30/0x38)
r10:c0e4c3fe r9:db807100 r8:00000010 r7:da969e08 r6:00000000 r5:00000010
r4:c0e07f80
[<c0119f60>] (omap2_gp_timer_interrupt) from [<c016f660>] (__handle_irq_event_percpu+0x6c/0x140)
[<c016f5f4>] (__handle_irq_event_percpu) from [<c016f768>] (handle_irq_event_percpu+0x34/0x84)
r10:00001cc4 r9:da968000 r8:db808000 r7:da969fb0 r6:db807100 r5:00000010
r4:c0e03048
[<c016f734>] (handle_irq_event_percpu) from [<c016f81c>] (handle_irq_event+0x64/0x90)
r6:00000000 r5:00000010 r4:db807100
[<c016f7b8>] (handle_irq_event) from [<c0172f50>] (handle_level_irq+0xc0/0x164)
r5:00000010 r4:db807100
[<c0172e90>] (handle_level_irq) from [<c016e7c8>] (generic_handle_irq+0x2c/0x3c)
r5:00000010 r4:c0e4abb0
[<c016e79c>] (generic_handle_irq) from [<c016ef88>] (__handle_domain_irq+0x5c/0xb0)
[<c016ef2c>] (__handle_domain_irq) from [<c0452dfc>] (omap_intc_handle_irq+0x3c/0x98)
r9:da968000 r8:db808000 r7:da969ee4 r6:ffffffff r5:200b0113 r4:c0e7f17c
[<c0452dc0>] (omap_intc_handle_irq) from [<c0101a0c>] (__irq_svc+0x6c/0xa8)
Exception stack(0xda969eb0 to 0xda969ef8)
9ea0: 00404040 c0a02a3c c0e4c1c0 00000000
9ec0: 00000040 00000014 00000000 00000000 db808000 00211c48 00001cc4 da969f5c
9ee0: da969f60 da969f00 c012e3dc c0102224 200b0113 ffffffff
r5:200b0113 r4:c0102224
[<c0102180>] (__do_softirq) from [<c012e3dc>] (irq_exit+0x114/0x118)
r10:00001cc4 r9:00211c48 r8:db808000 r7:00000000 r6:00000000 r5:00000014
r4:c0e4abb0
[<c012e2c8>] (irq_exit) from [<c016ef8c>] (__handle_domain_irq+0x60/0xb0)
[<c016ef2c>] (__handle_domain_irq) from [<c0452dfc>] (omap_intc_handle_irq+0x3c/0x98)
r9:00211c48 r8:10c53c7d r7:10c5387d r6:ffffffff r5:600b0030 r4:c0e7f17c
[<c0452dc0>] (omap_intc_handle_irq) from [<c0101e14>] (__irq_usr+0x54/0x80)
Exception stack(0xda969fb0 to 0xda969ff8)
9fa0: 00000000 0020fe38 ffffffff 0020fe38
9fc0: 00211c38 b6b7a7ac 00211c40 b6b7a7a4 00000000 00211c48 00001cc4 001b7c51
9fe0: 00000000 b628fbe8 0008d0b9 b6ae1926 600b0030 ffffffff
r5:600b0030 r4:b6ae1926

  • Hi Chris,

    This training that we put together might help:

    https://training.ti.com/node/1139125

    If you remove "myApp" from the system, does the problem still occur? If you run just the TI SDK do you see issues? 

    Have you checked power rails at a time of the stall? 

    Thanks.

  • Hi RonB,

    Thank you for your suggestions.  We have not looked at power rails yet, this product has been in service for years using a v3.14 mainline kernel so we weren't thinking it's hardware.

    "myApp" is doing the heavy lifting in our system, but it is not the only process where the stall is observed.

    We enabled soft lockup detection in some recent runs.  With the soft lockup detection turned on, our system is sometimes responsive enough to extract logs without the JTAG.  In the most recent runs, the detection fires repeatedly on cpuidle_enter_state() as shown in this log.

    [ 120.071511] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper:0]
    [ 120.078357] Modules linked in: xt_conntrack iptable_nat nf_nat_ipv4 nf_nat nf_conntrack sha256_arm sha256_generic nf_defrag_ipv6 nf_defrag_ipv4 cfg80211 libcrc32c arptable_filter arp_tables iptable_filter ip_tables ebtable_filter ebtables x_tables bridge usb_f_fs stp llc usb_f_mass_storage usb_f_acm u_serial usb_f_ecm usb_f_rndis u_ether libcomposite musb_dsps musb_hdrc udc_core usbcore phy_am335x phy_generic usb_common phy_am335x_control pm33xx wkup_m3_rproc wkup_m3_ipc remoteproc omap_aes_driver crypto_engine omap_crypto ti_emif_sram omap_sham rtc_omap musb_am335x omap_wdt overlay
    [ 120.130343] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.59-g5f8c1c6121da-dirty #4
    [ 120.137942] Hardware name: Generic AM33XX (Flattened Device Tree)
    [ 120.144085] PC is at cpuidle_enter_state+0x168/0x4a8
    [ 120.149070] LR is at cpuidle_enter_state+0x140/0x4a8
    [ 120.154054] pc : [<c07d65e8>] lr : [<c07d65c0>] psr: 60070013
    [ 120.160345] sp : c1001ec8 ip : c1001ec8 fp : c1001f0c
    [ 120.165589] r10: 68fa2975 r9 : dbe9ca00 r8 : 00000000
    [ 120.170834] r7 : 68fa5f02 r6 : 00000000 r5 : 00000016 r4 : dbeafc10
    [ 120.177387] r3 : 00000000 r2 : 05355555 r1 : 00000016 r0 : 00000000
    [ 120.183944] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
    [ 120.191108] Control: 10c5387d Table: 9aa78019 DAC: 00000051
    [ 120.196878] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.59-g5f8c1c6121da-dirty #4
    [ 120.204477] Hardware name: Generic AM33XX (Flattened Device Tree)
    [ 120.210632] [<c01119c4>] (unwind_backtrace) from [<c010d6f4>] (show_stack+0x20/0x24)
    [ 120.218419] [<c010d6f4>] (show_stack) from [<c0a08b58>] (dump_stack+0x24/0x28)
    [ 120.225691] [<c0a08b58>] (dump_stack) from [<c0109d8c>] (show_regs+0x1c/0x20)
    [ 120.232866] [<c0109d8c>] (show_regs) from [<c01d10c8>] (watchdog_timer_fn+0x1e0/0x214)
    [ 120.240835] [<c01d10c8>] (watchdog_timer_fn) from [<c019f368>] (__hrtimer_run_queues.constprop.4+0x1b0/0x3ac)
    [ 120.250799] [<c019f368>] (__hrtimer_run_queues.constprop.4) from [<c019fd28>] (hrtimer_interrupt+0xf4/0x304)
    [ 120.260680] [<c019fd28>] (hrtimer_interrupt) from [<c011e154>] (omap2_gp_timer_interrupt+0x38/0x40)
    [ 120.269773] [<c011e154>] (omap2_gp_timer_interrupt) from [<c017df3c>] (__handle_irq_event_percpu+0x84/0x26c)
    [ 120.279648] [<c017df3c>] (__handle_irq_event_percpu) from [<c017e160>] (handle_irq_event_percpu+0x3c/0x8c)
    [ 120.289345] [<c017e160>] (handle_irq_event_percpu) from [<c017e21c>] (handle_irq_event+0x6c/0x98)
    [ 120.298261] [<c017e21c>] (handle_irq_event) from [<c0181bb4>] (handle_level_irq+0xc8/0x16c)
    [ 120.306651] [<c0181bb4>] (handle_level_irq) from [<c017cfa4>] (generic_handle_irq+0x34/0x44)
    [ 120.315127] [<c017cfa4>] (generic_handle_irq) from [<c017d7e8>] (__handle_domain_irq+0x64/0xb8)
    [ 120.323866] [<c017d7e8>] (__handle_domain_irq) from [<c0102278>] (omap_intc_handle_irq+0x44/0xa0)
    [ 120.332781] [<c0102278>] (omap_intc_handle_irq) from [<c0101a0c>] (__irq_svc+0x6c/0xa8)
    [ 120.340818] Exception stack(0xc1001e78 to 0xc1001ec0)
    [ 120.345891] 1e60: 00000000 00000016
    [ 120.354105] 1e80: 05355555 00000000 dbeafc10 00000016 00000000 68fa5f02 00000000 dbe9ca00
    [ 120.362321] 1ea0: 68fa2975 c1001f0c c1001ec8 c1001ec8 c07d65c0 c07d65e8 60070013 ffffffff
    [ 120.370539] [<c0101a0c>] (__irq_svc) from [<c07d65e8>] (cpuidle_enter_state+0x168/0x4a8)
    [ 120.378669] [<c07d65e8>] (cpuidle_enter_state) from [<c07d6978>] (cpuidle_enter+0x24/0x28)
    [ 120.386979] [<c07d6978>] (cpuidle_enter) from [<c0160eb0>] (call_cpuidle+0x30/0x48)
    [ 120.394673] [<c0160eb0>] (call_cpuidle) from [<c0161290>] (do_idle+0x1c8/0x23c)
    [ 120.402016] [<c0161290>] (do_idle) from [<c01615fc>] (cpu_startup_entry+0x1c/0x20)
    [ 120.409632] [<c01615fc>] (cpu_startup_entry) from [<c0a1e034>] (rest_init+0xb8/0xbc)
    [ 120.417426] [<c0a1e034>] (rest_init) from [<c0f00fc8>] (start_kernel+0x47c/0x4ac)
    [ 144.061492] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper:0]
    [ 144.068312] Modules linked in: xt_conntrack iptable_nat nf_nat_ipv4 nf_nat nf_conntrack sha256_arm sha256_generic nf_defrag_ipv6 nf_defrag_ipv4 cfg80211 libcrc32c arptable_filter arp_tables iptable_filter ip_tables ebtable_filter ebtables x_tables bridge usb_f_fs stp llc usb_f_mass_storage usb_f_acm u_serial usb_f_ecm usb_f_rndis u_ether libcomposite musb_dsps musb_hdrc udc_core usbcore phy_am335x phy_generic usb_common phy_am335x_control pm33xx wkup_m3_rproc wkup_m3_ipc remoteproc omap_aes_driver crypto_engine omap_crypto ti_emif_sram omap_sham rtc_omap musb_am335x omap_wdt overlay
    [ 144.120247] CPU: 0 PID: 0 Comm: swapper Tainted: G L 4.19.59-g5f8c1c6121da-dirty #4
    [ 144.129242] Hardware name: Generic AM33XX (Flattened Device Tree)
    [ 144.135364] PC is at cpuidle_enter_state+0x168/0x4a8
    [ 144.140348] LR is at cpuidle_enter_state+0x140/0x4a8
    [ 144.145332] pc : [<c07d65e8>] lr : [<c07d65c0>] psr: 60070013
    [ 144.151622] sp : c1001ec8 ip : c1001ec8 fp : c1001f0c
    [ 144.156865] r10: 68fa2975 r9 : dbe9ca00 r8 : 00000000
    [ 144.162110] r7 : 68fa5f02 r6 : 00000000 r5 : 00000016 r4 : dbeafc10
    [ 144.168662] r3 : 00000000 r2 : 05355555 r1 : 00000016 r0 : 00000000
    [ 144.175215] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
    [ 144.182380] Control: 10c5387d Table: 9aa78019 DAC: 00000051
    [ 144.188149] CPU: 0 PID: 0 Comm: swapper Tainted: G L 4.19.59-g5f8c1c6121da-dirty #4
    [ 144.197142] Hardware name: Generic AM33XX (Flattened Device Tree)
    [ 144.203270] [<c01119c4>] (unwind_backtrace) from [<c010d6f4>] (show_stack+0x20/0x24)
    [ 144.211050] [<c010d6f4>] (show_stack) from [<c0a08b58>] (dump_stack+0x24/0x28)
    [ 144.218309] [<c0a08b58>] (dump_stack) from [<c0109d8c>] (show_regs+0x1c/0x20)
    [ 144.225480] [<c0109d8c>] (show_regs) from [<c01d10c8>] (watchdog_timer_fn+0x1e0/0x214)
    [ 144.233436] [<c01d10c8>] (watchdog_timer_fn) from [<c019f368>] (__hrtimer_run_queues.constprop.4+0x1b0/0x3ac)
    [ 144.243396] [<c019f368>] (__hrtimer_run_queues.constprop.4) from [<c019fd28>] (hrtimer_interrupt+0xf4/0x304)
    [ 144.253270] [<c019fd28>] (hrtimer_interrupt) from [<c011e154>] (omap2_gp_timer_interrupt+0x38/0x40)
    [ 144.262359] [<c011e154>] (omap2_gp_timer_interrupt) from [<c017df3c>] (__handle_irq_event_percpu+0x84/0x26c)
    [ 144.272230] [<c017df3c>] (__handle_irq_event_percpu) from [<c017e160>] (handle_irq_event_percpu+0x3c/0x8c)
    [ 144.281927] [<c017e160>] (handle_irq_event_percpu) from [<c017e21c>] (handle_irq_event+0x6c/0x98)
    [ 144.290840] [<c017e21c>] (handle_irq_event) from [<c0181bb4>] (handle_level_irq+0xc8/0x16c)
    [ 144.299229] [<c0181bb4>] (handle_level_irq) from [<c017cfa4>] (generic_handle_irq+0x34/0x44)
    [ 144.307705] [<c017cfa4>] (generic_handle_irq) from [<c017d7e8>] (__handle_domain_irq+0x64/0xb8)
    [ 144.316443] [<c017d7e8>] (__handle_domain_irq) from [<c0102278>] (omap_intc_handle_irq+0x44/0xa0)
    [ 144.325356] [<c0102278>] (omap_intc_handle_irq) from [<c0101a0c>] (__irq_svc+0x6c/0xa8)
    [ 144.333392] Exception stack(0xc1001e78 to 0xc1001ec0)
    [ 144.338463] 1e60: 00000000 00000016
    [ 144.346677] 1e80: 05355555 00000000 dbeafc10 00000016 00000000 68fa5f02 00000000 dbe9ca00
    [ 144.354891] 1ea0: 68fa2975 c1001f0c c1001ec8 c1001ec8 c07d65c0 c07d65e8 60070013 ffffffff
    [ 144.363107] [<c0101a0c>] (__irq_svc) from [<c07d65e8>] (cpuidle_enter_state+0x168/0x4a8)
    [ 144.371235] [<c07d65e8>] (cpuidle_enter_state) from [<c07d6978>] (cpuidle_enter+0x24/0x28)
    [ 144.379539] [<c07d6978>] (cpuidle_enter) from [<c0160eb0>] (call_cpuidle+0x30/0x48)
    [ 144.387231] [<c0160eb0>] (call_cpuidle) from [<c0161290>] (do_idle+0x1c8/0x23c)
    [ 144.394573] [<c0161290>] (do_idle) from [<c01615fc>] (cpu_startup_entry+0x1c/0x20)
    [ 144.402180] [<c01615fc>] (cpu_startup_entry) from [<c0a1e034>] (rest_init+0xb8/0xbc)
    [ 144.409963] [<c0a1e034>] (rest_init) from [<c0f00fc8>] (start_kernel+0x47c/0x4ac)

  • Hi Chris,

    If CPUIDLE is actually causing the occasional lock ups. Have you tried
    disabling the CONFIG_CPU_IDLE, CONFIG_ARM_CPUIDLE options and checking
    if that helps improving the time it takes to lock up? Just to see if we have more sources
    of lock up.

    - Keerthy

  • Good evening Keerthy,

    Thank you for the suggestion.  We have been performing multiple changes and debug activities, which I will summarize here.

    We saw this errata, and that made us remember that a previous developer had patched arch/arm/mach-omap2/timer.c in our previous v3.14 kernel to force nonposted synchronization mode on our timers.

    With devmem, I have confirmed that our running kernel now has TSCICR POSTED set to 0 for all timers.  After this change, the soft lockup I mentioned on Feb 17 went away completely.

    We are still seeing a lockup of an unknown nature, but we have found a way to resume operation.  If we connect to a locked up unit with JTAG and Code Composer, we notice a processor state of Disconnected: Unknown as shown in this screen shot:

    This is an unusual state, since we normally see Running or Suspended for the CortexA8.  If we then connect to the CortexA8, it transitions to the Suspended state.  Then if we click the Play button the execution of the kernel continues as normal with no lockup.  We have been able to fix the lockup about 7 times by starting the session, observing the Unknown state, connecting to the CortexA8, observing the suspended state, then resuming.

    On several instances after the resume, we have observed workqueue lockups.  A sample of log buffer for these lockups are:

    Test #1:

    [402014.105252] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 41s!
    [402014.113383] Showing busy workqueues and worker pools:
    [402014.118545] workqueue events: flags=0x0
    [402014.122493] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
    [402014.128624] pending: stop_one_cpu_nowait_workfn, musb_irq_work [musb_hdrc], switchdev_deferred_process_work
    [402014.139091] workqueue events_long: flags=0x0
    [402014.143465] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [402014.149594] pending: br_fdb_cleanup [bridge]
    [402014.154518] workqueue events_power_efficient: flags=0x80
    [402014.159951] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=6/256
    [402014.166079] pending: phy_state_machine, sync_hw_clock, check_lifetime, gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work
    [402014.179242] workqueue pm: flags=0x4
    [402014.182833] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [402014.188962] pending: pm_runtime_work
    [402014.193031] workqueue ipv6_addrconf: flags=0x40008
    [402014.197926] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
    [402014.203877] pending: addrconf_verify_work

    Test #2:
    [248220.667530] BUG: workqueue lockup - pool cpus=0 flags=0x4 nice=0 stuck for 81s!
    [248220.675236] Showing busy workqueues and worker pools:
    [248220.680486] workqueue events: flags=0x0
    [248312.359388] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
    [248312.365520] pending: stop_one_cpu_nowait_workfn, linkwatch_event, musb_irq_work [musb_hdrc], do_tty_hangup, usb_gadget_state_work [udc_core]
    [248312.379058] workqueue events_unbound: flags=0x2
    [248312.383782] pwq 2: cpus=0 flags=0x4 nice=0 active=2/512
    [248312.389296] pending: flush_to_ldisc, flush_to_ldisc
    [248312.394742] workqueue events_power_efficient: flags=0x80
    [248312.400252] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
    [248312.406379] pending: phy_state_machine, gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work, do_cache_clean
    [248312.418276] workqueue pm: flags=0x4
    [248312.421963] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [248312.428091] pending: pm_runtime_work
    [248312.432219] workqueue writeback: flags=0x4e
    [248312.436578] pwq 2: cpus=0 flags=0x4 nice=0 active=1/256
    [248312.442091] pending: wb_workfn
    [248312.445703] workqueue ipv6_addrconf: flags=0x40008
    [248312.450683] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
    [248312.456635] pending: addrconf_verify_work


    Test #3:
    BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 52s!
    Showing busy workqueues and worker pools:
    workqueue events: flags=0x0
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
    pending: stop_one_cpu_nowait_workfn, linkwatch_event, musb_irq_work [musb_hdrc], do_tty_hangup
    workqueue events_unbound: flags=0x2
    pwq 2: cpus=0 flags=0x4 nice=0 active=1/512
    pending: flush_to_ldisc
    workqueue events_power_efficient: flags=0x80
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
    pending: phy_state_machine, check_lifetime, gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work
    workqueue pm: flags=0x4
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    pending: pm_runtime_work
    workqueue ipv6_addrconf: flags=0x40008
    pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
    pending: addrconf_verify_work


    Test #4:
    [102776.124801] musb_g_ep0_irq 691: SetupEnd came in a wrong ep0stage idle
    [107681.135381] musb_g_ep0_irq 691: SetupEnd came in a wrong ep0stage idle
    [107681.142499] BUG: workqueue lockup - pool cpus=0 flags=0x4 nice=0 stuck for 61s!
    [107681.150051] Showing busy workqueues and worker pools:
    [107681.155315] workqueue events: flags=0x0
    [107681.159351] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [107681.165480] pending: stop_one_cpu_nowait_workfn
    [107681.170609] workqueue events_unbound: flags=0x2
    [107681.175324] pwq 2: cpus=0 flags=0x4 nice=0 active=1/512
    [107681.180838] pending: flush_to_ldisc
    [107681.184883] workqueue events_power_efficient: flags=0x80
    [107681.190390] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=6/256
    [107681.196516] pending: check_lifetime, phy_state_machine, gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work, do_cache_clean
    [107681.209828] workqueue pm: flags=0x4
    [107681.213491] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    [107681.219618] pending: pm_runtime_work, pm_runtime_work
    [107681.225244] workqueue ipv6_addrconf: flags=0x40008
    [107681.230228] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
    [107681.236179] pending: addrconf_verify_work
    [107701.462890] Unhandled fault: external abort on non-linefetch (0x018) at 0xb6ffb000
    [107701.470622] pgd = 83be7282
    [107701.473427] [b6ffb000] *pgd=959b9831, *pte=00000303, *ppte=00000a33
    [107787.243722] rcu: INFO: rcu_preempt self-detected stall on CPU
    [107787.249647] rcu: 0-...!: (1 ticks this GP) idle=62a/0/0x1 softirq=6152295/6152295 fqs=0
    [107787.257943] rcu: (t=4658 jiffies g=16161773 q=2)
    [107787.262849] rcu: rcu_preempt kthread starved for 4658 jiffies! g16161773 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
    [107787.273675] rcu: RCU grace-period kthread stack dump:
    [107787.278833] rcu_preempt I 0 10 2 0x00000000
    [107787.284483] [<c0a1eec0>] (__schedule) from [<c0a1f50c>] (schedule+0x60/0xcc)
    [107787.291661] [<c0a1f50c>] (schedule) from [<c0a23298>] (schedule_timeout+0xa8/0x440)
    [107787.299460] [<c0a23298>] (schedule_timeout) from [<c0194fd0>] (rcu_gp_kthread+0xb2c/0x13e4)
    [107787.307954] [<c0194fd0>] (rcu_gp_kthread) from [<c0151d84>] (kthread+0x164/0x16c)
    [107787.315565] [<c0151d84>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
    [107787.322904] Exception stack(0xdb941fb0 to 0xdb941ff8)
    [107787.328066] 1fa0: 00000000 00000000 00000000 00000000
    [107787.336367] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
    [107787.344667] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
    [107787.351420] Task dump for CPU 0:
    [107787.354745] swapper R running task 0 0 0 0x00000002
    [107787.361933] [<c01119c4>] (unwind_backtrace) from [<c010d6f4>] (show_stack+0x20/0x24)
    [107787.369816] [<c010d6f4>] (show_stack) from [<c015aa74>] (sched_show_task.part.15+0xdc/0x10c)
    [107787.378388] [<c015aa74>] (sched_show_task.part.15) from [<c015fdd8>] (dump_cpu_task+0x40/0x44)
    [107787.387131] [<c015fdd8>] (dump_cpu_task) from [<c0197220>] (rcu_dump_cpu_stacks+0x98/0xd8)
    [107787.395525] [<c0197220>] (rcu_dump_cpu_stacks) from [<c0196a2c>] (rcu_check_callbacks+0x730/0x9f8)
    [107787.404623] [<c0196a2c>] (rcu_check_callbacks) from [<c019e648>] (update_process_times+0x44/0x6c)
    [107787.413639] [<c019e648>] (update_process_times) from [<c01b0478>] (tick_sched_handle+0x70/0x74)
    [107787.422469] [<c01b0478>] (tick_sched_handle) from [<c01b0708>] (tick_sched_timer+0x58/0xb4)
    [107787.430950] [<c01b0708>] (tick_sched_timer) from [<c019f530>] (__hrtimer_run_queues.constprop.4+0x1b0/0x3ac)
    [107787.440913] [<c019f530>] (__hrtimer_run_queues.constprop.4) from [<c019fef0>] (hrtimer_interrupt+0xf4/0x304)
    [107787.450881] [<c019fef0>] (hrtimer_interrupt) from [<c011e154>] (omap2_gp_timer_interrupt+0x38/0x40)
    [107787.460061] [<c011e154>] (omap2_gp_timer_interrupt) from [<c017df3c>] (__handle_irq_event_percpu+0x84/0x26c)
    [107787.470022] [<c017df3c>] (__handle_irq_event_percpu) from [<c017e160>] (handle_irq_event_percpu+0x3c/0x8c)
    [107787.479805] [<c017e160>] (handle_irq_event_percpu) from [<c017e21c>] (handle_irq_event+0x6c/0x98)
    [107787.488809] [<c017e21c>] (handle_irq_event) from [<c0181bb4>] (handle_level_irq+0xc8/0x16c)
    [107787.497286] [<c0181bb4>] (handle_level_irq) from [<c017cfa4>] (generic_handle_irq+0x34/0x44)
    [107787.505849] [<c017cfa4>] (generic_handle_irq) from [<c017d7e8>] (__handle_domain_irq+0x64/0xb8)
    [107787.514677] [<c017d7e8>] (__handle_domain_irq) from [<c0102278>] (omap_intc_handle_irq+0x44/0xa0)
    [107787.523676] [<c0102278>] (omap_intc_handle_irq) from [<c0101a0c>] (__irq_svc+0x6c/0xa8)
    [107787.531798] Exception stack(0xc1001e78 to 0xc1001ec0)
    [107787.536956] 1e60: 00000000 00006208
    [107787.545258] 1e80: 05355555 00000000 dbeafc60 000061fd 00000001 2c5d91d3 00000001 dbe9ca00
    [107787.553559] 1ea0: 5418d9b6 c1001f0c c1001ec8 c1001ec8 c07d6788 c07d67b0 600a0013 ffffffff
    [107787.561868] [<c0101a0c>] (__irq_svc) from [<c07d67b0>] (cpuidle_enter_state+0x168/0x4a8)
    [107787.570085] [<c07d67b0>] (cpuidle_enter_state) from [<c07d6b40>] (cpuidle_enter+0x24/0x28)
    [107787.578478] [<c07d6b40>] (cpuidle_enter) from [<c0160eb0>] (call_cpuidle+0x30/0x48)
    [107787.586258] [<c0160eb0>] (call_cpuidle) from [<c0161290>] (do_idle+0x1c8/0x23c)
    [107787.593689] [<c0161290>] (do_idle) from [<c01615fc>] (cpu_startup_entry+0x1c/0x20)
    [107787.601382] [<c01615fc>] (cpu_startup_entry) from [<c0a1e1f4>] (rest_init+0xb8/0xbc)
    [107787.609263] [<c0a1e1f4>] (rest_init) from [<c0f00fc8>] (start_kernel+0x47c/0x4ac)
    [107787.617271] musb_g_ep0_irq 691: SetupEnd came in a wrong ep0stage idle


    A recent test was accompanied by the RCU detection of stall in CPUIDLE.

    Although we are no longer seeing soft lockups in the area of CPUIDLE, we are still seeing the occasional lockup in this area.

    We have started a new test with the CONFIG_CPU_IDLE and CONFIG_ARM_CPUIDLE options turned off as you suggested.  The lockups sometimes take a few days to show up, so we are waiting for the results.

    On another debug track, we have enabled CONFIG_HZ_PERIODIC and disabled CONFIG_NO_HZ_IDLE.  11 of our units have been in a long run with no lockups using CONFIG_HZ_PERIODIC.

    We have begun some analysis of the workqueue lockups, and a call that we see as commin in the logs is stop_one_cpu_nowait_workfn().  Our Sr. Linux engineer is seeing this called in in kernel/sched/fair.c , but only when CONFIG_HZ_PERIODIC is not being used.

    So to summarize, when we see the workqueue lockup (sometimes accompanied by RCU detection of stall), we are able to connect and resume with JTAG to get the O.S. running again.  The problem is never seen when we use CONFIG_HZ_PERIODIC.

    For our CONFIG_NO_HZ_IDLE kernel, we have a test in progress with CPUIDLE disabled.

    Any further suggestions or thoughts are appreciated!

    Regards,
    Chris Norris

  • Chris,

    Wanted to see if you could provide an update on your testing. Have you gotten any clarity on the issue with the new test runs?

    Also, can you do the test without USB? If so, what happens?

    If you feel USB is critical and is involved, can you please provide you USB configuration? Host/Device? Drivers used? Etc.?

    Thanks.

  • Hi Ron,

    For the 11 units where we changed to CONFIG_HZ_PERIODIC, we have not observed any lockups.  We are investigating one message in the log buffer that we see intermittently, looking to determine if this could be related to the crash when we are using CONFIG_NO_HZ_IDLE:
    [ 4616.142541] gpio-55 (sysfs): _gpiod_direction_output_raw: tried to set a GPIO tied to an IRQ as output

    We are running 2 units with the CONFIG_NO_HZ_IDLE, with CONFIG_CPU_IDLE and CONFIG_ARM_CPUIDLE options turned off as Keerthy suggested.  Those units ran over the weekend with no crash.  Since the crash can take days to occur, we do not have enough test time on CONFIG_NO_HZ_IDLE with CONFIG_CPU_IDLE and CONFIG_ARM_CPUIDLE options turned off to know if it made a difference.

    When the lockup occurs in our CONFIG_NO_HZ_IDLE configuration, connecting XDS100 to the CortexA8 core and resuming always seems to clear the problem and get the unit running again.

    We have noticed after connecting the JTAG that the workqueue lockup messages are sometimes not present in the log prior to the resume.  So it is not clear to us if the workqueue lockup has resulted from the CPU suspending from the JTAG connection or  if the workqueue lockup occurred prior to connecting JTAG.

    Since we've observed the lockups without USB connected, it is not clear that USB is involved and we have not done any USB debugging here.

    Sounds like TI is suspecting a USB issue, so I'll re-familiarize myself with the USB usage and send some details.

    Regards,
    Chris Norris

  • Chris,

    Some of the logs had some mentions of USB, but if you've seen lockups with USB inactive, then I don't believe the root cause is USB related. Sometimes USB is used for networking or storage locally on the board so there is no external connection. I would probably want to disable USB altogether to be certain.

    Please keep us posted on your results.