This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM335x Standby/Resume Stress Test

Other Parts Discussed in Thread: SYSCONFIG

Hello,

We are using the AM3354BZCZD80 on a custom board similar to BeagleBone Black design.  We are using EZSDK 6.00.00.00, which apparently uses PSP Kernel 04.06.00.11.  We have a customer requirement to place the device in standby until an external input triggers the device to resume.

We have successfully implemented standby/resume using a GPIO wake source, but the customer is complaining that sometimes the device doesn't resume properly.  They have to power cycle the device to recover.

We have been stress testing 7+ devices and have observed resume failures after 25 hours or so of testing.  We saw a note in the PSP 04.06.00.10 Release Notes in the "Known Issues" section which describes exactly what we are seeing:

  • SDOCM00099830 suspend/resume long hour stress test failed
  • SDOCM00099761 DUT could not resume after long hours of standby/resume test

We are wondering if this issue is still open, or if it was supposedly resolved in EZSDK 6.00.00.00?  Is there any possible solution?  We have ruled out all of our application software...the crash still occurs just using a shell script to manage the standby/resume process with only standard system daemons running.  We've also ruled out our attached USB hub and devices.

I've setup the same test on a BeagleBone Black running EZSDK 6.00.00.00 images for comparison, but I've not had it running very long and it hasn't crashed yet.  

  • I've seen hangs during suspend before, related to Bluetooth driver issues.  The general purpose solution is to enable the watchdog, e.g.

    # Extra safety against kernel hangs... this should cause any oops/panic to reboot (instead of hang).
    echo 1 > /proc/sys/kernel/panic
    echo 1 > /proc/sys/kernel/panic_on_oops

    # The Watchdog seems good at catching suspend/standby hangs (e.g. BT rfcomm related), and rebooting the system.
    # The watchdog is disabled just before standby, and re-enabled by the kernel when it wakes up.
    watchdog -T 10 -t 1 /dev/watchdog

  • We can try the watchdog and restart on panic, but it doesn't resolve the root cause. We don't have the bluetooth driver module enabled at the moment. We are using the PVR module, as well as other kernel built-ins.

    Have you been able to determine if the hang usually occurs when going to standby, or when coming out of standby? If it is during the resume process, the watchdog won't help because we need it up and running within seconds to show a live video feed. If it's going in to standby, we might be able to make it work if we can hide the fact that it's rebooting.
  • Yeah, I think it would hang during the suspend part (not the resume), occasionally with a panic/oops during suspend, occasionally 'silently'. But we didn't do much of the stress type testing that you're doing, so no doubt your issue could be caused by something else. Good luck. I think there are some kernel debugging options for suspend (i.e. keeping kernel console messages going during the process).
  • I just confirmed this morning, the BeagleBone Black suffers from the same issue using stock EZSDK 6.00.00.00 binaries. I have disabled Matrix GUI prior to the test. It lasted just under 36 hours before it failed. I think this rules out our Kernel modifications for our custom board.
  • I have begun similar testing on BeagleBone Black with new EZSDK 8.00.00.00 images. Should have some results in a day (or 3).
  • So far, 3.14 Kernel is still running.

    Looks like we've found the problem in 3.2 Kernel. After testing, we found a bad pointer dereference in am33x_cpgmac_reset, which is called on each standby. It seems there was a note in the Kernel sources about whether this was needed due to an errata or not. Later Kernels have this function entirely removed.

    We have removed the function assignment to .reset from am33xx_cpgmac0_hwmod_class structure in omap_hwmod_33xx_data.c. We have restarted testing with this modification to verify the results.
  • How is the testing going? Can you tell me what command you use to go into standby? Also, can you list all patches/code changes? I have attached another patch that was backported to the 3.2 kernel from the 3.12 kernel. This one helped another person with suspend/resume issues.

    Steve K.

    sleep.patch.gz

  • The 3.2 kernel minus am33xx_cpgmac_reset failed after about 16 hours. We didn't get a crash dump this time, so we'll try the provided patch and re-run.
  • Hello Steve,

    Regarding the patches and shell commands for standby, I've posted them below.  Note that we are using gpio143 through a GPIO expander to detect when to go to sleep (low = sleep, high = stay awake).  We are using the expander's INT pin connected to gpio1_29 of the CPU to wake (active low).

    standby_files.zip

  • We received a crash dump log from the unit which failed running the patched 3.2 kernel while running our application.  It seems musb driver is the cause of this.  We are using 2 USB devices (video adapters) in this application.

    [ 4249.200988] PM: late suspend of devices complete after 24.200 msecs
    [ 4252.532012] GFX domain entered low power state
    [ 4252.536804] Successfully transitioned all domains to low power state
    [ 4252.547058] PM: early resume of devices complete after 3.021 msecs
    [ 4252.717376] d_can d_can.0: can0: setting CAN BT = 0x1c05
    [ 4252.723144] d_can d_can.0: can_put_echo_skb: BUG! echo_skb is occupied!
    [ 4252.730133] d_can d_can.0: can_put_echo_skb: BUG! echo_skb is occupied!
    [ 4252.737121] d_can d_can.0: can_put_echo_skb: BUG! echo_skb is occupied!
    [ 4252.744140] d_can d_can.0: can_put_echo_skb: BUG! echo_skb is occupied!
    [ 4252.751129] d_can d_can.0: can_put_echo_skb: BUG! echo_skb is occupied!
    [ 4252.758087] d_can d_can.0: can_put_echo_skb: BUG! echo_skb is occupied!
    [ 4252.765075] d_can d_can.0: can_put_echo_skb: BUG! echo_skb is occupied!
    [ 4252.772064] d_can d_can.0: can_put_echo_skb: BUG! echo_skb is occupied!
    [ 4252.779022] d_can d_can.0: can_put_echo_skb: BUG! echo_skb is occupied!
    [ 4252.786010] d_can d_can.0: can_put_echo_skb: BUG! echo_skb is occupied!
    [ 4252.828094] PVR: PVRSRVDriverResume(pDevice=df0b0400)
    [ 4252.833496] PVR: SysSystemPostPowerState: Entering state D0
    [ 4252.839385] PVR: EnableSystemClocks: Enabling System Clocks
    [ 4252.845367] PVR: GPTIMER11 clock is 24MHz
    [ 4252.849761] PVR: Installing device LISR SGX ISR on IRQ 37 with cookie de280e80
    [ 4252.903564] omap_timer omap_timer.7: omap2_dm_timer_set_src: 518: clk_get() sys_ck FAILED
    [ 4252.932159] Unable to handle kernel NULL pointer dereference at virtual address 00000048
    [ 4252.940673] pgd = c0004000
    [ 4252.943542] [00000048] *pgd=00000000
    [ 4252.947326] Internal error: Oops: 17 [#1] PREEMPT
    [ 4252.952239] Modules linked in: bufferclass_ti(O) omaplfb(O) pvrsrvkm(O) g_ether atmel_mxt_ts
    [ 4252.961181] CPU: 0    Tainted: G           O  (3.2.0 #32)
    [ 4252.966888] PC is at musb_start_urb+0x38/0x9d4
    [ 4252.971588] LR is at musb_urb_enqueue+0x554/0x638
    [ 4252.976531] pc : [<c02f0cdc>]    lr : [<c02f1bcc>]    psr: 60000193
    [ 4252.976562] sp : d82e7cc0  ip : df17c0e8  fp : 00000000
    [ 4252.988616] r10: 00000000  r9 : df17c33c  r8 : d829220c
    [ 4252.994110] r7 : d82e6000  r6 : df17c0e8  r5 : e081cc00  r4 : d8292200
    [ 4253.000976] r3 : df7b3c48  r2 : d8292200  r1 : 00000000  r0 : df17c0e8
    [ 4253.007843] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
    [ 4253.015625] Control: 10c5387d  Table: 98320019  DAC: 00000015
    [ 4253.021667] Process kworker/u:9 (pid: 1796, stack limit = 0xd82e62f0)
    [ 4253.028442] Stack: (0xd82e7cc0 to 0xd82e8000)
    [ 4253.033020] 7cc0: d82e6000 00000010 d82e7cdc c0472efc c00a9184 c00a91b4 00000000 c0013d38
    [ 4253.041625] 7ce0: 00000001 00000028 0000002c df003800 d8292200 60000113 df000140 00000000
    [ 4253.050231] 7d00: 00000000 df7b3c38 00000010 00000000 e081cc00 d82e7d28 c00a9184 d82606c0
    [ 4253.058837] 7d20: d8292200 df17c000 d82e6000 d829220c df17c33c 60000113 00000000 c02f1bcc
    [ 4253.067443] 7d40: c06ebdc0 a0000113 c074a140 c047312c a0000113 00000000 00200200 0003da0b
    [ 4253.076049] 7d60: c074a140 c004af90 00000000 ffffffff 00000064 d82e6000 d82e7dd8 d82606c0
    [ 4253.084655] 7d80: 00000000 df17c000 00000010 000003e8 00000001 d82606c8 00000023 c02d64a8
    [ 4253.093261] 7da0: c06ea740 de49e3b0 df0b5a00 c06ea778 00000000 df0b5a30 de49e3b0 df0b5a00
    [ 4253.101867] 7dc0: d82e7df4 c0038f50 de49e3b0 c06ea778 c0039290 d82606c0 d82e7e08 00000000
    [ 4253.110504] 7de0: d82e7e34 000003e8 00000001 df7b3c00 00000023 c02d8bc4 00000081 de49e380
    [ 4253.119110] 7e00: 00ffff3f d826b9c0 00000001 d82e7e0c d82e7e0c 00000000 d826b9c0 00000000
    [ 4253.127716] 7e20: 00000000 00000002 80000200 c02d8e48 00000000 00000002 df7a8780 00000010
    [ 4253.136322] 7e40: df786800 00000002 df1b70c0 c04a4ff4 00000010 df1b70fc 00000089 c02ceb98
    [ 4253.144927] 7e60: 00000002 00000002 00000000 00000000 000003e8 d82e6000 00000089 c02d2b78
    [ 4253.153533] 7e80: 00000002 00000000 d8294908 00000000 d82e7f0c c0473154 00000081 df150507
    [ 4253.162139] 7ea0: 00000010 df786800 df786800 c02ce264 c04a4ff4 d8294908 d82e6000 c02daf54
    [ 4253.170745] 7ec0: 00000000 00000000 df786868 c02dbab8 00000000 00000000 df786868 c02ce270
    [ 4253.179351] 7ee0: 00000000 c0267a08 00100100 00200200 00000000 00000000 df786868 df78689c
    [ 4253.187957] 7f00: 00000001 00000010 d8294900 c0268190 00000000 df786868 c075ad54 df014200
    [ 4253.196563] 7f20: 00000000 c02682b8 00000000 c0736a1c df310e80 c005fec8 00000002 c005e868
    [ 4253.205169] 7f40: 00000000 c06ea778 d82ec01c df0b5a00 d8294908 df310e80 df014200 00000000
    [ 4253.213775] 7f60: c005fe30 00000000 d82e6000 c0054154 de0abe00 c0054a74 00000000 df310e80
    [ 4253.222381] 7f80: c074b1f0 c06ec32c df310e90 c074b1e8 c074b1e8 d82e6000 00000089 c0054b74
    [ 4253.230987] 7fa0: df310e80 c0059350 00000013 d82e5f2c df310e80 c0054a04 00000013 00000000
    [ 4253.239593] 7fc0: 00000000 00000000 00000000 c005936c 00000000 00000000 df310e80 00000000
    [ 4253.248199] 7fe0: d82e7fe0 d82e7fe0 d82e5f2c c00592ec c0015080 c0015080 beffffdf f7b75f7d
    [ 4253.256835] [<c02f0cdc>] (musb_start_urb+0x38/0x9d4) from [<c02f1bcc>] (musb_urb_enqueue+0x554/0x638)
    [ 4253.266571] [<c02f1bcc>] (musb_urb_enqueue+0x554/0x638) from [<c02d64a8>] (usb_hcd_submit_urb+0xa4/0x7d4)
    [ 4253.276672] [<c02d64a8>] (usb_hcd_submit_urb+0xa4/0x7d4) from [<c02d8bc4>] (usb_start_wait_urb+0x40/0x134)
    [ 4253.286834] [<c02d8bc4>] (usb_start_wait_urb+0x40/0x134) from [<c02d8e48>] (usb_control_msg+0x98/0xcc)
    [ 4253.296630] [<c02d8e48>] (usb_control_msg+0x98/0xcc) from [<c02ceb98>] (clear_port_feature+0x44/0x4c)
    [ 4253.306365] [<c02ceb98>] (clear_port_feature+0x44/0x4c) from [<c02d2b78>] (usb_port_resume+0x64/0x4d0)
    [ 4253.316162] [<c02d2b78>] (usb_port_resume+0x64/0x4d0) from [<c02daf54>] (usb_resume_both+0x104/0x138)
    [ 4253.325866] [<c02daf54>] (usb_resume_both+0x104/0x138) from [<c02dbab8>] (usb_resume+0x34/0x80)
    [ 4253.335052] [<c02dbab8>] (usb_resume+0x34/0x80) from [<c02ce270>] (usb_dev_resume+0xc/0x10)
    [ 4253.343841] [<c02ce270>] (usb_dev_resume+0xc/0x10) from [<c0267a08>] (pm_op+0x98/0xb4)
    [ 4253.352203] [<c0267a08>] (pm_op+0x98/0xb4) from [<c0268190>] (device_resume+0xf8/0x208)
    [ 4253.360626] [<c0268190>] (device_resume+0xf8/0x208) from [<c02682b8>] (async_resume+0x18/0x44)
    [ 4253.369689] [<c02682b8>] (async_resume+0x18/0x44) from [<c005fec8>] (async_run_entry_fn+0x98/0x1d4)
    [ 4253.379241] [<c005fec8>] (async_run_entry_fn+0x98/0x1d4) from [<c0054154>] (process_one_work+0x128/0x37c)
    [ 4253.389312] [<c0054154>] (process_one_work+0x128/0x37c) from [<c0054b74>] (worker_thread+0x170/0x334)
    [ 4253.399047] [<c0054b74>] (worker_thread+0x170/0x334) from [<c005936c>] (kthread+0x80/0x88)
    [ 4253.407775] [<c005936c>] (kthread+0x80/0x88) from [<c0015080>] (kernel_thread_exit+0x0/0x8)
    [ 4253.416564] Code: e5b3a010 e15a0003 124aa014 03a0a000 (e59a5048) 
    [ 4253.423156] ---[ end trace 4ef325c296ef0a97 ]---
    

  • David Paden said:
    We have removed the function assignment to .reset from am33xx_cpgmac0_hwmod_class structure in omap_hwmod_33xx_data.c. We have restarted testing with this modification to verify the results.

    So is this the change you made:

    diff --git a/arch/arm/mach-omap2/omap_hwmod_33xx_data.c b/arch/arm/mach-omap2/om
    index 6c5ebc7..4f1c448 100644
    --- a/arch/arm/mach-omap2/omap_hwmod_33xx_data.c
    +++ b/arch/arm/mach-omap2/omap_hwmod_33xx_data.c
    @@ -568,7 +568,6 @@ static struct omap_hwmod_class_sysconfig am33xx_cpgmac_sysc
     static struct omap_hwmod_class am33xx_cpgmac0_hwmod_class = {
            .name           = "cpsw",
            .sysc           = &am33xx_cpgmac_sysc,
    -       .reset          = am33xx_cpgmac_reset,
     };

    I did some sanity checking against SDK 8.00 and that looks to be the same, but I thought it would be a good idea to confirm!

    FYI, in case it helps I have a tera term macro that I use for automated testing of suspend resume.  Here it is:

    tera term macro said:

    for i 1 60000
        int2str countstr i
        statusbox countstr 'ATTEMPTS'
        pause 2
        sendln "echo mem > /sys/power/state"
        waitln "Suspending console"
        pause 2
        ; send a wake event
        sendln
        pause 1
        sendln
        Waitln "root@am335x-evm"
        sendln
    next

    sendln "echo 100 iterations completed"
    end

    Currently I have SDK 8.00 running on my AM335x EVM.  It's been going for about 24 hours and 15,000+ iterations so far without issue.

  • Hello Brad,

    Thanks for the update.

    Yes, that was the change we made to EZSDK 6.00.  The system would still freeze/crash intermittently, typically after around 25 hours of testing (it went to about 32 hours once).  Whenever we'd get a crash dump (which was almost never), it reported an error in the MUSB driver.  I completely disabled the MUSB driver in the kernel build, but the system would still crash without crash dump when stress testing standby.

    We switched to EZSDK 8.00 (did not apply any patches to CPSW driver) and we were able to achieve 96+ hours of continuous standby cycling with no lockups on 5 units.  So, for applications that require standby, we will be using EZSDK 8.00.

  • FYI, my SDK 8.00 test is up to 48 hours and 32k+ iterations. Still going.