This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM62A7: SDK 09.01: Rare boot crash related to wave5_vdi_readl

Part Number: AM62A7

On the AM62A7, I have seen a pretty rare boot crash twice now.  The current failure rate seems to be about 1 in 5000 boots.

Both crashes pointed to

pc : wave5_vdi_readl+0x14/0x30 [wave5]
lr : wave5_vpu_timer_callback+0x2c/0x94 [wave5]

[35.642] [    5.056753] e5010 fd20000.e5010: Device registered as /dev/video2
[35.680] [    5.067818] rtc-ti-k3 2b1f0000.rtc: registered as rtc0
[35.690] [    5.073156] rtc-ti-k3 2b1f0000.rtc: setting system clock to 1970-01-01T00:00:14 UTC (14)
[35.698] [    5.084059] systemd-journald[167]: Time jumped backwards, rotating.
[35.718] [    5.352205] systemd-journald[167]: Oldest entry in /run/log/journal/717664b3b6bb40b7b1b7672fe0f9f884/system.journal is older than the configured file retention duration (1month), suggesting rotation.
[35.987] [    5.370293] systemd-journald[167]: /run/log/journal/717664b3b6bb40b7b1b7672fe0f9f884/system.journal: Journal header limits reached or header out-of-date, rotating.
[36.002] [[0;32m  OK  [0m] Started [0;1;39mNetwork Time Synchronization[0m.
[36.012] [[0;32m  OK  [0m] Reached target [0;1;39mSystem Initialization[0m.
[36.030] [[0;32m  OK  [0m] Started [0;1;39mDaily Cleanup of Temporary Directories[0m.
[36.047] [[0;32m  OK  [0m] Reached target [0;1;39mSystem Time Set[0m.
[36.061] [[0;32m  OK  [0m] Reached target [0;1;39mTimer Units[0m.
[36.077] [[0;32m  OK  [0m] Listening on [0;1;39mD-Bus System Message Bus Socket[0m.
[36.093] [[0;32m  OK  [0m] Listening on [0;1;39mdropbear.socket[0m.
[36.148] [[0;32m  OK  [0m] Reached target [0;1;39mSocket Units[0m.
[36.157] [[0;32m  OK  [0m] Reached target [0;1;39mBasic System[0m.
[36.169] [[0;32m  OK  [0m] Started [0;1;39mKernel Logging Service[0m.
[36.187] [    5.582705] Internal error: synchronous external abort: 0000000096000010 [#1] PREEMPT SMP
[36.208] [    5.590902] Modules linked in: dwc3(+) crct10dif_ce e5010_jpeg_enc dwc3_am62 k3_j72xx_bandgap rtc_ti_k3 wave5 j721e_csi2rx videobuf2_dma_contig v4l2_mem2mem videobuf2_memops ti_k3_r5_remoteproc imx219 ti_k3_dsp_remoteproc cfg80211 videobuf2_v4l2 virtio_rpmsg_bus v4l2_fwnode videobuf2_common v4l2_async rpmsg_ns videodev ti_k3_common ltc2945 rfkill cdns_dphy_rx sa2ul mc at24 optee_rng rng_core overlay fuse drm drm_panel_orientation_quirks ipv6
[36.247] [    5.630027] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.46-g798369be2e #1
[36.254] [    5.636977] Hardware name: Critical Link MitySOM-AM62A (DT)
[36.259] [    5.642536] pstate: 000000c5 (nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[36.266] [    5.649485] pc : wave5_vdi_readl+0x14/0x30 [wave5]
[36.271] [    5.654290] lr : wave5_vpu_timer_callback+0x2c/0x94 [wave5]
[36.277] [    5.659860] sp : ffff800008003e70
[36.280] [    5.663163] x29: ffff800008003e70 x28: ffff00087f80e740 x27: ffff00080307c200
[36.287] [    5.670286] x26: 0000000000000000 x25: 00000000000000c0 x24: ffff800009380008
[36.294] [    5.677408] x23: ffff8000096d99e0 x22: ffff00087f80e700 x21: ffff00087f80e760
[36.301] [    5.684532] x20: ffff00080307c080 x19: ffff00080307c200 x18: 0000000000000000
[36.308] [    5.691654] x17: ffff80087648a000 x16: ffff800008000000 x15: 000000000000034e
[36.316] [    5.698777] x14: ffffffffffffffff x13: 0000000000000000 x12: 0000000000000003
[36.323] [    5.705901] x11: 0000000000000040 x10: ffff000800005370 x9 : ffff80000113754c
[36.347] [    5.713023] x8 : ffff0008004004b8 x7 : 0000000000000000 x6 : 0000000000000000
[36.347] [    5.720146] x5 : 0000000000000000 x4 : ffff00080307c200 x3 : ffff00087f80ec50
[36.347] [    5.727267] x2 : 0000000000000000 x1 : 0000000000000044 x0 : ffff80000a520044
[36.351] [    5.734392] Call trace:
[36.354] [    5.736828]  wave5_vdi_readl+0x14/0x30 [wave5]
[36.358] [    5.741273]  __hrtimer_run_queues+0x180/0x364
[36.362] [    5.745625]  hrtimer_interrupt+0xf0/0x250
[36.366] [    5.749625]  arch_timer_handler_phys+0x3c/0x50
[36.371] [    5.754062]  handle_percpu_devid_irq+0x90/0x240
[36.375] [    5.758583]  generic_handle_domain_irq+0x34/0x50
[36.380] [    5.763193]  gic_handle_irq+0x58/0x130
[36.384] [    5.766933]  call_on_irq_stack+0x24/0x54
[36.388] [    5.770847]  do_interrupt_handler+0x88/0x94
[36.392] [    5.775020]  el1_interrupt+0x34/0x70
[36.395] [    5.778592]  el1h_64_irq_handler+0x18/0x2c
[36.399] [    5.782679]  el1h_64_irq+0x64/0x68
[36.403] [    5.786071]  arch_cpu_idle+0x18/0x2c
[36.406] [    5.789636]  default_idle_call+0x58/0x198
[36.410] [    5.793639]  do_idle+0x250/0x2c4
[36.414] [    5.796862]  cpu_startup_entry+0x30/0x3c
[36.418] [    5.800776]  rest_init+0xf8/0x100
[36.421] [    5.804081]  arch_post_acpi_subsys_init+0x0/0x28
[36.425] [    5.808690]  start_kernel+0x6bc/0x700
[36.429] [    5.812341]  __primary_switched+0xbc/0xc4
[36.433] [    5.816347] Code: d503201f f940ac00 d503233f 8b214000 (b9400000) 
[36.439] [    5.822427] ---[ end trace 0000000000000000 ]---
[36.444] [    5.827033] Kernel panic - not syncing: synchronous external abort: Fatal exception in interrupt
[36.453] [    5.835798] SMP: stopping secondary CPUs
[36.456] [    5.839716] Kernel Offset: disabled
[36.460] [    5.843192] CPU features: 0x00000,00800084,0000420b
[36.465] [    5.848058] Memory Limit: none
[36.468] [    5.851106] ---[ end Kernel panic - not syncing: synchronous external abort: Fatal exception in interrupt ]---

  • Hi Jonathan,

    Could you elaborate on the test you are running so we can reproduce the issue on our end? 
    Is it just power cycle test or are you running any application( Related to Video encode/decode)  in the background. 

    Best Regards,

    Suren

  • I was validating our test fixture using power cycle tests.  Basically the board is powered on, some quick tests are run in u-boot, then unit then boots into linux and runs some more interface tests before powering down. Then power gets cycled.

    Frankly, I'm quite happy that I've gotten to the point where the failure rate of the testing has dropped to 1 in 5000 instead of 1 in 200 with other issues we've fixed.  I just wanted to see if the error made sense to you guys and perhaps if others have seen it as well.  The wave5 code is a mystery to me.

  • Hi Jonathan,

    I am checking internally with the team. Will respond to you in a day or two. 

    Apologies for the delay.

    Best Regards,

    Suren

  • Hi Jonathan,

    Can you try your test by disabling the runtime PM or revert the below patch:

    https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/commit/drivers/media/platform/chips-media/wave5?h=ti-linux-6.1.y&id=6d36c80cddb0ed5787887085bdf3a52747e107b7

    We suspect during bootup GStreamer runs some code to start/stop all the devices.(which could potentially start the hrtimer polling) and runtime PM support was added in 9.1 which would power off on idle..and if polling thread is still active it may lead to this issue

    Let us know if this helps.

    Best Regards,

    Suren

  • A git revert didn't apply cleanly so I just went and commented out the relevant sections.  I will rerun my tests and I'll let you know probably next week when I have enough test runs.

    diff --git a/drivers/media/platform/chips-media/wave5/wave5-helper.c b/drivers/media/platform/chips-media/wave5/wave5-helper.c
    index b08afbdfe2c9..167bde0acdb1 100644
    --- a/drivers/media/platform/chips-media/wave5/wave5-helper.c
    +++ b/drivers/media/platform/chips-media/wave5/wave5-helper.c
    @@ -86,8 +86,8 @@ int wave5_vpu_release_device(struct file *filp,
     		mutex_unlock(&dev->dev_lock);
     	}
     
    -	if (!pm_runtime_suspended(dev->dev))
    -		pm_runtime_put_sync(dev->dev);
    +	// if (!pm_runtime_suspended(dev->dev))
    +	// 	pm_runtime_put_sync(dev->dev);
     
     	return ret;
     }
    diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
    index 5431d934fae7..f0e950fca5e4 100644
    --- a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
    +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
    @@ -1651,12 +1651,12 @@ static int wave5_vpu_open_dec(struct file *filp)
     	}
     	mutex_init(inst->inst_lock);
     
    -	err = pm_runtime_resume_and_get(inst->dev->dev);
    -	if (err) {
    -		dev_err(inst->dev->dev, "runtime resume failed %d\n", err);
    -		ret = -EINVAL;
    -		goto cleanup_inst;
    -	}
    +	// err = pm_runtime_resume_and_get(inst->dev->dev);
    +	// if (err) {
    +	// 	dev_err(inst->dev->dev, "runtime resume failed %d\n", err);
    +	// 	ret = -EINVAL;
    +	// 	goto cleanup_inst;
    +	// }
     
     	ret = mutex_lock_interruptible(&dev->dev_lock);
     	if (ret)
    diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu-enc.c b/drivers/media/platform/chips-media/wave5/wave5-vpu-enc.c
    index 44f76533abd3..599f6e5e861f 100644
    --- a/drivers/media/platform/chips-media/wave5/wave5-vpu-enc.c
    +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu-enc.c
    @@ -1664,12 +1664,12 @@ static int wave5_vpu_open_enc(struct file *filp)
     		goto cleanup_inst;
     	}
     
    -	err = pm_runtime_resume_and_get(inst->dev->dev);
    -	if (err) {
    -		dev_err(inst->dev->dev, "runtime resume failed %d\n", err);
    -		ret = -EINVAL;
    -		goto cleanup_inst;
    -	}
    +	// err = pm_runtime_resume_and_get(inst->dev->dev);
    +	// if (err) {
    +	// 	dev_err(inst->dev->dev, "runtime resume failed %d\n", err);
    +	// 	ret = -EINVAL;
    +	// 	goto cleanup_inst;
    +	// }
     
     	ret = mutex_lock_interruptible(&dev->dev_lock);
     	if (ret)
    diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu.c b/drivers/media/platform/chips-media/wave5/wave5-vpu.c
    index f562c9294cf3..632bcd508462 100644
    --- a/drivers/media/platform/chips-media/wave5/wave5-vpu.c
    +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu.c
    @@ -422,8 +422,8 @@ static int wave5_vpu_remove(struct platform_device *pdev)
     {
     	struct vpu_device *dev = dev_get_drvdata(&pdev->dev);
     
    -	pm_runtime_put_sync(&pdev->dev);
    -	pm_runtime_disable(&pdev->dev);
    +	// pm_runtime_put_sync(&pdev->dev);
    +	// pm_runtime_disable(&pdev->dev);
     	dev_pm_opp_of_remove_table(&pdev->dev);
     	if (dev->irq < 0) {
     		kthread_destroy_worker(dev->worker);
    @@ -476,7 +476,7 @@ static struct platform_driver wave5_vpu_driver = {
     	.driver = {
     		.name = VPU_PLATFORM_DEVICE_NAME,
     		.of_match_table = of_match_ptr(wave5_dt_ids),
    -		.pm = &wave5_pm_ops,
    +		// .pm = &wave5_pm_ops,
     		},
     	.probe = wave5_vpu_probe,
     	.remove = wave5_vpu_remove,
    

  • Sounds good Jonathan. I will wait until next week for your test results.

    Best Regards,

    Suren

  • I should have looked at these results sooner. Looks like I broke something major with the above patch.

    Every test cycle is now crashing when trying to capture images from the camera. The only cycles that don't crash there, are the ones that crash early in boot which happened about 175 times out of 5900.

    [45.193] TESTFIXTURE:Capturing test patterns
    [45.196] 
    [45.196] [   22.770573] Internal error: synchronous external abort: 0000000096000010 [#1] PREEMPT SMP
    [45.757] [   22.778767] Modules linked in: rpmsg_ctrl xhci_plat_hcd rpmsg_char dwc3 cdns_csi2rx crct10dif_ce e5010_jpeg_enc k3_j72xx_bandgap rtc_ti_k3 dwc3_am62 wave5 ti_k3_r5_remoteproc j721e_csi2rx v4l2_mem2mem videobuf2_dma_contig imx219 videobuf2_memops videobuf2_v4l2 v4l2_fwnode ti_k3_dsp_remoteproc videobuf2_common v4l2_async virtio_rpmsg_bus rpmsg_ns ti_k3_common videodev sa2ul cdns_dphy_rx cfg80211 mc rfkill ltc2945 at24 optee_rng rng_core overlay fuse drm drm_panel_orientation_quirks ipv6
    [45.800] [   22.821806] CPU: 3 PID: 413 Comm: gst-plugin-scan Not tainted 6.1.46-00082-gd40bc9ff4f0b #9
    [45.809] [   22.830142] Hardware name: Critical Link MitySOM-AM62A (DT)
    [45.814] [   22.835699] pstate: 000000c5 (nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [45.821] [   22.842645] pc : wave5_vdi_readl+0x14/0x30 [wave5]
    [45.826] [   22.847451] lr : wave5_vpu_irq+0x3c/0x1f8 [wave5]
    [45.831] [   22.852154] sp : ffff800009973df0
    [45.834] [   22.855456] x29: ffff800009973df0 x28: ffff00087f86a740 x27: ffff0008025fba00
    [45.841] [   22.862580] x26: 0000000000000001 x25: ffff800009353008 x24: 00000000000000c0
    [45.848] [   22.869703] x23: ffff800009353008 x22: ffff8000096b99e0 x21: ffff0008025fb880
    [45.855] [   22.876827] x20: ffff0008032463c0 x19: ffff0008025fba00 x18: 0000000000000000
    [45.863] [   22.883951] x17: ffff800876513000 x16: ffff800009970000 x15: 0000000000000000
    [45.870] [   22.891076] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
    [45.877] [   22.898198] x11: 0000000000000040 x10: ffff000800005370 x9 : ffff800000ffdf94
    [45.884] [   22.905321] x8 : ffff0008004004b8 x7 : 0000000000000000 x6 : 0000000000000000
    [45.903] [   22.912445] x5 : 0000000000000000 x4 : ffff0008025fba00 x3 : ffff00087f86ac50
    [45.905] [   22.919569] x2 : 0000000000000000 x1 : 0000000000000044 x0 : ffff80000a4c0044
    [45.905] [   22.926694] Call trace:
    [45.908] [   22.929131]  wave5_vdi_readl+0x14/0x30 [wave5]
    [45.912] [   22.933576]  wave5_vpu_timer_callback+0x2c/0x88 [wave5]
    [45.917] [   22.938799]  __hrtimer_run_queues+0x150/0x328
    [45.922] [   22.943152]  hrtimer_interrupt+0xf0/0x250
    [45.926] [   22.947154]  arch_timer_handler_phys+0x34/0x50
    [45.930] [   22.951589]  handle_percpu_devid_irq+0x90/0x240
    [45.935] [   22.956112]  generic_handle_domain_irq+0x34/0x50
    [45.939] [   22.960718]  gic_handle_irq+0x58/0x13c
    [45.943] [   22.964459]  call_on_irq_stack+0x24/0x4c
    [45.947] [   22.968373]  do_interrupt_handler+0x88/0x94
    [45.951] [   22.972547]  el0_interrupt+0x50/0xd8
    [45.955] [   22.976116]  __el0_irq_handler_common+0x18/0x28
    [45.959] [   22.980635]  el0t_64_irq_handler+0x10/0x20
    [45.963] [   22.984722]  el0t_64_irq+0x18c/0x190
    [45.967] [   22.988292] Code: d503201f f940ac00 d503233f 8b214000 (b9400000) 
    [45.973] [   22.994373] ---[ end trace 0000000000000000 ]---
    [45.978] [   22.998979] Kernel panic - not syncing: synchronous external abort: Fatal exception in interrupt
    [45.986] [   23.007743] SMP: stopping secondary CPUs
    [45.990] [   23.011663] Kernel Offset: disabled
    [45.994] [   23.015139] CPU features: 0x00000,00800084,0000420b
    [45.999] [   23.020004] Memory Limit: none
    [46.002] [   23.023052] ---[ end Kernel panic - not syncing: synchronous external abort: Fatal exception in interrupt ]---
    

    I'm guessing I missed a pm call somewhere or some other part of the original commit which clock gates the wave5.  I'll try again.

  • Misses this, restarting test

    --- a/drivers/media/platform/chips-media/wave5/wave5-vpu.c
    +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu.c
    @@ -397,8 +397,8 @@ static int wave5_vpu_probe(struct platform_device *pdev)
                    (match_data->flags & WAVE5_IS_DEC) ? "'DECODE'" : "",
                    dev->product_code);
     
    -       pm_runtime_enable(&pdev->dev);
    -       wave5_vpu_sleep_wake(&pdev->dev, true, NULL, 0);
    +       // pm_runtime_enable(&pdev->dev);
    +       // wave5_vpu_sleep_wake(&pdev->dev, true, NULL, 0);

  • Let us know how it goes with the latest test run of yours. 

    Best Regards,

    Suren

  • With the wave5 pm disabled, my testing so far has run ~11,600 boot cycles without a wave5 crash.  Assuming the boot crash is really 1 in 5000, I should have seen at least 2 crashes so far.

  • I am glad that the tests ran after disabling the runtime PM calls.  If this has solved your issue, can we have this closed. 

    Best Regards,

    Suren

  • Ran an additional 29k test runs and saw no wave5 crashes.

  • That's great Jonathan. I am glad you were able to validate. 

    I am going to close this thread, feel free to open a new thread for any further assistance.

    Best Regards,

    Suren