TDA4VH-Q1: Kernel panic - not syncing: Asynchronous SError Interrupt - Wave5 driver

Part Number: TDA4VH-Q1
Other Parts Discussed in Thread: J784S4XEVM

Tool/software:

Hi,

Please point me in the right direction for debugging this System Error.  I have a custom board with a TDA4VH-Q1 SOC.  This async error consistently occurs in the wave5_dec_clr_disp_flag() call while starting up a gstreamer pipeline with the v4l2h264dec decoder.  Thanks for the help...

[ 2833.600901] SError Interrupt on CPU5, code 0x00000000bf000000 -- SError
[ 2833.600920] CPU: 5 UID: 0 PID: 4129 Comm: queue1:src Not tainted 6.15.8-dirty #2 PREEMPT(voluntary)
[ 2833.600926] Hardware name: ---
[ 2833.600929] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 2833.600933] pc : wave5_dec_clr_disp_flag+0x40/0x80 [wave5]
[ 2833.600959] lr : wave5_dec_clr_disp_flag+0x40/0x80 [wave5]
[ 2833.600964] sp : ffff800095ebba30
[ 2833.600965] x29: ffff800095ebba30 x28: ffff0008021acd30 x27: 0000000000000000
[ 2833.600972] x26: ffff000805894010 x25: ffff800079a02e98 x24: ffff000807f1ba00
[ 2833.600977] x23: ffff800095ebbcc8 x22: ffff0008021acd50 x21: ffff0008065b8000
[ 2833.600981] x20: ffff000805894000 x19: ffff000805894000 x18: 0000000000000000
[ 2833.600986] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffff80006278
[ 2833.600990] x14: 0000000100000000 x13: 0000000000000000 x12: 0000000000000000
[ 2833.600995] x11: ffffffffffffffff x10: ffffffffffffffff x9 : 0000000000000000
[ 2833.600999] x8 : ffff000807f1baa0 x7 : 0000000000000000 x6 : 0000000000000001
[ 2833.601003] x5 : 0000000000000001 x4 : ffff000807f1b8b0 x3 : 0000000000000000
[ 2833.601007] x2 : 0000000000000000 x1 : ffff800082ea0118 x0 : ffff800082ea0000
[ 2833.601014] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 2833.601016] CPU: 5 UID: 0 PID: 4129 Comm: queue1:src Not tainted 6.15.8-dirty #2 PREEMPT(voluntary)
[ 2833.601020] Hardware name: ---
[ 2833.601022] Call trace:
[ 2833.601025] show_stack+0x18/0x30 (C)
[ 2833.601039] dump_stack_lvl+0x60/0x80
[ 2833.601046] dump_stack+0x18/0x24
[ 2833.601050] panic+0x168/0x360
[ 2833.601054] nmi_panic+0x88/0x90
[ 2833.601059] arm64_serror_panic+0x64/0x80
[ 2833.601064] do_serror+0x3c/0x70
[ 2833.601068] el1h_64_error_handler+0x30/0x50
[ 2833.601076] el1h_64_error+0x6c/0x70
[ 2833.601079] wave5_dec_clr_disp_flag+0x40/0x80 [wave5] (P)
[ 2833.601085] wave5_vpu_dec_clr_disp_flag+0x54/0x80 [wave5]
[ 2833.601090] wave5_vpu_dec_buf_queue+0x148/0x150 [wave5]
[ 2833.601095] __enqueue_in_driver+0x3c/0x80 [videobuf2_common]
[ 2833.601100] vb2_core_qbuf+0x438/0x5b0 [videobuf2_common]
[ 2833.601104] vb2_qbuf+0xac/0x190 [videobuf2_v4l2]
[ 2833.601111] v4l2_m2m_qbuf+0x6c/0x240 [v4l2_mem2mem]
[ 2833.601119] v4l2_m2m_ioctl_qbuf+0x18/0x490 [v4l2_mem2mem]
[ 2833.601123] v4l_qbuf+0x48/0x70 [videodev]
[ 2833.601136] __video_do_ioctl+0x3f4/0x470 [videodev]
[ 2833.601144] video_usercopy+0x1e4/0x690 [videodev]
[ 2833.601151] video_ioctl2+0x18/0x30 [videodev]
[ 2833.601159] v4l2_ioctl+0x40/0x60 [videodev]
[ 2833.601167] __arm64_sys_ioctl+0xac/0xe0
[ 2833.601176] invoke_syscall+0x48/0x110
[ 2833.601182] el0_svc_common.constprop.0+0x40/0xe0
[ 2833.601186] do_el0_svc+0x1c/0x30
[ 2833.601189] el0_svc+0x30/0xd0
[ 2833.601193] el0t_64_sync_handler+0x10c/0x140
[ 2833.601197] el0t_64_sync+0x198/0x19c
[ 2833.601201] SMP: stopping secondary CPUs
[ 2833.601215] Kernel Offset: disabled
[ 2833.601217] CPU features: 0x0400,00040050,01000400,8200421b
[ 2833.601221] Memory Limit: none
[ 2833.880806] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

  • Hi, 

    What HLOS SDK version are you using? Is there a particular use-case that you are seeing this happen with? I have not noticed this kernel panic when starting a decoder pipeline yet.

    Thanks,
    Sarabesh S.

  • Hi,

    I am currently on the 6.15.8 linux kernel.  The panic occurs when I start a single gstreamer decode pipeline with the v4l2h264dec plugin.  The panic seems to always occur in  wave5_dec_clr_disp_flag().  I have been decoding a live h264 video stream.  Today, I sourced video from an MP4 file and the panic occurred at the same point. The panic occurs suspiciously close to wave5_vpu_dec_start_streaming(), but I am not sure how to isolate the source of the async error.

    Thanks,

    Jeff

  • Could you share the gstreamer pipeline?

    Thanks,
    Sarabesh S.

  • Hi Sarabesh,

    Below is a pipeline which induces the Serror every time.  The stack trace always indicates the issue is encountered when wave5_dec_clr_disp_flag() is executing.  I have built a debug kernel and hit the breakpoint in wave5_dec_clr_disp_flag() using KGDB.  As I step into the code, the SError is triggered immediately.  I wanted to read the CFSR register at 0xE000ED28, but the address was not valid (I think because there are multiple cores).  Not sure of the next steps for diagnosing the cause of the fault...

    Note $1 is just the path to an MP4 file..

    /opt/GStreamer/bin/gst-launch-1.0 \
    filesrc location=${1} \
    ! qtdemux name=demux demux.video_0 \
    ! h264parse config-interval=0 \
    ! v4l2h264dec capture-io-mode=4 \
    ! videoconvert \
    ! videorate ! video/x-raw,max-rate=30/1 \
    ! queue \
    ! rtpvrawpay \
    ! 'application/x-rtp, media=(string)video, encoding-name=(string)RAW' \
    ! udpsink host=127.0.0.1 port=5000 sync=false async=false

    Thanks again,
    Jeff

  • Hi Jeff, 

    Thanks for the information. I'll review this and get back to you. 

    Regards,
    Sarabesh S.

  • Thanks Sarabesh. 

    I did a bit more digging.  I see that the ESR is 0x00000000bf000000.  This decodes to a SError with IDS==1.  The Arm A-profile Architecture Reference Manual indicates that in this case, ESR_EL1[23:0] are IMPLEMENTATION DEFINED.  I am not sure where to find the implementation specific definition for ISS==24'b0.  Please let me know what you find.  I am stuck.

    Thanks, Jeff.

  • Thanks Jeff, 

    Currently discussing with the team. I'll follow up soon.

    Regards,
    Sarabesh S.

  • Hi Sarabesh,

    Do you have any updates following discussion with the team? This is a high priority issue and your guidance is much appreciated.

    Best,

    Luke

  • Hi Luke, 

    Have you been able to reproduce this on a TI EVM at all?

  • Hello,

    Could you please share the input stream. I will try to replicate this on my setup with the latest SDK. 

    Thank you,
    Sarabesh S.

  • Hi.  Attached is the source video used in the above pipeline.  Note that the pipeline was working with the v6.12.17 linux kernel.  When we stepped forward to the v6.15.8 kernel, we started seeing the issue.  The wave5 codec driver requires a 70MB allocation of CMA memory.  The CONFIG_ARCH_FORCE_MAX_ORDER kernel parameter was set to 15 (assuming a 4k page size) in order to support this allocation.

    Thanks,
    Jeff

  • Hi Jeffrey, 

    Thanks for the stream. A few clarifying questions below:

    1. Did you already set the CMA allocation w/ CONFIG_ARCH_FORCE_MAX_ORDER=15 in the 6.12 kernel or is it new? What was the default?
    2. Can you confirm what CMA is being passed in u-boot CMA parameter? 
    3. Does the kernel crash happen if you change the CMA in the u-boot cmdline as shown here:
      https://software-dl.ti.com/jacinto7/esd/processor-sdk-linux-j784s4/11_01_00_03/exports/docs/linux/How_to_Guides/Target/How_To_Carve_Out_CMA.html#how-to-configure-the-cma-size

    Thank you,
    Sarabesh S.

  • Hi Sarabesh,

    I have been working with Jeffrey on this issue, and we have a lot more detail now.

    First, the CMA configuration and CONFIG_ARCH_FORCE_MAX_ORDER=15 was required on both the upstream 6.12.17 kernel as well as the 6.16.9 kernel (or the 6.15.7 kernel, both behave the same).  Without this, the pipeline will not even start.  So, this is simply a requirement for now for us to make this start at all.

    Second, I was able to set up a minimal reproducer based on the gstreamer command line given above and the Big Buck Bunny video Jeff attached previously.  I was able to reproduce the crash on the TI J784S4XEVM board, so I am presently doing my testing on it.

    I bisected the issue with the mainline kernel versions on the J784S4XEVM board from 6.12.17 to 6.16.9 and found the commit which caused the issue to be:

    https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/commit?id=2092b3833487e5ce138f4303f98e46ba0f87f1d0

    Note that this went into the 6.13 kernel series, but is currently being backported by TI's SDK version 11 into TI's 6.12 vendor kernel.

    I also learned that if I simply revert this commit in the 6.16.9 mainline kernel the problem goes away.

    I found this patch series was just posted by the codec IP block vendor:

    https://lore.kernel.org/linux-media/20250922055255.116-1-jackson.lee@chipsnmedia.com/

    I tried applying those patches, and the problem continued to occur.  So, at this point I believe this to be an issue that still exists upstream.

    In order to make this crash happen, we have learned that having debugging turned up when playing the big buck bunny video causes the crash to occur reliably.  This is not required in our production-intent use of this, which is interesting.  Based on what is causing this (the codec suspend support being added), I assume this might just be related to the workload being pushed to the codec.

    Here is the command I am using to reproduce this issue:

    export GST_DEBUG=6
    gst-launch-1.0 \
        filesrc location=Big_Buck_Bunny_720_10s_30MB.mp4 \
        ! qtdemux name=demux demux.video_0 \
        ! h264parse config-interval=0 \
        ! v4l2h264dec capture-io-mode=4 output-io-mode=4 \
        ! videoconvert \
        ! videorate ! video/x-raw,max-rate=30/1 \
        ! queue \
        ! rtpvrawpay \
        ! 'application/x-rtp, media=(string)video, encoding-name=(string)RAW' \
        ! udpsink host=127.0.0.1 port=5000 sync=false async=false

    I plan to attempt this test with the latest tagged TI vendor kernel to see if it is happening there, but my guess based on what we are seeing is that it is.

    I will add an update to this ticket when I have those details.

  • A quick update on the ti-linux-kernel:

    I just tested Ti's kernel (commit: ccfe8fee8026cbb23dcd9c69a2bd961c99c58567, tag: 11.01.14) on the board with the same conditions described in my previous post.

    TI's kernel does not exhibit the crash in my testing.  This surprised me, but it is good to know.  Either other patches that TI has applied to the wave5 codec may have fixed this issue, or perhaps some interplay of the power management code elsewhere in the kernel has changed since 6.12 that has caused the wave5 codec to crash in newer kernels.

    At this point I think we have a path forward, which is to simply revert the offending kernel commit.  Long-term, I expect this will be fixed upstream, but in the meantime we have a usable workaround.

    I think this ticket can now be closed.

  • Hello, 

    Glad you were able to find the fix in the 11.01.14 tag. I agree that this is related to the runtime pm support commit . The resolution is likely due to increasing the autosuspend_delay so the VPU is not powered down while still having frames in use. I noticed the CNM patch series you shared is setting the delay to 500ms but in our tree you can see it is set to 5000ms (https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/commit/drivers/media/platform/chips-media/wave5/wave5-vpu.c?h=11.01.14&id=b4249050fcdb302f50ca047f4606dc8025cf27bb) GST_DEBUG=6 probably slowed things down enough for auto-suspend to fire mid-operation which caused the crash.

    Regards,
    Sarabesh S.

  • Hi Sarabesh,

    I was able to test the unmodified 6.16.9 kernel again, and only changed that one line to set the pm autosuspend delay to 5000 ms.  That also worked for the big buck bunny test reproducer, so that may be the long-term best solution.

    Thanks for the idea to check into that, it was very helpful.

  • Glad to hear 100. Let us know if you have any further questions.

    Regards,
    Sarabesh S.