TDA4VM: TDA4VM: A72 crashed and RTI0 ESM 344 is trigger

Part Number: TDA4VM

Tool/software:

Hi , 

    During operation of TDA4, RTI0 A72 wagchdog timeout after A72 crashed at 2024/12/31 9:51:59 (Running for 52 minutes),But The kernel log (journalctl )did not record any exceptions.

1.What causes this situation?
2. Could you provide survey ideas and suggestions abort a72 crash?
Using SDK 8.6. psdkla/board-support/linux-5.10.162+gitAUTOINC+76b3e88d56-g76b3e88d56/drivers/watchdog/rti_wdt.c
kernel log (journalctl):
  • During operation of TDA4, RTI0 A72 wagchdog timeout after A72 crashed at 2024/12/31 9:51:59 (Running for 52 minutes),But The kernel log (journalctl )did not record any exceptions.

    1.What causes this situation?

    How is this log indicating that watchdog timed out? Did you see a reset of Linux?

    2. Could you provide survey ideas and suggestions abort a72 crash?

    Please share the complete logs as a text file attachment.

    - Keerthy

  • ”How is this log indicating that watchdog timed out? Did you see a reset of Linux?”

    - Yes,We see that the ESM 344 event has been triggered,and the system was reset

    “Please share the complete logs as a text file attachment.”

    We only found that the application log stopped(without any errors) and ESM 344 was triggered.

    We have identified the following issues that need to be optimized. Could you please provide some suggestions?
    1.The priority of watchdogd kworker task is FIFO 50 same as vxe_enc /mmc et.all irq ,Can we set the watchdog priority to FIFO 99?

    2. All system interrupts default binding to core0 , Can we move the vxe-enc 、cpsw9g  irq to core1 ?Will there be any case performance issues ?For example, cache synchronization?

  • Hi,

    https://www.geeksforgeeks.org/priority-of-process-in-linux-nice-value/

    Linux Nice value could be one way. We do not have expertise on the user space side.

    All system interrupts default binding to core0 , Can we move the vxe-enc 、cpsw9g  irq to core1 ?Will there be any case performance issues ?For example, cache synchronization?

    Yes.

    https://docs.kernel.org/core-api/irq/irq-affinity.html

    cd to the 

    /proc/irq/n

    Where n is the CPSW9g IRQ

    echo 0x2 > smp_affinity

    - Keerthy
  • Hi,

        We have reproduced this bug:crash on cpsw interrupt...

  • Hi, 

    Now the crash is always consistency here? 

    What are the active use cases that need to be run to reproduce this? 

    Best Regards,

    Keerthy 

  • “Now the crash is always consistency here? ”

         We only caught this log once,bug the crash reproduce low probability

    "What are the active use cases that need to be run to reproduce this?"

        we run the view tool .Will send a large amount of video data through the internet.

  • Hi,

    Okay. I am sharing a  potential fix. Please try if that fixes the issue.

    diff --git a/drivers/soc/ti/k3-ringacc.c b/drivers/soc/ti/k3-ringacc.c
    index 148f54d96..164d3999b 100644
    --- a/drivers/soc/ti/k3-ringacc.c
    +++ b/drivers/soc/ti/k3-ringacc.c
    @@ -1177,11 +1177,13 @@ static int k3_ringacc_ring_push_mem(struct k3_ring *ring, void *elem)
     
     static int k3_ringacc_ring_pop_mem(struct k3_ring *ring, void *elem)
     {
    -       void *elem_ptr;
    +       volatile dma_addr_t *elem_ptr;
     
            elem_ptr = k3_ringacc_get_elm_addr(ring, ring->state.rindex);
     
    -       memcpy(elem, elem_ptr, (4 << ring->elm_size));
    +       while (*elem_ptr == 0);
    +       memcpy_fromio(elem, elem_ptr, (4 << ring->elm_size));
    +       memset_io(elem_ptr, 0, (4 << ring->elm_size));
     
            ring->state.rindex = (ring->state.rindex + 1) % ring->size;
            ring->state.occ--;
    

    - Keerthy