TDA4VM: Capture Node locking up due to a streaming camera being disconnected

Stuart Burtner

Part Number: TDA4VM

Hello,

I have a use-case where a camera will be asynchronously plugged and unplugged into the TDA4. I have made an application to handle this, and to deal with detecting a disconnect usecase, reprobing the camera - etc.

After extended testing, I'm finding what appears to be a race condition where after some high volume of connect / disconnects, the CSIRX is stuck in a state where it cannot recover except via a reboot.

Below is some unorganized debugging information I've collected. There seem to be two "types" of crashes:

CSIRX Registers:
- During some of the crashes - i get this behavior:
  - During the lockup, status register 0x04504104 reports 0x80000111, 0x80000112 and 0x800001B3
  - The IRQ register 0x04504028 reports several errors. Upon clearing it, I noticed these errors are permanently stuck: 0x00001060
- During other crashes:
  - During the lockup, the status register 0x04504104 reports seemingly normal states, like 0x80000111, 0x80000112 and 0x80000133
Capture Node:
- During the crash event, the capture node will always timeout and receive a callback indicating that there not a new frame:
  - the function `tivxCaptureTimeout` is never having the event `frame_available` posted via the callback `captDrvCallback`
CSIRX Driver:
- Similarly, the CSIRX driver is not doing anything from what I can tell. I added printfs to the beginning of every function and see absolutely no activity after this happens
- (I added printfs to csirx_drv.c, csirx_event.c csirx_drvUdma.c)
- During one type of crash, the ECC and CRC IRQs get stuck high, and so the event service routine will endlessly process it on all available clock cycles when a camera is streaming
- During the other type of crash, I see no activity inside of the CSIRX driver

I believe the bug occurs the moment the camera stream is cut from the TDA4, and will not fix itself until it is rebooted. Other CSIRX ports will continue working. It takes hundreds of disconnects in order to reproduce this issue.

EDIT: I have reproduced the issue by simply disabling / enabling the CSI output of the deserializer. As far as I understand, this confirms the issue lies within the TDA4 somewhere - either the CSIRX driver chain or the CSIRX hardware.

I have a few questions:

Where is this thing getting stuck? How could I debug it further? Can it be fixed?
Is it possible that I could issue a soft reset to some device in the TDA4 such as the CSIRX or related nodes to get operation up and running again?
1. I've tried the following soft resets with no luck:
  1. CSI_RX_IF_VBUS2APB_SOFT_RESET
  2. CSI_RX_IF_VBUS2APB_STREAM0_CTRL
2. I've also tried tearing down the openvx pipeline completely - then recreating it. Unfortunately, this doesn't fix anything.

over 1 year ago

0 Neehar Sawant over 1 year ago

TI__Mastermind 22420 points

Hi Stuart,

Our expert is currently out of office, please expect a delay in response. Thank you for your patience.

Thanks,

Neehar

0 Stuart Burtner over 1 year ago in reply to Neehar Sawant

Intellectual 301 points

Thank you Neehar - is there any update on TI's side on this issue?

0 Brijesh Jadav over 1 year ago

TI__Guru**** 485125 points

Hi Stuart,

Sorry for the late reply.

Most likely, CSITX module is getting stuck and stopped capturing. Which SDK release are you using? Soft Reset should have helped, but there might be additional things required. So could you please share the exact steps that you are following on detection of unplug event and also plug in back event?

Regards,

Brijesh

0 Stuart Burtner over 1 year ago in reply to Brijesh Jadav

Intellectual 301 points

Hi Brijesh:

I am using version 8.02 right now.

The usecase was designed to detect hotplug events by querying the deserializer for the presence of the link. Because of this issue, I have disabled this feature and now its' acting similar to a single-cam usecase.

To recreate this problem easily - you need to do the following:

Start with a singlecam usecase
Edit the code to supply a valid "test frame" (`status = app_send_test_frame(obj->pipe1->capture_node, obj->pipe1->fs_test_raw_image);`) to the capture node to enable a timeout frame
Over I2C - execute a sequence to continuously enable / disable CSI output from the deserializer (toggling once a second is fine). Leave this running.
After 1 - 15 minutes, the csirx will hang (symptoms stated before)

This system is using a MAX9296 deserializer, so I haven't looked into how I'd do the toggling with an FPD Link.

0 Brijesh Jadav over 1 year ago in reply to Stuart Burtner

TI__Guru**** 485125 points

Hi Stuart,

But when you disconnect and reconnect deserializer, do you not require to reconfigure deserializer and probably full chain from deserializer? Do you detect disconnect event from the capture node in the application and taking any action in the application?

Regards,

Brijesh

0 Stuart Burtner over 1 year ago in reply to Brijesh Jadav

Intellectual 301 points

Thank you for the reply,

Brijesh Jadav said:
But when you disconnect and reconnect deserializer, do you not require to reconfigure deserializer and probably full chain from deserializer?

No - I'm writing to a register that stops CSI output from the deserializer.

I do not reset / turn off the deserializer at all in these tests, nor do I need to reconfigure it. (Similarly, in the normal application the deserializer remains connected, only the serializer & camera need to be reconfigured.)

What I'm doing is the direct equivalent of toggling the CSI_ENABLE bit of the UB960 - below:

Brijesh Jadav said:
Do you detect disconnect event from the capture node in the application and taking any action in the application?

I do not need to - I leave the capture node alone. It will put up an error frame and time out because there are no images coming into the pipeline.

Some context on how I got here, if you are still curious:

My usecase involves the plugging / unplugging of cameras. It periodically queries the deserializer asking about the state of the link between the serdes pair. It is capable of distinguishing if its' connected, configured, or streaming, and will respond accordingly - by either configuring the serializer, camera, or enabling the stream, respectively.

The bug was found under these particular conditions: We have a camera being connected & disconnected, while the camera remains powered (which means the camera and serializer do not need to be reconfigured, but we do issue a link-reset on the deserializer to search for a new connection). After many connect / disconnect events, we found that the CSIRX seems to lock up - similar to the situation described above.

We did confirm that the other CSIRX instance (CSIRX1, instead of CSIRX0) does still work when this happens.

Eventually we tried moving toggling just the CSI output of the deserializer (as stated above) to see if that would reproduce the error - which it did.

0 Stuart Burtner over 1 year ago in reply to Brijesh Jadav

Intellectual 301 points

Brijesh, is there any other context you would to support us with this issue?

0 Brijesh Jadav over 1 year ago in reply to Stuart Burtner

TI__Guru**** 485125 points

Hi Stuart,

I think in this case, we do have to follow certain sequence in restarting the CSIRX. Because when camera is unplugged, it can be somewhere in the middle of the frame and there could be short frame transmitted to the CSIRX and it is waiting for the end of frame marker. Should make sure that CSIRX is in clean state before restarting.

Regards,

Brijesh

0 Stuart Burtner over 1 year ago in reply to Brijesh Jadav

Intellectual 301 points

Hi Brijesh,

What would a clean slate for the CSIRX be? How can we do that?

I tried to execute a reset in the 1 second gap between the first stream cutting off and before the new stream comes in by doing the following:

I added a CSIRX RESET ioctl which does the same thing as the function:
`static int32_t CsirxDrv_resetStream(const CsirxDrv_InstObj *instObj, uint32_t strmIdx)`
in the file `csirx_event.c.`

I then added the following patch to the capture node:

commit 0e068dfc6d1c933f529564e05264e78d0bbba067
Author: Stuart Burtner <sburtner@d3engineering.com>
Date:   Tue Apr 30 16:36:51 2024 -0400

    Added stop -> soft reset -> start to frame invalid case

diff --git a/kernels_j7/hwa/capture/vx_capture_target.c b/kernels_j7/hwa/capture/vx_capture_target.c
index c2d88271..30a2c260 100755
--- a/kernels_j7/hwa/capture/vx_capture_target.c
+++ b/kernels_j7/hwa/capture/vx_capture_target.c
@@ -61,6 +61,7 @@
  */
 
 #include <stdio.h>
+#include <stdbool.h>
 #include "TI/tivx.h"
 #include "TI/j7.h"
 #include "VX/vx.h"
@@ -78,6 +79,9 @@
 #include <vx_reference.h>
 #include <vx_internal.h>
 
+#define DEBUG_PRINTF(...) VX_PRINT(VX_ZONE_WARNING, __VA_ARGS__)
+//#define DEBUG_PRINTF(...)
+
 #define CAPTURE_FRAME_DROP_LEN                          (4096U*4U)
 
 #define CAPTURE_INST_ID_INVALID                         (0xFFFFU)
@@ -179,6 +183,8 @@ struct tivxCaptureParams_t
     uint8_t enableErrorFrameTimeout;
     /**< Flag indicating if error frame has been sent and can use error timeout
      *   Error timeout is only used if this error frame is sent */
+    bool frame_valid;
+    /**< Flag indicating a timeout has already occurred*/
 };
 
 static tivx_target_kernel vx_capture_target_kernel1 = NULL;
@@ -335,6 +341,11 @@ static vx_status tivxCaptureEnqueueFrameToDriver(
             /* Only enqueue the frame if it is a valid frame */
             if (tivxFlagIsBitSet(prms->img_obj_desc[chId]->flags, TIVX_REF_FLAG_IS_INVALID) == 0U)
             {
+                if(!prms->frame_valid) {
+                    prms->frame_valid = true;
+                    DEBUG_PRINTF("Frame has transitioned to valid - flags=0x%x\n", prms->img_obj_desc[chId]->flags);
+                }
+
                 if ((uint32_t)TIVX_OBJ_DESC_RAW_IMAGE == prms->img_obj_desc[chId]->type)
                 {
                     tivx_obj_desc_raw_image_t *raw_image;
@@ -383,6 +394,15 @@ static vx_status tivxCaptureEnqueueFrameToDriver(
             }
             else
             {
+                if(prms->frame_valid ) {
+                    prms->frame_valid = false;
+                    //IOCTL_CSIRX_RESET
+                    //printf("Resetting camera: Status = %d\n",fvid2_status);
+                    fvid2_status = Fvid2_stop(instParams->drvHandle, NULL);
+                    fvid2_status = Fvid2_control(instParams->drvHandle, IOCTL_CSIRX_RESET, NULL, NULL);
+                    fvid2_status = Fvid2_start(instParams->drvHandle, NULL);
+                    DEBUG_PRINTF("Frame has transitioned to invalid - displaying error frame - flags=0x%x\n", prms->img_obj_desc[chId]->flags);
+                }
                 tivxQueuePut(&prms->errorFrameQ[chId], (uintptr_t)output_desc->obj_desc_id[chId], TIVX_EVENT_TIMEOUT_NO_WAIT);
             }
         }

I tried a variant without the soft reset, and another variant without the stop/start functions. None of these stopped the CSIRX from locking up.

When we lock up, sometimes the IRQs seems to be stuck, and other times it just sits there not capturing any frames.

+1 Brijesh Jadav over 1 year ago in reply to Stuart Burtner

TI__Guru**** 485125 points

Hi Stuart,

How is control command IOCTL_CSIRX_RESET implemented in the code? Can you please share this part of the code? Also why do you need to reset the CSIRX module? Any specific condition?

Can you please also refer to below e2e links, where i had shared method for resetting CSIRX?

(+) TDA4VM: The ECC and CRC errors cause the csirx driver can not receive the image - Processors forum - Processors - TI E2E support forums

(+) TDA4VM: Camera Stream not working for different combination (Channel Mask 0xB) - Processors forum - Processors - TI E2E support forums

(+) TDA4VM: CSIRX receive data error - Processors forum - Processors - TI E2E support forums

Regards,

Brijesh

0 Stuart Burtner over 1 year ago in reply to Brijesh Jadav

Intellectual 301 points

Brijesh,

My issue is most similar to this one:

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1092960/tda4vm-the-ecc-and-crc-errors-cause-the-csirx-driver-can-not-receive-the-image

Thank you for your help.

I have applied the patch - but I did find a semaphore leak inside of those CSIRX changes (at least on version 8.2). I fixed it - see below:

commit c4e512a4ae122b8e6e4440d514fa4f3ea9469d6b
Author: Stuart Burtner <sburtner@d3engineering.com>
Date:   Mon May 20 14:56:48 2024 -0400

    Fixed semaphore leak

diff --git a/pdk_jacinto_08_02_00_21/packages/ti/drv/csirx/src/csirx_drvInit.c b/pdk_jacinto_08_02_00_21/packages/ti/drv/csirx/src/csirx_drvInit.c
index fdd9d64ba..6188b3478 100644
--- a/pdk_jacinto_08_02_00_21/packages/ti/drv/csirx/src/csirx_drvInit.c
+++ b/pdk_jacinto_08_02_00_21/packages/ti/drv/csirx/src/csirx_drvInit.c
@@ -179,6 +179,21 @@ int32_t Csirx_init(const Csirx_InitParams *initParams)
                 captObj->numVirtContUsed[instCnt]                  = 0U;
             }
 
+            /* Allocate instance semaphore */
+            SemaphoreP_Params semParams;
+            SemaphoreP_Params_init(&semParams);
+            semParams.mode = SemaphoreP_Mode_BINARY;
+            if (instObj[instCnt].lockSem == NULL)
+                instObj[instCnt].lockSem = SemaphoreP_create(1U, &semParams);
+
+            if (instObj[instCnt].lockSem == NULL)
+            {
+                GT_0trace(
+                    CsirxTrace, GT_ERR,
+                    "Instance semaphore create failed!!\r\n");
+                retVal = FVID2_EALLOC;
+                return retVal;
+            }
             retVal = CsirxDrv_modInstObjInit(&instObj[instCnt], instCnt);
             if (retVal != FVID2_SOK)
             {
@@ -442,7 +457,6 @@ int32_t CsirxDrv_modInstObjInit(CsirxDrv_InstObj *instObj, uint32_t instId)
     CsirxDrv_CommonObj *captObj;
     captObj = &gCsirxCommonObj;
     uint32_t loopCnt, eventId;
-    SemaphoreP_Params semParams;
 
     instObj->dpyCfgDone                 = 0U;
     /* Initialize D-PHY configuration parameters */
@@ -495,20 +509,7 @@ int32_t CsirxDrv_modInstObjInit(CsirxDrv_InstObj *instObj, uint32_t instId)
         }
 #endif
     }
-    if (retVal == FVID2_SOK)
-    {
-        /* Allocate instance semaphore */
-        SemaphoreP_Params_init(&semParams);
-        semParams.mode = SemaphoreP_Mode_BINARY;
-        instObj->lockSem = SemaphoreP_create(1U, &semParams);
-        if (instObj->lockSem == NULL)
-        {
-            GT_0trace(
-                CsirxTrace, GT_ERR,
-                "Instance semaphore create failed!!\r\n");
-            retVal = FVID2_EALLOC;
-        }
-    }
+
     /* Initialize event object */
     for (eventId = 0U ; eventId < CSIRX_EVENT_GROUP_MAX ; eventId++)
     {

0 Brijesh Jadav over 1 year ago in reply to Stuart Burtner

TI__Guru**** 485125 points

Hi Stuart,

ok, not sure exactly why you are seeing semaphore leak, even now you are creating two semaphores, which were created in instance specific API in the original code. So as such, the changes look fine to me. So with this reset function, you are not no longer seeing this issue, correct?

Regards,

Brijesh

0 Stuart Burtner over 1 year ago in reply to Brijesh Jadav

Intellectual 301 points

The semaphore leak was because we run `CsirxDrv_modInstObjInit` inside of the csirxDrv_init() function, and `CsirxDrv_modInstObjInit` will allocate 2 new semaphores without ever deleting the old ones. I moved the allocation outside of that function and everything is great now.

But yes, after that everything is working properly. Thank you for your help!