TDA4VM: About the TIDL thread in QNX hangs -- follow-up question

Frank DU

Part Number: TDA4VM

Hello TI experts, We have developed a product called TDA with qnx+rots, SDK version 8.0.4. We have found that when using TIDL for a long time, there is an occasional freezing phenomenon. The tidl thread on A72 (same as QNX) will always be in a waiting state in vxWaitGraph, and at this time, there are no error messages on either the A72 or C71 side.

Regarding this issue, we have added some special handling and extra logs within the TIOVX and have observed some phenomena. Our special handling includes:

a. For the graph running on C71, we set a 5s timeout in the corresponding vxWaitGraph function and saved extra runtime log information in the tiovx framework. After the timeout occurs, we print the relevant information collected on the A72 side.

b. Afterward, when the graph timeout occurs, we also start a test graph to confirm whether C71 can work normally.

Here are the details:

a. Adding a 5-second timeout to vxWaitGraph and extra logging within the tiovx framework:

A 5-second timeout is added to vxWaitGraph. This duration far exceeds the execution time of TIDLNode within C71.
After the aforementioned 5-second timeout occurs, we print the information previously recorded within the tiovx framework. This information represents the last execution information received for the specified graph.

  The steps are as follows:

  2.1 After each call to tivxObjDescGet in tivxTargetTaskMain, the graph name is obtained using vxQueryReference(VX_REFERENCE_NAME).

  2.2 In functions such as tivxTargetDequeueObjDesc, tivxTargetCmdDescHandleUserCallback, and ownCheckGraphCompleted, information about the code execution process is saved to temp_info.

  2.3 In the final portion of tivxTargetCmdDescHandleUserCallback, it is determined whether the current graph name matches the graph name of interest. If they match, the information from temp_info is saved to last_info. Subsequently, temp_info is cleared, while last_info is used for final printing.

  2.4 After the 5-second graph timeout occurs, the information from last_info is printed within the thread that calls vxWaitGraph.

Upon completing the aforementioned information, we encountered a 5-second graph timeout during actual product operation. We confirmed the relevant printed information for this graph and analyzed it as follows:

Phenomenon 1:

Within the printed last_info, it is observed that tivxEventPost(graph->all_graph_completed_event) in ownCheckGraphCompleted is not called. This implies that the last execution information received is not "all graph execution completed notification -- all_graph_completed_event". This indicates that there is still processing for the current graph running on C71, and it has not been fully completed.

Phenomenon 2:

The timestamp (cmd_obj_desc->timestamp_h and cmd_obj_desc->timestamp_l) of the CMD sent from C71 in the printed last_info has a time difference of at least 3.6 seconds from the end of the 5-second timeout. This means that there should have been sufficient time (3.6 seconds) to receive subsequent CMDs from C71 again and update last_info. However, up until the timeout occurred, last_info was not updated.

Based on the above analysis, we would like to ask the following questions:

Question 1:

Is there a bug in C71 that causes the execution in C71 to slow down, resulting in the inability to complete all node processing within the set timeout of 5 seconds? Consequently, "all graph execution completed notification -- all_graph_completed_event" is not sent to the A72 side?What could be the cause of it.

Question 2:

About IPCs, is there a method to print only the relevant IPCs within a specific graph(like using graph name), similar to how the execution information is printed above?

Question 3:

If the answer to Question 2 is "No", can you provide a method to statistically summarize the order and number of IPCs sent and received? Analyzing IPCs seems like a challenging thing.

b. After the graph timeout occurs in (a), we also start a test graph to confirm whether C71 can work normally.

We utilized an additional test graph executed from app_c7x_kernel in vision_apps. This test graph is executed following the timeout of the aforementioned graph to verify whether C71 can operate normally after the timeout. During this process, the following observations were made:

Phenomenon 3:

Based on the corresponding logs, ownNodeKernelInit was executed successfully, indicating that the IPC communication for initialization between A72 and C71 was successful.

VX_ZONE_INFO:[ownGraphNodeKernelInit:578] kernel init for node 0, kernel app_c7x_kernel.img_add ...

VX_ZONE_INFO:[ownGraphNodeKernelInit:589] kernel init for node 0, kernel app_c7x_kernel.img_add ... done !!!

Phenomenon 4:

An error occurred during execution on C71, as indicated by the following information in the C71 logs. However, even with these errors, the CPU usage of C71 remained at 99%. Simultaneously, the vxWaitGraph for this test graph on A72 remained frozen, and A72 side did not receive any "execution failure notification".

[C7x_1 ] 10901.091091 s: UDMA : ERROR: UDMA channel open failed!!

[C7x_1 ] 10901.145457 s: UDMA : ERROR: ch_handle NULL Pointer!!!

[C7x_1 ] 10901.145495 s: UDMA : ERROR: SW trigger failed!!

Based on the above analysis, we would like to ask the following question,

Question 4:

As observed, the IPC channel between A72 and C71 is functioning normally. However, why is it unable to receive graph execution failure notifications ? What could be the potential cause?

Last thing to know, this thread is a follow-up question related to this thread,

e2e.ti.com/.../

Thank you for your assistance!

over 1 year ago

0 Takuma Fujiwara over 1 year ago

TI__Guru 51353 points

Hi Frank,

Could you clarify whether the SDK version is 8.0.4 or 8.2? In the related thread, I see that the SDK version is 8.2, but is this current thread for a different issue with C7x since it is a different SDK version?
I think there were a couple of experiments suggested in the previous thread, but were those reviewed and attempted?
Could you clarify how long it takes for the issue to appear? Would you say it takes 5 minutes, 1 hour, 8 hours, 24 hours, a couple of days, or a couple of weeks? sir.ext.ti.com/.../EXT_EP-10713 <- In this, we were seeing issues around 8 hours of running. If it takes a couple of hours to manifest, we should try to find a way to reproduce the issue faster, so that we can have quicker turn around time for debug/experiments.

Otherwise, for the IPC question, it so far sounds like C7x is becoming unresponsive, so there are no IPC communication happening.

A72 is running QNX, but is A72 core responsive? For example, can you still access the terminal and execute commands?
For the C7x, can you connect a debugger and see if you can connect to the C7x core to see what state the core is in? For example, whether the program counter is still running, stuck somewhere, or if it has gone into an abort state.

Regards,

Takuma

0 Frank DU over 1 year ago in reply to Takuma Fujiwara

Prodigy 190 points

Thank you for your reply. Here are some additional explanations

>>1. Could you clarify whether the SDK version is 8.0.4 or 8.2? In the related thread, I see that the SDK version is 8.2, but is this current thread for a different issue with C7x since it is a different SDK version?
--
Sorry, it's my mistake, the SDK version is 8.2.

>>2. I think there were a couple of experiments suggested in the previous thread, but were those reviewed and attempted?
--
I've followed all suggestions in the previous thread.I've checked c7x_1.cfg, and the following settings are correct:HwiC7x.dispatcherAutoNestingSupport and Task.idleTaskStackSize. Also we are using "production silicon".Regarding DDR, at this stage of the product, it is not possible to change the DDR frequency from 4266MT/s to 3733MT/s.

>>3.Could you clarify how long it takes for the issue to appear? Would you say it takes 5 minutes, 1 hour, 8 hours, 24 hours, a couple of days, or a couple of weeks? sir.ext.ti.com/.../EXT_EP-10713 <- In this, we were seeing issues around 8 hours of running. If it takes a couple of hours to manifest, we should try to find a way to reproduce the issue faster, so that we can have quicker turn around time for debug/experiments.
--
In the products where the issue actually occurs, most of them need to run continuously for more than 2 hours, and as described in the previous thread, both A72 and C71 are under high load.

>>Otherwise, for the IPC question, it so far sounds like C7x is becoming unresponsive, so there are no IPC communication happening.
>>1. A72 is running QNX, but is A72 core responsive? For example, can you still access the terminal and execute commands?
--
QNX(on A72) is working, we can get remote chip load information by periodically running vx_app_arm_remote_log.out,so I think REMOTE_SRV on C71 works

>>2.For the C7x, can you connect a debugger and see if you can connect to the C7x core to see what state the core is in? For example, whether the program counter is still running, stuck somewhere, or if it has gone into an abort state.
--
This issue can be reproduced in the actual product environment, but JTag debuggers cannot be used in that environment. In our current On-Table experimental environment, it is difficult to fully simulate the numerous parameters and actual load, so this issue has not yet been reproduced. We are also trying to adjust the parameters and load in the hope of reproducing this issue in the on-table environment.

Thank you!

0 Takuma Fujiwara over 1 year ago in reply to Frank DU

TI__Guru 51353 points

Hi Frank,

My understanding is that this issue will be hard to debug in-depth since we do not have access to debugger and it takes a very long time to reproduce the issue. Especially issues that take a very long time to reproduce tend to turn into month long effort, since turnaround time for each experiment and information gathering for debugging scales with time to reproduce the issue.

With that expectation set, I assume you have access to a terminal for UART input/output to the A72 core. We can do some very high level troubleshooting if that is the case. In previous related issue we found that halving the C7x speed made the failure disappear. To see if the current issue that we are seeing has similar behavior as the previous related issue, could you try halving the C7x speed? As a disclaimer this is purely for debug purposes, and not a workaround since it does halve C7x performance.

You should be able to set C7x clock using k3conf where the syntax is "k3conf set clock <device ID> <clock ID> <desired clock frequency>". Device ID and clock ID can be obtained using "k3conf dump clock | grep -i c7" or by referencing the TISCI documentation: https://software-dl.ti.com/tisci/esd/latest/5_soc_doc/j721s2/clocks.html#clocks-for-compute-cluster0-c71ss0-0-device.

Regards,

Takuma

0 Takuma Fujiwara over 1 year ago in reply to Takuma Fujiwara

TI__Guru 51353 points

Hi Frank,

Additionally, would you happen to have a number for failure rate? For example, do you see the issue on 4 out of 10 boards, 25 out of 25 boards, 1 out of 1 boards, 1 out of 2 boards, etc?

For reference, during debug for similar issue we were seeing around 12 out of 15 boards seeing various failures.

Regards,

Takuma

Processors

Processors forum

TDA4VM: About the TIDL thread in QNX hangs -- follow-up question