PROCESSOR-SDK-AM62A: Using the AM62A chip development board for AI model development, I encountered a runtime issue and seek assistance.

Part Number: PROCESSOR-SDK-AM62A

Tool/software:

     I currently have an AM62A development board with system SDK 9.2. The model has been successfully converted and deployed on the board, and it can run correctly to produce the expected results. However, I encountered an issue: during model inference, if the program crashes or is forcefully terminated (e.g., using Ctrl+C), in short, if the resource release code is not executed properly, the model cannot run correctly the next time. The only solution is to restart the board. The error message is as follows:

libtidl_onnxrt_EP loaded 0xed32db0 

Final number of subgraphs created are : 1, - Offloaded Nodes - 132, Total Nodes - 135 

176789.252347 s:  VX_ZONE_ERROR:[ownContextSendCmd:875] Command ack message returned failure cmd_status: -1

176789.252405 s:  VX_ZONE_ERROR:[ownNodeKernelInit:590] Target kernel, TIVX_CMD_NODE_CREATE failed for node TIDLNode

176789.252424 s:  VX_ZONE_ERROR:[ownNodeKernelInit:591] Please be sure the target callbacks have been registered for this core

176789.252439 s:  VX_ZONE_ERROR:[ownNodeKernelInit:592] If the target callbacks have been registered, please ensure no errors are occurring within the create callback of this kernel

176789.252457 s:  VX_ZONE_ERROR:[ownGraphNodeKernelInit:608] kernel init for node 0, kernel com.ti.tidl:1:1 ... failed !!!

176789.252487 s:  VX_ZONE_ERROR:[vxVerifyGraph:2159] Node kernel init failed

176789.252500 s:  VX_ZONE_ERROR:[vxVerifyGraph:2213] Graph verify failed

TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!!

TIDL_RT_OVX: ERROR: Verify OpenVX graph failed

      After investigation and preliminary analysis, it is determined that TI's model inference does not directly invoke the hardware but instead communicates with other processes, which are responsible for resource allocation. When the program crashes, it fails to send the release signal to the corresponding process, resulting in resource occupation. This lack of available resources leads to errors when attempting to run the model again.

[C7x_1 ]  28392.238061 s: IPC: Echo status: a530-0[.] r5f0-0[P] c75ss0[s] 

[C7x_1 ] 176789.251932 s:  VX_ZONE_ERROR:[tivxAlgiVisionAllocMem:194] Failed to Allocate memory record 13 @ space = 17 and size = 4964028 !!! 

[C7x_1 ] 176789.251968 s:  VX_ZONE_ERROR:[tivxAlgiVisionCreate:358] tivxAlgiVisionAllocMem Failed

[C7x_1 ] 176789.251999 s:  VX_ZONE_ERROR:[tivxKernelTIDLCreate:926] tivxAlgiVisionCreate returned NULL

[C7x_1 ] 176789.548018 s:  VX_ZONE_ERROR:[tivxAlgiVisionAllocMem:194] Failed to Allocate memory record 13 @ space = 17 and size = 3828764 !!! 

[C7x_1 ] 176789.548052 s:  VX_ZONE_ERROR:[tivxAlgiVisionCreate:358] tivxAlgiVisionAllocMem Failed

[C7x_1 ] 176789.548082 s:  VX_ZONE_ERROR:[tivxKernelTIDLCreate:926] tivxAlgiVisionCreate returned NULL

[C7x_1 ] 176789.860206 s:  VX_ZONE_ERROR:[tivxAlgiVisionAllocMem:194] Failed to Allocate memory record 5 @ space = 17 and size = 2935104 !!! 

[C7x_1 ] 176789.860240 s:  VX_ZONE_ERROR:[tivxAlgiVisionCreate:358] tivxAlgiVisionAllocMem Failed

[C7x_1 ] 176789.860269 s:  VX_ZONE_ERROR:[tivxKernelTIDLCreate:926] tivxAlgiVisionCreate returned NULL

[C7x_1 ] 176790.194297 s:  VX_ZONE_ERROR:[tivxAlgiVisionAllocMem:194] Failed to Allocate memory record 5 @ space = 17 and size = 2826986 !!! 

[C7x_1 ] 176790.194331 s:  VX_ZONE_ERROR:[tivxAlgiVisionCreate:358] tivxAlgiVisionAllocMem Failed

[C7x_1 ] 176790.194361 s:  VX_ZONE_ERROR:[tivxKernelTIDLCreate:926] tivxAlgiVisionCreate returned NULL

Is there any API available to manually release these resources? Or, how can this issue be effectively resolved?

  • Hello,

    However, I encountered an issue: during model inference, if the program crashes or is forcefully terminated (e.g., using Ctrl+C), in short, if the resource release code is not executed properly, the model cannot run correctly the next time. The only solution is to restart the board.
    After investigation and preliminary analysis, it is determined that TI's model inference does not directly invoke the hardware but instead communicates with other processes, which are responsible for resource allocation. When the program crashes, it fails to send the release signal to the corresponding process, resulting in resource occupation. This lack of available resources leads to errors when attempting to run the model again.

    This is a known behavior, and you have the correct understanding. When you close the application without telling the remote cores / accelerators to shut down, they are left in an uncertain state. Sometimes this is fine, but often this results in unreleased memory allocations, and causes failures depending on the size of memory allocations and number of non-graceful shutdowns. 

    I do not believe there is currently an API for releasing all the resources that might be in use on these remote cores. The best approach is to reset the SoC and start from a clean state. It is preferred to gracefully shut down your application by catching interrupt signals and releasing resources intentionally.

    BR,
    Reese