SK-TDA4VM: TIDL freeze when running Gemm op in ONNX Runtime

Olle Friman

Part Number: SK-TDA4VM

Dear TI experts,

Me and my team are experiencing an issue when executing an ONNX model with a Gemm operator using TIDL 08.04.00.06 on our TI-TDA4VM with ONNX Runtime. The Python process is freezing during execution. The problem seems to persist after the freeze, where other models that previously ran successfully also freeze, until the device is rebooted.

We have also tested many other combinations of parameters for this operator, and then the freeze does not occur.

This is the command we ran to run the model using TIDL and ONNX Runtime:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@tda4vm-sk:~# python3 run_custom_model.py --model-path fail.onnx --model-artifacts fail/ --debug-level 3
libtidl_onnxrt_EP loaded 0x296bd560 
artifacts_folder                                = fail/ 
debug_level                                     = 3 
target_priority                                 = 0 
max_pre_empt_delay                              = 340282346638528859811704183484516925440.000000 
Final number of subgraphs created are : 1, - Offloaded Nodes - 1, Total Nodes - 1 
In TIDL_createStateInfer 
Compute on node : TIDLExecutionProvider_TIDL_0_0
************ in TIDL_subgraphRtCreate ************ 
 APP: Init ... !!!
MEM: Init ... !!!
MEM: Initialized DMA HEAP (fd=4) !!!
MEM: Init ... Done !!!
IPC: Init ... !!!
IPC: Init ... Done !!!
REMOTE_SERVICE: Init ... !!!
REMOTE_SERVICE: Init ... Done !!!
3002408.839539 s: GTC Frequency = 200 MHz
APP: Init ... Done !!!
3002408.839772 s:  VX_ZONE_INIT:Enabled
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

root@tda4vm-sk:~# python3 run_custom_model.py --model-path fail.onnx --model-artifacts fail/ --debug-level 3
libtidl_onnxrt_EP loaded 0x296bd560 
artifacts_folder                                = fail/ 
debug_level                                     = 3 
target_priority                                 = 0 
max_pre_empt_delay                              = 340282346638528859811704183484516925440.000000 
Final number of subgraphs created are : 1, - Offloaded Nodes - 1, Total Nodes - 1 
In TIDL_createStateInfer 
Compute on node : TIDLExecutionProvider_TIDL_0_0
************ in TIDL_subgraphRtCreate ************ 
 APP: Init ... !!!
MEM: Init ... !!!
MEM: Initialized DMA HEAP (fd=4) !!!
MEM: Init ... Done !!!
IPC: Init ... !!!
IPC: Init ... Done !!!
REMOTE_SERVICE: Init ... !!!
REMOTE_SERVICE: Init ... Done !!!
3002408.839539 s: GTC Frequency = 200 MHz
APP: Init ... Done !!!
3002408.839772 s:  VX_ZONE_INIT:Enabled
3002408.839845 s:  VX_ZONE_ERROR:Enabled
3002408.839911 s:  VX_ZONE_WARNING:Enabled
3002408.840618 s:  VX_ZONE_INIT:[tivxInitLocal:130] Initialization Done !!!
3002408.841894 s:  VX_ZONE_INIT:[tivxHostInitLocal:86] Initialization Done for HOST !!!
************ TIDL_subgraphRtCreate done ************ 
 *******   In TIDL_subgraphRtInvoke  ********

And this is the source code for the Python script used in the console output above:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
"""run_custom_model.py"""
import argparse
import os
import numpy as np
import onnxruntime as rt
def run_model(*, model_path, model_artifacts, debug_level):
    delegate_options = {
        "tidl_tools_path": os.environ["TIDL_TOOLS_PATH"],
        "artifacts_folder": model_artifacts,
        "platform": "J7",
        "debug_level": debug_level,
    }
    so = rt.SessionOptions()
    sess = rt.InferenceSession(
        model_path,
        providers=["TIDLExecutionProvider"],
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

"""run_custom_model.py"""

import argparse
import os

import numpy as np
import onnxruntime as rt


def run_model(*, model_path, model_artifacts, debug_level):
    delegate_options = {
        "tidl_tools_path": os.environ["TIDL_TOOLS_PATH"],
        "artifacts_folder": model_artifacts,
        "platform": "J7",
        "debug_level": debug_level,
    }

    so = rt.SessionOptions()
    sess = rt.InferenceSession(
        model_path,
        providers=["TIDLExecutionProvider"],
        provider_options=[delegate_options],
        sess_options=so,
    )

    input_details = sess.get_inputs()
    input_name = input_details[0].name
    input_data = np.zeros(input_details[0].shape, dtype=np.float32)

    sess.run(None, {input_name: input_data})


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-path", type=str, required=True)
    parser.add_argument("--model-artifacts", type=str, required=True)
    parser.add_argument("--debug-level", type=int, default=0)
    args = parser.parse_args()

    run_model(**vars(args))

Here is a link to Google Drive for the model that causes the freeze:

https://drive.google.com/file/d/1vl23tDDLhXz71USfhQCCjHd1rnD6DP3D/view?usp=sharing

And here is a link to another model with different parameters for the Gemm operator that does succeed and does not freeze:

https://drive.google.com/file/d/1vtsfQq6LEo1DI88DaHe_WeltBu-75i0U/view?usp=sharing

If you could help us find a way to fix this freeze and run this model, we would be most thankful.

over 2 years ago

Processors

Processors forum

SK-TDA4VM: TIDL freeze when running Gemm op in ONNX Runtime