SK-AM62A-LP: How to convert a model trained with a custom dataset

Anaïs Gastineau

Part Number: SK-AM62A-LP

Tool/software:

Hello,

I contact you because I want to convert a model from the zoo with my own database. My dataset is a multi-class dataset, i.e. I have several labels (more than 3 labels) per image. I have a checkpoint obtained with a pytorch model (resnet50). Can I use directly these weights?

I succeed to run the example given in the github https://github.com/TexasInstruments/edgeai-tensorlab/tree/main?tab=readme-ov-file

But, in all examples, it's only 1 label per image and I don't know how and where I can change it. Do you have some suggestions?

I'm using the branch r9.1 of the github.

Thanks,

Anaïs

8 months ago

0 Reese Grimsley 8 months ago

TI__Genius 15396 points

Hello Anaïs,

Good question. I'm interpreting this to mean you have one image, and you want 3 separate labels, such that you are effectively running 3 classifiers on the same input. Is this correct? This implies your model would have multiple outputs.

The main resnet model from our model zoo would have 1x1000 (or 1x1001) for classifying 1000 separate classes. You can certainly modify a model like this to have multiple outputs, and you can start from the same set of pretrained weights (PTH file). This will require you to dig into some of the training code

Anaïs Gastineau said:
I succeed to run the example given in the github https://github.com/TexasInstruments/edgeai-tensorlab/tree/main?tab=readme-ov-file

Which example, exactly? edgeai-tensorlab organizes several of our repos into one to resolve some challenges with versioning and dependencies. Typically edgeai-modelmaker is used as the top-level tool for training, but this has limited models supported (doesn't include resnet50 unfortunately)

Your task looks something like this:

Modify the model architecture in the training code to include multiple outputs for your classification tasks
- You can probably start from here: https://github.com/TexasInstruments/edgeai-tensorlab/blob/r9.1/edgeai-torchvision/torchvision/models/resnet.py
Load pretrained weights for the majority of resnet layers.
Modify the dataset loader to handle the format of your 3-class labels
modify the loss function during training so that when the forward-pass completes, the error across the 3 classes are back-propagated into the network. These will likely need to combine (add, average, max, etc.) where the 3-classifiers split from the end of the main resnet
export to ONNX and load through TIDL's custom import process
- you may need to modify some of the python source to handle/postprocess your 3-classes
- https://github.com/TexasInstruments/edgeai-tidl-tools/blob/master/docs/custom_model_evaluation.md

Summarized, you need to change the model and dataset-handling to allow 3 outputs.

Edit: I recently posted an FAQ on our model zoo and supporting tools. Your case is an extension on comment #3, in which you are working with a model-zoo model that TI has not actually modified.

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1456107/faq-edge-ai-studio-how-do-i-use-and-retrain-ai-models-from-ti-s-model-zoo-for-my-own-dataset-using-ti-deep-learning-tidl-on-edge-ai-am6xa-processors

BR,
Reese

0 Anaïs Gastineau 8 months ago in reply to Reese Grimsley

Prodigy 80 points

Hello Reese,

Thank you for your response. I am currently facing an issue in understanding how to compile my ONNX model to generate the model artifact folder. I followed your advice and am now using the GitHub repository

https://github.com/TexasInstruments/edgeai-tidl-tools/blob/master/docs/custom_model_evaluation.md .

I was able to successfully install the Docker for edgeai-tidl-tools.

From what I understand, the compilation steps are as follows:

"[...]

Update the inference script to compile the model with TIDL acceleration by passing required compilation options. Refer here for detailed documentation on all the required and optional parameters.
Run the python code with compilation options using representative input data samples for model compilation and calibration.
- Default options expects minimum 20 input data samples (calibration_frames) for calibration. User can set as minimum as 1 also for quick model compilation (This may impact the accuracy of fixed point inference).
At the end of model compilation step, model-artifacts for inference will be generated in user specified path.
Create OSRT inference session with TIDL acceleration option for running inference with generated model artifacts in the above step.
- User can either update existing python code written for compilation or copy the compilation code to new file and update with accelerated inference option.
Refer the below tables for creating OSRT sessions with Compilation and Accelerated inference options.

However, I am unsure which script I need to run for the compilation step. The first point is not very clear to me. Is it /home/root/examples/osrt_python/ort/onnrt_ep.py ? If so, I’m encountering an error when trying to run this script: "AttributeError: 'InferenceSession' object has no attribute 'get_TI_benchmark_data'".

Could you confirm whether I am working with the correct Python script?

Best regards,

Anaïs

0 Reese Grimsley 8 months ago in reply to Anaïs Gastineau

TI__Genius 15396 points

Hello Anaïs,

Anaïs Gastineau said:
However, I am unsure which script I need to run for the compilation step. The first point is not very clear to me. Is it /home/root/examples/osrt_python/ort/onnrt_ep.py ? If so, I’m encountering an error when trying to run this script: "AttributeError: 'InferenceSession' object has no attribute 'get_TI_benchmark_data'".

Yes, you are on the right track. This is the correct script to use within edgeai-tidl-tools. You can run this script with -c option to compile, -d option to run on CPU (so no TIDL in any form), or neither to run with TIDL (including emulation of C7x if you're on x86 PC).

For this script, it lets you specify a model defined in the model_configs.py file (edgeai-tidl-tools/examples/osrt_python).
- You should supply path to model file, model type (segmentation, object detection, etc... if undefined, you may need to write some postprocessing code), input preprocessing (mean, scale, dimensions)
- Any additional TIDL options can be specified in 'optional_options' like here: https://github.com/TexasInstruments/edgeai-tidl-tools/blob/09347c5390eb95f80754dfbd7cbc7e98029254b9/examples/osrt_python/model_configs.py#L1012
You can target your model only by including '-m MODEL_NAME' in the command line args

Anaïs Gastineau said:
'InferenceSession' object has no attribute 'get_TI_benchmark_data'".

You may be encountering some similar issue as another active thread: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1458023/processor-sdk-am62a-edgeai-tidl-tool-compile-error

The get_TI_benchmark_data is a function only available in the TIDL version of onnxruntime. Perhaps you have the main upstream version of onnxruntime installed too. Can you show me output of the following command?

pip3 freeze | grep -i "onnx"

For example, my python3.10 virtual environment for TIDL 9.2 looks like the following

caffe2onnx==1.0.2
onnx==1.13.0
onnx_graphsurgeon @ git+https://github.com/NVIDIA/TensorRT@68b5072fdb9df6b6edab1392b02a705394b2e906#subdirectory=tools/onnx-graphsurgeon
onnxruntime-tidl @ file:///home/reese/1-edgeai/1-ti-tools/1-tidl-tools/10.0-tidl-tools/onnxruntime_tidl-1.14.0%2B10000000-cp310-cp310-linux_x86_64.whl#sha256=5efb894e39d3ca988e0644a1d0e9e34eab34c1a1f374d0085b9900febbb9724d
onnxsim==0.4.35
-e git+https://github.com/TexasInstruments/edgeai-tidl-tools@b7b07738bcd9afc7f74580217e81c307668a84ed#egg=tidl_onnx_model_optimizer&subdirectory=scripts/osrt_model_tools/onnx_tools/tidl-onnx-model-optimizer

Yours should have only an onnxruntime-tidl, and not an ordinary onnxruntime. I recommend a virutal environment to keep our version of onnxruntime separate. Otherwise, I think default import will choose upstream onnxruntime.

BR,
Reese

0 Anaïs Gastineau 8 months ago in reply to Reese Grimsley

Prodigy 80 points

Hello Reese,

I'm glad to hear that I'm on the right track, thank you!
I'm using Docker to run the scripts and I successfully installed onnxruntime-tidl:

However, I'm still encountering an issue when running the onnxrt_ep.py script.

I have exported the TIDL_TOOLS_PATH, and when I check the directory, I can see the libtidl_onnxrt_EP.so file. Additionally, when I print all the environment variables using os.environ, the correct path for TIDL is shown:

Do you have any suggestions?
Thanks,

Anaïs

0 Anaïs Gastineau 8 months ago in reply to Anaïs Gastineau

Prodigy 80 points

I found the solution, I needed to export the LD_LIBRARY_PATH

0 Anaïs Gastineau 8 months ago in reply to Anaïs Gastineau

Prodigy 80 points

I'm reaching out because the inference code runs perfectly on my dataset and custom model. However, when I add the compile argument, I encounter a segmentation fault.

Do you know what might be causing this?

Thanks,

Anaïs

0 Reese Grimsley 8 months ago in reply to Anaïs Gastineau

TI__Genius 15396 points

Hi Anaïs,

Glad you were able to resolve some of the pathing issues above -- you found the right solution.

Anaïs Gastineau said:
However, when I add the compile argument, I encounter a segmentation fault.

Hmm, hard to say based on these logs. I cannot tell at what point this failed

It looks like when you run without any option and it tries to use TIDL for inference, it is not finding the right files, such that the whole network is running on CPU. Some of the printouts in the line before your compile command seem odd, like having 32687 subgraphs for your model (ideally 1, but 16 at maximum due to SW limiter). Perhaps you tried to compile before and it failed, but still produced a few intermediate files... otherwise I'd have expected your initial inference to fail immediately due to missing artifacts.

I'll need more logs for the compile command to suggest a solution. Please run your compilation with the following settings and share the log -- ideally, also share the artifacts, especially the SVG's under artifacts/tempDir.

export TIDL_RT_DEBUG=1 #in linux env
"debug_level":2 #either in 'optional_options' as part of your model_config or by setting global variable in common_utils.py

It might also be informative to run the compile command through gdb, and share the callstack / backtrace ('bt' in gdb shell) to see where we hit this seg fault.

BR,
Reese

0 Anaïs Gastineau 8 months ago in reply to Reese Grimsley

Prodigy 80 points

Hello Reese,

Thank you for your response. Thanks to your advice, I was able to notice that I made a mistake in a folder path to save the artifacts. However, I am still encountering a segmentation fault, and here are the logs:

Fullscreen error_compile_tidl.txt Download

root@908b5978fca7:/home/root/examples/osrt_python/ort# gdb --args python3 onnxrt_ep.py --compile
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(No debugging symbols found in python3)
(gdb) run
Starting program: /usr/bin/python3 onnxrt_ep.py --compile
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x75e69e000640 (LWP 184)]
[New Thread 0x75e69b600640 (LWP 185)]
[New Thread 0x75e69ac00640 (LWP 186)]
[New Thread 0x75e698200640 (LWP 187)]
[New Thread 0x75e693800640 (LWP 188)]
[New Thread 0x75e692e00640 (LWP 189)]
[New Thread 0x75e68e400640 (LWP 190)]
[New Thread 0x75e68ba00640 (LWP 191)]
[New Thread 0x75e689000640 (LWP 192)]
[New Thread 0x75e686600640 (LWP 193)]
[New Thread 0x75e683c00640 (LWP 194)]
[New Thread 0x75e683200640 (LWP 195)]
[New Thread 0x75e67e800640 (LWP 196)]
[New Thread 0x75e67be00640 (LWP 197)]
[New Thread 0x75e679400640 (LWP 198)]
[New Thread 0x75e676a00640 (LWP 199)]
[New Thread 0x75e674000640 (LWP 200)]
[New Thread 0x75e673600640 (LWP 201)]
[New Thread 0x75e670c00640 (LWP 202)]
Available execution providers :  ['TIDLExecutionProvider', 'TIDLCompilationProvider', 'CPUExecutionProvider']
/home/root/model-artifacts/model

Running shape inference on model model 

[New Thread 0x75e65f800640 (LWP 203)]
[New Thread 0x75e65ee00640 (LWP 204)]
[New Thread 0x75e65e400640 (LWP 205)]
[New Thread 0x75e65da00640 (LWP 206)]
[New Thread 0x75e65d000640 (LWP 207)]
[New Thread 0x75e657e00640 (LWP 208)]
[New Thread 0x75e657400640 (LWP 209)]
[New Thread 0x75e656a00640 (LWP 210)]
[New Thread 0x75e656000640 (LWP 211)]
[New Thread 0x75e655600640 (LWP 212)]
[New Thread 0x75e654c00640 (LWP 213)]
[New Thread 0x75e64be00640 (LWP 214)]
[New Thread 0x75e64b400640 (LWP 215)]
tidl_tools_path                                 = /home/root/tidl_tools 
artifacts_folder                                = /home/root/model-artifacts/model 
tidl_tensor_bits                                = 8 
debug_level                                     = 2 
num_tidl_subgraphs                              = 16 
tidl_denylist                                   = 
tidl_denylist_layer_name                        = 
tidl_denylist_layer_type                         = 
tidl_allowlist_layer_name                        = 
model_type                                      =  
tidl_calibration_accuracy_level                 = 7 
tidl_calibration_options:num_frames_calibration = 2 
tidl_calibration_options:bias_calibration_iterations = 5 
mixed_precision_factor = -1.000000 
model_group_id = 0 
power_of_2_quantization                         = 2 
ONNX QDQ Enabled                                = 0 
enable_high_resolution_optimization             = 0 
pre_batchnorm_fold                              = 1 
add_data_convert_ops                          = 3 
output_feature_16bit_names_list                 =  
m_params_16bit_names_list                       =  
reserved_compile_constraints_flag               = 1601 
ti_internal_reserved_1                          = 


 ****** WARNING : Network not identified as Object Detection network : (1) Ignore if network is not Object Detection network (2) If network is Object Detection network, please specify "model_type":"OD" as part of OSRT compilation options******

Supported TIDL layer type --- [...]

Preliminary subgraphs created = 1 
Final number of subgraphs created are : 1, - Offloaded Nodes - 124, Total Nodes - 124 
[Detaching after vfork from child process 216]
Running runtimes graphviz - /home/root/tidl_tools/tidl_graphVisualiser_runtimes.out /home/root/model-artifacts/model/allowedNode.txt /home/root/model-artifacts/model/tempDir/graphvizInfo.txt /home/root/model-artifacts/model/tempDir/runtimes_visualization.svg 
*** In TIDL_createStateImportFunc *** 
Compute on node : TIDLExecutionProvider_TIDL_0_0
  [...]

Input tensor name -  input 
Output tensor name - 501 
Output tensor name - output 
Output tensor name - 499 
[New Thread 0x75e63be00640 (LWP 221)]
[New Thread 0x75e63b400640 (LWP 222)]
[New Thread 0x75e63aa00640 (LWP 223)]
[New Thread 0x75e63a000640 (LWP 224)]
[New Thread 0x75e639600640 (LWP 225)]
[New Thread 0x75e638c00640 (LWP 226)]
[New Thread 0x75e628200640 (LWP 227)]
[New Thread 0x75e627800640 (LWP 228)]
[New Thread 0x75e626e00640 (LWP 229)]
[New Thread 0x75e626400640 (LWP 230)]
[New Thread 0x75e625a00640 (LWP 231)]
[New Thread 0x75e625000640 (LWP 232)]
[New Thread 0x75e624600640 (LWP 233)]
[New Thread 0x75e623c00640 (LWP 234)]
[New Thread 0x75e623200640 (LWP 235)]
[New Thread 0x75e622800640 (LWP 236)]
[New Thread 0x75e621e00640 (LWP 237)]
[New Thread 0x75e621400640 (LWP 238)]
[New Thread 0x75e620a00640 (LWP 239)]
 Graph Domain TO version : 11In TIDL_onnxRtImportInit subgraph_name=499output501
Layer 0, subgraph id 499output501, name=501
Layer 1, subgraph id 499output501, name=output
Layer 2, subgraph id 499output501, name=499
Layer 3, subgraph id 499output501, name=input
In TIDL_runtimesOptimizeNet: LayerIndex = 128, dataIndex = 125 
WARNING: [...]
WARNING: [...]
WARNING: [...]

 ************** Frame index 1 : Running float import ************* 
In TIDL_runtimesPostProcessNet 
In TIDL_runtimesPostProcessNet 1
In TIDL_runtimesPostProcessNet 2
In TIDL_runtimesPostProcessNet 3
[Detaching after vfork from child process 240]
[Detaching after vfork from child process 242]
****************************************************
**                ALL MODEL CHECK PASSED          **
****************************************************

In TIDL_runtimesPostProcessNet 4
************ in TIDL_subgraphRtCreate ************ 
 TIDL_RT_OVX: Set default TIDLRT params done
Calling appInit() in TIDL-RT!
The soft limit is 2048
The hard limit is 2048
MEM: Init ... !!!
MEM: Init ... Done !!!
 0.0s:  VX_ZONE_INIT:Enabled
 0.5s:  VX_ZONE_ERROR:Enabled
 0.7s:  VX_ZONE_WARNING:Enabled
[New Thread 0x75e616a00640 (LWP 249)]
[New Thread 0x75e616000640 (LWP 250)]
[New Thread 0x75e615600640 (LWP 251)]
[New Thread 0x75e614c00640 (LWP 252)]
[New Thread 0x75e614200640 (LWP 253)]
[New Thread 0x75e613800640 (LWP 254)]
[New Thread 0x75e612e00640 (LWP 255)]
[New Thread 0x75e612400640 (LWP 256)]
[New Thread 0x75e611a00640 (LWP 257)]
[New Thread 0x75e611000640 (LWP 258)]
[New Thread 0x75e610600640 (LWP 259)]
[New Thread 0x75e60fc00640 (LWP 260)]
[New Thread 0x75e60f200640 (LWP 261)]
[New Thread 0x75e60e800640 (LWP 262)]
[New Thread 0x75e60de00640 (LWP 263)]
[New Thread 0x75e60d400640 (LWP 264)]
[New Thread 0x75e60ca00640 (LWP 265)]
[New Thread 0x75e60c000640 (LWP 266)]
[New Thread 0x75e60b600640 (LWP 267)]
[New Thread 0x75e60ac00640 (LWP 268)]
[New Thread 0x75e60a200640 (LWP 269)]
[New Thread 0x75e609800640 (LWP 270)]
[New Thread 0x75e608e00640 (LWP 271)]
[New Thread 0x75e608400640 (LWP 272)]
 0.6024s:  VX_ZONE_INIT:[tivxInit:185] Initialization Done !!!
TIDL_RT_OVX: Init ... 
TIDL_RT_OVX: Mapping config file ...
TIDL_RT_OVX: Mapping config file ... Done. 37912 bytes
TIDL_RT_OVX: Tensors, input = 1, output = 3
Host kernel - 0x75e649c0f658 
TIDL_RT_OVX: Mapping network file
TIDL_RT_OVX: Mapping network file... Done 97299008 bytes
TIDL_RT_OVX: Init done.
TIDL_RT_OVX: Creating graph ... 
TIDL_RT_OVX: input_sizes[0] = 896, dim = 224 padL = 0 padR = 0
TIDL_RT_OVX: input_sizes[1] = 200704, dim = 224 padT = 0 padB = 0
TIDL_RT_OVX: input_sizes[2] = 3, dim = 3 
TIDL_RT_OVX: input_sizes[3] = 1, dim = 1 
TIDL_RT_OVX: input_buffer = 0x75e683232000 150528
TIDL_RT_OVX: Creating graph ... Done.

--------------------------------------------
TIDL Memory size requiement (record wise):
MemRecNum   , Space               , Attribute   , Alignment   , Size(KBytes), BasePtr     
0           , DDR Cacheable       , Persistent  ,  128, 15.25   , 0x00000000
1           , DDR Cacheable       , Persistent  ,  128, 0.64    , 0x00000000
2           , DDR Cacheable       , Scratch     ,  128, 16.00   , 0x00000000
3           , DDR Cacheable       , Scratch     ,  128, 4.00    , 0x00000000
4           , DDR Cacheable       , Scratch     ,  128, 56.00   , 0x00000000
5           , DDR Cacheable       , Persistent  ,  128, 930.75  , 0x00000000
6           , DDR Cacheable       , Scratch     ,  128, 34549.12, 0x00000000
7           , DDR Cacheable       , Scratch     ,  128, 0.12    , 0x00000000
8           , DDR Cacheable       , Scratch     ,  128, 4873.25 , 0x00000000
9           , DDR Cacheable       , Scratch     ,  128, 6500.50 , 0x00000000
10          , DDR Cacheable       , Persistent  ,  128, 929.20  , 0x00000000
11          , DDR Cacheable       , Scratch     ,  128, 512.25  , 0x00000000
12          , DDR Cacheable       , Persistent  ,  128, 0.12    , 0x00000000
13          , DDR Cacheable       , Persistent  ,  128, 95018.69, 0x00000000
14          , DDR Cacheable       , Persistent  ,  128, 0.08    , 0x00000000
--------------------------------------------
Total memory size requirement (space wise):
Mem Space , Size(KBytes)
DDR Cacheable, 143405.98
--------------------------------------------
NOTE: Memory requirement in host emulation can be different from the same on EVM
      To get the actual TIDL memory requirement make sure to run on EVM with 
      debugTraceLevel = 2

--------------------------------------------
TIDL init call from ivision API 

--------------------------------------------
TIDL Memory size requiement (record wise):
MemRecNum   , Space               , Attribute   , Alignment   , Size(KBytes), BasePtr     
0           , DDR Cacheable       , Persistent  ,  128, 15.25   , 0x9ec8e000
1           , DDR Cacheable       , Persistent  ,  128, 0.64    , 0xa1b89000
2           , DDR Cacheable       , Scratch     ,  128, 16.00   , 0x9e009000
3           , DDR Cacheable       , Scratch     ,  128, 4.00    , 0xa1b88000
4           , DDR Cacheable       , Scratch     ,  128, 56.00   , 0x92e34000
5           , DDR Cacheable       , Persistent  ,  128, 930.75  , 0x7be17000
6           , DDR Cacheable       , Scratch     ,  128, 34549.12, 0xffd77000
7           , DDR Cacheable       , Scratch     ,  128, 0.12    , 0xa1940000
8           , DDR Cacheable       , Scratch     ,  128, 4873.25 , 0x5c33d000
9           , DDR Cacheable       , Scratch     ,  128, 6500.50 , 0xff71d000
10          , DDR Cacheable       , Persistent  ,  128, 929.20  , 0x79417000
11          , DDR Cacheable       , Scratch     ,  128, 512.25  , 0x86609000
12          , DDR Cacheable       , Persistent  ,  128, 0.12    , 0xa193f000
13          , DDR Cacheable       , Persistent  ,  128, 95018.69, 0xf9a52000
14          , DDR Cacheable       , Persistent  ,  128, 0.08    , 0xa1098000
--------------------------------------------
Total memory size requirement (space wise):
Mem Space , Size(KBytes)
DDR Cacheable, 143405.98
--------------------------------------------
NOTE: Memory requirement in host emulation can be different from the same on EVM
      To get the actual TIDL memory requirement make sure to run on EVM with 
      debugTraceLevel = 2

--------------------------------------------
Alg Init for Layer # -    1
[...]

PREEMPTION: Adding a new priority object for targetPriority = 0, handle = 0x75e69ec8e000
PREEMPTION: Now total number of priority objects = 1 at priorityId = 0,    with new memRec of base = 0x75e6a193f000 and size = 128
PREEMPTION: Requesting context memory addr for handle 0x75e69ec8e000, return Addr = 0x75e64aa4e678
TIDL_RT_OVX: Verifying TIDL graph ... Done.
************ TIDL_subgraphRtCreate done ************ 
 *******   In TIDL_subgraphRtInvoke  ******** 
TIDL_RT_OVX: Set default TIDLRT tensor done
TIDL_RT_OVX: Set default TIDLRT tensor done
TIDL_RT_OVX: Set default TIDLRT tensor done
TIDL_RT_OVX: Set default TIDLRT tensor done
TIDL_RT_OVX: Running Graph ... 
TIDL_RT_OVX: input_sizes[0] = 896, dim = 224 padL = 0 padR = 0
TIDL_RT_OVX: input_sizes[1] = 200704, dim = 224 padT = 0 padB = 0
TIDL_RT_OVX: input_sizes[2] = 3, dim = 3 
TIDL_RT_OVX: input_sizes[3] = 1, dim = 1 
TIDL_RT_OVX : Memcpy Input Buffer 
TIDL_RT_OVX: input_buffer = 0x75e683232000 150528
TIDL_RT_OVX: memset_out_tensor_tidlrt_tiovx  ... Done.
TIDL_activate is called with handle : 9ec8e000 
Core 0 Alg Process for Layer # [...]

Thread 53 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x75e616a00640 (LWP 249)]
0x000075e64889109d in void TIDL_refInnerProductParamBitDepth<float, float, float>(TIDL_Obj*, int, void*, void*, void*, float*, float*, float*, int, tidlInnerProductBuffParams_t*) [clone .isra.0] () from /home/root/tidl_tools/libvx_tidl_rt.so

(gdb) backtrace
#0  0x000075e64889109d in void TIDL_refInnerProductParamBitDepth<float, float, float>(TIDL_Obj*, int, void*, void*, void*, float*, float*, float*, int, tidlInnerProductBuffParams_t*) [clone .isra.0] ()
   from /home/root/tidl_tools/libvx_tidl_rt.so
#1  0x000075e648896e75 in TIDL_innerProductRefProcess(TIDL_Obj*, sTIDL_AlgLayer_t*, sTIDL_Layer_t*, sTIDL_InnerProductParams_t*, tidlInnerProductBuffParams_t*, void*, void*, void*) ()
   from /home/root/tidl_tools/libvx_tidl_rt.so
#2  0x000075e648897f2c in TIDL_innerProductProcessNew(TIDL_NetworkCommonParams*, sTIDL_AlgLayer_t*, sTIDL_Layer_t*, void**, void**, int) () from /home/root/tidl_tools/libvx_tidl_rt.so
#3  0x000075e6488dbae2 in WorkloadRefExec_Process(TIDL_Obj*, TIDL_NetworkCommonParams*, sWorkloadUnit_t*, sTIDL_AlgLayer_t*, sTIDL_Layer_t*, void**, void**, int, int) ()
   from /home/root/tidl_tools/libvx_tidl_rt.so
#4  0x000075e648838904 in TIDL_process(IVISION_Obj*, IVISION_BufDescList*, IVISION_BufDescList*, IVISION_InArgs*, IVISION_OutArgs*) () from /home/root/tidl_tools/libvx_tidl_rt.so
#5  0x000075e648835f7a in tivxKernelTIDLProcess () from /home/root/tidl_tools/libvx_tidl_rt.so
#6  0x000075e648824411 in ownTargetKernelExecute () from /home/root/tidl_tools/libvx_tidl_rt.so
#7  0x000075e648822bb7 in ownTargetNodeDescNodeExecuteTargetKernel () from /home/root/tidl_tools/libvx_tidl_rt.so
#8  0x000075e6488235d9 in ownTargetTaskMain () from /home/root/tidl_tools/libvx_tidl_rt.so
#9  0x000075e64883137c in tivxTaskMain () from /home/root/tidl_tools/libvx_tidl_rt.so
#10 0x000075e6a1efcac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#11 0x000075e6a1f8da04 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

It's coming from libvx_tidl_rt.so (the file exists in my tidl-tools path). But when I look in the output path (artifacts folder), several files/folders have been created:

Does this mean that the compilation has still been completed and that I can use this model on the AM62A board?

Thanks,

Anaïs

0 Anaïs Gastineau 8 months ago in reply to Anaïs Gastineau

Prodigy 80 points

In my previous message, I forgot to mention that I have hidden the architecture details in the log file. I replaced the architecture details with [...].

Best regards,

Anaïs

0 Reese Grimsley 8 months ago in reply to Anaïs Gastineau

TI__Genius 15396 points

Hello,

Thanks for the information and screenshots, very helpful. The backtrace especially tells me this is happening fairly deep within the TIDL import tool.

Anaïs Gastineau said:
Does this mean that the compilation has still been completed and that I can use this model on the AM62A board?

No, looks like compilation did not complete. Those tempDir files are a working directory and once compilation completes, a few of the files will be copied back up into the artifacts/ directory. Sometimes these intermediate binaries are sufficient, but I am doubtful that is the case here.

From the logs, TIDL hit an error during compilation while trying to run the floating-point implementation of an InnerProduct (or similar matrix multiplication) layer. This is part of the calibration and quantization process. It's hard to immediately say why this failed

Core 0 Alg Process for Layer # [...]

Is this the last layer that ran, layer 0? It may help to open the ...tidl_net.bin.svg in a browser and send a screenshot of the last layer with print for "Alg Process for Layer #". You can hover your mouse over the node -- this will provide more info on that layer only. I'm looking for something like the image from this link: https://github.com/TexasInstruments/edgeai-tidl-tools/blob/master/docs/tidl_osr_debug.md#example-visualization-1. Feel free to edit out anything that may expose more details than you are comfortable with.

I will take the opportunity to note that the 10.0 SDK and TIDL tools made many improvements to robustness and logging during compilation and inference. If you can, I would suggest upgrading. It is quite likely the issue you are seeing has been resolved in more recent release.

BR,
Reese

0 Anaïs Gastineau 8 months ago in reply to Reese Grimsley

Prodigy 80 points

Hello Reese,

Thanks for your answer. I’ve hidden the architecture, but the last layer to run is: Core 0 Alg Process Layer # - 77.
It’s a dense layer, and the information in the file ...tidl_net.bin.svg for this layer is:

Fullscreen last_layer_executed.txt Download

Layer 77: TIDL_InnerProductLayer "output_netFormat"
weightdElementSizeInBits=32
multiCoreMode=TIDL_NOT_MULTI_CORE
strideOffsetMethod=TIDL_StrideOffsetTopLeft
activationType=0 numInRows=1 numInCols=2048 numOutCols=6 transA=0 transB=1
weightsQ=0 weightScale=1.000000 zeroWeightValue=0
biasSca;e=1.000000 biasQ=0 inDataQ=0 interDataQ=0
biasB=0
weights:0x5ca6a40 bias:0x5cb2a40
actParams:
   actType=TIDL_NoAct
   slopeScale=1.000000 clipMin/Max=(0.000000,0.000000)
Inputs:
   [75][2]
Outputs:
   [77] numDim=0 dims=[1,1,1,2048,1,6] elementType=TIDL_SinglePrecFloat padH/W=[0,0] batchPadH/W=[0,0] numBatchH/W=[1,1]
pitch=[12288,12288,12288,6,6]
   dataQ=0 roundBits=0
   min/maxValue=(0,0) min/maxTensorValue=(0.000000,0.000000)
   tensorScale=1.000000

However, I noticed that the shape has too many dimensions. I’m not sure why, but since the input, the dimensions are [1,1,1,3,224,224]. The first two dimensions are never present usually. Do you think the error could be related to this?

Regarding SDK 10.0, we can't use it. We’ve already tried it, but there are some incompatibilities with Python or certain libraries.

Best regards,

Anaïs

0 Jianzhong Xu 8 months ago in reply to Anaïs Gastineau

TI__Guru 59025 points

Hello Anaïs,

Reese is out this week and won't be able to respond until next week.

Regards,

Jianzhong

0 Reese Grimsley 8 months ago in reply to Jianzhong Xu

TI__Genius 15396 points

Hi Anaïs,

Thanks for your patience while I was out.

Is this the last layer of your network? I notice this is called 'output_netFormat', and the _netFormat is something I know TIDL will sometimes add to a name.

Similarly, is this layer an output to the model? Is that output also used as an input to another layer?

I see that the data type is weightdElementSizeInBits=32, and the output datatype is similarly TIDL_SinglePrecFloat, so we're in floating point at this stage. Ordinarily, a model should have those weights as 8 or 16, and output type is Char or Short, depending on the quantization mode. This supports my theory that the model-import process is failing during initial phase of calibration, in which it runs in 32-bit mode.

The 6 dimensions can sometimes cause an issue when there are layers that need to run on Arm (unaccelerated), followed by more layers on C7x with TIDL for acceleration. We use 6-D representations (and several other variables like pitch) to program the accelerators data-movement mechanisms, which inherently support 6D. I don't think this is the issue, though.

Ignoring those first two [1,1,...], are the other dimensions consistent with your model? I have seen error modes in which intermediate tensor shapes are wrong (visible in the SVG), which causes issues later on. Looks like the output should be [2048,1,6] in the original model.

I would also suggest trying to deny-list this layer to verify it is the offender. See doc here:

https://github.com/TexasInstruments/edgeai-tidl-tools/tree/master/examples/osrt_python#options-to-enable-control-on-layer-level-delegation-to-ti-dsparm
Either in optional_options for your model config, or in common_utils, add the name of your layer to the deny_list:layer_name. This is a comma-separated string of layer names.
- You may need to check the SVG or other files in tempDir to see if TIDL assigned a modified name after parsing -- this often happens when activation functions are fused into the preceding layer or when an additional layer (like casting int->float) is included by TIDL

If you are comfortable, you can also share a version of your model. Random weights are okay. You can also share with me via direct message to protect IP. Alternatively, sharing screenshot of the configuration + tensor input/output shapes for this failing layer 77.

BR,

Reese

0 Anaïs Gastineau 7 months ago in reply to Reese Grimsley

Prodigy 80 points

Hello Reese,

Thanks for you answer. I'm ok to send you a ONNX model in direct message. I think it will be easier for us to move forward.
Thanks,

Anaïs

0 Reese Grimsley 7 months ago in reply to Anaïs Gastineau

TI__Genius 15396 points

Understood, I've sent you a message to kick off the process. Please share your model and the model_config python code associated with it. Thanks!

Processors

Processors forum

SK-AM62A-LP: How to convert a model trained with a custom dataset