SK-AM62A-LP: complie onnx. split and add node can't pass.

Wang Xiaojun

Part Number: SK-AM62A-LP

Tool/software:

Non-zero status code returned while running TIDL_5 node. Name:'TIDLExecutionProvider_TIDL_5_5' Status Message: TIDL Compute Import Failed.But remove this split can pass

split_node = onnx.helper.make_node(

            'Split',
            inputs=[node.input[0]],  
            outputs=[f'{node.output[0]}_0', f'{node.output[0]}_1', f'{node.output[0]}_2'],  
            axis=2,  
            split=[1, 1, 1],  
 )
model.graph.node.append(split_node)
Results are the same with the model before replace slice node.
 

9 months ago

0 Wang Xiaojun 9 months ago

Prodigy 50 points

Output1 can be postprocessed in cpu.

But another output is complex.

Slice with 2D :

[Quantization & Calibration for subgraph_1 Started]

2024-12-27 16:48:11.407701156 [E:onnxruntime:, sequential_executor.cc:494 ExecuteKernel] Non-zero status code returned while running Squeeze node. Name:'model/tf.__operators__.getitem_16/strided_slice__354' Status Message: /root/onnxruntime/onnxruntime/core/providers/cpu/tensor/squeeze.h:52 static onnxruntime::TensorShapeVector onnxruntime::SqueezeBase::ComputeOutputShape(const onnxruntime::TensorShape&, const TensorShapeVector&) input_shape[i] == 1 was false. Dimension of input 1 must be 1 instead of 0. shape={1,0,1,1,1,3}

Slice with 3D :

==================== [Optimization for subgraph_6 Started] ====================

[TIDL Import] [PARSER] UNSUPPORTED: All the input tensor dimensions has to be greater then zero. For tensor model/tf.__operators__.getitem_23/strided_slice3, id 0 - Dim 2 is 0 -- [tidl_import_common_model_check.cpp, 2290]
[TIDL Import] ERROR: Invalid input tensor dimension, aborting -- [tidl_import_core.cpp, 2556]
[TIDL Import] ERROR: Network Optimization failed - Failed in function: TIDL_runtimesOptimizeNet -- [tidl_runtimes_import_common.cpp, 1268]
[TIDL Import] [PARSER] ERROR: - Failed in function: TIDL_computeImportFunc -- [tidl_onnxRtImport_EP.cpp, 1713]
2024-12-27 16:40:48.794281915 [E:onnxruntime:, sequential_executor.cc:494 ExecuteKernel] Non-zero status code returned while running TIDL_6 node. Name:'TIDLExecutionProvider_TIDL_6_6' Status Message: TIDL Compute Import Failed.

I replace slice to split also failed, no matter output1 or output2

0 Wang Xiaojun 8 months ago in reply to Wang Xiaojun

Prodigy 50 points

Also,when i compile onnx model with 4 outputs,artifacts/param.yaml can only find one output node.I concat all results to a tensor with 1*71*3 or 1*1*71*3 shape log error:

RUNTIME_EXCEPTION : Non-zero status code returned while running TIDL_3 node. Name:'TIDLExecutionProvider_TIDL_3_3' Status Message: /root/onnxruntime/onnxruntime/core/providers/tidl/tidl_execution_provider.cc:430 void onnxruntime::populateOnnxRtInputParams(Ort::CustomOpApi, OrtKernelContext*, onnxruntime::tidl_ops*, OnnxTIDLSubGraphParams*) (TIDL_MAX_DIM-inputNumDims) >= 0 was false. TIDL_EP: Only tensors up to 6D

0 Reese Grimsley 8 months ago

TI__Genius 15556 points

Hello,

You have several topics here, and I'll try to address them. In your follow-up reply, please help me by noting which topics you have found a solution for. It seems that you have found resolution for some of them, but it is not clear to me.

My understanding is that overall, you are facing challenges with Slice and Split operators in TIDL.

What would help most is to pass along your model artifacts (or at least the SVG files under tempDir), compilation logs (debug_level=1 for verbosity), and the model file.

ONNX complaining about tensor shapes

Wang Xiaojun said:
onnxruntime::SqueezeBase::ComputeOutputShape(const onnxruntime::TensorShape&, const TensorShapeVector&) input_shape[i] == 1 was false. Dimension of input 1 must be 1 instead of 0. shape={1,0,1,1,1,3}

ONNX will sometimes complain about TIDL's 6D tensor representation and this is often fixed by defining TIDL_RT_ONNX_VARDIM=1 in the calling environment.

This will not be needed for TIDL tools version 10.1 and beyond

Slice and split on one of the output heads:

Wang Xiaojun said:
I replace slice to split also failed, no matter output1 or output2

Allow me to be honest here -- I think this would be better off running with ONNXRT' Arm-based representation if TIDL is having a hard time with it. There is very little computation such that acceleration is not worthwhile. The overhead is probably high enough that C7x acceleration does not give much benefit.

Otherwise, it seems like TIDL is not allowing you to use slice or split nodes as you like. Assuming those adhere to the supported_operators page, please share model+compilation logs for the Slice or Split configurations you are trying to use. The single line error message is not sufficient to suggest a fix.

Multiple Outputs not being recognized

It would help to see the model file and the SVGs for the network.

The param.yaml files are mainly useful for well-defined model types, like object detection (1 output for boxes, 1 for output for classes), segmentation (image mask), etc. For a custom model type like yours, I do not expect the param.yaml to know how to encode the postprocessing information for these. It likely generated the outputs based on the 'model_type' within your model_configs.py for this model.

For the two images below (outputs), is one from CPU and one from TIDL? Please provide context or label for the images

Are you getting 'nan' for the TIDL version of the network?

Wang Xiaojun said:
RUNTIME_EXCEPTION : Non-zero status code returned while running TIDL_3 node. Name:'TIDLExecutionProvider_TIDL_3_3' Status Message: /root/onnxruntime/onnxruntime/core/providers/tidl/tidl_execution_provider.cc:430 void onnxruntime::populateOnnxRtInputParams(Ort::CustomOpApi, OrtKernelContext*, onnxruntime::tidl_ops*, OnnxTIDLSubGraphParams*) (TIDL_MAX_DIM-inputNumDims) >= 0 was false. TIDL_EP: Only tensors up to 6D

I will need to see model file and artifacts to give comment here. Looks like it found >6 dimensions in one of your tensors.

For models whose output does not match one of the tasks we outright enable (includes classification, object detection, segmentation, keypoint detection), you will need to include your own postprocessing code.

Overall suggestions:

If the sizes of these tensors are really 1x3 and 1x71 and similar, I recommend running said layers on CPU with ONNX.
Apply env variable TIDL_RT_ONNX_VARDIM=1 in your linux environment.
For testing your model with multiple outputs, check there is nothing task-specific (e.g. classification) affecting the values you receive from TIDL.

BR,
Reese

0 Wang Xiaojun 8 months ago in reply to Reese Grimsley

Prodigy 50 points

If set TIDL_RT_ONNX_VARDIM=1.It couldn't even pass the the front part of the model's op.It's input shape is 1*1024*1*1 why detect 1*1*1*1,onnx model runs in cpu correctly.Also if i cut these op no matter gemm or matmul+add，all ens in GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62.

0 Wang Xiaojun 8 months ago in reply to Wang Xiaojun

Prodigy 50 points

ttt.zip https://drive.google.com/file/d/1taPxxn-Von3IIDWJZkCJEpDg_WqQ1dlG/view?usp=sharing

Here is the mini model.When TIDL_RT_ONNX_VARDIM=0 the last add node will cause error mast be replaced with neg+sub onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed.

If set TIDL_RT_ONNX_VARDIM=1 the error report will become the content of the previous reply.

I have tried more than a hundred combinations in tidl but all failed.

ALL THE ONNX MODEL ABOVE RUNNING ON CPU ALL CAN GET CORRECT RESULTS

Reese Grimsley said:
I do not expect the param.yaml to know how to encode the postprocessing information for these. It likely generated the outputs based on the 'model_type' within your model_configs.py for this model.

The model (2 outputs and can pass the compilation) param only have 1 output and it is incorrect.Vividly the output with different results are running on cpu,while the output with the same results which even have "nan " is from TIDL

Reese Grimsley said:
It likely generated the outputs based on the 'model_type' within your model_configs.py

Same result no matter model_type=None or classification.

We will do poseprocess by ourselves ，so is there any other way to get the correct artifacts .

Also the out model of optimizer is the same problem

Wang Xiaojun said:
If set TIDL_RT_ONNX_VARDIM=1.It couldn't even pass the the front part of the model's op.It's input shape is 1*1024*1*1 why detect 1*1*1*1,onnx model runs in cpu correctly.Also if i cut these op no matter gemm or matmul+add，all ens in GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62.

from tidl_onnx_model_optimizer import optimize

optimize("/ti/modified_modified_22.onnx", "./test.onnx",simplify_mode=all)

0 Reese Grimsley 8 months ago in reply to Wang Xiaojun

TI__Genius 15556 points

Hello,

This is a frustrating situation, I understand. That is many configurations to try without clear solution. I am trying compilation on my side for the ttt.zip model you provided. Thank you.

I have requested access to the google drive link.

We are releasing the 10.1 SDK very shortly (some SoCs it is released already), and edgeai-tidl-tools can be updated to use these tools with the 10_01_00_02 tag (rerun setup script).

I can replicate your error on 10.0 SDK, but I think we have patched part of this in 10.1

The input tensor cannot be reshaped to the requested shape. Input shape:{1,1,1,1}, requested shape:{-1,1024}

My further comments are based on this 10_01_00_02 tag for edgeai-tidl-tools.

I've tried compilation on my side and found a working configuration for the ttt.onnx model you sent, and I can clearly see that there is a bug in the model import tool (part of the optimization process, which is NOT the same as tidl_onnx_model_optimizer). This seems to be related to the GEMM node.

Would you try compiling with these additional options passed to the TIDLCompilationProvider?:

'max_num_subgraphs': 1,

My intent here was to cut the last set of layers in the network. This max_num_subgraphs=1 config makes one TIDL graph with the Squeeze->unSqueeze->Reshape at the end, and final layers with GEMM, Add, Slice run on Arm/CPU. This way, the model compiles and run on PC with C7x emulation.

I noted that allowing a second subgraph to form in that bottom portion generated additional errors. This only included a GEMM node, and TIDL seems to complain that it does not have a bias tensor. I believe the MatMul -> Add needs to be replaced with GEMM. From your images, I think you have done this in some of your models.

The Slice layers are trying to slice in 2 axes at the same time. This is not supported

Note: To run on target, we need 10.1 SDK for AM62A to release. This should happen within the next week or so, once New Years holiday has passed

Summary:

Please update to 10_01_00_02 edgeai-tidl-tools. This should fix some of the ONNX-related issues
Ensure your MatMul+Add are replaced with GEMM, or try denying those layers
- I see a bug that occurs during optimization, probably from fusing MatMul with Add to create GEMM. I will log this with our dev team.
Slice layers should only slice on a single axis at a time

0 Wang Xiaojun 8 months ago in reply to Reese Grimsley

Prodigy 50 points

HI,

I just git clone github.com/.../edgeai-tidl-tools.git -b 10_01_00_02 source ./setup.sh . it ends with same questions .

Reese Grimsley said:
Would you try compiling with these additional options passed to the TIDLCompilationProvider?:

'max_num_subgraphs': 1,

My intent here was to cut the last set of layers in the network. This max_num_subgraphs=1 config makes one TIDL graph with the Squeeze->unSqueeze->Reshape at the end, and final layers with GEMM, Add, Slice run on Arm/CPU. This way, the model compiles and run on PC with C7x emulation.

The model download from google just one part of the model where causes these errors,it seems not a good choice.

Slice2D can be changed to slice1D via below method.This is not difficult.But main problem is 1. up to 10.01 op:reshape doesnt work 2. down to 10.00.08 some other op doesn't work.

from tidl_onnx_model_optimizer import optimize

optimize("/ti/1111-op12.onnx", "./test1111-op12.onnx",simplify_mode=all)

0 Reese Grimsley 8 months ago in reply to Wang Xiaojun

TI__Genius 15556 points

Hello,

I see you have been busy and tried many options. Let me respond to these. I understand now how my suggestion to deny-list and reduce # subgraphs will not work -- You have passed me a portion of the model only.

I see one of your notes was to replace 2D slice with 1D slice.

Wang Xiaojun said:
Slice2D can be changed to slice1D via below method.This is not difficult

I shall assume Slice layers are now operating fine, but please inform if I misunderstand. I see from your screenshot that that these have been unpacked from 3 axes Slice --->1 axis slice, replicated 3 times for different axes.

Wang Xiaojun said:
1. up to 10.01 op:reshape doesnt work 2. down to 10.00.08 some other op doesn't work.

1) Reshape -- I see the error from Onnx saying input shape is (1,1,1,1) and requested output is (-1, 1024). This is for layer show in part of your screenshot

For this Squeeze -> unsqueeze -> unsqueeze -> reshape -> MatMul, why can it not use Squeeze -> MatMul?
Otherwise, I'm not sure why the Reshape node is not accepted by TIDL. Is there a warning / error for this Reshape?
- Can you provide me the SVGs in the artifacts/tempDir directory here? I would like to see how TIDL parsed tensor shapes and which layers were marked for acceleration by TIDL
From my previous comment, I thought this was patched in 10.1. You are saying it is not. Can you please confirm if this error is with 10.0 or 10.1 tools?

2) Other op in 10.0.0.8

Is this for the GEMM / MatMul->Add? I have noted this as failing during the optimization step of TIDL import (happens internally. This optimization is not same as the optimizer python scripts).

In this scenario, you can also try replacing MatMul -> Add with GEMM layer. Your configuration looks to match the supported_ops page

This topic is known by development team and will be addressed in the near future. I am tracking the progress on this with our team. I appreciate your patience here.

Please inform if there is another layer not functioning here. I will reproduce the issue if you pass me such a model, and provide to our dev team to fix.

BR,
Reese

0 Wang Xiaojun 8 months ago in reply to Reese Grimsley

Prodigy 50 points

If you have tried to use onnx modifier to modifie model ，you will meet lost of these problems。But all these onnx model can be infered correctly.

IN TIDL10.01:

If i change Squeeze -> unsqueeze -> unsqueeze -> reshape -> MatMul to Squeeze -> MatMul it sometimes wiil comes with error like [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed.

But in TIDL 10.00.08 this will 100% happen.

TIDL_RT_ONNX_VARDIM=0: RUNTIME_EXCEPTION : Non-zero status code returned while running Gemm node. Name:'gemm' Status Message: gemm_helper.h:14 onnxruntime::GemmHelper::GemmHelper(const onnxruntime::TensorShape&, bool, const onnxruntime::TensorShape&, bool, const onnxruntime::TensorShape&) left.NumDimensions() == 2 || left.NumDimensions() == 1 was false.

TIDL_RT_ONNX_VARDIM=1:INVALID_ARGUMENT : Non-zero status code returned while running Gemm node. Name:'gemm' Status Message: GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62

Reese Grimsley said:
对于这个 Squeeze -> unsqueeze -> unsqueeze -> reshape -> MatMul，为什么不能使用 Squeeze -> MatMul

Reese Grimsley said:
MatMul -> Add with GEMM

Both GEMM or matmul+add all failed whice uses onnxslim to simplify.

7776.tempDir.zipHere is this model's tmpdir

0 Reese Grimsley 8 months ago in reply to Wang Xiaojun

TI__Genius 15556 points

Thank you for providing the TempDir. I see that there are multiple subgraphs that have not finished parsing and optimizing. As a result, 3 of the 4 subgraphs lack SVG graphs. Please share the onnx model as well if you are comfortable -- random weights is ok.

Wang Xiaojun said:
If you have tried to use onnx modifier to modifie model ，you will meet lost of these problems。

You mean this graphical tool for changing models, yes? https://github.com/ZhangGe6/onnx-modifier

I too have had challenges with onnx-modifier tool, especially for layers with many default attributes or more complex initializers. onnx-modifier is a good tool, but modifying directly with onnx or onnx-graphsurgeon is more stable.

Wang Xiaojun said:
TIDL_RT_ONNX_VARDIM=1:INVALID_ARGUMENT : Non-zero status code returned while running Gemm node. Name:'gemm' Status Message: GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62

For the GEMM nodes, can you try representing the tensor's dimensions in 'C' as {1,62} instead of {62}? From what I see, TIDL is rejecting your GEMM (or equivalent MatMul->Add) from running with acceleration, but then ONNX also hits an error for the tensor returned by TIDL reaches CPU ONNX runtime vs. the constants C,W. If this is the problem, I will see that this gets fixed for future release.

And I think using TIDL_RT_ONNX_VARDIM=0 will overall cause issue with layers running on Arm. Let's use =1 here.

HOWEVER, the first subgraph SVG has this shapes as the output. This suggests that TIDL is passing back a tensor of shape {1024,1,1} or even {1,1}, which ONNX probably does like. This is probably why ONNX reads 'K' as 1 instead of 1024. I think we are best suited trying to resolve this GEMM/MatMul issue, since this is ultimately what we need.

Alternatively (and not ideal for performance): If we have that output from Pooling [29 29] as {1,1,1,1,1,1024}, I think this issue will resolve so that CPU can run this layer. However, I'll need to investigate what may enable this dimensionality. Perhaps a 'Flatten' layer could be used.
Errors coming from here in source: https://github.com/TexasInstruments/onnxruntime/blob/TIDL_PSDK_9.0/onnxruntime/core/providers/cpu/math/gemm_helper.h

BR,
Reese

0 Wang Xiaojun 8 months ago in reply to Reese Grimsley

Prodigy 50 points

Hello,

We decided to split face keypoint and head pose to different model. Could your team helps to compile this onnx model based on mmpose rtmpose.

Results here: CPU seems to get similar results but NPU . Is there anything wrong with my model_configs?

0 Reese Grimsley 8 months ago in reply to Wang Xiaojun

TI__Genius 15556 points

Hello,

Different model architecture now, I understand. We will investigate this. You model config looks reasonable

I see in your artifacts that there are multiple subgraphs now, and the artifacts look complete. However, you have noted accuracy is poor from your spreadsheet.

For accuracy, can you try compiling for tensor_bits = 16 and 32 and check the output in each case? This should tell us if the issue is quantization-related.

If so, we can use debug_level=4 to collect layer-level traces and see where fixed-point acceleration diverges from floating point.

Are you using TIDL tools for 10.1 or 10.0 SDK?

There are many subgraphs here as well. We should be able to reduce this. Some fixes are part of the tidl_onnx_model_optimizer already, although we may have to remove a few default rules that were problematic on a previous version of your model. At the least, we can use rules like:

convert_large_global_avg_pooling_to_matmul
convert_reducemean_to_matmul
convert_maxpool_to_cascaded_maxpool
convert_unsqueeze_to_reshape

I expect these to remove at least 4 subgraphs from the 8 subgraphs you have now. Clip->Div is the only part of your model that I don't have an immediate solution for.

BR,
Reese

0 Wang Xiaojun 8 months ago in reply to Reese Grimsley

Prodigy 50 points

Hello,

conclusion：AFTER THE POSTPROCESS OF THE RESULTS 2*1*106*512 , 106 POINTS ARE THE SAME ,NO MATTER ALL globalavgpooling IS REPLACED.

I replaced all the globalavgpooling to reshape+matmul+reshape and maxpool to cascaded_maxpool and unsqueeze to reshape ,then retried all the npu artifacts based on this model(just add postprocess in code to get the max of 512 2*1*106*512 --->2*1*106*1).Now are 4subgraphs.This onnx model running on CPU can get correct results.

Reese Grimsley said:
convert_large_global_avg_pooling_to_matmul

convert_reducemean_to_matmul

convert_maxpool_to_cascaded_maxpool

convert_unsqueeze_to_reshape

Wang Xiaojun said:
We decided to split face keypoint and head pose to different model. Could your team helps to compile this onnx model based on mmpose rtmpose.

c666.zip

How to do? edgeai-tidl-tools/examples/osrt_python/ort/onnxrt_ep.py can't see any thing about calibration epoches and tensor bits

Reese Grimsley said:
For accuracy, can you try compiling for tensor_bits = 16 and 32 and check the output in each case? This should tell us if the issue is quantization-related.

0 Adam Hua 8 months ago in reply to Wang Xiaojun

TI__Expert 4910 points

Hi Xiaojun

I am working on comparing the layer trace. But now I have a problem that setting debug level to 4 will require additional mem and will generate the following error:

root@am62axx-evm:/opt/edgeai/edgeai-tidl-tools/examples/osrt_python/ort# python3 onnxrt_ep_no_post.py -m tianma_model                   
Available execution providers :  ['TIDLExecutionProvider', 'TIDLCompilationProvider', 'CPUExecutionProvider']

Running 1 Models - ['tianma_model']


Running_Model :  tianma_model  

libtidl_onnxrt_EP loaded 0x2de888b0 
artifacts_folder                                = ../../../model-artifacts//tianma_model/artifacts 
debug_level                                     = 4 
target_priority                                 = 0 
max_pre_empt_delay                              = 340282346638528859811704183484516925440.000000 
Final number of subgraphs created are : 5, - Offloaded Nodes - 186, Total Nodes - 198 
In TIDL_createStateInfer 
Compute on node : TIDLExecutionProvider_TIDL_0_0
************ in TIDL_subgraphRtCreate ************ 
 APP: Init ... !!!
   763.027503 s: MEM: Init ... !!!
   763.027575 s: MEM: Initialized DMA HEAP (fd=5) !!!
   763.027733 s: MEM: Init ... Done !!!
   763.027759 s: IPC: Init ... !!!
   763.044549 s: IPC: Init ... Done !!!
REMOTE_SERVICE: Init ... !!!
REMOTE_SERVICE: Init ... Done !!!
   763.048551 s: GTC Frequency = 200 MHz
APP: Init ... Done !!!
   763.048684 s:  VX_ZONE_INIT:Enabled
   763.048699 s:  VX_ZONE_ERROR:Enabled
   763.048708 s:  VX_ZONE_WARNING:Enabled
   763.049615 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-0 
   763.049949 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-1 
   763.050212 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-2 
   763.050449 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-3 
   763.050481 s:  VX_ZONE_INIT:[tivxInitLocal:136] Initialization Done !!!
   763.050713 s:  VX_ZONE_INIT:[tivxHostInitLocal:106] Initialization Done for HOST !!!
************ TIDL_subgraphRtCreate done ************ 
 In TIDL_createStateInfer 
Compute on node : TIDLExecutionProvider_TIDL_1_1
************ in TIDL_subgraphRtCreate ************ 
 ************ TIDL_subgraphRtCreate done ************ 
 In TIDL_createStateInfer 
Compute on node : TIDLExecutionProvider_TIDL_2_2
************ in TIDL_subgraphRtCreate ************ 
    763.238992 s: MEM: ERROR: Alloc failed with status = 12 !!!
   763.239040 s:  VX_ZONE_ERROR:[tivxMemBufferAlloc:90] Shared mem ptr allocation failed
   763.239052 s:  VX_ZONE_ERROR:[ownAllocReferenceBufferGeneric:340] Memory allocation failed
   763.239063 s:  VX_ZONE_ERROR:[ownGraphAllocateDataObject:1031] Memory allocation for data reference failed
   763.239074 s:  VX_ZONE_ERROR:[vxVerifyGraph:2199] Memory alloc for data objects failed
   763.239085 s:  VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed
TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!!
TIDL_RT_OVX: ERROR: Verify OpenVX graph failed
************ TIDL_subgraphRtCreate done ************ 
 In TIDL_createStateInfer 
Compute on node : TIDLExecutionProvider_TIDL_3_3
************ in TIDL_subgraphRtCreate ************ 
    763.247303 s: MEM: ERROR: Alloc failed with status = 12 !!!
   763.247348 s:  VX_ZONE_ERROR:[tivxMemBufferAlloc:90] Shared mem ptr allocation failed
   763.247360 s:  VX_ZONE_ERROR:[ownAllocReferenceBufferGeneric:340] Memory allocation failed
   763.247371 s:  VX_ZONE_ERROR:[ownGraphAllocateDataObject:1031] Memory allocation for data reference failed
   763.247382 s:  VX_ZONE_ERROR:[vxVerifyGraph:2199] Memory alloc for data objects failed
   763.247393 s:  VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed
TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!!
TIDL_RT_OVX: ERROR: Verify OpenVX graph failed
************ TIDL_subgraphRtCreate done ************ 
 In TIDL_createStateInfer 
Compute on node : TIDLExecutionProvider_TIDL_4_4
************ in TIDL_subgraphRtCreate ************ 
    763.257352 s: MEM: ERROR: Alloc failed with status = 12 !!!
   763.257401 s:  VX_ZONE_ERROR:[tivxMemBufferAlloc:90] Shared mem ptr allocation failed
   763.257413 s:  VX_ZONE_ERROR:[ownAllocReferenceBufferGeneric:340] Memory allocation failed
   763.257424 s:  VX_ZONE_ERROR:[ownGraphAllocateDataObject:1031] Memory allocation for data reference failed
   763.257435 s:  VX_ZONE_ERROR:[vxVerifyGraph:2199] Memory alloc for data objects failed
   763.257445 s:  VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed
TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!!
TIDL_RT_OVX: ERROR: Verify OpenVX graph failed
************ TIDL_subgraphRtCreate done ************ 
 *******   In TIDL_subgraphRtInvoke  ******** 
[C7x_1 ]    763.330949 s:    0         1.00000        17.00000       255.00000 6
[C7x_1 ]    763.448313 s:    5         0.50000        12.00000       254.00000 1
[C7x_1 ]    763.709967 s:    6         0.04539     -2005.05115       980.49207 1
[C7x_1 ]    764.048756 s:    7         5.80933         0.00000         0.86068 0
[C7x_1 ]    764.377386 s:    8         0.13183         0.00000       853.37659 1
[C7x_1 ]    764.696369 s:    9         0.03640     -2444.87451       947.73218 1
[C7x_1 ]    765.035074 s:   10         4.65954         0.00000         0.85845 0
[C7x_1 ]    765.353711 s:   11         0.08481         0.00000       825.37476 1
[C7x_1 ]    765.672484 s:   12         0.01743     -3959.20020      1864.84058 1
[C7x_1 ]    766.249094 s:   13         2.23075         0.00000         0.89656 0
[C7x_1 ]    766.705231 s:   14         0.03888         0.00000      1659.07666 1
[C7x_1 ]    767.201182 s:   15         0.08849     -1192.18604       672.37036 1
[C7x_1 ]    767.539965 s:   16        11.32709         0.00000         0.97112 0
[C7x_1 ]    767.848516 s:   17         0.12530         0.00000       646.46973 1
[C7x_1 ]    768.146976 s:   18         0.09069     -1025.43579        82.69644 1
[C7x_1 ]    768.341678 s:   19        11.60872         0.00000         0.94756 0
[C7x_1 ]    768.536278 s:   20         1.05283         0.00000        83.10921 1
[C7x_1 ]    768.720892 s:   21         0.99571      -125.53882        46.19829 1
[C7x_1 ]    768.935700 s:   22       127.45061         0.00000         0.99646 0
[C7x_1 ]    769.130362 s:   23         1.98287         0.00000        46.39743 1
[C7x_1 ]    769.315001 s:   24         0.35777      -357.77322       120.18944 1
[C7x_1 ]    769.509613 s:   25        45.79437         0.00000         0.98265 0
[C7x_1 ]    769.714310 s:   26         1.02399         0.00000       114.25931 1
[C7x_1 ]    769.898952 s:   27         0.54613      -221.56052       162.96600 1
[C7x_1 ]    770.093589 s:   28        69.90415         0.00000         0.98707 0
[C7x_1 ]    770.298299 s:   29         0.59651         0.00000       160.93677 1
[C7x_1 ]    770.492997 s:   30         2.50239       -43.35855        25.77536 1
[C7x_1 ]    770.697802 s:   31       160.15294         0.00000         0.99905 0
[C7x_1 ]    770.912589 s:   32         3.13098        -0.31939        26.03021 1
[C7x_1 ]    771.117308 s:   33         1.98287        -0.50432        46.39743 1
[C7x_1 ]    771.322036 s:   34         2.10566        -0.94982        55.80190 1
[C7x_1 ]    771.630731 s:   35         2.10566        -0.94982        55.80190 1
[C7x_1 ]    771.939034 s:   36         2.10566        -0.94982        55.80190 1
[C7x_1 ]    772.247548 s:    1    524288.00000         0.00000         0.00000 1
[C7x_1 ]    772.349586 s:   37         4.21133        -0.11873        15.55330 1
[C7x_1 ]    772.450381 s:   38         4.21133        -0.11873        15.55330 1
[C7x_1 ]    772.551249 s:   39         4.21133        -0.11873        15.55330 1
[C7x_1 ]    772.652131 s:   40         3.09413       -40.23747        40.88386 1
[C7x_1 ]    772.752945 s:   41        99.01219         0.00000         0.99988 1
[C7x_1 ]    772.853839 s:   42         3.25760        -0.92092        36.68346 1
[C7x_1 ]    773.152374 s:   43         0.77269      -104.82794        65.35569 1
[C7x_1 ]    773.491508 s:   44        98.90493         0.00000         0.99085 0
[C7x_1 ]    773.810090 s:   45         1.19411         0.00000        64.90165 1
[C7x_1 ]    774.118805 s:   46         0.73371      -165.59564        99.49368 1
[C7x_1 ]    774.323545 s:   47        93.91551         0.00000         0.99025 0
[C7x_1 ]    774.518230 s:   48         1.07668         0.00000        97.98685 1
[C7x_1 ]    774.702852 s:   49         0.49600      -198.58832

Before the excution being killed by the error, 173 layers are stored and compared with PC. All those layers traces are the same. I will enlarge the shared mem tomorrow to try again. Changing mem size and rebuild is time consuming, so please expect delayed response.

Regards,

Adam

0 Adam Hua 8 months ago in reply to Reese Grimsley

TI__Expert 4910 points

Hi Reese,

Seek your help here. As my last reply says, mem problem occur when I tried to dump layer traces. But I am not sure which part of mem it uses. I enlarged edgeai-core-heap-memory but it does not help.

Regards,

Adam

0 Wang Xiaojun 8 months ago in reply to Adam Hua

Prodigy 50 points

I added the postprocess op to the model but also failed.I have tried no less than 10 combinations but all failed.

Is there anyother way to solve these problems.

Here the artifacts file

c666-1.10.zip

0 Reese Grimsley 8 months ago in reply to Adam Hua

TI__Genius 15556 points

Hi Adam,

Hmm, perhaps that is not right memory region to increase. I have not run into this MALLOC error during trace dump.

From running a network with debug_level 2 and 5, I can see the memrec tables differ for entry 9, which is part of DDR_C7X_1_SCRATCH region (address starts with 0xB900). I think that region needs increase. We may be able to confirm this is needed by looking at memrec tables.

If 173 of 198 layers are same on PC, then difference must be in some of the last layers

0 Reese Grimsley 8 months ago in reply to Reese Grimsley

TI__Genius 15556 points

I suggest running in a host-emulation mode at this stage. This is preferred when working on accuracy issues. This will also let us analyze traces without worrying about memory maps and allocation failures.

What I see so far is that 8-bit and 16-bit model does indeed have substantially different output than CPU and 32-bit execution. 16-bit is less severe, but still different. When running the network with tensor_bits=32 through TIDL, the output is same as disable-offload (run on CPU, no TIDL at all). This tells that the quantized version of the layer has limited accuracy. The debugging steps below will help us understand at what layer the tensor_bits=32 differs from 8 and 16

https://github.com/TexasInstruments/edgeai-tidl-tools/blob/master/docs/tidl_osr_debug.md#feature-map-comparison-with-reference

I have attached my script that includes these visualization functions:

Fullscreen debug_and_viz_traces.py Download

import numpy as np
import argparse
import matplotlib
import matplotlib.pyplot as plt
import os
import sys
import subprocess
import shutil
import argparse

def parse_args():

    parser = argparse.ArgumentParser()
    parser.add_argument('tracedir_fixed', type=str, default=None)
    parser.add_argument('tracedir_float', type=str, default=None)
    parser.add_argument('-s', '--save_trace_dir', type=str, default=None)
    parser.add_argument('-t','--tensor_bits', type=int, default=8, help='Tensor_bits used for these traces. Hybrid mode not supported yet')
    args = parser.parse_args()
    return args

def save_error_plot(float_data, fixed_data, axes):
    mx = np.max(float_data)
    mn = np.min(float_data)
    org_diff = (fixed_data - float_data)
    combined = np.vstack((float_data, fixed_data, org_diff)).T
    # #np.savetxt("figs\\"+str(i).zfill(4)+"_float.txt", combined, fmt='%10.6f, %10.6f, %10.6f')
    abs_diff = abs(fixed_data - float_data)
    maxIndex = np.argmax(abs_diff)
    max_abs_diff = np.max(abs_diff)
    mean_abs_diff = np.mean(abs_diff)
    var_abs_diff = np.var(abs_diff)

    axes.hist(abs_diff, color='blue', edgecolor='black', bins=60)
    # image_txt = "mean = " + str(mean) +", Var = "+ str(var) +", MAx = "+ str(mx)
    image_txt = "Hist; MeanAbsDiff=%7.4f, MaxAbsDiff=%7.4f, MaxVal=%7.3f" % (mean_abs_diff, max_abs_diff, mx)
    #plt.title(image_txt)
    axes.set_title(image_txt, fontdict = {'fontsize' : 8})
    axes.set_xlabel('tensor element values')
    axes.set_ylabel('value frequency')



def save_pc_ref_plot(float_output, fixed_output, axes):
    axes.set_title("Float output Vs Fixed Output : Plot 1")
    axes.set_xlabel('Float Output (tensor_bits 32 / reference)')
    axes.set_ylabel('Fixed Output (dequantized to fp32)')
    axes.plot(float_output, fixed_output, '.')

def save_pc_ref_plot2(float_output, fixed_output, axes):
    axes.set_title("Float output Vs Fixed Output : Plot 2")
    axes.plot(float_output, "bs", label = "Float")
    axes.plot(fixed_output, "c.", label = "Fixed")
    axes.legend(loc='upper right', frameon=True)


fig, axs = plt.subplots(ncols=2)
plt.subplots_adjust(left=0.075, right=0.95)
fig.set_figwidth(12)
def compare_traces(float_tracefile, fixed_tracefile, save_pngs_dir=None):
    float_data = np.fromfile(float_tracefile, dtype=np.float32)
    fixed_data = np.fromfile(fixed_tracefile, dtype=np.float32)

    # plt.clf() #clear
    axs[0].clear()
    axs[1].clear()
    
    layer_info = float_tracefile.split('/')[-1].split('_')[3:-1]
    #( trace names will be like tidl_traceAAAA_BBBBB_CCCCC_DDDDDxEEEEE.y, AAAA is dataId, BBBBB is batch number, CCCCC is channel number, DDDDD is width and EEEEE is height)
    print('subgraph | data ID | DIM0 | DIM1 | batch number | channel | width x height')
    print(layer_info)
    data_id = layer_info[1]
    print(f'data ID: {data_id}')
    
    # save_error_plot(float_data, fixed_data, axes)
    save_pc_ref_plot(float_data, fixed_data, axs[0])
    save_pc_ref_plot2(float_data, fixed_data, axs[0])
    save_error_plot(float_data, fixed_data, axs[1])
    # plt.show()
    fig.suptitle(f'Analysis for data ID {data_id}') #TODO: read layer_info file for string name of the layer
    plt.draw()
    # while not plt.waitforbuttonpress(): pass

    if save_pngs_dir is not None:
        fig.savefig(os.path.join(save_pngs_dir, float_tracefile.split('/')[-1])+'.png')
    else:
        print('PNG not saved')


def main():
    args = parse_args()

    files_fixed = os.listdir(args.tracedir_fixed)
    files_fixed.sort()
    traces_fixed = [f for f in files_fixed if '_float.bin' in f]
    traces_fixed.sort()
    num_files = len(traces_fixed)

    files_float = os.listdir(args.tracedir_float)
    files_float.sort()
    traces_float = [f for f in files_float if '_float.bin' in f]
    traces_float.sort()

    for i in range(num_files):
        filename_fixed = traces_fixed[i]
        # file_basename = filename_float.split('_float.bin')[0]
        # print(file_basename)
        filename_float = None
        for j in range(len(traces_float)):
            if  filename_fixed in traces_float[j]:
                filename_float = traces_float[j]
                print(filename_float)
                break
        if filename_fixed is None or filename_float is None: 
            print('skip %s / %s\n\n' % (filename_fixed, filename_float))
            continue

        print(filename_fixed)
        print(filename_float)
        print('found files; now compare traces')
        filename_float = os.path.join(args.tracedir_float, filename_float )
        filename_fixed = os.path.join(args.tracedir_fixed, filename_fixed)
        print(filename_fixed)
        print(filename_float)
        compare_traces(float_tracefile=filename_float, fixed_tracefile=filename_fixed, save_pngs_dir=args.save_trace_dir)
        

    # file_pairs[0:4]


if __name__ == '__main__':
    main()

I note that same behavior is seen in 10_00_00_08 and 10_01_02_00 tidl-tools versions.

There are a few extra layers like abs, pow that use TIDL in 10_01 version, but same number of subgraphs, mostly from Clip, Div, ReduceSum (RS can be replaced with optimizer rule)

Note that we can probably eliminate a subgraph here with a model change:

By setting Argmax axis to be -3 instead of 1 and moving cast after Concat. Although I am unsure if Flatten before argmax will permit this axis setting.

0 Adam Hua 8 months ago in reply to Reese Grimsley

TI__Expert 4910 points

Hi Reese,

There are a few problems to solve with the model right now:

1. Too many subgraph. I have suggested customer modify the model structure to make conv and mul operators to a group and other operator like abs to another group to reduce number of subgraph.

2. Problem with sigmod layer. As you suggested, I use pc simulation to run 32bit and 8bit and found that there are problem with batch+sigmod layer. All sigmod layer have bad accuracy:

I am using tools 10.0.8 since sdk 10.1 has not been released. Need your comment whether should customer change all sigmod to relu.

Regards,

Adam

0 Wang Xiaojun 8 months ago in reply to Adam Hua

Prodigy 50 points

Other question： could run cpu+npu in the same time based on edgeai-gst-apps /app_cpp. We tried to delete allownodes.txt 's contents And tried add these code to postprocess part .

auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU);

Ort::Value input_tensor_ = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());

auto cpu_output = ort_session->Run(Ort::RunOptions{ nullptr }, &input_names[0], &input_tensor_, 1, output_names.data(), 1);

const float* output_cpu = cpu_output[0].GetTensorMutableData<float>();

But all the fps will down to 5 . Is that normal?

(We tried to use usural opencv method but found that imread and imwrite couldn't work cause:[100%] Linking CXX executable
/usr/lib/gcc/aarch64-oe-linux/13.3.0/../../../../aarch64-oe-linux/bin/ld:(.text.startup+0x128): undefined reference to `cv:imread//imwrite)

Reese Grimsley said:
moving cast after Concat

This will lead to [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed. MAYBE CAUSED BY CONCAT ARGMAX(INT)

TWO outputs failed, one output without postprocess

Wang Xiaojun said:
conclusion：AFTER THE POSTPROCESS OF THE RESULTS 2*1*106*512 , 106 POINTS ARE THE SAME ,NO MATTER ALL globalavgpooling IS REPLACED.

0 Adam Hua 8 months ago in reply to Wang Xiaojun

TI__Expert 4910 points

Hi xiaojun,

The first problem about fps is another problem. Could you file a different ticket for that?

As for the model, there are other problems with it. Please allow us some time to make a workaround for that.

Regards,

Adam

0 Wang Xiaojun 8 months ago in reply to Adam Hua

Prodigy 50 points

GOT IT.Here it is.

(+) SK-AM62A-LP: could run cpu+npu in the same time based on edgeai-gst-apps /app_cpp. - Processors forum - Processors - TI E2E support forums

0 Jianzhong Xu 8 months ago in reply to Wang Xiaojun

TI__Guru 59225 points

Hi Xiaojun,

Reese is out this week and won't be able to respond to you until next week.

Regards,

Jianzhong

0 Reese Grimsley 8 months ago in reply to Adam Hua

TI__Genius 15556 points

Hi Adam,

Adam Hua said:
1. Too many subgraph. I have suggested customer modify the model structure to make conv and mul operators to a group and other operator like abs to another group to reduce number of subgraph.

2. Problem with sigmod layer. As you suggested, I use pc simulation to run 32bit and 8bit and found that there are problem with batch+sigmod layer. All sigmod layer have bad accuracy:

Understood on the two points. For the first, let me know if help is needed to make these optimizations. I see several places where automated scripts might help. Additionally, some layers that were previously on CPU should run with TIDL now with 10.1 SDK, like Abs and Pow.

The sigmoid one deserves further investigation. As a start, I'd recommend trying 10.1 tools. The 10.1 SDK released this week, so it is ready to try. I agree that the data shown in those traces is not good quantization. Can you provide model + import config used here so I can reproduce and log as an issue? Do you know if hard-sigmoid is seeing that same? I see that this network is using both.

For the time being, lets switch to ReLU. It is not an ideal change and we'll work towards a fix. I suggest RELU as short-term workaround.
Is the c666-1.10 model above showing these sigmoid errors? If so, we can use this as a good test case for isolating the error.

0 Wang Xiaojun 8 months ago in reply to Reese Grimsley

Prodigy 50 points

Reese Grimsley said:
Is the c666-1.10 model above showing these sigmoid errors?

sure.

Reese Grimsley said:
Can you provide model + import config used here so I can reproduce and log as an issue?

Wang Xiaojun said:
Here the artifacts file

c666-1.10.zip

'c666' :create_model_config(
preprocess=AttrDict(
resize=256,
crop=256,
data_layout="NCHW",
resize_with_pad=False,
reverse_channels=False,
),
session=AttrDict(
session_name="onnxrt", #_face_1x3x120x120 modified_ -op11 modified_sparse_face_me
model_path=os.path.join( "/home/zxb/Desktop/ti/final-0gmp1.onnx"),
input_mean=[0, 0, 0],
input_scale=[1, 1, 1],

),
task_type="classification",
extra_info=AttrDict(num_images=numImages, num_classes=1000),
),

0 Reese Grimsley 8 months ago in reply to Wang Xiaojun

TI__Genius 15556 points

Hello,

Thank you for supplying this. I have logged this sigmoid accuracy as an issue to resolve. In the meantime, please replace these with RELU

For this C666 model, what else do you need assistance with? I believe there are still some issues with subgraphs / performance. Please help me understand your current status -- it is not clear to me. Perhaps one of these challenges is the CAST / ARGMAX / CONCAT issue from above?

Wang Xiaojun said:
This will lead to [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed. MAYBE CAUSED BY CONCAT ARGMAX(INT)

TWO outputs failed, one output without postprocess

I assume you have changed the model for this.

BR,
Reese

0 Wang Xiaojun 8 months ago in reply to Reese Grimsley

Prodigy 50 points

left cpu right npu

change sigmoid ———》relu

remove flatten and argmax

WHAT CAN I SAY

c666-relunoam.zip

0 Reese Grimsley 8 months ago in reply to Wang Xiaojun

TI__Genius 15556 points

Hello,

I understand this has been a frustrating experience, thank you for your perseverance -- it is much appreciated.

You have found a configuration that passes the compilation stage and can run on target. We are experiencing accuracy challenge now, and your image makes this very obvious. Replacement with ReLU gives reasonable output on CPU, but poor on NPU/C7x.

This will take investigation to understand which layer(s) cause accuracy issue.

As a first step, I run the model with tensor_bits set to 8, 16, and 32, and can see big difference (but correct order of magnitude) between each quantization level. 32-bit is a reference floating point mode, and is within very small error margin of CPUExecutionProvider. Therefore, 32-bit with TIDL is good to compare against.

The next step is layer-level analysis. We need to run the model with debug_level=4 and tensor_bits=8. Save the traces under /tmp/tidl_trace...._float.bin. Then recompile and run the same with tensor_bits=32, and similarly save traces. We can compare between 8 and 32 in same way as Adam did above

Reese Grimsley said:
I have attached my script that includes these visualization functions:

Using the functions from the script I mentioned for accuracy comparison.

As for reducing # subgraphs -- Several layers you are using are not supported. 10.1 adds support for Abs and Pow layers, but this model still results in 5 subgraphs on your model due to ReduceSum, Div (with both inputs variable) and Max layers.

ReduceSum could perhaps be replaced by ReduceMin or ReduceMax -- another alternative is ReduceMean, which we internally replace with Reshape and MatMul layers.
Div will be a challenge for the instances where both inputs are variable... unsure if there is a good way to replace that one without removing the skip connection for operand 2
Max layer should have been marked as supported... logs are not clear on why this was not treated as an elementwise layer.
- The constant input you provided should have been identified as broadcastable. Perhaps try replacing the single constant value with that same value in the same shape as the input (e.g. custom_added_Max1 used 1x98x1 variable vs. 1x constant --> replace constant with same 1x98x1 size).

BR,
Reese

0 Adam Hua 8 months ago in reply to Wang Xiaojun

TI__Expert 4910 points

I have tried your new model on 10.1 pc mode 32bit tidlrt and 8bit tidlrt and I found that the accuracy loss gradually after every convolution layer. Have you tried to run the model in PC mode and tensor bits 8 to see if the output is correct?

Regards,

Adam

0 Wang Xiaojun 7 months ago in reply to Adam Hua

Prodigy 50 points

have a question. I don't fully understand how TI's tensorbit works. My ONNX model is 32-bit, and when I convert it directly using the TIDL tool, will it automatically convert it to bits=8, or will it remain at bits=32 by default, or is it a mixed configuration?

Adam Hua said:
Have you tried to run the model in PC mode and tensor bits 8 to see if the output is correct?

Should i convert onnx model from float32 to int8,then compile it ?

I noticed that there are extra config.yaml files in ModelMaker's output artifacts folder.

ABOUT 10.01 remove reducesum and max alse failed.c666-nomax.zip

VARIABLE div to pow（ -1 ）+ concat * +mul 。 just one subgraph，but ends failed too.

c666-noerror.zip

0 Reese Grimsley 7 months ago in reply to Wang Xiaojun

TI__Genius 15556 points

Hello,

Wang Xiaojun said:
have a question. I don't fully understand how TI's tensorbit works. My ONNX model is 32-bit, and when I convert it directly using the TIDL tool, will it automatically convert it to bits=8, or will it remain at bits=32 by default, or is it a mixed configuration?

I understand, let me clarify tensorbits usage.

The tensorbits parameter can be 8, 16, or 32.

Values 8 and 16 mean that the model will run in fixed point mode, therefore it is quantized. You do not need to prequantize your model to do this (although this is an option). TIDL will use post-training quantization (PTQ) on a set of calibration images. This allow TIDL to find quantization for each layer, such that the 32-bit floating-point values can be used as int8 (or int16) instead.

If you use int8, you can designate specific layers to run in 16-bit to improve accuracy -- the compile-time option is called "output_feature_16bit_names_list". If you use tensorbits=16, then all layers will be quantized to 16-bit, and this hybrid mode cannot be used.

Value of tensorbits=32 is different. This is considered a reference mode to check model functionality. The model will run in floating-point -- it will therefore skip calibration, meaning no float-->fixed point conversion. This mode is intended to be used on PC.

________________________

Wang Xiaojun said:
ABOUT 10.01 remove reducesum and max alse failed.c666-nomax.zip

It is clear to me that accuracy is poor right now. Your source code looks okay in the screenshots. Let us figure out why the accuracy is poor.

For now, please continue testing with c666-nomax model, and ignore the variable Div --> pow^-1 and Mul. Let us focus on diagnosing the accuracy issue with the no-max model, as this trick with Div could be further influencing accuracy. We can look at Div afterwards.

Could you recompile your model with tensorbits=32, and share the output of the same test on your c666-nomax model? I want to compare CPU-based inference (which is good) to NPU-based inference running with TIDL.

If this tensorbits=32 reference mode with TIDL is good, then this issue is due to quantization. If the issue is still present with tensorbits=32, this may indicate a bug in the TIDL SW.

BR,
Reese

0 Wang Xiaojun 7 months ago in reply to Reese Grimsley

Prodigy 50 points

Reese Grimsley said:
If this tensorbits=32 reference mode with TIDL is good, then this issue is due to quantization. If the issue is still present with tensorbits=32, this may indicate a bug in the TIDL SW.

Do u mean in benchmark folder run_custom_pc.sh AM62A set setting_base.yaml : tensorbits =8/16/32 .

However, in this case, benchmark and TIDL would be two completely unrelated concepts.

c666-ben32.zip

This is a keypoint model, but task_type just only cls/seg/det ,and don't need postprocess .How to set the config .

c666-ben8-2.zip c666-ben16.zip

Is there anyother userfriendly ways to convert an ONNX model into the format required by EdgeAI (similar to eiq/rknn/snpe)?

0 Wang Xiaojun 7 months ago in reply to Wang Xiaojun

Prodigy 50 points

你好，

在TIDL种当我选择不同数据量去编译的时候，NPU的结果呈现的结果也不同。当我再去edgeai-benchmark 去运行 ./run_custom_pc.sh AM62A

sh文件会使用到settings_base.yaml，如果我设置task_selection : keypoint_detection 会报错： Traceback (most recent call last):
File "/home/zxb/Desktop/ti/edgeai-tensorlab/edgeai-benchmark/edgeai_benchmark/datasets/__init__.py", line 188, in get_datasets
dataset_cache[DATASET_CATEGORY_IMAGENET]['calibration_dataset'] = ImageNetDataSetType(**imagenet_cls_calib_cfg, download=download)
TypeError: 'NoneType' object is not subscriptable

现在的模型是keypoint 模型，虽然modelzoo有humanpose 模型，但是tidl的config和benchmark的configs里面都没有样例，直接使用classification编译也是有问题的。不是像一开始提到的一样customsmodel使用classification就可以使NPU拿到和CPU差不多的结果。

我们这部分一直不是很明白，需求也已经沟通过了，模型也提供了但是没有得到想要的答案

0 Adam Hua 7 months ago in reply to Wang Xiaojun

TI__Expert 4910 points

Hi Xiaojun,

As we suggested, your model does not get expected output after quantization may because you need to add batchnorm after each conv.

Or, you can create a new model with TI model-maker to do keypoint detection, which will accelerate your model deployment.

Regards,

Adam

0 Adam Hua 7 months ago in reply to Wang Xiaojun

TI__Expert 4910 points

Also, I found your model result is similar with mmpose fase detection.

If you are using mmpose based model, you can create a new model with https://github.com/TexasInstruments/edgeai-tensorlab/tree/main/edgeai-mmpose

Regards,

Adam

0 Wang Xiaojun 7 months ago in reply to Adam Hua

Prodigy 50 points

Dear,adam

edgeai-tensorlab/edgeai-mmpose/README.md

EdgeAI-MMPose

This repository is an extension of the popular mmpose open source repository for keypoint detection training. In edge-mmpose, we focus on yolox based keypoint detection models that are optimized for speed and accuracy so that they run efficiently on embedded devices. For this purpose, we have added a set of embedded friendly model configurations and scripts.

This seems to indicate that TI has not conducted research on other configurations of MMPose, and I believe this suggestion is not entirely right. Our model comprises 331 key points, but for the preliminary ONNX verification, we utilized the 98-point base model of RTMPose for validation and have made the artifacts file publicly available.

By the way, we conducted some verification on the regnet.onnx (classification) within TI's Model Zoo. Similarly, the results from the CPU and NPU are not consistent. It is unclear whether this indicates that the issue does not lie with the Batch Normalization (BN) layer.

'kd-7060':utils.dict_update(common_cfg, preprocess=preproc_transforms.get_transform_onnx(640, 640, reverse_channels=True, resize_with_pad=[True, "corner"], backend='cv2', pad_color=[114,114,114]), session=onnx_session_type(**sessions.get_common_session_cfg(settings, work_dir=work_dir, input_optimization=False), runtime_options=settings.runtime_options_onnx_p2( det_options=True, ext_options={'object_detection:meta_arch_type': 6, 'object_detection:meta_layers_names_list': f'{settings.models_path}/vision/keypoint/coco/edgeai-yolox/yolox_s_pose_ti_lite_640_20220301_model.prototxt', 'advanced_options:output_feature_16bit_names_list': '/0/backbone/backbone/stem/stem.0/act/Relu_output_0, /0/head/cls_preds.0/Conv_output_0, /0/head/reg_preds.0/Conv_output_0, /0/head/obj_preds.0/Conv_output_0, /0/head/kpts_preds.0/Conv_output_0, /0/head/cls_preds.1/Conv_output_0, /0/head/reg_preds.1/Conv_output_0, /0/head/obj_preds.1/Conv_output_0, /0/head/kpts_preds.1/Conv_output_0, /0/head/cls_preds.2/Conv_output_0, /0/head/reg_preds.2/Conv_output_0, /0/head/obj_preds.2/Conv_output_0, /0/head/kpts_preds.2/Conv_output_0'}, fast_calibration=True), model_path=f'{settings.models_path}/vision/keypoint/coco/edgeai-yolox/yolox_s_pose_ti_lite_640_20220301_model.onnx'), postprocess=postproc_transforms.get_transform_detection_yolov5_pose_onnx(squeeze_axis=None, normalized_detections=False, resize_with_pad=True, formatter=postprocess.DetectionBoxSL2BoxLS(), keypoint=True), metric=dict(label_offset_pred=1), #TODO: add this for other models as well? model_info=dict(metric_reference={'accuracy_ap[.5:.95]%':49.6, 'accuracy_ap50%':78.0}, model_shortlist=10, compact_name='human-pose-yolox-s-640x640', shortlisted=True, recommended=True) ),

edgeai-tensorlab/edgeai-benchmark/edgeai_benchmark/postprocess/__init__.py at main · TexasInstruments/edgeai-tensorlab

edgeai-tensorlab/edgeai-benchmark/edgeai_benchmark/postprocess/keypoints.py at main · TexasInstruments/edgeai-tensorlab

It may be necessary for us to rewrite the TIDL/postprocess/humanpose in order to resolve this issue.

0 Adam Hua 7 months ago in reply to Wang Xiaojun

TI__Expert 4910 points

Hi Xiaojun

Wang Xiaojun said:
By the way, we conducted some verification on the regnet.onnx (classification) within TI's Model Zoo.

Which specific model are your referring to?

I tried this model with edgeai-tidl-tools and found no error between cpu and npu:

I suggested adding batchnorm after each conv because I found accuracy loss after every conv+relu layer even in PC mode. I compared pc mode 8bit results and 32bit results and found great accuracy loss. So I think the problem is not difference between pc and npu but accuracy loss due to quantization.

Adding batchnorm is still worth trying. If retraining takes long time, you can use a model not trained just to verify the results between onnx mode and tidl 8bit mode.

Regards,

Adam

0 Wang Xiaojun 7 months ago in reply to Adam Hua

Prodigy 50 points

Alright, we'll give it a shot later this week.

Adam Hua said:
Adding batchnorm is still worth trying. If retraining takes long time, you can use a model not trained just to verify the results between onnx mode and tidl 8bit mode.