Tool/software:
Non-zero status code returned while running TIDL_5 node. Name:'TIDLExecutionProvider_TIDL_5_5' Status Message: TIDL Compute Import Failed.But remove this split can pass
split_node = onnx.helper.make_node(
This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Tool/software:
Non-zero status code returned while running TIDL_5 node. Name:'TIDLExecutionProvider_TIDL_5_5' Status Message: TIDL Compute Import Failed.But remove this split can pass
split_node = onnx.helper.make_node(
Output1 can be postprocessed in cpu.
But another output is complex.
Slice with 2D :
[Quantization & Calibration for subgraph_1 Started]
2024-12-27 16:48:11.407701156 [E:onnxruntime:, sequential_executor.cc:494 ExecuteKernel] Non-zero status code returned while running Squeeze node. Name:'model/tf.__operators__.getitem_16/strided_slice__354' Status Message: /root/onnxruntime/onnxruntime/core/providers/cpu/tensor/squeeze.h:52 static onnxruntime::TensorShapeVector onnxruntime::SqueezeBase::ComputeOutputShape(const onnxruntime::TensorShape&, const TensorShapeVector&) input_shape[i] == 1 was false. Dimension of input 1 must be 1 instead of 0. shape={1,0,1,1,1,3}
Slice with 3D :
==================== [Optimization for subgraph_6 Started] ====================
[TIDL Import] [PARSER] UNSUPPORTED: All the input tensor dimensions has to be greater then zero. For tensor model/tf.__operators__.getitem_23/strided_slice3, id 0 - Dim 2 is 0 -- [tidl_import_common_model_check.cpp, 2290]
[TIDL Import] ERROR: Invalid input tensor dimension, aborting -- [tidl_import_core.cpp, 2556]
[TIDL Import] ERROR: Network Optimization failed - Failed in function: TIDL_runtimesOptimizeNet -- [tidl_runtimes_import_common.cpp, 1268]
[TIDL Import] [PARSER] ERROR: - Failed in function: TIDL_computeImportFunc -- [tidl_onnxRtImport_EP.cpp, 1713]
2024-12-27 16:40:48.794281915 [E:onnxruntime:, sequential_executor.cc:494 ExecuteKernel] Non-zero status code returned while running TIDL_6 node. Name:'TIDLExecutionProvider_TIDL_6_6' Status Message: TIDL Compute Import Failed.
I replace slice to split also failed, no matter output1 or output2
Also,when i compile onnx model with 4 outputs,artifacts/param.yaml can only find one output node.I concat all results to a tensor with 1*71*3 or 1*1*71*3 shape log error:
RUNTIME_EXCEPTION : Non-zero status code returned while running TIDL_3 node. Name:'TIDLExecutionProvider_TIDL_3_3' Status Message: /root/onnxruntime/onnxruntime/core/providers/tidl/tidl_execution_provider.cc:430 void onnxruntime::populateOnnxRtInputParams(Ort::CustomOpApi, OrtKernelContext*, onnxruntime::tidl_ops*, OnnxTIDLSubGraphParams*) (TIDL_MAX_DIM-inputNumDims) >= 0 was false. TIDL_EP: Only tensors up to 6D
Hello,
You have several topics here, and I'll try to address them. In your follow-up reply, please help me by noting which topics you have found a solution for. It seems that you have found resolution for some of them, but it is not clear to me.
My understanding is that overall, you are facing challenges with Slice and Split operators in TIDL.
onnxruntime::SqueezeBase::ComputeOutputShape(const onnxruntime::TensorShape&, const TensorShapeVector&) input_shape[i] == 1 was false. Dimension of input 1 must be 1 instead of 0. shape={1,0,1,1,1,3}
ONNX will sometimes complain about TIDL's 6D tensor representation and this is often fixed by defining TIDL_RT_ONNX_VARDIM=1 in the calling environment.
I replace slice to split also failed, no matter output1 or output2
Allow me to be honest here -- I think this would be better off running with ONNXRT' Arm-based representation if TIDL is having a hard time with it. There is very little computation such that acceleration is not worthwhile. The overhead is probably high enough that C7x acceleration does not give much benefit.
Otherwise, it seems like TIDL is not allowing you to use slice or split nodes as you like. Assuming those adhere to the supported_operators page, please share model+compilation logs for the Slice or Split configurations you are trying to use. The single line error message is not sufficient to suggest a fix.
It would help to see the model file and the SVGs for the network.
The param.yaml files are mainly useful for well-defined model types, like object detection (1 output for boxes, 1 for output for classes), segmentation (image mask), etc. For a custom model type like yours, I do not expect the param.yaml to know how to encode the postprocessing information for these. It likely generated the outputs based on the 'model_type' within your model_configs.py for this model.
For the two images below (outputs), is one from CPU and one from TIDL? Please provide context or label for the images
Are you getting 'nan' for the TIDL version of the network?
RUNTIME_EXCEPTION : Non-zero status code returned while running TIDL_3 node. Name:'TIDLExecutionProvider_TIDL_3_3' Status Message: /root/onnxruntime/onnxruntime/core/providers/tidl/tidl_execution_provider.cc:430 void onnxruntime::populateOnnxRtInputParams(Ort::CustomOpApi, OrtKernelContext*, onnxruntime::tidl_ops*, OnnxTIDLSubGraphParams*) (TIDL_MAX_DIM-inputNumDims) >= 0 was false. TIDL_EP: Only tensors up to 6D
I will need to see model file and artifacts to give comment here. Looks like it found >6 dimensions in one of your tensors.
For models whose output does not match one of the tasks we outright enable (includes classification, object detection, segmentation, keypoint detection), you will need to include your own postprocessing code.
Overall suggestions:
BR,
Reese
If set TIDL_RT_ONNX_VARDIM=1.It couldn't even pass the the front part of the model's op.It's input shape is 1*1024*1*1 why detect 1*1*1*1,onnx model runs in cpu correctly.Also if i cut these op no matter gemm or matmul+add,all ens in GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62
.
ttt.ziphttps://drive.google.com/file/d/1taPxxn-Von3IIDWJZkCJEpDg_WqQ1dlG/view?usp=sharing
Here is the mini model.When TIDL_RT_ONNX_VARDIM=0 the last add node will cause error mast be replaced with neg+sub onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed.
If set TIDL_RT_ONNX_VARDIM=1 the error report will become the content of the previous reply.
I have tried more than a hundred combinations in tidl but all failed.
ALL THE ONNX MODEL ABOVE RUNNING ON CPU ALL CAN GET CORRECT RESULTS
I do not expect the param.yaml to know how to encode the postprocessing information for these. It likely generated the outputs based on the 'model_type' within your model_configs.py for this model.
The model (2 outputs and can pass the compilation) param only have 1 output and it is incorrect.Vividly the output with different results are running on cpu,while the output with the same results which even have "nan " is from TIDL
It likely generated the outputs based on the 'model_type' within your model_configs.py
Same result no matter model_type=None or classification.
We will do poseprocess by ourselves ,so is there any other way to get the correct artifacts .
Also the out model of optimizer is the same problem
If set TIDL_RT_ONNX_VARDIM=1.It couldn't even pass the the front part of the model's op.It's input shape is 1*1024*1*1 why detect 1*1*1*1,onnx model runs in cpu correctly.Also if i cut these op no matter gemm or matmul+add,all ens in GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62
.
from tidl_onnx_model_optimizer import optimize
optimize("/ti/modified_modified_22.onnx", "./test.onnx",simplify_mode=all)
Hello,
This is a frustrating situation, I understand. That is many configurations to try without clear solution. I am trying compilation on my side for the ttt.zip model you provided. Thank you.
We are releasing the 10.1 SDK very shortly (some SoCs it is released already), and edgeai-tidl-tools can be updated to use these tools with the 10_01_00_02 tag (rerun setup script).
I can replicate your error on 10.0 SDK, but I think we have patched part of this in 10.1
The input tensor cannot be reshaped to the requested shape. Input shape:{1,1,1,1}, requested shape:{-1,1024}
My further comments are based on this 10_01_00_02 tag for edgeai-tidl-tools.
I've tried compilation on my side and found a working configuration for the ttt.onnx model you sent, and I can clearly see that there is a bug in the model import tool (part of the optimization process, which is NOT the same as tidl_onnx_model_optimizer). This seems to be related to the GEMM node.
Would you try compiling with these additional options passed to the TIDLCompilationProvider?:
My intent here was to cut the last set of layers in the network. This max_num_subgraphs=1 config makes one TIDL graph with the Squeeze->unSqueeze->Reshape at the end, and final layers with GEMM, Add, Slice run on Arm/CPU. This way, the model compiles and run on PC with C7x emulation.
I noted that allowing a second subgraph to form in that bottom portion generated additional errors. This only included a GEMM node, and TIDL seems to complain that it does not have a bias tensor. I believe the MatMul -> Add needs to be replaced with GEMM. From your images, I think you have done this in some of your models.
The Slice layers are trying to slice in 2 axes at the same time. This is not supported
Note: To run on target, we need 10.1 SDK for AM62A to release. This should happen within the next week or so, once New Years holiday has passed
Summary:
HI,
I just git clone github.com/.../edgeai-tidl-tools.git -b 10_01_00_02 source ./setup.sh . it ends with same questions .
Would you try compiling with these additional options passed to the TIDLCompilationProvider?:
'max_num_subgraphs': 1,My intent here was to cut the last set of layers in the network. This max_num_subgraphs=1 config makes one TIDL graph with the Squeeze->unSqueeze->Reshape at the end, and final layers with GEMM, Add, Slice run on Arm/CPU. This way, the model compiles and run on PC with C7x emulation.
The model download from google just one part of the model where causes these errors,it seems not a good choice.
Slice2D can be changed to slice1D via below method.This is not difficult.But main problem is 1. up to 10.01 op:reshape doesnt work 2. down to 10.00.08 some other op doesn't work.
from tidl_onnx_model_optimizer import optimize
optimize("/ti/1111-op12.onnx", "./test1111-op12.onnx",simplify_mode=all)
Hello,
I see you have been busy and tried many options. Let me respond to these. I understand now how my suggestion to deny-list and reduce # subgraphs will not work -- You have passed me a portion of the model only.
I see one of your notes was to replace 2D slice with 1D slice.
Slice2D can be changed to slice1D via below method.This is not difficult
I shall assume Slice layers are now operating fine, but please inform if I misunderstand. I see from your screenshot that that these have been unpacked from 3 axes Slice --->1 axis slice, replicated 3 times for different axes.
1. up to 10.01 op:reshape doesnt work 2. down to 10.00.08 some other op doesn't work.
1) Reshape -- I see the error from Onnx saying input shape is (1,1,1,1) and requested output is (-1, 1024). This is for layer show in part of your screenshot
2) Other op in 10.0.0.8
Is this for the GEMM / MatMul->Add? I have noted this as failing during the optimization step of TIDL import (happens internally. This optimization is not same as the optimizer python scripts).
This topic is known by development team and will be addressed in the near future. I am tracking the progress on this with our team. I appreciate your patience here.
Please inform if there is another layer not functioning here. I will reproduce the issue if you pass me such a model, and provide to our dev team to fix.
BR,
Reese
If you have tried to use onnx modifier to modifie model ,you will meet lost of these problems。But all these onnx model can be infered correctly.
IN TIDL10.01:
If i change Squeeze -> unsqueeze -> unsqueeze -> reshape -> MatMul to Squeeze -> MatMul it sometimes wiil comes with error like [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed.
But in TIDL 10.00.08 this will 100% happen.
TIDL_RT_ONNX_VARDIM=0: RUNTIME_EXCEPTION : Non-zero status code returned while running Gemm node. Name:'gemm' Status Message: gemm_helper.h:14 onnxruntime::GemmHelper::GemmHelper(const onnxruntime::TensorShape&, bool, const onnxruntime::TensorShape&, bool, const onnxruntime::TensorShape&) left.NumDimensions() == 2 || left.NumDimensions() == 1 was false.
TIDL_RT_ONNX_VARDIM=1:INVALID_ARGUMENT : Non-zero status code returned while running Gemm node. Name:'gemm' Status Message: GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62
对于这个 Squeeze -> unsqueeze -> unsqueeze -> reshape -> MatMul,为什么不能使用 Squeeze -> MatMul
MatMul -> Add with GEMM
Both GEMM or matmul+add all failed whice uses onnxslim to simplify.
7776.tempDir.zipHere is this model's tmpdir
Thank you for providing the TempDir. I see that there are multiple subgraphs that have not finished parsing and optimizing. As a result, 3 of the 4 subgraphs lack SVG graphs. Please share the onnx model as well if you are comfortable -- random weights is ok.
If you have tried to use onnx modifier to modifie model ,you will meet lost of these problems。
You mean this graphical tool for changing models, yes? https://github.com/ZhangGe6/onnx-modifier
I too have had challenges with onnx-modifier tool, especially for layers with many default attributes or more complex initializers. onnx-modifier is a good tool, but modifying directly with onnx or onnx-graphsurgeon is more stable.
TIDL_RT_ONNX_VARDIM=1:INVALID_ARGUMENT : Non-zero status code returned while running Gemm node. Name:'gemm' Status Message: GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62
For the GEMM nodes, can you try representing the tensor's dimensions in 'C' as {1,62} instead of {62}? From what I see, TIDL is rejecting your GEMM (or equivalent MatMul->Add) from running with acceleration, but then ONNX also hits an error for the tensor returned by TIDL reaches CPU ONNX runtime vs. the constants C,W. If this is the problem, I will see that this gets fixed for future release.
And I think using TIDL_RT_ONNX_VARDIM=0 will overall cause issue with layers running on Arm. Let's use =1 here.
HOWEVER, the first subgraph SVG has this shapes as the output. This suggests that TIDL is passing back a tensor of shape {1024,1,1} or even {1,1}, which ONNX probably does like. This is probably why ONNX reads 'K' as 1 instead of 1024. I think we are best suited trying to resolve this GEMM/MatMul issue, since this is ultimately what we need.
BR,
Reese
Hello,
We decided to split face keypoint and head pose to different model. Could your team helps to compile this onnx model based on mmpose rtmpose.
Results here: CPU seems to get similar results but NPU . Is there anything wrong with my model_configs?
Hello,
Different model architecture now, I understand. We will investigate this. You model config looks reasonable
I see in your artifacts that there are multiple subgraphs now, and the artifacts look complete. However, you have noted accuracy is poor from your spreadsheet.
For accuracy, can you try compiling for tensor_bits = 16 and 32 and check the output in each case? This should tell us if the issue is quantization-related.
Are you using TIDL tools for 10.1 or 10.0 SDK?
There are many subgraphs here as well. We should be able to reduce this. Some fixes are part of the tidl_onnx_model_optimizer already, although we may have to remove a few default rules that were problematic on a previous version of your model. At the least, we can use rules like:
I expect these to remove at least 4 subgraphs from the 8 subgraphs you have now. Clip->Div is the only part of your model that I don't have an immediate solution for.
BR,
Reese
Hello,
conclusion:AFTER THE POSTPROCESS OF THE RESULTS 2*1*106*512 , 106 POINTS ARE THE SAME ,NO MATTER ALL globalavgpooling IS REPLACED.
I replaced all the globalavgpooling to reshape+matmul+reshape and maxpool to cascaded_maxpool and unsqueeze to reshape ,then retried all the npu artifacts based on this model(just add postprocess in code to get the max of 512 2*1*106*512 --->2*1*106*1).Now are 4subgraphs.This onnx model running on CPU can get correct results.
- convert_large_global_avg_pooling_to_matmul
- convert_reducemean_to_matmul
- convert_maxpool_to_cascaded_maxpool
- convert_unsqueeze_to_reshape
We decided to split face keypoint and head pose to different model. Could your team helps to compile this onnx model based on mmpose rtmpose.
How to do? edgeai-tidl-tools/examples/osrt_python/ort/onnxrt_ep.py can't see any thing about calibration epoches and tensor bits
For accuracy, can you try compiling for tensor_bits = 16 and 32 and check the output in each case? This should tell us if the issue is quantization-related.
Hi Xiaojun
I am working on comparing the layer trace. But now I have a problem that setting debug level to 4 will require additional mem and will generate the following error:
root@am62axx-evm:/opt/edgeai/edgeai-tidl-tools/examples/osrt_python/ort# python3 onnxrt_ep_no_post.py -m tianma_model Available execution providers : ['TIDLExecutionProvider', 'TIDLCompilationProvider', 'CPUExecutionProvider'] Running 1 Models - ['tianma_model'] Running_Model : tianma_model libtidl_onnxrt_EP loaded 0x2de888b0 artifacts_folder = ../../../model-artifacts//tianma_model/artifacts debug_level = 4 target_priority = 0 max_pre_empt_delay = 340282346638528859811704183484516925440.000000 Final number of subgraphs created are : 5, - Offloaded Nodes - 186, Total Nodes - 198 In TIDL_createStateInfer Compute on node : TIDLExecutionProvider_TIDL_0_0 ************ in TIDL_subgraphRtCreate ************ APP: Init ... !!! 763.027503 s: MEM: Init ... !!! 763.027575 s: MEM: Initialized DMA HEAP (fd=5) !!! 763.027733 s: MEM: Init ... Done !!! 763.027759 s: IPC: Init ... !!! 763.044549 s: IPC: Init ... Done !!! REMOTE_SERVICE: Init ... !!! REMOTE_SERVICE: Init ... Done !!! 763.048551 s: GTC Frequency = 200 MHz APP: Init ... Done !!! 763.048684 s: VX_ZONE_INIT:Enabled 763.048699 s: VX_ZONE_ERROR:Enabled 763.048708 s: VX_ZONE_WARNING:Enabled 763.049615 s: VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-0 763.049949 s: VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-1 763.050212 s: VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-2 763.050449 s: VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-3 763.050481 s: VX_ZONE_INIT:[tivxInitLocal:136] Initialization Done !!! 763.050713 s: VX_ZONE_INIT:[tivxHostInitLocal:106] Initialization Done for HOST !!! ************ TIDL_subgraphRtCreate done ************ In TIDL_createStateInfer Compute on node : TIDLExecutionProvider_TIDL_1_1 ************ in TIDL_subgraphRtCreate ************ ************ TIDL_subgraphRtCreate done ************ In TIDL_createStateInfer Compute on node : TIDLExecutionProvider_TIDL_2_2 ************ in TIDL_subgraphRtCreate ************ 763.238992 s: MEM: ERROR: Alloc failed with status = 12 !!! 763.239040 s: VX_ZONE_ERROR:[tivxMemBufferAlloc:90] Shared mem ptr allocation failed 763.239052 s: VX_ZONE_ERROR:[ownAllocReferenceBufferGeneric:340] Memory allocation failed 763.239063 s: VX_ZONE_ERROR:[ownGraphAllocateDataObject:1031] Memory allocation for data reference failed 763.239074 s: VX_ZONE_ERROR:[vxVerifyGraph:2199] Memory alloc for data objects failed 763.239085 s: VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!! TIDL_RT_OVX: ERROR: Verify OpenVX graph failed ************ TIDL_subgraphRtCreate done ************ In TIDL_createStateInfer Compute on node : TIDLExecutionProvider_TIDL_3_3 ************ in TIDL_subgraphRtCreate ************ 763.247303 s: MEM: ERROR: Alloc failed with status = 12 !!! 763.247348 s: VX_ZONE_ERROR:[tivxMemBufferAlloc:90] Shared mem ptr allocation failed 763.247360 s: VX_ZONE_ERROR:[ownAllocReferenceBufferGeneric:340] Memory allocation failed 763.247371 s: VX_ZONE_ERROR:[ownGraphAllocateDataObject:1031] Memory allocation for data reference failed 763.247382 s: VX_ZONE_ERROR:[vxVerifyGraph:2199] Memory alloc for data objects failed 763.247393 s: VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!! TIDL_RT_OVX: ERROR: Verify OpenVX graph failed ************ TIDL_subgraphRtCreate done ************ In TIDL_createStateInfer Compute on node : TIDLExecutionProvider_TIDL_4_4 ************ in TIDL_subgraphRtCreate ************ 763.257352 s: MEM: ERROR: Alloc failed with status = 12 !!! 763.257401 s: VX_ZONE_ERROR:[tivxMemBufferAlloc:90] Shared mem ptr allocation failed 763.257413 s: VX_ZONE_ERROR:[ownAllocReferenceBufferGeneric:340] Memory allocation failed 763.257424 s: VX_ZONE_ERROR:[ownGraphAllocateDataObject:1031] Memory allocation for data reference failed 763.257435 s: VX_ZONE_ERROR:[vxVerifyGraph:2199] Memory alloc for data objects failed 763.257445 s: VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!! TIDL_RT_OVX: ERROR: Verify OpenVX graph failed ************ TIDL_subgraphRtCreate done ************ ******* In TIDL_subgraphRtInvoke ******** [C7x_1 ] 763.330949 s: 0 1.00000 17.00000 255.00000 6 [C7x_1 ] 763.448313 s: 5 0.50000 12.00000 254.00000 1 [C7x_1 ] 763.709967 s: 6 0.04539 -2005.05115 980.49207 1 [C7x_1 ] 764.048756 s: 7 5.80933 0.00000 0.86068 0 [C7x_1 ] 764.377386 s: 8 0.13183 0.00000 853.37659 1 [C7x_1 ] 764.696369 s: 9 0.03640 -2444.87451 947.73218 1 [C7x_1 ] 765.035074 s: 10 4.65954 0.00000 0.85845 0 [C7x_1 ] 765.353711 s: 11 0.08481 0.00000 825.37476 1 [C7x_1 ] 765.672484 s: 12 0.01743 -3959.20020 1864.84058 1 [C7x_1 ] 766.249094 s: 13 2.23075 0.00000 0.89656 0 [C7x_1 ] 766.705231 s: 14 0.03888 0.00000 1659.07666 1 [C7x_1 ] 767.201182 s: 15 0.08849 -1192.18604 672.37036 1 [C7x_1 ] 767.539965 s: 16 11.32709 0.00000 0.97112 0 [C7x_1 ] 767.848516 s: 17 0.12530 0.00000 646.46973 1 [C7x_1 ] 768.146976 s: 18 0.09069 -1025.43579 82.69644 1 [C7x_1 ] 768.341678 s: 19 11.60872 0.00000 0.94756 0 [C7x_1 ] 768.536278 s: 20 1.05283 0.00000 83.10921 1 [C7x_1 ] 768.720892 s: 21 0.99571 -125.53882 46.19829 1 [C7x_1 ] 768.935700 s: 22 127.45061 0.00000 0.99646 0 [C7x_1 ] 769.130362 s: 23 1.98287 0.00000 46.39743 1 [C7x_1 ] 769.315001 s: 24 0.35777 -357.77322 120.18944 1 [C7x_1 ] 769.509613 s: 25 45.79437 0.00000 0.98265 0 [C7x_1 ] 769.714310 s: 26 1.02399 0.00000 114.25931 1 [C7x_1 ] 769.898952 s: 27 0.54613 -221.56052 162.96600 1 [C7x_1 ] 770.093589 s: 28 69.90415 0.00000 0.98707 0 [C7x_1 ] 770.298299 s: 29 0.59651 0.00000 160.93677 1 [C7x_1 ] 770.492997 s: 30 2.50239 -43.35855 25.77536 1 [C7x_1 ] 770.697802 s: 31 160.15294 0.00000 0.99905 0 [C7x_1 ] 770.912589 s: 32 3.13098 -0.31939 26.03021 1 [C7x_1 ] 771.117308 s: 33 1.98287 -0.50432 46.39743 1 [C7x_1 ] 771.322036 s: 34 2.10566 -0.94982 55.80190 1 [C7x_1 ] 771.630731 s: 35 2.10566 -0.94982 55.80190 1 [C7x_1 ] 771.939034 s: 36 2.10566 -0.94982 55.80190 1 [C7x_1 ] 772.247548 s: 1 524288.00000 0.00000 0.00000 1 [C7x_1 ] 772.349586 s: 37 4.21133 -0.11873 15.55330 1 [C7x_1 ] 772.450381 s: 38 4.21133 -0.11873 15.55330 1 [C7x_1 ] 772.551249 s: 39 4.21133 -0.11873 15.55330 1 [C7x_1 ] 772.652131 s: 40 3.09413 -40.23747 40.88386 1 [C7x_1 ] 772.752945 s: 41 99.01219 0.00000 0.99988 1 [C7x_1 ] 772.853839 s: 42 3.25760 -0.92092 36.68346 1 [C7x_1 ] 773.152374 s: 43 0.77269 -104.82794 65.35569 1 [C7x_1 ] 773.491508 s: 44 98.90493 0.00000 0.99085 0 [C7x_1 ] 773.810090 s: 45 1.19411 0.00000 64.90165 1 [C7x_1 ] 774.118805 s: 46 0.73371 -165.59564 99.49368 1 [C7x_1 ] 774.323545 s: 47 93.91551 0.00000 0.99025 0 [C7x_1 ] 774.518230 s: 48 1.07668 0.00000 97.98685 1 [C7x_1 ] 774.702852 s: 49 0.49600 -198.58832
Before the excution being killed by the error, 173 layers are stored and compared with PC. All those layers traces are the same. I will enlarge the shared mem tomorrow to try again. Changing mem size and rebuild is time consuming, so please expect delayed response.
Regards,
Adam
Hi Reese,
Seek your help here. As my last reply says, mem problem occur when I tried to dump layer traces. But I am not sure which part of mem it uses. I enlarged edgeai-core-heap-memory but it does not help.
Regards,
Adam
I added the postprocess op to the model but also failed.I have tried no less than 10 combinations but all failed.
Is there anyother way to solve these problems.
Here the artifacts file
Hi Adam,
Hmm, perhaps that is not right memory region to increase. I have not run into this MALLOC error during trace dump.
From running a network with debug_level 2 and 5, I can see the memrec tables differ for entry 9, which is part of DDR_C7X_1_SCRATCH region (address starts with 0xB900). I think that region needs increase. We may be able to confirm this is needed by looking at memrec tables.
If 173 of 198 layers are same on PC, then difference must be in some of the last layers
I suggest running in a host-emulation mode at this stage. This is preferred when working on accuracy issues. This will also let us analyze traces without worrying about memory maps and allocation failures.
What I see so far is that 8-bit and 16-bit model does indeed have substantially different output than CPU and 32-bit execution. 16-bit is less severe, but still different. When running the network with tensor_bits=32 through TIDL, the output is same as disable-offload (run on CPU, no TIDL at all). This tells that the quantized version of the layer has limited accuracy. The debugging steps below will help us understand at what layer the tensor_bits=32 differs from 8 and 16
import numpy as np import argparse import matplotlib import matplotlib.pyplot as plt import os import sys import subprocess import shutil import argparse def parse_args(): parser = argparse.ArgumentParser() parser.add_argument('tracedir_fixed', type=str, default=None) parser.add_argument('tracedir_float', type=str, default=None) parser.add_argument('-s', '--save_trace_dir', type=str, default=None) parser.add_argument('-t','--tensor_bits', type=int, default=8, help='Tensor_bits used for these traces. Hybrid mode not supported yet') args = parser.parse_args() return args def save_error_plot(float_data, fixed_data, axes): mx = np.max(float_data) mn = np.min(float_data) org_diff = (fixed_data - float_data) combined = np.vstack((float_data, fixed_data, org_diff)).T # #np.savetxt("figs\\"+str(i).zfill(4)+"_float.txt", combined, fmt='%10.6f, %10.6f, %10.6f') abs_diff = abs(fixed_data - float_data) maxIndex = np.argmax(abs_diff) max_abs_diff = np.max(abs_diff) mean_abs_diff = np.mean(abs_diff) var_abs_diff = np.var(abs_diff) axes.hist(abs_diff, color='blue', edgecolor='black', bins=60) # image_txt = "mean = " + str(mean) +", Var = "+ str(var) +", MAx = "+ str(mx) image_txt = "Hist; MeanAbsDiff=%7.4f, MaxAbsDiff=%7.4f, MaxVal=%7.3f" % (mean_abs_diff, max_abs_diff, mx) #plt.title(image_txt) axes.set_title(image_txt, fontdict = {'fontsize' : 8}) axes.set_xlabel('tensor element values') axes.set_ylabel('value frequency') def save_pc_ref_plot(float_output, fixed_output, axes): axes.set_title("Float output Vs Fixed Output : Plot 1") axes.set_xlabel('Float Output (tensor_bits 32 / reference)') axes.set_ylabel('Fixed Output (dequantized to fp32)') axes.plot(float_output, fixed_output, '.') def save_pc_ref_plot2(float_output, fixed_output, axes): axes.set_title("Float output Vs Fixed Output : Plot 2") axes.plot(float_output, "bs", label = "Float") axes.plot(fixed_output, "c.", label = "Fixed") axes.legend(loc='upper right', frameon=True) fig, axs = plt.subplots(ncols=2) plt.subplots_adjust(left=0.075, right=0.95) fig.set_figwidth(12) def compare_traces(float_tracefile, fixed_tracefile, save_pngs_dir=None): float_data = np.fromfile(float_tracefile, dtype=np.float32) fixed_data = np.fromfile(fixed_tracefile, dtype=np.float32) # plt.clf() #clear axs[0].clear() axs[1].clear() layer_info = float_tracefile.split('/')[-1].split('_')[3:-1] #( trace names will be like tidl_traceAAAA_BBBBB_CCCCC_DDDDDxEEEEE.y, AAAA is dataId, BBBBB is batch number, CCCCC is channel number, DDDDD is width and EEEEE is height) print('subgraph | data ID | DIM0 | DIM1 | batch number | channel | width x height') print(layer_info) data_id = layer_info[1] print(f'data ID: {data_id}') # save_error_plot(float_data, fixed_data, axes) save_pc_ref_plot(float_data, fixed_data, axs[0]) save_pc_ref_plot2(float_data, fixed_data, axs[0]) save_error_plot(float_data, fixed_data, axs[1]) # plt.show() fig.suptitle(f'Analysis for data ID {data_id}') #TODO: read layer_info file for string name of the layer plt.draw() # while not plt.waitforbuttonpress(): pass if save_pngs_dir is not None: fig.savefig(os.path.join(save_pngs_dir, float_tracefile.split('/')[-1])+'.png') else: print('PNG not saved') def main(): args = parse_args() files_fixed = os.listdir(args.tracedir_fixed) files_fixed.sort() traces_fixed = [f for f in files_fixed if '_float.bin' in f] traces_fixed.sort() num_files = len(traces_fixed) files_float = os.listdir(args.tracedir_float) files_float.sort() traces_float = [f for f in files_float if '_float.bin' in f] traces_float.sort() for i in range(num_files): filename_fixed = traces_fixed[i] # file_basename = filename_float.split('_float.bin')[0] # print(file_basename) filename_float = None for j in range(len(traces_float)): if filename_fixed in traces_float[j]: filename_float = traces_float[j] print(filename_float) break if filename_fixed is None or filename_float is None: print('skip %s / %s\n\n' % (filename_fixed, filename_float)) continue print(filename_fixed) print(filename_float) print('found files; now compare traces') filename_float = os.path.join(args.tracedir_float, filename_float ) filename_fixed = os.path.join(args.tracedir_fixed, filename_fixed) print(filename_fixed) print(filename_float) compare_traces(float_tracefile=filename_float, fixed_tracefile=filename_fixed, save_pngs_dir=args.save_trace_dir) # file_pairs[0:4] if __name__ == '__main__': main()
I note that same behavior is seen in 10_00_00_08 and 10_01_02_00 tidl-tools versions.
Note that we can probably eliminate a subgraph here with a model change:
By setting Argmax axis to be -3 instead of 1 and moving cast after Concat. Although I am unsure if Flatten before argmax will permit this axis setting.
Hi Reese,
There are a few problems to solve with the model right now:
1. Too many subgraph. I have suggested customer modify the model structure to make conv and mul operators to a group and other operator like abs to another group to reduce number of subgraph.
2. Problem with sigmod layer. As you suggested, I use pc simulation to run 32bit and 8bit and found that there are problem with batch+sigmod layer. All sigmod layer have bad accuracy:
I am using tools 10.0.8 since sdk 10.1 has not been released. Need your comment whether should customer change all sigmod to relu.
Regards,
Adam
Other question: could run cpu+npu in the same time based on edgeai-gst-apps /app_cpp. We tried to delete allownodes.txt 's contents And tried add these code to postprocess part .
auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU);
Ort::Value input_tensor_ = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
auto cpu_output = ort_session->Run(Ort::RunOptions{ nullptr }, &input_names[0], &input_tensor_, 1, output_names.data(), 1);
const float* output_cpu = cpu_output[0].GetTensorMutableData<float>();
But all the fps will down to 5 . Is that normal?
(We tried to use usural opencv method but found that imread and imwrite couldn't work cause:[100%] Linking CXX executable
/usr/lib/gcc/aarch64-oe-linux/13.3.0/../../../../aarch64-oe-linux/bin/ld:(.text.startup+0x128): undefined reference to `cv:imread//imwrite)
moving cast after Concat
This will lead to [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed. MAYBE CAUSED BY CONCAT ARGMAX(INT)
TWO outputs failed, one output without postprocess
conclusion:AFTER THE POSTPROCESS OF THE RESULTS 2*1*106*512 , 106 POINTS ARE THE SAME ,NO MATTER ALL globalavgpooling IS REPLACED.
Hi xiaojun,
The first problem about fps is another problem. Could you file a different ticket for that?
As for the model, there are other problems with it. Please allow us some time to make a workaround for that.
Regards,
Adam
Hi Xiaojun,
Reese is out this week and won't be able to respond to you until next week.
Regards,
Jianzhong
Hi Adam,
1. Too many subgraph. I have suggested customer modify the model structure to make conv and mul operators to a group and other operator like abs to another group to reduce number of subgraph.
2. Problem with sigmod layer. As you suggested, I use pc simulation to run 32bit and 8bit and found that there are problem with batch+sigmod layer. All sigmod layer have bad accuracy:
Understood on the two points. For the first, let me know if help is needed to make these optimizations. I see several places where automated scripts might help. Additionally, some layers that were previously on CPU should run with TIDL now with 10.1 SDK, like Abs and Pow.
The sigmoid one deserves further investigation. As a start, I'd recommend trying 10.1 tools. The 10.1 SDK released this week, so it is ready to try. I agree that the data shown in those traces is not good quantization. Can you provide model + import config used here so I can reproduce and log as an issue? Do you know if hard-sigmoid is seeing that same? I see that this network is using both.
Is the c666-1.10 model above showing these sigmoid errors?
sure.
Can you provide model + import config used here so I can reproduce and log as an issue?
Here the artifacts file
'c666' :create_model_config(
preprocess=AttrDict(
resize=256,
crop=256,
data_layout="NCHW",
resize_with_pad=False,
reverse_channels=False,
),
session=AttrDict(
session_name="onnxrt", #_face_1x3x120x120 modified_ -op11 modified_sparse_face_me
model_path=os.path.join( "/home/zxb/Desktop/ti/final-0gmp1.onnx"),
input_mean=[0, 0, 0],
input_scale=[1, 1, 1],
),
task_type="classification",
extra_info=AttrDict(num_images=numImages, num_classes=1000),
),
Hello,
Thank you for supplying this. I have logged this sigmoid accuracy as an issue to resolve. In the meantime, please replace these with RELU
For this C666 model, what else do you need assistance with? I believe there are still some issues with subgraphs / performance. Please help me understand your current status -- it is not clear to me. Perhaps one of these challenges is the CAST / ARGMAX / CONCAT issue from above?
This will lead to [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed. MAYBE CAUSED BY CONCAT ARGMAX(INT)
TWO outputs failed, one output without postprocess
I assume you have changed the model for this.
BR,
Reese
left cpu right npu
change sigmoid ———》relu
remove flatten and argmax
WHAT CAN I SAY
Hello,
I understand this has been a frustrating experience, thank you for your perseverance -- it is much appreciated.
You have found a configuration that passes the compilation stage and can run on target. We are experiencing accuracy challenge now, and your image makes this very obvious. Replacement with ReLU gives reasonable output on CPU, but poor on NPU/C7x.
This will take investigation to understand which layer(s) cause accuracy issue.
As a first step, I run the model with tensor_bits set to 8, 16, and 32, and can see big difference (but correct order of magnitude) between each quantization level. 32-bit is a reference floating point mode, and is within very small error margin of CPUExecutionProvider. Therefore, 32-bit with TIDL is good to compare against.
The next step is layer-level analysis. We need to run the model with debug_level=4 and tensor_bits=8. Save the traces under /tmp/tidl_trace...._float.bin. Then recompile and run the same with tensor_bits=32, and similarly save traces. We can compare between 8 and 32 in same way as Adam did above
I have attached my script that includes these visualization functions:
Using the functions from the script I mentioned for accuracy comparison.
As for reducing # subgraphs -- Several layers you are using are not supported. 10.1 adds support for Abs and Pow layers, but this model still results in 5 subgraphs on your model due to ReduceSum, Div (with both inputs variable) and Max layers.
BR,
Reese
Hi
I have tried your new model on 10.1 pc mode 32bit tidlrt and 8bit tidlrt and I found that the accuracy loss gradually after every convolution layer. Have you tried to run the model in PC mode and tensor bits 8 to see if the output is correct?
Regards,
Adam
have a question. I don't fully understand how TI's tensorbit works. My ONNX model is 32-bit, and when I convert it directly using the TIDL tool, will it automatically convert it to bits=8, or will it remain at bits=32 by default, or is it a mixed configuration?
Have you tried to run the model in PC mode and tensor bits 8 to see if the output is correct?
Should i convert onnx model from float32 to int8,then compile it ?
I noticed that there are extra config.yaml files in ModelMaker's output artifacts folder.
ABOUT 10.01 remove reducesum and max alse failed.c666-nomax.zip
VARIABLE div to pow( -1 )+ concat * +mul 。 just one subgraph,but ends failed too.
Hello,
have a question. I don't fully understand how TI's tensorbit works. My ONNX model is 32-bit, and when I convert it directly using the TIDL tool, will it automatically convert it to bits=8, or will it remain at bits=32 by default, or is it a mixed configuration?
I understand, let me clarify tensorbits usage.
The tensorbits parameter can be 8, 16, or 32.
Values 8 and 16 mean that the model will run in fixed point mode, therefore it is quantized. You do not need to prequantize your model to do this (although this is an option). TIDL will use post-training quantization (PTQ) on a set of calibration images. This allow TIDL to find quantization for each layer, such that the 32-bit floating-point values can be used as int8 (or int16) instead.
Value of tensorbits=32 is different. This is considered a reference mode to check model functionality. The model will run in floating-point -- it will therefore skip calibration, meaning no float-->fixed point conversion. This mode is intended to be used on PC.
________________________
ABOUT 10.01 remove reducesum and max alse failed.c666-nomax.zip
It is clear to me that accuracy is poor right now. Your source code looks okay in the screenshots. Let us figure out why the accuracy is poor.
For now, please continue testing with c666-nomax model, and ignore the variable Div --> pow^-1 and Mul. Let us focus on diagnosing the accuracy issue with the no-max model, as this trick with Div could be further influencing accuracy. We can look at Div afterwards.
Could you recompile your model with tensorbits=32, and share the output of the same test on your c666-nomax model? I want to compare CPU-based inference (which is good) to NPU-based inference running with TIDL.
If this tensorbits=32 reference mode with TIDL is good, then this issue is due to quantization. If the issue is still present with tensorbits=32, this may indicate a bug in the TIDL SW.
BR,
Reese
If this tensorbits=32 reference mode with TIDL is good, then this issue is due to quantization. If the issue is still present with tensorbits=32, this may indicate a bug in the TIDL SW.
Do u mean in benchmark folder run_custom_pc.sh AM62A set setting_base.yaml : tensorbits =8/16/32 .
However, in this case, benchmark and TIDL would be two completely unrelated concepts.
This is a keypoint model, but task_type just only cls/seg/det ,and don't need postprocess .How to set the config .
Is there anyother userfriendly ways to convert an ONNX model into the format required by EdgeAI (similar to eiq/rknn/snpe)?
你好,
在TIDL种当我选择不同数据量去编译的时候,NPU的结果呈现的结果也不同。当我再去edgeai-benchmark 去运行 ./run_custom_pc.sh AM62A
sh文件会使用到settings_base.yaml,如果我设置task_selection : keypoint_detection 会报错: Traceback (most recent call last):
File "/home/zxb/Desktop/ti/edgeai-tensorlab/edgeai-benchmark/edgeai_benchmark/datasets/__init__.py", line 188, in get_datasets
dataset_cache[DATASET_CATEGORY_IMAGENET]['calibration_dataset'] = ImageNetDataSetType(**imagenet_cls_calib_cfg, download=download)
TypeError: 'NoneType' object is not subscriptable
现在的模型是keypoint 模型,虽然modelzoo有humanpose 模型,但是tidl的config和benchmark的configs里面都没有样例,直接使用classification编译也是有问题的。不是像一开始提到的一样customsmodel使用classification就可以使NPU拿到和CPU差不多的结果。
我们这部分一直不是很明白,需求也已经沟通过了,模型也提供了但是没有得到想要的答案
Hi Xiaojun,
As we suggested, your model does not get expected output after quantization may because you need to add batchnorm after each conv.
Or, you can create a new model with TI model-maker to do keypoint detection, which will accelerate your model deployment.
Regards,
Adam
Hi
Also, I found your model result is similar with mmpose fase detection.
If you are using mmpose based model, you can create a new model with https://github.com/TexasInstruments/edgeai-tensorlab/tree/main/edgeai-mmpose
Regards,
Adam
Dear,adam
edgeai-tensorlab/edgeai-mmpose/README.md
This repository is an extension of the popular mmpose open source repository for keypoint detection training. In edge-mmpose, we focus on yolox based keypoint detection models that are optimized for speed and accuracy so that they run efficiently on embedded devices. For this purpose, we have added a set of embedded friendly model configurations and scripts.
This seems to indicate that TI has not conducted research on other configurations of MMPose, and I believe this suggestion is not entirely right. Our model comprises 331 key points, but for the preliminary ONNX verification, we utilized the 98-point base model of RTMPose for validation and have made the artifacts file publicly available.
By the way, we conducted some verification on the regnet.onnx (classification) within TI's Model Zoo. Similarly, the results from the CPU and NPU are not consistent. It is unclear whether this indicates that the issue does not lie with the Batch Normalization (BN) layer.
'kd-7060':utils.dict_update(common_cfg,
preprocess=preproc_transforms.get_transform_onnx(640, 640, reverse_channels=True, resize_with_pad=[True, "corner"], backend='cv2', pad_color=[114,114,114]),
session=onnx_session_type(**sessions.get_common_session_cfg(settings, work_dir=work_dir, input_optimization=False),
runtime_options=settings.runtime_options_onnx_p2(
det_options=True, ext_options={'object_detection:meta_arch_type': 6,
'object_detection:meta_layers_names_list': f'{settings.models_path}/vision/keypoint/coco/edgeai-yolox/yolox_s_pose_ti_lite_640_20220301_model.prototxt',
'advanced_options:output_feature_16bit_names_list': '/0/backbone/backbone/stem/stem.0/act/Relu_output_0, /0/head/cls_preds.0/Conv_output_0, /0/head/reg_preds.0/Conv_output_0, /0/head/obj_preds.0/Conv_output_0, /0/head/kpts_preds.0/Conv_output_0, /0/head/cls_preds.1/Conv_output_0, /0/head/reg_preds.1/Conv_output_0, /0/head/obj_preds.1/Conv_output_0, /0/head/kpts_preds.1/Conv_output_0, /0/head/cls_preds.2/Conv_output_0, /0/head/reg_preds.2/Conv_output_0, /0/head/obj_preds.2/Conv_output_0, /0/head/kpts_preds.2/Conv_output_0'},
fast_calibration=True),
model_path=f'{settings.models_path}/vision/keypoint/coco/edgeai-yolox/yolox_s_pose_ti_lite_640_20220301_model.onnx'),
postprocess=postproc_transforms.get_transform_detection_yolov5_pose_onnx(squeeze_axis=None, normalized_detections=False, resize_with_pad=True, formatter=postprocess.DetectionBoxSL2BoxLS(), keypoint=True),
metric=dict(label_offset_pred=1), #TODO: add this for other models as well?
model_info=dict(metric_reference={'accuracy_ap[.5:.95]%':49.6, 'accuracy_ap50%':78.0}, model_shortlist=10, compact_name='human-pose-yolox-s-640x640', shortlisted=True, recommended=True)
),
It may be necessary for us to rewrite the TIDL/postprocess/humanpose in order to resolve this issue.
Hi Xiaojun
By the way, we conducted some verification on the regnet.onnx (classification) within TI's Model Zoo.
Which specific model are your referring to?
I tried this model with edgeai-tidl-tools and found no error between cpu and npu:
I suggested adding batchnorm after each conv because I found accuracy loss after every conv+relu layer even in PC mode. I compared pc mode 8bit results and 32bit results and found great accuracy loss. So I think the problem is not difference between pc and npu but accuracy loss due to quantization.
Adding batchnorm is still worth trying. If retraining takes long time, you can use a model not trained just to verify the results between onnx mode and tidl 8bit mode.
Regards,
Adam
Alright, we'll give it a shot later this week.
Adding batchnorm is still worth trying. If retraining takes long time, you can use a model not trained just to verify the results between onnx mode and tidl 8bit mode.