This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SK-AM62A-LP: complie onnx. split and add node can't pass.

Part Number: SK-AM62A-LP

Tool/software:

 Non-zero status code returned while running TIDL_5 node. Name:'TIDLExecutionProvider_TIDL_5_5' Status Message: TIDL Compute Import Failed.But remove this split can pass

  split_node = onnx.helper.make_node(

            'Split',
            inputs=[node.input[0]],  
            outputs=[f'{node.output[0]}_0', f'{node.output[0]}_1', f'{node.output[0]}_2'],  
            axis=2,  
            split=[1, 1, 1],  
 )
model.graph.node.append(split_node)
Results are the same with the model before replace slice node.
 

  • Output1  can be postprocessed in cpu.

    But another output is complex.

    Slice with 2D : 

    [Quantization & Calibration for subgraph_1 Started]

    2024-12-27 16:48:11.407701156 [E:onnxruntime:, sequential_executor.cc:494 ExecuteKernel] Non-zero status code returned while running Squeeze node. Name:'model/tf.__operators__.getitem_16/strided_slice__354' Status Message: /root/onnxruntime/onnxruntime/core/providers/cpu/tensor/squeeze.h:52 static onnxruntime::TensorShapeVector onnxruntime::SqueezeBase::ComputeOutputShape(const onnxruntime::TensorShape&, const TensorShapeVector&) input_shape[i] == 1 was false. Dimension of input 1 must be 1 instead of 0. shape={1,0,1,1,1,3}

    Slice with 3D : 

    ==================== [Optimization for subgraph_6 Started] ====================

    [TIDL Import] [PARSER] UNSUPPORTED: All the input tensor dimensions has to be greater then zero. For tensor model/tf.__operators__.getitem_23/strided_slice3, id 0 - Dim 2 is 0 -- [tidl_import_common_model_check.cpp, 2290]
    [TIDL Import] ERROR: Invalid input tensor dimension, aborting -- [tidl_import_core.cpp, 2556]
    [TIDL Import] ERROR: Network Optimization failed - Failed in function: TIDL_runtimesOptimizeNet -- [tidl_runtimes_import_common.cpp, 1268]
    [TIDL Import] [PARSER] ERROR: - Failed in function: TIDL_computeImportFunc -- [tidl_onnxRtImport_EP.cpp, 1713]
    2024-12-27 16:40:48.794281915 [E:onnxruntime:, sequential_executor.cc:494 ExecuteKernel] Non-zero status code returned while running TIDL_6 node. Name:'TIDLExecutionProvider_TIDL_6_6' Status Message: TIDL Compute Import Failed.

    I replace  slice to split also failed, no matter output1 or output2

  • Also,when i compile onnx model with 4 outputs,artifacts/param.yaml can only find one output node.I concat all results to a tensor with  1*71*3  or 1*1*71*3 shape log error:

    RUNTIME_EXCEPTION : Non-zero status code returned while running TIDL_3 node. Name:'TIDLExecutionProvider_TIDL_3_3' Status Message: /root/onnxruntime/onnxruntime/core/providers/tidl/tidl_execution_provider.cc:430 void onnxruntime::populateOnnxRtInputParams(Ort::CustomOpApi, OrtKernelContext*, onnxruntime::tidl_ops*, OnnxTIDLSubGraphParams*) (TIDL_MAX_DIM-inputNumDims) >= 0 was false. TIDL_EP: Only tensors up to 6D

  • Hello,

    You have several topics here, and I'll try to address them. In your follow-up reply, please help me by noting which topics you have found a solution for. It seems that you have found resolution for some of them, but it is not clear to me. 

    My understanding is that overall, you are facing challenges with Slice and Split operators in TIDL.

    • What would help most is to pass along your model artifacts (or at least the SVG files under tempDir), compilation logs (debug_level=1 for verbosity), and the model file.

    ONNX complaining about tensor shapes  

    onnxruntime::SqueezeBase::ComputeOutputShape(const onnxruntime::TensorShape&, const TensorShapeVector&) input_shape[i] == 1 was false. Dimension of input 1 must be 1 instead of 0. shape={1,0,1,1,1,3}

    ONNX will sometimes complain about TIDL's 6D tensor representation and this is often fixed by defining TIDL_RT_ONNX_VARDIM=1 in the calling environment. 

    • This will not be needed for TIDL tools version 10.1 and beyond

    Slice and split on one of the output heads:

    I replace  slice to split also failed, no matter output1 or output2

    Allow me to be honest here -- I think this would be better off running with ONNXRT' Arm-based representation if TIDL is having a hard time with it. There is very little computation such that acceleration is not worthwhile. The overhead is probably high enough that C7x acceleration does not give much benefit. 

    Otherwise, it seems like TIDL is not allowing you to use slice or split nodes as you like. Assuming those adhere to the supported_operators page, please share model+compilation logs for the Slice or Split configurations you are trying to use. The single line error message is not sufficient to suggest a fix.

    Multiple Outputs not being recognized

    It would help to see the model file and the SVGs for the network. 

    The param.yaml files are mainly useful for well-defined model types, like object detection (1 output for boxes, 1 for output for classes), segmentation (image mask), etc. For a custom model type like yours, I do not expect the param.yaml to know how to encode the postprocessing information for these. It likely generated the outputs based on the 'model_type' within your model_configs.py for this model. 

    For the two images below (outputs), is one from CPU and one from TIDL? Please provide context or label for the images

    Are you getting 'nan' for the TIDL version of the network?

    RUNTIME_EXCEPTION : Non-zero status code returned while running TIDL_3 node. Name:'TIDLExecutionProvider_TIDL_3_3' Status Message: /root/onnxruntime/onnxruntime/core/providers/tidl/tidl_execution_provider.cc:430 void onnxruntime::populateOnnxRtInputParams(Ort::CustomOpApi, OrtKernelContext*, onnxruntime::tidl_ops*, OnnxTIDLSubGraphParams*) (TIDL_MAX_DIM-inputNumDims) >= 0 was false. TIDL_EP: Only tensors up to 6D

    I will need to see model file and artifacts to give comment here. Looks like it found >6 dimensions in one of your tensors. 

    For models whose output does not match one of the tasks we outright enable (includes classification, object detection, segmentation, keypoint detection), you will need to include your own postprocessing code. 

    Overall suggestions:

    • If the sizes of these tensors are really 1x3 and 1x71 and similar, I recommend running said layers on CPU with ONNX.
    • Apply env variable TIDL_RT_ONNX_VARDIM=1 in your linux environment.
    • For testing your model with multiple outputs, check there is nothing task-specific (e.g. classification) affecting the values you receive from TIDL.

    BR,
    Reese

  • If set TIDL_RT_ONNX_VARDIM=1.It couldn't even pass the  the front part of the model's op. It's input shape is 1*1024*1*1 why detect 1*1*1*1,onnx model runs in cpu correctly.Also if i cut these op no matter gemm or matmul+add,all ens in GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62 .

  • ttt.ziphttps://drive.google.com/file/d/1taPxxn-Von3IIDWJZkCJEpDg_WqQ1dlG/view?usp=sharing

    Here is the mini model.When TIDL_RT_ONNX_VARDIM=0 the last add node will cause error mast be replaced with neg+sub  onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed.

    If set TIDL_RT_ONNX_VARDIM=1 the error report will become the content of the previous reply.

     I have tried more than a hundred combinations in tidl but all failed.

    ALL THE ONNX MODEL ABOVE RUNNING ON CPU ALL CAN GET CORRECT RESULTS

    I do not expect the param.yaml to know how to encode the postprocessing information for these. It likely generated the outputs based on the 'model_type' within your model_configs.py for this model. 

    The model (2 outputs and can pass the compilation) param only have 1 output and it is incorrect.Vividly the output with different results are running on cpu,while the output with the same results  which even have "nan " is  from TIDL

    It likely generated the outputs based on the 'model_type' within your model_configs.py

    Same result no matter  model_type=None or classification.

    We will do poseprocess by ourselves ,so is there any other way to get the correct artifacts .

    Also the out model of optimizer is the same problem 

    If set TIDL_RT_ONNX_VARDIM=1.It couldn't even pass the  the front part of the model's op. It's input shape is 1*1024*1*1 why detect 1*1*1*1,onnx model runs in cpu correctly.Also if i cut these op no matter gemm or matmul+add,all ens in GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62 .

    from tidl_onnx_model_optimizer import optimize

    optimize("/ti/modified_modified_22.onnx", "./test.onnx",simplify_mode=all)

  • Hello,

    This is a frustrating situation, I understand. That is many configurations to try without clear solution. I am trying compilation on my side for the ttt.zip model you provided. Thank you.

    • I have requested access to the google drive link. 

    We are releasing the 10.1 SDK very shortly (some SoCs it is released already), and edgeai-tidl-tools can be updated to use these tools with the 10_01_00_02 tag (rerun setup script).

    I can replicate your error on 10.0 SDK, but I think we have patched part of this in 10.1

    The input tensor cannot be reshaped to the requested shape. Input shape:{1,1,1,1}, requested shape:{-1,1024}

    My further comments are based on this 10_01_00_02 tag for edgeai-tidl-tools.

    I've tried compilation on my side and found a working configuration for the ttt.onnx model you sent, and I can clearly see that there is a bug in the model import tool (part of the optimization process, which is NOT the same as tidl_onnx_model_optimizer). This seems to be related to the GEMM node.

    Would you try compiling with these additional options passed to the TIDLCompilationProvider?:

    • 'max_num_subgraphs': 1,

    My intent here was to cut the last set of layers in the network. This max_num_subgraphs=1 config makes one TIDL graph with the Squeeze->unSqueeze->Reshape at the end, and final layers with GEMM, Add, Slice run on Arm/CPU.  This way, the model compiles and run on PC with C7x emulation.

    I noted that allowing a second subgraph to form in that bottom portion generated additional errors. This only included a GEMM node, and TIDL seems to complain that it does not have a bias tensor. I believe the MatMul -> Add needs to be replaced with GEMM. From your images, I think you have done this in some of your models.

    The Slice layers are trying to slice in 2 axes at the same time. This is not supported

    Note: To run on target, we need 10.1 SDK for AM62A to release. This should happen within the next week or so, once New Years holiday has passed

    Summary:

    • Please update to 10_01_00_02 edgeai-tidl-tools. This should fix some of the ONNX-related issues
    • Ensure your MatMul+Add are replaced with GEMM, or try denying those layers
      • I see a bug that occurs during optimization, probably from fusing MatMul with Add to create GEMM. I will log this with our dev team.
    • Slice layers should only slice on a single axis at a time
  • HI,

    I just git clone github.com/.../edgeai-tidl-tools.git -b 10_01_00_02  source ./setup.sh . it ends with same questions .

    Would you try compiling with these additional options passed to the TIDLCompilationProvider?:

    • 'max_num_subgraphs': 1,

    My intent here was to cut the last set of layers in the network. This max_num_subgraphs=1 config makes one TIDL graph with the Squeeze->unSqueeze->Reshape at the end, and final layers with GEMM, Add, Slice run on Arm/CPU.  This way, the model compiles and run on PC with C7x emulation.

    The model download from google  just one part of the model where causes these errors,it seems not a good choice.

    Slice2D can be changed to slice1D via below method.This is not difficult.But main problem is    1. up to 10.01 op:reshape doesnt work    2. down to 10.00.08  some other op doesn't work.

    from tidl_onnx_model_optimizer import optimize

    optimize("/ti/1111-op12.onnx", "./test1111-op12.onnx",simplify_mode=all)

  • Hello,

    I see you have been busy and tried many options. Let me respond to these. I understand now how my suggestion to deny-list and reduce # subgraphs will not work -- You have passed me a portion of the model only.

    I see one of your notes was to replace 2D slice with 1D slice. 

    Slice2D can be changed to slice1D via below method.This is not difficult

    I shall assume Slice layers are now operating fine, but please inform if I misunderstand. I see from your screenshot that that these have been unpacked from 3 axes Slice --->1 axis slice, replicated 3 times for different axes.

      1. up to 10.01 op:reshape doesnt work    2. down to 10.00.08  some other op doesn't work.

    1) Reshape -- I see the error from Onnx saying input shape is (1,1,1,1) and requested output is (-1, 1024). This is for layer  show in part of your screenshot

    • For this Squeeze -> unsqueeze -> unsqueeze -> reshape -> MatMul, why can it not use Squeeze -> MatMul? 
    • Otherwise, I'm not sure why the Reshape node is not accepted by TIDL. Is there a warning / error for this Reshape? 
      • Can you provide me the SVGs in the artifacts/tempDir directory here? I would like to see how TIDL parsed tensor shapes and which layers were marked for acceleration by TIDL
    • From my previous comment, I thought this was patched in 10.1. You are saying it is not. Can you please confirm if this error is with 10.0 or 10.1 tools?

    2) Other op in 10.0.0.8

    Is this for the GEMM / MatMul->Add? I have noted this as failing during the optimization step of TIDL import (happens internally. This optimization is not same as the optimizer python scripts). 

    •  In this scenario, you can also try replacing MatMul -> Add with GEMM layer. Your configuration looks to match the supported_ops page

    This topic is known by development team and will be addressed in the near future. I am tracking the progress on this with our team. I appreciate your patience here. 

    Please inform if there is another layer not functioning here. I will reproduce the issue if you pass me such a model, and provide to our dev team to fix.


    BR,
    Reese

  • If you have tried to use  onnx modifier to modifie model ,you will meet lost of these problems。But all these  onnx model can be infered correctly.

    IN  TIDL10.01:

    If i change   Squeeze -> unsqueeze -> unsqueeze -> reshape -> MatMul  to  Squeeze -> MatMul   it sometimes wiil comes with error like [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed.

    But in TIDL 10.00.08 this will 100%  happen.

    TIDL_RT_ONNX_VARDIM=0:  RUNTIME_EXCEPTION : Non-zero status code returned while running Gemm node. Name:'gemm' Status Message: gemm_helper.h:14 onnxruntime::GemmHelper::GemmHelper(const onnxruntime::TensorShape&, bool, const onnxruntime::TensorShape&, bool, const onnxruntime::TensorShape&) left.NumDimensions() == 2 || left.NumDimensions() == 1 was false.

    TIDL_RT_ONNX_VARDIM=1:INVALID_ARGUMENT : Non-zero status code returned while running Gemm node. Name:'gemm' Status Message: GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62

    对于这个 Squeeze -> unsqueeze -> unsqueeze -> reshape -> MatMul,为什么不能使用 Squeeze -> MatMul
    MatMul -> Add with GEMM

    Both GEMM or  matmul+add  all failed  whice uses onnxslim to simplify.

    7776.tempDir.zipHere is this model's tmpdir

  • Thank you for providing the TempDir. I see that there are multiple subgraphs that have not finished parsing and optimizing. As a result, 3 of the 4 subgraphs lack SVG graphs.  Please share the onnx model as well if you are comfortable -- random weights is ok.

    If you have tried to use  onnx modifier to modifie model ,you will meet lost of these problems。

    You mean this graphical tool for changing models, yes? https://github.com/ZhangGe6/onnx-modifier

    I too have had challenges with onnx-modifier tool, especially for layers with many default attributes or more complex initializers. onnx-modifier is a good tool, but modifying directly with onnx or onnx-graphsurgeon is more stable.

    TIDL_RT_ONNX_VARDIM=1:INVALID_ARGUMENT : Non-zero status code returned while running Gemm node. Name:'gemm' Status Message: GEMM: Dimension mismatch, W: {1024,62} K: 1 N:62

    For the GEMM nodes, can you try representing the tensor's dimensions in 'C' as {1,62} instead of {62}? From what I see, TIDL is rejecting your GEMM (or equivalent MatMul->Add) from running with acceleration, but then ONNX also hits an error for the tensor returned by TIDL reaches CPU ONNX runtime vs. the constants C,W. If this is the problem, I will see that this gets fixed for future release. 

    And I think using TIDL_RT_ONNX_VARDIM=0 will overall cause issue with layers running on Arm. Let's use =1 here.

    HOWEVER, the first subgraph SVG has this shapes as the output. This suggests that TIDL is passing back a tensor of shape {1024,1,1} or even {1,1}, which ONNX probably does like. This is probably why ONNX reads 'K' as 1 instead of 1024. I think we are best suited trying to resolve this GEMM/MatMul issue, since this is ultimately what we need. 

    BR,
    Reese

  • Hello,

    We decided to split face keypoint and head pose to different model. Could your team helps to compile this onnx model based on  mmpose  rtmpose.

    Results here:    CPU seems to get similar results but NPU Disappointed relieved. Is there anything wrong with my model_configs?

  • Hello,

    Different model architecture now, I understand.  We will investigate this. You model config looks reasonable

    I see in your artifacts that there are multiple subgraphs now, and the artifacts look complete. However, you have noted accuracy is poor from your spreadsheet.

    For accuracy, can you try compiling for tensor_bits = 16 and 32 and check the output in each case? This should tell us if the issue is quantization-related.

    • If so, we can use debug_level=4 to collect layer-level traces and see where fixed-point acceleration diverges from floating point. 

    Are you using TIDL tools for 10.1 or 10.0 SDK?

    There are many subgraphs here as well. We should be able to reduce this. Some fixes are part of the tidl_onnx_model_optimizer already, although we may have to remove a few default rules that were problematic on a previous version of your model. At the least, we can use rules like: 

    • convert_large_global_avg_pooling_to_matmul 
    • convert_reducemean_to_matmul
    • convert_maxpool_to_cascaded_maxpool
    • convert_unsqueeze_to_reshape

    I expect these to remove at least 4 subgraphs from the 8 subgraphs you have now. Clip->Div is the only part of your model that I don't have an immediate solution for. 

    BR,
    Reese

  • Hello,

    conclusion:AFTER THE POSTPROCESS OF THE RESULTS 2*1*106*512  ,  106  POINTS ARE THE SAME ,NO MATTER ALL  globalavgpooling  IS REPLACED.

    I  replaced all the globalavgpooling   to reshape+matmul+reshape and maxpool to cascaded_maxpool and unsqueeze to reshape  ,then retried all the npu artifacts based on this model(just add postprocess in code  to get the  max of 512    2*1*106*512 --->2*1*106*1).Now are 4subgraphs.This onnx model running on CPU can get correct results.

    • convert_large_global_avg_pooling_to_matmul 
    • convert_reducemean_to_matmul
    • convert_maxpool_to_cascaded_maxpool
    • convert_unsqueeze_to_reshape

    We decided to split face keypoint and head pose to different model. Could your team helps to compile this onnx model based on  mmpose  rtmpose.

    c666.zip

    How to do?   edgeai-tidl-tools/examples/osrt_python/ort/onnxrt_ep.py   can't see any thing about    calibration epoches and  tensor bits

    For accuracy, can you try compiling for tensor_bits = 16 and 32 and check the output in each case? This should tell us if the issue is quantization-related.
  • Hi Xiaojun

    I am working on comparing the layer trace. But now I have a problem that setting debug level to 4 will require additional mem and will generate the following error:

    root@am62axx-evm:/opt/edgeai/edgeai-tidl-tools/examples/osrt_python/ort# python3 onnxrt_ep_no_post.py -m tianma_model                   
    Available execution providers :  ['TIDLExecutionProvider', 'TIDLCompilationProvider', 'CPUExecutionProvider']
    
    Running 1 Models - ['tianma_model']
    
    
    Running_Model :  tianma_model  
    
    libtidl_onnxrt_EP loaded 0x2de888b0 
    artifacts_folder                                = ../../../model-artifacts//tianma_model/artifacts 
    debug_level                                     = 4 
    target_priority                                 = 0 
    max_pre_empt_delay                              = 340282346638528859811704183484516925440.000000 
    Final number of subgraphs created are : 5, - Offloaded Nodes - 186, Total Nodes - 198 
    In TIDL_createStateInfer 
    Compute on node : TIDLExecutionProvider_TIDL_0_0
    ************ in TIDL_subgraphRtCreate ************ 
     APP: Init ... !!!
       763.027503 s: MEM: Init ... !!!
       763.027575 s: MEM: Initialized DMA HEAP (fd=5) !!!
       763.027733 s: MEM: Init ... Done !!!
       763.027759 s: IPC: Init ... !!!
       763.044549 s: IPC: Init ... Done !!!
    REMOTE_SERVICE: Init ... !!!
    REMOTE_SERVICE: Init ... Done !!!
       763.048551 s: GTC Frequency = 200 MHz
    APP: Init ... Done !!!
       763.048684 s:  VX_ZONE_INIT:Enabled
       763.048699 s:  VX_ZONE_ERROR:Enabled
       763.048708 s:  VX_ZONE_WARNING:Enabled
       763.049615 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-0 
       763.049949 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-1 
       763.050212 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-2 
       763.050449 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-3 
       763.050481 s:  VX_ZONE_INIT:[tivxInitLocal:136] Initialization Done !!!
       763.050713 s:  VX_ZONE_INIT:[tivxHostInitLocal:106] Initialization Done for HOST !!!
    ************ TIDL_subgraphRtCreate done ************ 
     In TIDL_createStateInfer 
    Compute on node : TIDLExecutionProvider_TIDL_1_1
    ************ in TIDL_subgraphRtCreate ************ 
     ************ TIDL_subgraphRtCreate done ************ 
     In TIDL_createStateInfer 
    Compute on node : TIDLExecutionProvider_TIDL_2_2
    ************ in TIDL_subgraphRtCreate ************ 
        763.238992 s: MEM: ERROR: Alloc failed with status = 12 !!!
       763.239040 s:  VX_ZONE_ERROR:[tivxMemBufferAlloc:90] Shared mem ptr allocation failed
       763.239052 s:  VX_ZONE_ERROR:[ownAllocReferenceBufferGeneric:340] Memory allocation failed
       763.239063 s:  VX_ZONE_ERROR:[ownGraphAllocateDataObject:1031] Memory allocation for data reference failed
       763.239074 s:  VX_ZONE_ERROR:[vxVerifyGraph:2199] Memory alloc for data objects failed
       763.239085 s:  VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed
    TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!!
    TIDL_RT_OVX: ERROR: Verify OpenVX graph failed
    ************ TIDL_subgraphRtCreate done ************ 
     In TIDL_createStateInfer 
    Compute on node : TIDLExecutionProvider_TIDL_3_3
    ************ in TIDL_subgraphRtCreate ************ 
        763.247303 s: MEM: ERROR: Alloc failed with status = 12 !!!
       763.247348 s:  VX_ZONE_ERROR:[tivxMemBufferAlloc:90] Shared mem ptr allocation failed
       763.247360 s:  VX_ZONE_ERROR:[ownAllocReferenceBufferGeneric:340] Memory allocation failed
       763.247371 s:  VX_ZONE_ERROR:[ownGraphAllocateDataObject:1031] Memory allocation for data reference failed
       763.247382 s:  VX_ZONE_ERROR:[vxVerifyGraph:2199] Memory alloc for data objects failed
       763.247393 s:  VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed
    TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!!
    TIDL_RT_OVX: ERROR: Verify OpenVX graph failed
    ************ TIDL_subgraphRtCreate done ************ 
     In TIDL_createStateInfer 
    Compute on node : TIDLExecutionProvider_TIDL_4_4
    ************ in TIDL_subgraphRtCreate ************ 
        763.257352 s: MEM: ERROR: Alloc failed with status = 12 !!!
       763.257401 s:  VX_ZONE_ERROR:[tivxMemBufferAlloc:90] Shared mem ptr allocation failed
       763.257413 s:  VX_ZONE_ERROR:[ownAllocReferenceBufferGeneric:340] Memory allocation failed
       763.257424 s:  VX_ZONE_ERROR:[ownGraphAllocateDataObject:1031] Memory allocation for data reference failed
       763.257435 s:  VX_ZONE_ERROR:[vxVerifyGraph:2199] Memory alloc for data objects failed
       763.257445 s:  VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed
    TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!!
    TIDL_RT_OVX: ERROR: Verify OpenVX graph failed
    ************ TIDL_subgraphRtCreate done ************ 
     *******   In TIDL_subgraphRtInvoke  ******** 
    [C7x_1 ]    763.330949 s:    0         1.00000        17.00000       255.00000 6
    [C7x_1 ]    763.448313 s:    5         0.50000        12.00000       254.00000 1
    [C7x_1 ]    763.709967 s:    6         0.04539     -2005.05115       980.49207 1
    [C7x_1 ]    764.048756 s:    7         5.80933         0.00000         0.86068 0
    [C7x_1 ]    764.377386 s:    8         0.13183         0.00000       853.37659 1
    [C7x_1 ]    764.696369 s:    9         0.03640     -2444.87451       947.73218 1
    [C7x_1 ]    765.035074 s:   10         4.65954         0.00000         0.85845 0
    [C7x_1 ]    765.353711 s:   11         0.08481         0.00000       825.37476 1
    [C7x_1 ]    765.672484 s:   12         0.01743     -3959.20020      1864.84058 1
    [C7x_1 ]    766.249094 s:   13         2.23075         0.00000         0.89656 0
    [C7x_1 ]    766.705231 s:   14         0.03888         0.00000      1659.07666 1
    [C7x_1 ]    767.201182 s:   15         0.08849     -1192.18604       672.37036 1
    [C7x_1 ]    767.539965 s:   16        11.32709         0.00000         0.97112 0
    [C7x_1 ]    767.848516 s:   17         0.12530         0.00000       646.46973 1
    [C7x_1 ]    768.146976 s:   18         0.09069     -1025.43579        82.69644 1
    [C7x_1 ]    768.341678 s:   19        11.60872         0.00000         0.94756 0
    [C7x_1 ]    768.536278 s:   20         1.05283         0.00000        83.10921 1
    [C7x_1 ]    768.720892 s:   21         0.99571      -125.53882        46.19829 1
    [C7x_1 ]    768.935700 s:   22       127.45061         0.00000         0.99646 0
    [C7x_1 ]    769.130362 s:   23         1.98287         0.00000        46.39743 1
    [C7x_1 ]    769.315001 s:   24         0.35777      -357.77322       120.18944 1
    [C7x_1 ]    769.509613 s:   25        45.79437         0.00000         0.98265 0
    [C7x_1 ]    769.714310 s:   26         1.02399         0.00000       114.25931 1
    [C7x_1 ]    769.898952 s:   27         0.54613      -221.56052       162.96600 1
    [C7x_1 ]    770.093589 s:   28        69.90415         0.00000         0.98707 0
    [C7x_1 ]    770.298299 s:   29         0.59651         0.00000       160.93677 1
    [C7x_1 ]    770.492997 s:   30         2.50239       -43.35855        25.77536 1
    [C7x_1 ]    770.697802 s:   31       160.15294         0.00000         0.99905 0
    [C7x_1 ]    770.912589 s:   32         3.13098        -0.31939        26.03021 1
    [C7x_1 ]    771.117308 s:   33         1.98287        -0.50432        46.39743 1
    [C7x_1 ]    771.322036 s:   34         2.10566        -0.94982        55.80190 1
    [C7x_1 ]    771.630731 s:   35         2.10566        -0.94982        55.80190 1
    [C7x_1 ]    771.939034 s:   36         2.10566        -0.94982        55.80190 1
    [C7x_1 ]    772.247548 s:    1    524288.00000         0.00000         0.00000 1
    [C7x_1 ]    772.349586 s:   37         4.21133        -0.11873        15.55330 1
    [C7x_1 ]    772.450381 s:   38         4.21133        -0.11873        15.55330 1
    [C7x_1 ]    772.551249 s:   39         4.21133        -0.11873        15.55330 1
    [C7x_1 ]    772.652131 s:   40         3.09413       -40.23747        40.88386 1
    [C7x_1 ]    772.752945 s:   41        99.01219         0.00000         0.99988 1
    [C7x_1 ]    772.853839 s:   42         3.25760        -0.92092        36.68346 1
    [C7x_1 ]    773.152374 s:   43         0.77269      -104.82794        65.35569 1
    [C7x_1 ]    773.491508 s:   44        98.90493         0.00000         0.99085 0
    [C7x_1 ]    773.810090 s:   45         1.19411         0.00000        64.90165 1
    [C7x_1 ]    774.118805 s:   46         0.73371      -165.59564        99.49368 1
    [C7x_1 ]    774.323545 s:   47        93.91551         0.00000         0.99025 0
    [C7x_1 ]    774.518230 s:   48         1.07668         0.00000        97.98685 1
    [C7x_1 ]    774.702852 s:   49         0.49600      -198.58832 

    Before the excution being killed by the error, 173 layers are stored and compared with PC. All those layers traces are the same. I will enlarge the shared mem tomorrow to try again. Changing mem size and rebuild is time consuming, so please expect delayed response

    Regards,

    Adam

  • Hi Reese,

    Seek your help here. As my last reply says, mem problem occur when I tried to dump layer traces. But I am not sure which part of mem it uses. I enlarged edgeai-core-heap-memory but it does not help. 

    Regards,

    Adam

  • I added the postprocess op to the model  but also failed.I have tried no less than 10 combinations but all failed.

    Is there anyother way to solve these problems.

    Here the artifacts file

    c666-1.10.zip

  • Hi Adam, 

    Hmm, perhaps that is not right memory region to increase. I have not run into this MALLOC error during trace dump.

    From running a network with debug_level 2 and 5, I can see the memrec tables differ for entry 9, which is part of DDR_C7X_1_SCRATCH region (address starts with 0xB900). I think that region needs increase. We may be able to confirm this is needed by looking at memrec tables.

    If 173 of 198 layers are same on PC, then difference must be in some of the last layers

  • I suggest running in a host-emulation mode at this stage. This is preferred when working on accuracy issues. This will also let us analyze traces without worrying about memory maps and allocation failures. 

    What I see so far is that 8-bit and 16-bit model does indeed have substantially different output than CPU and 32-bit execution. 16-bit is less severe, but still different. When running the network with tensor_bits=32 through TIDL, the output is same as disable-offload (run on CPU, no TIDL at all). This tells that the quantized version of the layer has limited accuracy. The debugging steps below will help us understand at what layer the tensor_bits=32 differs from 8 and 16

    • https://github.com/TexasInstruments/edgeai-tidl-tools/blob/master/docs/tidl_osr_debug.md#feature-map-comparison-with-reference 
    • I have attached my script that includes these visualization functions: 
      import numpy as np
      import argparse
      import matplotlib
      import matplotlib.pyplot as plt
      import os
      import sys
      import subprocess
      import shutil
      import argparse
      
      def parse_args():
      
          parser = argparse.ArgumentParser()
          parser.add_argument('tracedir_fixed', type=str, default=None)
          parser.add_argument('tracedir_float', type=str, default=None)
          parser.add_argument('-s', '--save_trace_dir', type=str, default=None)
          parser.add_argument('-t','--tensor_bits', type=int, default=8, help='Tensor_bits used for these traces. Hybrid mode not supported yet')
          args = parser.parse_args()
          return args
      
      def save_error_plot(float_data, fixed_data, axes):
          mx = np.max(float_data)
          mn = np.min(float_data)
          org_diff = (fixed_data - float_data)
          combined = np.vstack((float_data, fixed_data, org_diff)).T
          # #np.savetxt("figs\\"+str(i).zfill(4)+"_float.txt", combined, fmt='%10.6f, %10.6f, %10.6f')
          abs_diff = abs(fixed_data - float_data)
          maxIndex = np.argmax(abs_diff)
          max_abs_diff = np.max(abs_diff)
          mean_abs_diff = np.mean(abs_diff)
          var_abs_diff = np.var(abs_diff)
      
          axes.hist(abs_diff, color='blue', edgecolor='black', bins=60)
          # image_txt = "mean = " + str(mean) +", Var = "+ str(var) +", MAx = "+ str(mx)
          image_txt = "Hist; MeanAbsDiff=%7.4f, MaxAbsDiff=%7.4f, MaxVal=%7.3f" % (mean_abs_diff, max_abs_diff, mx)
          #plt.title(image_txt)
          axes.set_title(image_txt, fontdict = {'fontsize' : 8})
          axes.set_xlabel('tensor element values')
          axes.set_ylabel('value frequency')
      
      
      
      def save_pc_ref_plot(float_output, fixed_output, axes):
          axes.set_title("Float output Vs Fixed Output : Plot 1")
          axes.set_xlabel('Float Output (tensor_bits 32 / reference)')
          axes.set_ylabel('Fixed Output (dequantized to fp32)')
          axes.plot(float_output, fixed_output, '.')
      
      def save_pc_ref_plot2(float_output, fixed_output, axes):
          axes.set_title("Float output Vs Fixed Output : Plot 2")
          axes.plot(float_output, "bs", label = "Float")
          axes.plot(fixed_output, "c.", label = "Fixed")
          axes.legend(loc='upper right', frameon=True)
      
      
      fig, axs = plt.subplots(ncols=2)
      plt.subplots_adjust(left=0.075, right=0.95)
      fig.set_figwidth(12)
      def compare_traces(float_tracefile, fixed_tracefile, save_pngs_dir=None):
          float_data = np.fromfile(float_tracefile, dtype=np.float32)
          fixed_data = np.fromfile(fixed_tracefile, dtype=np.float32)
      
          # plt.clf() #clear
          axs[0].clear()
          axs[1].clear()
          
          layer_info = float_tracefile.split('/')[-1].split('_')[3:-1]
          #( trace names will be like tidl_traceAAAA_BBBBB_CCCCC_DDDDDxEEEEE.y, AAAA is dataId, BBBBB is batch number, CCCCC is channel number, DDDDD is width and EEEEE is height)
          print('subgraph | data ID | DIM0 | DIM1 | batch number | channel | width x height')
          print(layer_info)
          data_id = layer_info[1]
          print(f'data ID: {data_id}')
          
          # save_error_plot(float_data, fixed_data, axes)
          save_pc_ref_plot(float_data, fixed_data, axs[0])
          save_pc_ref_plot2(float_data, fixed_data, axs[0])
          save_error_plot(float_data, fixed_data, axs[1])
          # plt.show()
          fig.suptitle(f'Analysis for data ID {data_id}') #TODO: read layer_info file for string name of the layer
          plt.draw()
          # while not plt.waitforbuttonpress(): pass
      
          if save_pngs_dir is not None:
              fig.savefig(os.path.join(save_pngs_dir, float_tracefile.split('/')[-1])+'.png')
          else:
              print('PNG not saved')
      
      
      def main():
          args = parse_args()
      
          files_fixed = os.listdir(args.tracedir_fixed)
          files_fixed.sort()
          traces_fixed = [f for f in files_fixed if '_float.bin' in f]
          traces_fixed.sort()
          num_files = len(traces_fixed)
      
          files_float = os.listdir(args.tracedir_float)
          files_float.sort()
          traces_float = [f for f in files_float if '_float.bin' in f]
          traces_float.sort()
      
          for i in range(num_files):
              filename_fixed = traces_fixed[i]
              # file_basename = filename_float.split('_float.bin')[0]
              # print(file_basename)
              filename_float = None
              for j in range(len(traces_float)):
                  if  filename_fixed in traces_float[j]:
                      filename_float = traces_float[j]
                      print(filename_float)
                      break
              if filename_fixed is None or filename_float is None: 
                  print('skip %s / %s\n\n' % (filename_fixed, filename_float))
                  continue
      
              print(filename_fixed)
              print(filename_float)
              print('found files; now compare traces')
              filename_float = os.path.join(args.tracedir_float, filename_float )
              filename_fixed = os.path.join(args.tracedir_fixed, filename_fixed)
              print(filename_fixed)
              print(filename_float)
              compare_traces(float_tracefile=filename_float, fixed_tracefile=filename_fixed, save_pngs_dir=args.save_trace_dir)
              
      
          # file_pairs[0:4]
      
      
      if __name__ == '__main__':
          main()

     I note that same behavior is seen in 10_00_00_08 and 10_01_02_00 tidl-tools versions.

    • There are a few extra layers like abs, pow that use TIDL in 10_01 version, but same number of subgraphs, mostly from Clip, Div, ReduceSum (RS can be replaced with optimizer rule) 

    Note that we can probably eliminate a subgraph here with a model change:

      

    By setting Argmax axis to be -3 instead of 1 and moving cast after Concat. Although I am unsure if Flatten before argmax will permit this axis setting.

  • Hi Reese,

    There are a few problems to solve with the model right now:

    1. Too many subgraph. I have suggested customer modify the model structure to make conv and mul operators to a group and other operator like abs to another group to reduce number of subgraph. 

    2. Problem with sigmod layer. As you suggested, I use pc simulation to run 32bit and 8bit and found that there are problem with batch+sigmod layer. All sigmod layer have bad accuracy:

    I am using tools 10.0.8 since sdk 10.1 has not been released. Need your comment whether should customer change all sigmod to relu.

    Regards,

    Adam

  • Other question: could run cpu+npu in the same time  based on edgeai-gst-apps /app_cpp.  We tried to delete  allownodes.txt 's contents And tried  add  these code to postprocess part .

    auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU);
    Ort::Value input_tensor_ = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
    auto cpu_output = ort_session->Run(Ort::RunOptions{ nullptr }, &input_names[0], &input_tensor_, 1, output_names.data(), 1);
    const float* output_cpu = cpu_output[0].GetTensorMutableData<float>();

    But all the fps will down to 5 . Is that normal?

    (We tried to use usural opencv method but found that imread and imwrite couldn't  work cause:[100%] Linking CXX executable 
    /usr/lib/gcc/aarch64-oe-linux/13.3.0/../../../../aarch64-oe-linux/bin/ld:(.text.startup+0x128): undefined reference to `cv:imread//imwrite)

    moving cast after Concat

    This will lead to    [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed.    MAYBE  CAUSED BY CONCAT ARGMAX(INT)

    TWO outputs failed, one output without postprocess  

    conclusion:AFTER THE POSTPROCESS OF THE RESULTS 2*1*106*512  ,  106  POINTS ARE THE SAME ,NO MATTER ALL  globalavgpooling  IS REPLACED.
  • Hi xiaojun,

    The first problem about fps is another problem. Could you file a different ticket for that?

    As for the model, there are other problems with it. Please allow us some time to make a workaround for that.

    Regards,

    Adam

  • Hi Xiaojun,

    Reese is out this week and won't be able to respond to you until next week.

    Regards,

    Jianzhong

  • Hi Adam,

    1. Too many subgraph. I have suggested customer modify the model structure to make conv and mul operators to a group and other operator like abs to another group to reduce number of subgraph. 

    2. Problem with sigmod layer. As you suggested, I use pc simulation to run 32bit and 8bit and found that there are problem with batch+sigmod layer. All sigmod layer have bad accuracy:

    Understood on the two points. For the first, let me know if help is needed to make these optimizations. I see several places where automated scripts might help. Additionally, some layers that were previously on CPU should run with TIDL now with 10.1 SDK, like Abs and Pow.

    The sigmoid one deserves further investigation. As a start, I'd recommend trying 10.1 tools. The 10.1 SDK released this week, so it is ready to try. I agree that the data shown in those traces is not good quantization. Can you provide model + import config used here so I can reproduce and log as an issue? Do you know if hard-sigmoid is seeing that same? I see that this network is using both.

    • For the time being, lets switch to ReLU. It is not an ideal change and we'll work towards a fix. I suggest RELU as short-term workaround.
    • Is the c666-1.10 model above showing these sigmoid errors? If so, we can use this as a good test case for isolating the error. 
  • Is the c666-1.10 model above showing these sigmoid errors?

    sure.

    Can you provide model + import config used here so I can reproduce and log as an issue?

    Here the artifacts file

    c666-1.10.zip

    'c666' :create_model_config(
    preprocess=AttrDict(
    resize=256,
    crop=256,
    data_layout="NCHW",
    resize_with_pad=False,
    reverse_channels=False,
    ),
    session=AttrDict(
    session_name="onnxrt", #_face_1x3x120x120 modified_ -op11 modified_sparse_face_me
    model_path=os.path.join( "/home/zxb/Desktop/ti/final-0gmp1.onnx"),
    input_mean=[0, 0, 0],
    input_scale=[1, 1, 1],

    ),
    task_type="classification",
    extra_info=AttrDict(num_images=numImages, num_classes=1000),
    ),

  • Hello,

    Thank you for supplying this. I have logged this sigmoid accuracy as an issue to resolve. In the meantime, please replace these with RELU

    For this C666 model, what else do you need assistance with? I believe there are still some issues with subgraphs / performance. Please help me understand your current status -- it is not clear to me. Perhaps one of these challenges is the CAST / ARGMAX / CONCAT issue from above?

    This will lead to    [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_1 node. Name:'TIDLExecutionProvider_TIDL_1_1' Status Message: TIDL Compute Import Failed.    MAYBE  CAUSED BY CONCAT ARGMAX(INT)

    TWO outputs failed, one output without postprocess  

    I assume you have changed the model for this. 

    BR,
    Reese

  • left  cpu  right npu 

    change  sigmoid ———》relu 

    remove  flatten and argmax

    WHAT CAN I  SAY

    c666-relunoam.zip

  • Hello,

    I understand this has been a frustrating experience, thank you for your perseverance -- it is much appreciated.

    You have found a configuration that passes the compilation stage and can run on target. We are experiencing accuracy challenge now, and your image makes this very obvious. Replacement with ReLU gives reasonable output on CPU, but poor on NPU/C7x.

    This will take investigation to understand which layer(s) cause accuracy issue.

    As a first step, I run the model with tensor_bits set to 8, 16, and 32, and can see big difference (but correct order of magnitude) between each quantization level. 32-bit is a reference floating point mode, and is within very small error margin of CPUExecutionProvider. Therefore, 32-bit with TIDL is good to compare against.

    The next step is layer-level analysis. We need to run the model with debug_level=4 and tensor_bits=8. Save the traces under /tmp/tidl_trace...._float.bin. Then recompile and run the same with tensor_bits=32, and similarly save traces. We can compare between 8 and 32 in same way as Adam did above

    I have attached my script that includes these visualization functions: 

    Using the functions from the script I mentioned for accuracy comparison. 

    As for reducing # subgraphs -- Several layers you are using are not supported. 10.1 adds support for Abs and Pow layers, but this model still results in 5 subgraphs on your model due to ReduceSum, Div (with both inputs variable) and Max layers.

    • ReduceSum could perhaps be replaced by ReduceMin or ReduceMax -- another alternative is ReduceMean, which we internally replace with Reshape and MatMul layers.
    • Div will be a challenge for the instances where both inputs are variable... unsure if there is a good way to replace that one without removing the skip connection for operand 2
    • Max layer should have been marked as supported... logs are not clear on why this was not treated as an elementwise layer.
      • The constant input you provided should have been identified as broadcastable. Perhaps try replacing the single constant value with that same value in the same shape as the input (e.g. custom_added_Max1 used 1x98x1 variable vs. 1x constant --> replace constant with same 1x98x1 size).

    BR,
    Reese

  • Hi 

    I have tried your new model on 10.1 pc mode 32bit tidlrt and 8bit tidlrt and I found that the accuracy loss gradually after every convolution layer. Have you tried to run the model in PC mode and tensor bits 8 to see if the output is correct? 

    Regards,

    Adam

  •  have a question. I don't fully understand how TI's tensorbit works. My ONNX model is 32-bit, and when I convert it directly using the TIDL tool, will it automatically convert it to bits=8, or will it remain at bits=32 by default, or is it a mixed configuration?

    Have you tried to run the model in PC mode and tensor bits 8 to see if the output is correct?

    Should i convert onnx model from float32  to int8,then compile it ?

    I noticed that there are extra config.yaml files in ModelMaker's output artifacts folder.

    ABOUT 10.01      remove reducesum and max   alse  failed.c666-nomax.zip

    VARIABLE div to pow( -1 )+ concat *  +mul    。 just  one subgraph,but ends  failed  too.

      c666-noerror.zip

  • Hello,

     have a question. I don't fully understand how TI's tensorbit works. My ONNX model is 32-bit, and when I convert it directly using the TIDL tool, will it automatically convert it to bits=8, or will it remain at bits=32 by default, or is it a mixed configuration?

    I understand, let me clarify tensorbits usage.

    The tensorbits parameter can be 8, 16, or 32.

    Values 8 and 16 mean that the model will run in fixed point mode, therefore it is quantized. You do not need to prequantize your model to do this (although this is an option). TIDL will use post-training quantization (PTQ) on a set of calibration images. This allow TIDL to find quantization for each layer, such that the 32-bit floating-point values can be used as int8 (or int16) instead.

    • If you use int8, you can designate specific layers to run in 16-bit to improve accuracy -- the compile-time option is called "output_feature_16bit_names_list". If you use tensorbits=16, then all layers will be quantized to 16-bit, and this hybrid mode cannot be used.

    Value of tensorbits=32 is different. This is considered a reference mode to check model functionality. The model will run in floating-point -- it will therefore skip calibration, meaning no float-->fixed point conversion. This mode is intended to be used on PC.

    ________________________

    ABOUT 10.01      remove reducesum and max   alse  failed.c666-nomax.zip

    It is clear to me that accuracy is poor right now. Your source code looks okay in the screenshots. Let us figure out why the accuracy is poor. 

    For now, please continue testing with c666-nomax model, and ignore the variable Div --> pow^-1 and Mul. Let us focus on diagnosing the accuracy issue with the no-max model, as this trick with Div could be further influencing accuracy. We can look at Div afterwards.

    Could you recompile your model with tensorbits=32, and share the output of the same test on your c666-nomax model? I want to compare CPU-based inference (which is good) to NPU-based inference running with TIDL. 

    If this tensorbits=32 reference mode with TIDL is good, then this issue is due to quantization. If the issue is still present with tensorbits=32, this may indicate a bug in the TIDL SW.

    BR,
    Reese

  • If this tensorbits=32 reference mode with TIDL is good, then this issue is due to quantization. If the issue is still present with tensorbits=32, this may indicate a bug in the TIDL SW.

    Do u mean in benchmark folder   run_custom_pc.sh  AM62A      set setting_base.yaml : tensorbits =8/16/32    .

    However, in this case, benchmark and TIDL would be two completely unrelated concepts.

    c666-ben32.zip

    This is a keypoint model, but  task_type  just only cls/seg/det ,and don't need postprocess .How to  set the config .

    c666-ben8-2.zipc666-ben16.zip

    Is there anyother userfriendly ways to convert an ONNX model into the format required by EdgeAI (similar to eiq/rknn/snpe)?

  • 你好,

    在TIDL种当我选择不同数据量去编译的时候,NPU的结果呈现的结果也不同。当我再去edgeai-benchmark 去运行   ./run_custom_pc.sh  AM62A

    sh文件会使用到settings_base.yaml,如果我设置task_selection : keypoint_detection 会报错: Traceback (most recent call last):
    File "/home/zxb/Desktop/ti/edgeai-tensorlab/edgeai-benchmark/edgeai_benchmark/datasets/__init__.py", line 188, in get_datasets
    dataset_cache[DATASET_CATEGORY_IMAGENET]['calibration_dataset'] = ImageNetDataSetType(**imagenet_cls_calib_cfg, download=download)
    TypeError: 'NoneType' object is not subscriptable

    现在的模型是keypoint 模型,虽然modelzoo有humanpose 模型,但是tidl的config和benchmark的configs里面都没有样例,直接使用classification编译也是有问题的。不是像一开始提到的一样customsmodel使用classification就可以使NPU拿到和CPU差不多的结果。

    我们这部分一直不是很明白,需求也已经沟通过了,模型也提供了但是没有得到想要的答案

  • Hi Xiaojun,

    As we suggested, your model does not get expected output after quantization may because you need to add batchnorm after each conv. 

    Or, you can create a new model with TI model-maker to do keypoint detection, which will accelerate your model deployment.

    Regards,

    Adam

  • Hi 

    Also, I found your model result is similar with mmpose fase detection. 

    If you are using mmpose based model, you can create a new model with https://github.com/TexasInstruments/edgeai-tensorlab/tree/main/edgeai-mmpose 

    Regards,

    Adam

  • Dear,adam 

    edgeai-tensorlab/edgeai-mmpose/README.md

    EdgeAI-MMPose

    This repository is an extension of the popular mmpose open source repository for keypoint detection training. In edge-mmpose, we focus on yolox based keypoint detection models that are optimized for speed and accuracy so that they run efficiently on embedded devices. For this purpose, we have added a set of embedded friendly model configurations and scripts.

    This seems to indicate that TI has not conducted research on other configurations of MMPose, and I believe this suggestion is not entirely right. Our model comprises 331 key points, but for the preliminary ONNX verification, we utilized the 98-point base model of RTMPose for validation and have made the artifacts file publicly available.

    By the way, we conducted some verification on the regnet.onnx (classification) within TI's Model Zoo. Similarly, the results from the CPU and NPU are not consistent. It is unclear whether this indicates that the issue does not lie with the Batch Normalization (BN) layer.

    'kd-7060':utils.dict_update(common_cfg,
    preprocess=preproc_transforms.get_transform_onnx(640, 640, reverse_channels=True, resize_with_pad=[True, "corner"], backend='cv2', pad_color=[114,114,114]),
    session=onnx_session_type(**sessions.get_common_session_cfg(settings, work_dir=work_dir, input_optimization=False),
    runtime_options=settings.runtime_options_onnx_p2(
    det_options=True, ext_options={'object_detection:meta_arch_type': 6,
    'object_detection:meta_layers_names_list': f'{settings.models_path}/vision/keypoint/coco/edgeai-yolox/yolox_s_pose_ti_lite_640_20220301_model.prototxt',
    'advanced_options:output_feature_16bit_names_list': '/0/backbone/backbone/stem/stem.0/act/Relu_output_0, /0/head/cls_preds.0/Conv_output_0, /0/head/reg_preds.0/Conv_output_0, /0/head/obj_preds.0/Conv_output_0, /0/head/kpts_preds.0/Conv_output_0, /0/head/cls_preds.1/Conv_output_0, /0/head/reg_preds.1/Conv_output_0, /0/head/obj_preds.1/Conv_output_0, /0/head/kpts_preds.1/Conv_output_0, /0/head/cls_preds.2/Conv_output_0, /0/head/reg_preds.2/Conv_output_0, /0/head/obj_preds.2/Conv_output_0, /0/head/kpts_preds.2/Conv_output_0'},
    fast_calibration=True),
    model_path=f'{settings.models_path}/vision/keypoint/coco/edgeai-yolox/yolox_s_pose_ti_lite_640_20220301_model.onnx'),
    postprocess=postproc_transforms.get_transform_detection_yolov5_pose_onnx(squeeze_axis=None, normalized_detections=False, resize_with_pad=True, formatter=postprocess.DetectionBoxSL2BoxLS(), keypoint=True),
    metric=dict(label_offset_pred=1), #TODO: add this for other models as well?
    model_info=dict(metric_reference={'accuracy_ap[.5:.95]%':49.6, 'accuracy_ap50%':78.0}, model_shortlist=10, compact_name='human-pose-yolox-s-640x640', shortlisted=True, recommended=True)
    ),

    edgeai-tensorlab/edgeai-benchmark/edgeai_benchmark/postprocess/__init__.py at main · TexasInstruments/edgeai-tensorlab

    edgeai-tensorlab/edgeai-benchmark/edgeai_benchmark/postprocess/keypoints.py at main · TexasInstruments/edgeai-tensorlab

    It may be necessary for us to rewrite the TIDL/postprocess/humanpose in order to resolve this issue.

  • Hi Xiaojun

    By the way, we conducted some verification on the regnet.onnx (classification) within TI's Model Zoo.

    Which specific model are your referring to? 

    I tried this model with edgeai-tidl-tools and found no error between cpu and npu:

    I suggested adding batchnorm after each conv because I found accuracy loss after every conv+relu layer even in PC mode. I compared pc mode 8bit results and 32bit results and found great accuracy loss. So I think the problem is not difference between pc and npu but accuracy loss due to quantization. 

    Adding batchnorm is still worth trying. If retraining takes long time, you can use a model not trained just to verify the results between onnx mode and tidl 8bit mode. 

    Regards,

    Adam

  • Alright, we'll give it a shot later this week.

    Adding batchnorm is still worth trying. If retraining takes long time, you can use a model not trained just to verify the results between onnx mode and tidl 8bit mode.