This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM62A7: Accuracy Issue - C7x vs CPU

Part Number: AM62A7

Tool/software:

Greetings!

I am facing a strange accuracy issue when offloading the execution of my model to the hardware accelerators. I’ve checked the troubleshooting guides and played around with the accuracy_level and tensor_bits parameters of the model, but that had a marginal positive impact on the results. From what I have managed to identify so far, the problem seem to be tightly related to either the number of inputs the model is supposed to process, or the order in which the inputs are processed.

To exemplify my problem, let’s consider my model, which is designed to process an input with 4 metadata features and give it a label. The recurring issue I’ve been facing is that, given a batch of X inputs to be processed on the hardware accelerator, only the first input is consistently correctly labeled by the model.

The figure below illustrates my problem: if I feed a batch of 5 inputs to the CPU, the accuracy rarely goes below 98%. If I feed the same batch to the offloaded model, only the first input is concistently correctly labeled, while the remaining 4 inputs face a drastic drop in accuracy. As highlighted in green, the first input of the offloaded model closely relates to the values inferred in the CPU for the same data set. This behavior is reproducible with other batch values as well, like 20, 100, etc. As can be expected, when the batch number is 1, the accuracy is always on par with the CPU.

Do you happen to have any insights on what could be happening? I’ve checked posts from other users, but their problems didn’t seem to have anything in common with what I am experiencing.

Thank you for your time!

Best regards,

Giann.

  • Hi Giann,

    Hmm, I'm not sure why this is. For ordinary CNN models, using the batch dimension ('N' for NCHW format) is handled in a way that global layers (like GEMM, GlobalAvgPool, etc. ) can struggle on because batches are tiled along width dimension, rather than having a fully separate dimension for batch.

    However, you are providing input with just 2 dimensions used (HW, in a sense). We don't see simple classifiers like this so often -- instead it's usually more complex CNN's

    Could you also post the other SVG in your artifacts/tempDir? I'd like to see the dimensions for each tensor that will show there. 

    And using tensor_bits=32 doesn't change this behavior at all? It's consistently like this for all rows (in your case, batches) after the first for your model?

  • Hi Reese!

    Hmm, I'm not sure why this is. For ordinary CNN models, using the batch dimension ('N' for NCHW format) is handled in a way that global layers (like GEMM, GlobalAvgPool, etc. ) can struggle on because batches are tiled along width dimension, rather than having a fully separate dimension for batch.

    However, you are providing input with just 2 dimensions used (HW, in a sense). We don't see simple classifiers like this so often -- instead it's usually more complex CNN's

    Could you also post the other SVG in your artifacts/tempDir? I'd like to see the dimensions for each tensor that will show there. 

    Indeed, I think it is an odly specific issue for an odly specific classifier. I was hoping that maybe I missed some flag that could help on the case, or maybe I had overlooked some behavior of the hardware accelerators that I should've accounted for. In any case, here is the other .svg file of my model designed to run batches of 5 inputs:

    And using tensor_bits=32 doesn't change this behavior at all?

    I did try compiling it with 32 bits, but running it on the board (setting the tensor_bits as 16, right?) gives me the following:

    root@mitysom-am62ax:/opt/edgeai-tidl-tools/examples/osrt_python/ort# python3 onnxrt_ep_static.py -m mlp_pytorch_static5
    Available execution providers :  ['TIDLExecutionProvider', 'TIDLCompilationProvider', 'CPUExecutionProvider']
    
    Running Model - mlp_pytorch_static5
    
    
    Running_Model :  mlp_pytorch_static5  
    
    /usr/lib/python3.12/site-packages/sklearn/base.py:486: UserWarning: X has feature names, but StandardScaler was fitted without feature names
      warnings.warn(
    [[-0.24328645 -0.60652469 -0.94525457  1.46170644]
     [ 0.32166659 -0.29399856  0.39657165  1.31221415]
     [-2.50309862 -0.6060511  -1.71201241  1.69361362]
     ...
     [-0.80823949 -0.60439354 -1.13694403  0.52396768]
     [ 0.32166659  2.13850865 -0.17849673 -0.56858938]
     [-0.24328645 -0.60652469 -0.56187565 -0.74079855]]
    libtidl_onnxrt_EP loaded 0x192a63b0 
    Final number of subgraphs created are : 1, - Offloaded Nodes - 9, Total Nodes - 9 
    APP: Init ... !!!
      8386.257131 s: MEM: Init ... !!!
      8386.257508 s: MEM: Initialized DMA HEAP (fd=5) !!!
      8386.257824 s: MEM: Init ... Done !!!
      8386.257867 s: IPC: Init ... !!!
      8386.277730 s: IPC: Init ... Done !!!
    REMOTE_SERVICE: Init ... !!!
    REMOTE_SERVICE: Init ... Done !!!
      8386.282756 s: GTC Frequency = 200 MHz
    APP: Init ... Done !!!
      8386.282970 s:  VX_ZONE_INIT:Enabled
      8386.283010 s:  VX_ZONE_ERROR:Enabled
      8386.283020 s:  VX_ZONE_WARNING:Enabled
      8386.284521 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-0 
      8386.284902 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-1 
      8386.285310 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-2 
      8386.286244 s:  VX_ZONE_INIT:[tivxPlatformCreateTargetId:124] Added target MPU-3 
      8386.286294 s:  VX_ZONE_INIT:[tivxInitLocal:136] Initialization Done !!!
      8386.287143 s:  VX_ZONE_INIT:[tivxHostInitLocal:106] Initialization Done for HOST !!!
      8386.303663 s:  VX_ZONE_ERROR:[ownContextSendCmd:885] Command ack message returned failure cmd_status: -1
      8386.303721 s:  VX_ZONE_ERROR:[ownNodeKernelInit:592] Target kernel, TIVX_CMD_NODE_CREATE failed for node TIDLNode
      8386.303738 s:  VX_ZONE_ERROR:[ownNodeKernelInit:593] Please be sure the target callbacks have been registered for this core
      8386.303779 s:  VX_ZONE_ERROR:[ownNodeKernelInit:594] If the target callbacks have been registered, please ensure no errors are occurring within the create callback of this kernel
      8386.303798 s:  VX_ZONE_ERROR:[ownGraphNodeKernelInit:620] kernel init for node 0, kernel com.ti.tidl:1:1 ... failed !!!
      8386.303844 s:  VX_ZONE_ERROR:[vxVerifyGraph:2254] Node kernel init failed
      8386.303858 s:  VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed
    TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!!
    TIDL_RT_OVX: ERROR: Verify OpenVX graph failed
    5
      8386.312460 s:  VX_ZONE_ERROR:[ownContextSendCmd:885] Command ack message returned failure cmd_status: -1
      8386.312512 s:  VX_ZONE_ERROR:[ownNodeKernelInit:592] Target kernel, TIVX_CMD_NODE_CREATE failed for node TIDLNode
      8386.312528 s:  VX_ZONE_ERROR:[ownNodeKernelInit:593] Please be sure the target callbacks have been registered for this core
      8386.312541 s:  VX_ZONE_ERROR:[ownNodeKernelInit:594] If the target callbacks have been registered, please ensure no errors are occurring within the create callback of this kernel
      8386.312559 s:  VX_ZONE_ERROR:[ownGraphNodeKernelInit:620] kernel init for node 0, kernel com.ti.tidl:1:1 ... failed !!!
      8386.312582 s:  VX_ZONE_ERROR:[vxVerifyGraph:2254] Node kernel init failed
      8386.312594 s:  VX_ZONE_ERROR:[vxVerifyGraph:2311] Graph verify failed
      8386.312725 s:  VX_ZONE_ERROR:[ownGraphScheduleGraphWrapper:919] graph is not in a state required to be scheduled
      8386.312741 s:  VX_ZONE_ERROR:[vxProcessGraph:844] schedule graph failed
      8386.312753 s:  VX_ZONE_ERROR:[vxProcessGraph:849] wait graph failed
    ERROR: Running TIDL graph ... Failed !!!
    2025-04-10 10:32:07.251136938 [E:onnxruntime:, sequential_executor.cc:494 ExecuteKernel] Non-zero status code returned while running TIDL_0 node. Name:'TIDLExecutionProvider_TIDL_0_0' Status Message: TIDL Compute Invoke Failed.
    Traceback (most recent call last):
      File "/opt/edgeai-tidl-tools/examples/osrt_python/ort/onnxrt_ep_static.py", line 355, in <module>
        main()
      File "/opt/edgeai-tidl-tools/examples/osrt_python/ort/onnxrt_ep_static.py", line 351, in main
        run_model(args, so, model, 0)
      File "/opt/edgeai-tidl-tools/examples/osrt_python/ort/onnxrt_ep_static.py", line 268, in run_model
        output, proc_time, sub_graph_time = run_prediction(sess, input_rows)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/edgeai-tidl-tools/examples/osrt_python/ort/onnxrt_ep_static.py", line 96, in run_prediction
        predictions = session.run([output_name], {input_name: input_data})[0]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.12/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run
        return self._sess.run(output_names, input_feed, run_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TIDL_0 node. Name:'TIDLExecutionProvider_TIDL_0_0' Status Message: TIDL Compute Invoke Failed.
      8386.601581 s:  VX_ZONE_INIT:[tivxHostDeInitLocal:120] De-Initialization Done for HOST !!!
      8386.606505 s:  VX_ZONE_INIT:[tivxDeInitLocal:206] De-Initialization Done !!!
    APP: Deinit ... !!!
    REMOTE_SERVICE: Deinit ... !!!
    REMOTE_SERVICE: Deinit ... Done !!!
      8386.607225 s: IPC: Deinit ... !!!
      8386.608028 s: IPC: DeInit ... Done !!!
      8386.608112 s: MEM: Deinit ... !!!
      8386.608133 s: DDR_SHARED_MEM: Alloc's: 8 alloc's of 13276020 bytes 
      8386.608144 s: DDR_SHARED_MEM: Free's : 8 free's  of 13276020 bytes 
      8386.608155 s: DDR_SHARED_MEM: Open's : 0 allocs  of 0 bytes 
      8386.608170 s: MEM: Deinit ... Done !!!
    APP: Deinit ... Done !!!

    It's consistently like this for all rows (in your case, batches) after the first for your model?

    It looks like it. For instance, I classified 900 inputs in batches of 5, and the result seem to be consistent accross all batches. I randomized the order of the inputs in a couple of experiments, and the behavior was always the same. Here are arbitrary batches I extracted from the beginning, middle, and end of the classification process. The results below are the outputs generated by the CPU and the C7x processors for the same input batches.

    CPU Outputs

    [[  0.5499512     29.86619      10.013045     -68.96837     -117.670074  ]
     [  -24.728174    2.1392264     11.749797     -43.16042     -94.035225   ]
     [  -0.31245813   23.115824     7.411965      -53.738945    -89.19313    ]
     [  -14.790332    34.75649      24.557858     -115.59277    -244.28152   ]
     [  -8.414772     23.445932     11.591913     -59.868694    -104.44454   ]]
    
    [[  0.5499512     29.86619      10.013045    -68.96837      -117.670074  ]
     [  -24.728174    2.1392264    11.749797     -43.16042      -94.035225   ]
     [  -0.31245813   23.115824      7.411965    -53.738945     -89.19313    ]
     [  -14.790332    34.75649      24.557858    -115.59277     -244.28152   ]
     [  -8.414772     23.445932     11.591913    -59.868694     -104.44454   ]]
    
    [[  -66.46856     -38.68299    16.216866   -6.0427346 -60.902885 ]
     [  -77.21364     -52.401382    4.6126523   9.821233   28.053577 ]
     [  -14.435523    -4.65252     5.4250026 -14.78532   -45.285408 ]
     [  -48.10428     -33.970505    8.2421255  11.033554  -12.873893 ]
     [  -32.18031     -34.3971      2.4281816  18.979115   -2.106932 ]]
    
    [[  -34.84538    -30.827648     3.8531241   16.92795      0.9875134   ]
     [  -76.036446   -51.94864      4.4317117    9.819406     27.643337   ]
     [  -53.555202   -36.47319      3.594068     7.919107     18.3411     ]
     [  -13.563389    -8.161517      2.2703397  -11.412899    -29.486507  ]
     [  -15.914622     1.830104      6.699764   -23.267334    -46.62388   ]]
     
    [[  -38.932514   -33.723167      2.9016192   14.818679      5.907353   ]
     [  -28.286034    -0.97361857    10.951331   -32.76529     -69.296974  ]
     [  -15.673623     8.680218     26.204927  -126.45664    -344.4251     ]
     [    8.507519    33.484344     13.493736  -101.173325   -208.67247    ]
     [   -7.4636      24.068132     11.171818   -60.85739    -104.539444   ]]
    
    [[  -60.612915   -42.289196      6.5389113   15.820601     12.33727    ]
     [  -14.525048    33.23467      21.618534   -99.91361    -200.91388    ]
     [  -54.04861    -26.687498     12.296658   -12.427855    -48.643497   ]
     [  -13.661114    27.674337     18.323404   -83.5391     -166.56775    ]
     [   -6.600704    21.46618       9.22357    -53.857243    -88.61777    ]]
     
    [[  -27.047228   -10.543996      8.479833    -7.575113    -33.604733  ]
     [  -15.542902    -3.6416762     5.1694474  -21.737648    -49.104355  ]
     [  -29.054012     9.54565      19.02267    -56.05999    -130.08748   ]
     [  -27.789953   -24.79303      3.1229627    8.065356     -9.083949   ]
     [  -35.77662    -30.80306      4.14476     16.828968      1.2244477  ]]

    C7x Outputs

    [[[[[[   0.5548318    29.866343     10.012192    -68.969376   -117.67478  ]
         [  10.050021      0.472868     -3.5433576   -15.869451   -33.25838   ]
         [  -7.3136916   -20.881851     -2.2508516    13.15834    -4.2116776  ]
         [   9.873484     34.04019      12.275653    -90.24213    -180.84995  ]
         [   6.4373097    11.254258      0.91421145  -23.441643   -36.656727  ]]]]]]
    
    [[[[[[   0.5548318    29.866343     10.012192    -68.969376   -117.67478  ]
         [  10.050021      0.472868     -3.5433576   -15.869451   -33.25838   ]
         [  -7.3136916   -20.881851     -2.2508516    13.15834    -4.2116776  ]
         [   9.873484     34.04019      12.275653    -90.24213    -180.84995  ]
         [   6.4373097    11.254258      0.91421145  -23.441643   -36.656727  ]]]]]]
    
    [[[[[[  -66.46002   -38.680603    16.21622    -6.0464053      -60.9054    ]
         [  -58.692375  -42.205044    2.6165364   8.032451        21.323195   ]
         [  -28.73146   -30.982311   -1.7086297   8.984492        9.7663      ]
         [  -32.74138   -30.175283   -0.7061496   7.1686788       12.193689   ]
         [  -23.391203  -29.626757   -3.1398435   8.385526        9.066456    ]]]]]]
    
    [[[[[[  -34.847218   -30.830994     3.852298    16.922369      0.98987037 ]
         [  -57.664677   -41.68804      2.496743     7.9883165     20.963814   ]
         [  -46.359978   -33.579933     1.9860456    6.4436145     16.85932    ]
         [   -6.777775   -19.847847    -3.7514195    8.9088335      0.6367956  ]
         [  -30.698591   -23.763193     0.813333     4.8863025     11.229039   ]]]]]]
    
    [[[[[[  -38.926495   -33.724945      2.900257     14.81653       5.9013925 ]
         [   -9.217773   -16.922369     -2.8182933     5.737465       3.4109545 ]
         [   34.601326    33.308823     11.134465   -116.98124     -206.59918   ]
         [   26.70128     30.18159       0.8511624   -72.97299     -131.16098   ]
         [    4.8232536    8.700771      0.71245444  -17.817667    -27.684845   ]]]]]]
    
    [[[[[[  -60.602764   -42.28701       6.5381885   15.819011      12.332397  ]
         [    6.159894    31.013836     10.308522   -72.74601     -134.63498   ]
         [  -39.33631    -37.987064     -1.418604     9.312347      14.721957   ]
         [    9.324957    25.00526       6.191418    -55.319252     -99.74362   ]
         [   -6.777775   -11.342527    -1.0466145     5.3024263      0.472868   ]]]]]]
    [[[[[[-27.041744   -10.548109      8.473795    -7.572193     -33.598846  ]
         [    4.72868    -10.3022175    -3.9279568     1.1601028    -15.024592  ]
         [    4.5395327   20.068518      5.951832    -45.761013    -81.93226    ]
         [  -11.342527   -21.02056     -3.4992232     7.2128134      4.104494   ]
         [  -47.570522   -33.06293       2.4904382     6.0464053     17.23131    ]]]]]]

    Thank you for the support!

    Best regards,

    Giann.

  • Hi Giann,

    Thanks for the detail here. Always appreciated! The problem is clear. The outputs for all other batches / rows look random

    While debugging accuracy issues, it's often best to handle from PC side and use the emulation tools (same Python code you would run on EVM). 

    • Tensor_Bits=32 is only supported on PC, and this 'reference' mode is used to perform a baseline with floating point math. No calibration is run in this context, and it will not run on EVM.
    • Tensor_bits=16 is a separate mode that does 16-bit fixed point computation (including calibration from float->fixed).

    If you are willing to share a version of your model (randomized weights okay), we can take a look. 

    Otherwise, it is hard to know the exact cause here, so we'll need to go a bit deeper. Let's start by testing that reference mode (tensor_bits=32) on PC and check the output. If this is not correct, then there is a functional error, and would be considered a bug. 

    BR,
    Reese

  • Hi Reese!

    I recall testing it with 32bits on the PC and the result was marginally better.

    I am attaching a zip file containing the following files and folders:

    e2e_accuracy.zip

    • data: this folder goes inside the osrt_python/ort/ folder and contains the inputs to be inferred;
    • onnxrt_ep_mlp5.py: custom compilation script for my use case, which goes inside the osrt_python/ort/ folder;
    • mlp_pytorch_static5.onnx and mlp_pytorch_static5_label_map.json: model adapted for our testing purposes and the labels used to check the accuracy. Both of these files go on the edgeai-tidl-tools/models/public folder of the docker;

    You are probably more familiar than I am with the procedure, but:

    # to compile using my custom compilation script
    /home/root/examples/osrt_python/ort# python3 onnxrt_ep_mlp5.py -m mlp_pytorch_static5 -c
    
    # to infer simulating the c7x processor and check the accuracy
    /home/root/examples/osrt_python/ort# python3 onnxrt_ep_mlp5.py -m mlp_pytorch_static5
    
    # to infer simulating the cpu and check the accuracy
    /home/root/examples/osrt_python/ort# python3 onnxrt_ep_mlp5.py -m mlp_pytorch_static5 -d

    Please let me know if I can be of any help, or if you see something that you are not sure about.

    Thank you for helping out!

    Best regards,

    Giann.

  • Hi Giann

    I recall testing it with 32bits on the PC and the result was marginally better.

    Marginally better, but still wrong, correct? Is it a slightly wrong or VERY WRONG (i.e. unusable for batches after the first), like what you've shown previously?

    32-bit mode is supposed to be a reference / baseline, so if this is incorrect, it suggests an issue in the reference implementation that will also impact any of the hardware-optimized / accelerated implementations. 

    e2e_accuracy.zip

    • data: this folder goes inside the osrt_python/ort/ folder and contains the inputs to be inferred;
    • onnxrt_ep_mlp5.py: custom compilation script for my use case, which goes inside the osrt_python/ort/ folder;
    • mlp_pytorch_static5.onnx and mlp_pytorch_static5_label_map.json: model adapted for our testing purposes and the labels used to check the accuracy. Both of these files go on the edgeai-tidl-tools/models/public folder of the docker;

    Thanks for including this. Give me a few days to provide an update here. I'll let you know sooner if I have some doubts about the scripts or data provided.

    • Please confirm SDK version and/or tag on edgeai-tidl-tools. Is this 10_01_00_04 / 10.1 SDK?

    Just as a heads up, in the likelihood I confirm this is a reference-implementation issue, it will be logged as a bug and take time to address. 

    BR,
    Reese

  • Hello Reese!

    Marginally better, but still wrong, correct? Is it a slightly wrong or VERY WRONG (i.e. unusable for batches after the first), like what you've shown previously?

    Correct. And to be honest, I think the ‘marginally better results’ that I recalled were probably just a product of the randomness of the results. I just double cheked it, and the accuracy with 32bits this time was around 52%.

    Please confirm SDK version and/or tag on edgeai-tidl-tools. Is this 10_01_00_04 / 10.1 SDK?

    I was compiling and testing things on the board using version 10_00_08_00. But I will allocate sometime to test things from scratch with version 10_01_00_04 as well and get back to you.

    Just as a heads up, in the likelihood I confirm this is a reference-implementation issue, it will be logged as a bug and take time to address. 

    No worries, I understand these things take time. Part of the reason why I ended up posting the issue was to make sure I wasn’t missing any crucial step, but also to maybe help uncover one of these hard to track bugs.

    Thank you for the great support as always.

    Best regards,

    Giann.

  • Hi Giann,

    I should have asked earlier -- can you provide me the model_config entry you are using for this? I think you probably modified some of the postprocessing code as well (common_utils.py?) So that would be very helpful to provide as well.

    Perhaps easiest thing is to tarball your examples/osrt_python directory and share that  so all the source is consistent

    BR,
    Reese

  • Hello Reese,

    Yes, here it is:

        "mlp_pytorch_static5": create_model_config(
            source=AttrDict(
                infer_shape=True,
            ),
            preprocess=AttrDict(
                resize=None,
                crop=None,
                data_layout="NCHW",
                resize_with_pad=False,
                reverse_channels=False,
            ),
            session=AttrDict(
                session_name="onnxrt",
                model_path=os.path.join(models_base_path, "mlp_pytorch_static5.onnx"),
                input_optimization=True,
            ),
            task_type="other",
            extra_info=AttrDict(num_rows=180, num_classes=5)
        ),

    Where the models_base_path is

    models_base_path = '../../../models/public/'

    Perhaps easiest thing is to tarball your examples/osrt_python directory and share that  so all the source is consistent

    Let me do a cleanup on the folder before sending it to you. At the moment there are several files that I've been using to test stuff, and I think it would be a bit too distracting to send it as is.

    Please let me know if any further information is needed.

    Best regards,

    Giann.

  • Hi Giann,

    I was able to hack around the code and generate what I needed. 

    I took a look deeper into the model's intermediate outputs at each stage using the reference model, and the original supposition is correct.

    The first chunk/batch/row of data is good throughout the network (first 1/5th of the data, for your case). The rest of the data being passed between each layer is quite wrong. The root cause is not clear, but it is certainly happening the floating point reference mode,. so expect further accuracy loss when quantizaed. I'll need to raise this with the development team. 

    I experimented a little bit with using the deny_list to prevent some layers from being accelerated by TIDL. There is still error, but also interesting that outputs are now at least correct from argmax perspective, see output data below under different conditions

    [[ output ]] argmax class (1-indexed)
    
    [[ -19.854969      6.298686     18.908436    -76.92346    -201.2036    ] 3 
     [  15.466614     50.259216     12.97859    -112.03983    -198.6063    ] 2
     [   6.253457     -8.780747     -4.716157     -4.1485167   -19.801765  ] 1
     [ -58.118713    -43.002296      2.1930084     8.446512     21.171597  ] 5
     [   9.201027     10.682446     -0.30513966  -25.332159    -43.82667   ]] 2
    ^^ tensor_bits = 32
    
    
    [[ -19.854958      6.2986827    18.908432    -76.92345    -201.20358   ]  3
     [ -12.10481      46.394966     25.568953   -125.3672     -245.01732   ]  2
     [ -26.210058      0.28695372   10.868061    -37.725567    -79.95061   ]  3
     [ -78.44962     -51.275146     11.368432     17.634438      4.835573  ]  4
     [   6.612872     32.4673        7.5877786   -74.13031    -119.98943   ]] 3
    CPU offload
    
    
    [[ -19.854969     6.298686    18.908436   -76.92346   -201.2036   ] 3
     [ -10.618736    44.752426    24.46529   -123.04369   -241.65892  ] 2
     [ -23.809456     0.9467592    9.758838   -38.54927    -78.231735 ] 3
     [ -78.99292    -52.33345     10.288101    16.722866     8.058851 ] 4
     [   8.462003    31.874058     6.7729435  -70.71581   -115.059006 ]] 3
    tensor_bits = 32, deny first two layers
    'deny_list:layer_name': '/network/network.0/Gemm, /network/network.1/Relu'

    Please note that I tested this on 10.1 SDK (10_01_04_00 tag/release for edgeai-tidl-tools), so error is present in this release as well as 10.0

    BR,
    Reese