This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM68A: Compilation not work properly on SDK 9.1

Part Number: AM68A

We compiled our model using SDK 8.6 and it works well for us. After switching to newer version  (SDK 9.1) we encountered following issues:

* large output mismatch after calibration comparing to ONNX model if add_data_ops: 0

* more similar values, but still large mismatch after calibration comparing to ONNX model if add_data_ops: 1

It feels like default add_data_ops: 0 not works and produce bad outputs even for simple Neural Network (only Add, Conv and Relu operations)
This model should be easily compiled using SDK 9.1 but it doesn't work in that way.

Here I provide assets to compile and readme how to run this example, inference, onnx model, logs for both variants. I believe it should work for add_data_ops: 0 too and maybe somebody could help me to determine what is wrong?

assets.zip

  • Hi,

    Can you share exact sdk tags you have used for 8.6 and 9.1 sdks ?

    Moreover is this issue coming for 16bit quantization only ? Can you share observation on 8 bit ?

  • Hi, the tags used:
    09_01_00_02 and 08_06_00_05 tags used. For 8bit right now I don`t have results, but I will share when I test it. Here  you can see how large degradation is with same setup (number of frames, number of iterations, same model, same frames and same images for inference) between SDK 8.6 and 9.1 (add_data_ops: 0) for 16bit compilation.
    Moreover, it also produce large mismatch if add_data_ops: 1 and denylist: Transpose, Resize, Reshape, MatMul, it feels like there is some problems with handling data in SDK 9.1, but it is topic for separate post, right now it would be good to understand why this simple neural network cannot compile properly on SDK 9.1 if add_data_ops equals to 0



  • Let me note the observations:

    1. You are working with 16 bit network

    2. On 9.1 sdk (09_01_00_02  sdk tag) you are facing issue for model compilation when add_data_ops is set to 1, however the same observation is not coming for 08_06_00_05  tag ?

    Can you confirm my understanding is correct on this issue ? In particularly am trying to understand how did inference worked if model compilation itself is failing, as there wont be model artifacts generated per say.

    Can you elaborate how are you comparing the results ? 16bits quantized values to de quantized to flaot 32 and compare with oonx flot 32 values ?

    If you can try to add comparison table of 8.6 and 9.1 with add_data_ops flags set and not set that would be good to understand.

  • 1. Yes, I am working with 16bit network

    2. On 9.1 sdk (09_01_00_02  sdk tag) I facing issue when add_data_ops is set to 0 and yes, the same observation is not coming for 08_06_00_05 tag

    3. Yes, I take 16bits quantized values after converting to float32 and compare to original onnx float32 values

    Here name is a name column of setup on which I test, error margin <number> column values show how similar are outputs of original onnx float32 model to 16bit model on this specific setup where <number> represents threshold (you can see compare_float_3d_arrays in assets). Those numbers are approximate and variate depending on sample, but you could clearly see that 09_01_00_02 (add_data_ops: 0) produce not relevant outputs at all. For all of those experiments I take same onnx model, same calibration images and options (except add_data_ops), so it feels like something wrong under the hood in this new SDK when add_data_ops: 0

    name

    error margin 0.1 error margin 0.01 error margin 0.001 error margin 0.0001
    08_06_00_05 99.70% 51.25% 5.87% 0.59%
    09_01_00_02 (add_data_ops: 1) 99.59% 48.27% 5.56% 0.54%
    09_01_00_02 (add_data_ops: 0) 10.35% 1.10% 0.11% 0.01%

     

  • Thanks for sharing the info.

    Let me check this internally and get back to you within 1 week time frame.

  • Hi Roman,

    The observation pointed by are correct, i have verified these at my end.

    As i see, when you set the add_data_ops to 1 you get 99.59% correlation percentage, however with default i.e 0 (The same is reflected in OSRT example script) its poor. With default add_data_ops set to 0 this operation is scheduled on ARM core, and its showing functional miss match in the current case, however with flag set to 1 the same is scheduled on DSP core and its working fine.

    I have filed the JIRA to track this issue, this will be fixed in coming 9.2 release tentatively.

    Adding JIRA link for TI's internal tracking purpose.

    jira.itg.ti.com/.../TIDL-3572

  • Hi Roman,

    I would need model which is showing this behavior, so that i can try to reproduce this issue at my end and try to add fix.

    If in case you are not able to share exact same model, you can share the TOY MODEL(dummy model) where this issue is reproducible.

    Thank you  

  • Sure! Shared a dummy model in assets, thank you

  • Thank you! Also, maybe it will help, I wanted to point out that the same error occurred even if add_data_ops: 1, but denyList is not empty (for example: Reshape, Transpose, MatMul etc., but not sure about other operations) For me, it feels like they are related and scheduling to ARM not work properly (maybe because of this 6 dimension tensors?)

  • Sure! Shared a dummy model in assets, thank you

    Didn't understood what is asset here ! but generally user attach the model as zip files here.

    Could you please attach the model here ? So can check the same 

  • sub_graph.onnx in assets.zip folder. Also, is there any link to JIRA to track a status? For us it is important issue
    4452.assets.zip

  • Hi,

    Thanks for sharing model.

    I will attach the same in above shared JIRA so that this issue can be reproducible by developer.

    Shall we close this thread until then ?