This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VH-Q1: Artifacts built with 10.1 version do not work.

Part Number: TDA4VH-Q1

Tool/software:

Hello,

We found that the model performance built with tools 10.1 is not acceptable compared with performance of  the same model built with tools 10.05.

It started with some warnings received in compilation stage considering that some Slic and Transpose layers are not supported. The model has un acceptable accuracy  when running on the device.

After setting the rt.GraphOptimizationLevel.ORT_DISABLE_ALL during the compilation, the warnings dissapear, however it does not improve the model accuracy.

  • Hi Alex,

    Which version of edgeai-tidl-tools are you using with SDK version 10.1 and 10.05? Could you please confirm that the edgeai-tidl-tools tag and SDK versions used for each compilation are compatible and that no steps outlined in the "Notes" column of the version compatibility table were missed?

    Version Compatibility Tablehttps://github.com/TexasInstruments/edgeai-tidl-tools/blob/master/docs/version_compatibility_table.md

    Release Noteshttps://github.com/TexasInstruments/edgeai-tidl-tools/releases

    Thank you,

    Fabiana

  • Hi Alex,

    I have ran your model and am currently trying to reproduce the issue. I saw in the emails that you had your own script to test the accuracy(full_test.py). Is this something you can share?

    If not, which numbers on the email are the accurate numbers? Are they the larger ones that range from 1.65-.01?

    I am using edgeai-tidl-tools version 10_01_04_00 and 10_00_05_00. Please let me know if you used different one.

    Warm regards,

    Christina

  • Hi Christina,

    I'm using the same test for both versions - 10_00_05_00 (backward compatible with SDK9.2) and 10_01_00_01. I'm running a single input while inference is done for 32 bit and 16 bit quantized model. I'm comparing over all outputs 16 vs 32 bits, just computing an absolute difference error.. I receive very high differences in 10_01_00_01, while the errors in the 10_00_05_00 are small ~[1e-3 : 1e-2].

    I attached the script so you can run and reproduce it. It runs using the yaml file were you need to set the path to the artifacts and onnx files.full_test.tar.gz

  • Hi Alex,

    I compared the results between 10_00_05_00 and 10_01_00_04 for the outputs Value and Key, as well as the absolute difference error for both. Everything ended up being the same between to two versions, with the same difference. I ran the onnxrt_ep script found under examples and checked through there.  

    I tried to get your script full_test to work for me, however I kept getting a segmentation core dump whenever I ran it. I think there may be an issue with the math done in your script for absolute error, as based on your screenshots, some of the numbers match up to the other version, whereas others don't. It may be due to the negative in some of the outputs. 

    Could you try running the onnxrt_ep script included with TIDL and see if you still see different outputs, or see if there is indeed an error in the math of your script?

    Warm regards,

    Christina

  • Hi Christina,  run the script with cfg['get_artifacts'] set to False in row 223 and cfg['test_TIDL'] set to True in row 226. Just point to the artifacts you received from me for both versions.

  • Hi Christina,

    Just to make sure we are aligned on how to reproduce the issue. You should have the input image and artifacts for both versions. The script I shared with you can do several things - model compilation, inference on CPU emulation and also an inference on the device. You don't need to create the artifacts, the point is to test the average mean error between the quantized and 32 bits inference for both versions for a given input. Please also comment the row 74 (so.graph_optimization_level = rt.GraphOptimizationLevel.ORT_DISABLE_ALL) when running the 10_00_05_00 inference - the compilation was ran without this parameter, it was advised by the TIDL team later when I was running into some weird warnings for 10_01_00_04 compilation.

    Thank you,

    Alex.

  • Hi Alex,

    I had followed all the instructions from the past Jiras, including the uncommenting of row 74 so the errors weren't seen. The setting on lines 223 and 226 was already set and it still didn't let me reproduce with your script. 

    I just did the model compilation and inference on the CPU emulation outside your script as this is our normal practice to ensure reproducibility. I also recompiled the artifacts for this reason as well. I compared the average mean error between when ran with 16 bit and 32 bit inference through this.

    I ran everything again with another image from the ones you provided and got the same results for both versions for the outputs. Just to explain my steps more thoroughly that I did on OSRT

     1. Set common_utils.py tensor bit value to 32

     2. Ran onnx_rt.py compile and inference

     3. See outputs for Value and Key, convert to .txt for 32float

     4. Repeat for 16 int

     5. Find absolute error by subtracting .txt files

     5. Repeat on 10.00.05.00

     6. Check difference between 10.01.00.04 and 10.00.05.00 (didn't have any difference)

    I have attached the artifacts I generated on 10.01.00.04, incase those may work for you. Right now, I do think running through OSRT as a check on your end is needed since I am not seeing the issues on my end. Are you seeing this issue both on emulation and device, or just device? I assumed it was both emulation and device.

    Warm regards,

    Christina

    Artifacts_Leddartech_10.1.0.4.zip

  • Hi Chrisitna,

    When you compile the artifacts using a single image you overfit to it and for sure will get a decent result when testing using the same input. When I do the PTQ, I build the artifacts using a calibration set of 128 images, making a 64 iterations on this calibration set. I attached the artifacts here so you will test them. Those are from 10_01 version, please run the script I provided for inference and receive all the outputs, not just Key and Value. Please update what 16/32 mae you receive. 

    Thank you,

    Alex.

    Artifacts_10_01.zip

  • Thanks Alex for the clarification on the artifacts.

    I have been using the artifacts you attached to generate. I will update you tomorrow on my results for both 16 and 32 bit with both 10.01.00.04 and 10.00.05.00

    Warm regards,

    Christina

  • Hi Christina,

    Any updates on this?

    Thank you,

    Alex.

  • Hi Alex,

    Apologies for the late response. I was trying to validate my results to double check but here are the outputs I was able to gather for 10.1Leddar_10.1_outputs.zip

    I was having some trouble getting the results for 10.00.05 with your artifacts, which seemed to be the opposite of your problem, since 10.00.05 was working on your end, so I had been trying to debug that. Is there another set of artifacts you use for 10.00.05? I assumed it was the same artifacts you used for both, but I tried both artifacts you had shared with me incase it was an issue with one of them. 

    I also been trying to get your script to work on the side but it is still giving a Core Dump whenever I used it. Because of this, I was trying to also create a script that does that same thing as yours to fix that problem.

    When I looked at these results, though I only checked some of the outputs (Key, TEgor and Value), I did notice that Value now had many zero outputs for 10.1. I am trying to replicate the results to see if this is consistent with the other images or if it may have been a user error on my end when switching around the artifacts and pictures. 

    Also, I want to double check that you were able to set additional C7x firmware patch described in the notes section of the Compatibility Table? I assumed you had since my colleague mentioned it (first response) but just to make sure incase it was missed. 

    However, since I am still in the process of making sure I can replicate your issue, there is a layer analysis you can do to further pinpoint where this mismatch may be happening in the model. The usual way is to set debug level to 4 and see the .y and .bin outputs to compare. Here is the instructions to further explain how to do this here

    There is another tool called layer_trace_inspector.py which is a bit more complex to setup, but once done, will help in regards to debugging. The link for it is here 

    Since the main issue is in regards to the artifacts, I just using these to confirm if the artifacts are properly being passed through the model for both versions is a good test to do.

    Apologies again for the delay in the response. I was trying to make sure I included everything I had looked into even though they are still in progress.

    Warm regards,

    Christina

  • Hi Chrisitna,

    I have 2 different Artifacts, on 10.00.05 which is backward compatible with SDK 9.2 and it is working correctly. I didn't send it to you though as I wanted you to focus on the non-working artifacts from 10.01.00.04. 

    When do you receive the core dump while running with my script? during the inference? 

    What do you mean by " Value now had many zero outputs for 10.1"?

    I'm aware about the compatibility table, this is relevant only to a working 10.00.05. It is working correctly on our side with all the needed patches installed while building the SDK.

    Thank you,

    Alex.

  • Thank you for the clarification Alex. If I could also receive the artifacts you have for 10.00.05, it would be useful for the Jira if this issue is actually consistent when comparing both and to see which the expected outputs are meant to be. 

    Yes, whenever I ran your script during inference I would get a segmentation dump* (I mistyped earlier) 

    Value.bin output when converted to text for both 32 and 16 bit showed multiple zero outputs when I initially passed through it. The other outputs I checked didn't have this issue and I wasn't sure if it was the correct outputs either (which is why the 10.00.05 artifacts to create the right outputs for comparison would help). I am trying to double check if this is consistent with other images you included right now. 

    I'm glad you checked with the compatibility table for 10.01.04.00. I just wanted to make sure incase since this is something other customers miss. 

    I will send another update soon.

    Warm regards,

    Christina

  • Hi Christina,

    Thank you for a quick update.

    Please find attached artifacts 10.00.05 per your request.

    Artifacts_10_00_05.tar.gz

  • Hi Alex,

    Here are the outputs for 10.1.4.0 after running a second time after ensuring everything was right. Leddar_trial2.zip

    Unfortunately, I couldn't get the new artifacts you shared for 10.0.5.0 to work on my end still as it would stop halfway during compiling. Chris is going to see if he can on his end. 

    Just a note that TI has holiday tomorrow due to Good Friday, and I will not be able to respond until Monday. But do know that I am working solely on your E2e at this point in time. I will get you an update on Monday on my progress. 

    Warm regards,

    Christina

  • Hi Christina, 

    What do you mean by saying "Unfortunately, I couldn't get the new artifacts you shared for 10.0.5.0 to work on my end still as it would stop halfway during compiling" ?  Why are you compiling the artifacts? You have them to run inference with them and test the outputs vs. 32 bit inference. 

    Thank you,

    Alex.

  • Hi Alex,

    Sorry for the mistype but I meant, when running inference, the new artifacts were not working for 10.0.5.0 on my end. I was running inference on them, not compiling. 

    I am going to be working on this today and will keep you updated. 

    Thank you,

    Christina

  • Hi Alex, are you able to share the outputs you get on your end when running on 10.0.5.0 for both 16 and 32 bit?

    Thanks,

    Christina

  • Hi Christina,

    Yes, here they are, I don't know the order of the outputs though as I print all of them:

    Btw, I think I know why are you getting the core dump while running the 10_00_05 artifacts - you need to comment out the so1.graph_optimization_level = rt.GraphOptimizationLevel.ORT_DISABLE_ALL as the artifacts were compiled without this argument. I started using this as a recommendation from Chris to get rid of some newly appeared warnings considering Slice operations with tidl_tools 10.01.

    Thank you,

    Alex.

  • Hi Alex,

    Couple of updates and some steps to move forward.

    Updates: 

    1. I was able to fix the hanging issue I had with the 10_00_05_00 artifacts yesterday by disabling the rt.GraphOptimizationLevel.ORT_DISABLE_ALL and generate the outputs (Your deduction was correct) 

    2. I had asked for your outputs to compare, but since I wasn't sure which output was which, I couldn't really use it. I have attached the outputs I generated yesterday below

    Leddar_trial2_100.zip

    3. When comparing the Value bins, I could see that 10.1 and 10.0 had different numbers for both the 16 and 32 bit outputs, with 10.1 having many zeros. So I was able to see the mismatch between it. 

    Steps moving forward: 

    Based on my understanding, you need to have both versions (10.1 and 10.0) to either have rt.GraphOptimizationLevel.ORT_DISABLE_ALL disabled or enabled when comparing. This causes the model to behave differently, which results in the mismatch when comparing. Could you create the Artifacts for 10.0 with this enabled and then run the comparison between them. This would be a more accurate assessment of whether they are indeed different when this is enabled.

    I can submit a Jira once you either check this or send over the new artifacts for 10.0 with rt.GraphOptimizationLevel.ORT_DISABLE_ALL enabled. 

    The Jira issue will be dependent on the results of this comparison

    1. If the results of 10.0 and 10.1 is comparable after rt.GraphOptimizationLevel.ORT_DISABLE_ALL is enabled on both, then we will know the issue is that accuracy when having this enabled is unacceptable for both versions. 

    2. If the results of 10.0 and 10.1 is not comparable after rt.GraphOptimizationLevel.ORT_DISABLE_ALL is enabled on both, with 10.1 still having the worse values, then the issue is a possible bug only in 10.1

    Looking forward to hearing back. Please let me know if you have any questions

    Warm regards,

    Christina

  • Hi Christina,

    I compiled the artifacts for 10.0 with rt.GraphOptimizationLevel.ORT_DISABLE_ALL enabled as you requested. I receive the same accuracy results as before. 

    Just to explain to you why did I start using this flag with 10.1 - it all started with some sudden warnings I started to receive when compiling the same ONNX with 10.1. The warnings were due to Slice unsupported operation, which was weird as there was no such a problem with the 10.00.05. The artifacts didn't work either. After opening this bug to TI, they suggested to add this flag, which solved the warnings but not the main accuracy issue. 

    I attached the 10.00.05 artifacts compiled with the flag so we are matched in this term with the 10.01.

    Artifacts_10.00.05_with_ORT_DISABLE_ALL .tar.gz 

    Thank you,

    Alex.

  • Hi Christina,

    I also tried to use the layer_trace_inspector tool you suggested in the thread above. However I got stuck as I don't understand what are ONNX traces folder contents. I used debug_level=4 option to export the per layer binary files for the quantized 16 bit model and also for the 32 bit original model, but it looks like it expect something else for the ONNX traces.

    Could you please help me to understand what is needed to extract the ONNX traces?

    Thank you,

    Alex.

  • Hi Alex,

    Since you received the same error with these new artifacts, I just need to recreate it on my end and submit the Jira. I will let you know once I have completed this with the Jira number.

    Layer_trace_inspector tool has to be done under tidl_tools with the use of the import and inference config files. I do think this may take too long to properly setup, so I recommend using the OSR debug : https://github.com/TexasInstruments/edgeai-tidl-tools/blob/master/docs/tidl_osr_debug.md

    Warm regards,

    Christina

  • Hi Alex,

    I have updated Jira-7414 and someone on Dev is working on seeing the issue on their end now. I will keep you updated on any new information I receive. 

    Warm regards, 

    Christina

  • Hi Alex, 

    Our Dev team has look through your full_test.tar.gz file and observed that random data is being used for calibration while the images are being used for inference. Since this approach isn't suitable for accurately assessing performance degradation, could you fix this so it uses images for the calibration and for inference?

    Warm regards,

    Christina

  • Hi Christina,

    The script is not used to build the real artifacts for inference. It is used to build artifacts for measuring inference time and to use the real artifacts for accuracy tests. If there is a need for the Dev team to build the artifacts by themselves, I can share the calibration dataset I'm using for real artifacts creation.

    Thank you,

    Alex.  

  • Hi Alex,

    I have communicated with Dev team on your response. Thank you for the clarification and I have made an additional note on the proper use of your artifacts.

    Warm regards,

    Christina

  • Hi Alex, could you share your calibration dataset? Dev team is going double check with it, but may need some time since they are also working the 11.0 release.

    Warm Regards,

    Christina

  • Hi Christina, 

    Please find attached.

    calibration_dataset.tar.gz

    Thank you,

    Alex.

  • Thank you Alex. I have added it to the Jira.

    Warm regards,

    Christina

  • Hi Alex,

    Just an update but they have added your model to the test cases for the latest version of TIDL tools v11.0 and results are consistent with 10.05.  

    Warm regards,

    Christina

  • Hi Alex, 

    TIDL 11.0 just got released. Here is the github: https://github.com/TexasInstruments/edgeai-tidl-tools/releases

    This should have the fixes for accuracy issues you saw. Please reopen this thread if you continue to have holdups will testing this new version.

    Warm regards,

    Christina