TDA4VM: Questions about Quantization Aware Training (QAT) using edgeai-torchvision repository

KONG XIANGXU

Part Number: TDA4VM

Hi there,

We are currently using QAT to fine-tune a yolov5s model using edgeai-torchvision repository(https://github.com/TexasInstruments/edgeai-torchvision/blob/master/docs/pixel2pixel/Quantization.md)

The following is what we've tried so far,

Trained a yolov5s
Wrap the model with xnn module
load the checkpoint

But the training crashed at the first epoch because of NAN as you can see in the image below,

And then we changed the order of checkpoint loading and model wrapping,

Trained a yolov5s
load the checkpoint
Wrap the model with xnn module

The model training runs well and can converge at the end.

Questions:

Which order of checkpoint loading and model wrapping is correct?
We met the accuracy drop problem after we compiled the QAT model into TIDL model (sdk_version: 08_04, calibrationOption is set to be 64), is it because of the order changing?

Thanks in advance!

over 2 years ago

0 Manu Mathew over 2 years ago

TI__Genius 11226 points

1. When you load checkpoint state_dict into the model, you will get a list of layer parameters that are not correctly loaded if there is a mismatch. In the both order of checkpoint loading that you reported, which one gives the warning and which one doesn't?

2. To correctly measure the accuracy for object detection models, there are other parameters to be set such that it matches with what was used in training. For example confidence_threshold, top_k, nms_threshold etc.

Debugging accuracy issues are tricky and a step by step and methodical approach is needed.

First, using the float model (not QAT model) measure the accuracy with TIDL float model (tensor_bits=32). Are you getting the expected accuracy? If you are not, ask why are you not getting the accuracy? May be some resize/crop, mean/scale params are not as expected? May be the detection params confidence_threshold, top_k, nms_threshold etc are not correct?

Secondly,using the float model measure the accuracy with 16bit mode. 16bit mode accuracy should be close to float accuracy. If it is not, again find out the reason.

Thirdly, using the float model measure the accuracy in 8bit mode. If the accuracy in first two steps are good and the accuracy in this third step is poor then may be we need to adjust some parameters. Perhaps advanced calibration? Perhaps some layers in 16bits mixed precision).

Only if all this fails should we try QAT.

0 KONG XIANGXU over 2 years ago in reply to Manu Mathew

Expert 1571 points

Thank you for your reply!

>>> Neither of them gives the warning. Does it mean all of layer parameters are loaded correctly and the order of checkpoint loading does not matter when the model to be loaded is a pre-train model (not a QAT fine-tune model)? If so, why does one training got NAN problem, and the other one run well?

2. Debugging accuracy issues.

>>> Yes, the work flow you mentioned above is exactly the flow we have done so far and it doesn't work. That's why we switch to QAT.

0 Manu Mathew over 2 years ago in reply to KONG XIANGXU

TI__Genius 11226 points

Can you confirm if you got the required accuracy in 16bit mode?

QAT: If you can attach the log, we can see whether the checkpoint is loaded correctly. QAT has some additional parameters (they are called clips_act) - at least there should be a warning about those, because they are not in the original checkpoint.

PTQ with mixed precision: We have had good success with PTQ with mixed precision. The final prediction layers (The convolutions layers that produces the classification, regression, confidence outputs) are the right candidates to be put in 16bit. If you can share the .svg file generated during TIDL impoirt, we can look at it and confirm if those layers indeed are put correctly into 16bits.

Processors

Processors forum

TDA4VM: Questions about Quantization Aware Training (QAT) using edgeai-torchvision repository