This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hi there,
We are currently using QAT to fine-tune a yolov5s model using edgeai-torchvision repository(https://github.com/TexasInstruments/edgeai-torchvision/blob/master/docs/pixel2pixel/Quantization.md)
The following is what we've tried so far,
But the training crashed at the first epoch because of NAN as you can see in the image below,
And then we changed the order of checkpoint loading and model wrapping,
The model training runs well and can converge at the end.
Questions:
Thanks in advance!
1. When you load checkpoint state_dict into the model, you will get a list of layer parameters that are not correctly loaded if there is a mismatch. In the both order of checkpoint loading that you reported, which one gives the warning and which one doesn't?
2. To correctly measure the accuracy for object detection models, there are other parameters to be set such that it matches with what was used in training. For example confidence_threshold, top_k, nms_threshold etc.
Debugging accuracy issues are tricky and a step by step and methodical approach is needed.
First, using the float model (not QAT model) measure the accuracy with TIDL float model (tensor_bits=32). Are you getting the expected accuracy? If you are not, ask why are you not getting the accuracy? May be some resize/crop, mean/scale params are not as expected? May be the detection params confidence_threshold, top_k, nms_threshold etc are not correct?
Secondly,using the float model measure the accuracy with 16bit mode. 16bit mode accuracy should be close to float accuracy. If it is not, again find out the reason.
Thirdly, using the float model measure the accuracy in 8bit mode. If the accuracy in first two steps are good and the accuracy in this third step is poor then may be we need to adjust some parameters. Perhaps advanced calibration? Perhaps some layers in 16bits mixed precision).
Only if all this fails should we try QAT.
Thank you for your reply!
1. When you load checkpoint state_dict into the model, you will get a list of layer parameters that are not correctly loaded if there is a mismatch. In the both order of checkpoint loading that you reported, which one gives the warning and which one doesn't?
>>> Neither of them gives the warning. Does it mean all of layer parameters are loaded correctly and the order of checkpoint loading does not matter when the model to be loaded is a pre-train model (not a QAT fine-tune model)? If so, why does one training got NAN problem, and the other one run well?
2. Debugging accuracy issues.
>>> Yes, the work flow you mentioned above is exactly the flow we have done so far and it doesn't work. That's why we switch to QAT.
Can you confirm if you got the required accuracy in 16bit mode?
QAT: If you can attach the log, we can see whether the checkpoint is loaded correctly. QAT has some additional parameters (they are called clips_act) - at least there should be a warning about those, because they are not in the original checkpoint.
PTQ with mixed precision: We have had good success with PTQ with mixed precision. The final prediction layers (The convolutions layers that produces the classification, regression, confidence outputs) are the right candidates to be put in 16bit. If you can share the .svg file generated during TIDL impoirt, we can look at it and confirm if those layers indeed are put correctly into 16bits.