TDA4VM: Quantization Aware Training (QAT) using edgeai-torchvision repository

Ofir G

Prodigy 10 points

Part Number: TDA4VM

Hi,

I have been working with your library edgeai-torchvision, and your guide on quantization: https://github.com/TexasInstruments/edgeai-torchvision/blob/master/docs/pixel2pixel/Quantization.md#how-to-use--quanttrainmodule

The training flow I'm using is as-

Taking a trained model that I want to use, loading its weights
Replacing the relevant layers to TI implementation (Upsample -> xnn.layers.ResizeWith, Concat -> xnn.layers.CatBlock)
Wrapping the modified model with xnn.quantize.QuantTrainModule class
Training 50 epochs, with "small" learning rate values

These step were done with many hyper parameters changes- LR changes, adding/removing weights decay:

Tried with/without freezing BatchNorm and Quantization range as advised in the guide (using xnn.utils.freeze_bn(model) and xnn.layers.freeze_quant_range(model))

While training is being done - it can be seem that the training loss value if very much bigger than the one I get without the QAT flow, and seem that nothing of the described experiments improves it (about x5 bigger than without QAT flow)

The inference flow I'm using with the trained QAT model-

When testing the trained model, I use torch.load + load_state_dict (using the weights of each model after the replacement of the relevant layers (concat, upsample) as done before training

The tested model have severe localization problems, accuracy drops dramatically

Would appreciate your guidance if the described flow seems reasonable, and if there's something that might be recommended to try (maybe hyper parameters tuning or other suggested methods).

Thanks in advance!

over 2 years ago

0 Manu Mathew over 2 years ago

TI__Genius 11236 points

Hi,

Your explanations seem correct. But let us analyze further to see if you have missed any thing.

1 Can you try the QAT flow on a classification model and make sure the flow works.

2. Please share the onnx model with QAT (above)

0 Jerry Wong over 2 years ago

Prodigy 100 points

Hi,

I have met your issue when I try to QAT a yolox model, While I didn't take step 2 in your flow. My situation is, in the first few epochs, the loss descended, then become huge big and soon the train crashed because of NAN. This is nothing to do with the lr I think, but somehow I fixed it, there are my attempts:

- Using weight_decay for all params in model optimizer (when do QAT, and I use sgd with weight_decay, momentum and nesterov)

- Make the task easy (In my case, YoloX using multi-scale training, and I fixed the scale with (640,640), the loss and whole process of QAT become much stable)

but, I just wondering if I should do step 2. Replacing layers with xnn.layers before I wrapping the model, because from the EdgeAI_MMDet code, You just not need do this manually

0 Manu Mathew over 2 years ago in reply to Jerry Wong

TI__Genius 11236 points

- Make the task easy (In my case, YoloX using multi-scale training, and I fixed the scale with (640,640), the loss and whole process of QAT become much stable)

[Manu] This is a good observation. It makes sense - fixed image resize may make the qhole training easy, especially with QAT.

- but, I just wondering if I should do step 2. Replacing layers with xnn.layers before I wrapping the model, because from the EdgeAI_MMDet code, You just not need do this manually

[Manu] I did not understand. Can you please explain this further.

0 Jerry Wong over 2 years ago in reply to Manu Mathew

Prodigy 100 points

By should I do step2 I mean:

My train flow should or should not include the step which is replacing the relevant layers to Ti implementation (Upsample -> xnn.layers.ResizeWith, Conca -> xnn.layers.CatBlock)

I didn't do this before I wrap the model, but it still works, and the official code in Edgeai_MMdet seems not do this neither

0 Manu Mathew over 2 years ago in reply to Jerry Wong

TI__Genius 11236 points

>>>I didn't do this before I wrap the model, but it still works

The changes are for the model to work efficiently in TIDL and to get good accuracy. By saying "it still works" do you mean it worked as expected in TIDL?

Regarding Upsample/interpolation, if the code is using torch.nn.functional.interpolate, the following may be sufficient: https://github.com/TexasInstruments/edgeai-mmdetection/blob/master/tools/train.py#L193

But if it is using torch.nn.Upsample module, then you may have to change it to use xnn.layers.ResizeWith

0 Jerry Wong over 2 years ago in reply to Manu Mathew

Prodigy 100 points

Thx, Yes it still works means it worked as expected in TIDL (from the results of few test imgs so far), the onnx graph is clean, I think. I have checked the doc of QAT, and I'm confused, it mentioned one also needs to use xnn.layers.AddBlock for add and xnn.layers.CatBlock for concat, is all this necessary? Will this automatically being done by wrapping with xnn.quantize.QuantTrainModule?

0 Manu Mathew over 2 years ago in reply to Jerry Wong

TI__Genius 11236 points

Use of xnn.layers.AddBlock for add and xnn.layers.CatBlock for concat will make sure that there are Clip layers after those operators to inform correct quantization ranges to TIDL. This is for accuracy.

In the absence of those CLIP layers in ONNX model, TIDL will have to compute the ranges during calibration - which may be suboptimal.

0 Jerry Wong over 2 years ago in reply to Manu Mathew

Prodigy 100 points

Thx a lot, see if I get your point right:

>>>| Use of xnn.layers.AddBlock for add and xnn.layers.CatBlock for concat will make sure that there are Clip layers after those operators to inform correct quantization ranges to TIDL. This is for accuracy.

Means the output maybe out of range, leads to unknown result?

>>>| In the absence of those CLIP layers in ONNX model, TIDL will have to compute the ranges during calibration - which may be suboptimal.

Means TIDL will fix this bug with more time?

0 Manu Mathew over 2 years ago in reply to Jerry Wong

TI__Genius 11236 points

>>>Means the output maybe out of range, leads to unknown result?

Depends on the calibration set.

>>>Means TIDL will fix this bug with more time?

This is not a bug, but rather a feature of TIDL. If Clip range is missing for a certain layer, TIDL will compute it during calibration. But if it is present, it will not compute it, but merely use what is available.

0 Jerry Wong over 2 years ago in reply to Manu Mathew

Prodigy 100 points

I got you, so the standard process including

1. Train a model regularly

2. Replacing nn.functional.interpolate with xnn.resize, Replacing add operator with xnn.layers.Addblock, and torch.cat with xnn.layers.CatBlock, etc.

3. Wrapper the model with xnn.quantize.QuantTrainMode

4. Train the QAT model

0 Manu Mathew over 2 years ago in reply to Jerry Wong

TI__Genius 11236 points

Yes.

0 Jerry Wong over 2 years ago in reply to Manu Mathew

Prodigy 100 points

Thx, another ambiguous problem I have met is that:

I found my network class contains an loss computation function, during training it will compute loss internally in the forward function after I got output of net,

eg:

if self.training:

out = self.head(x)

loss = self.loss_fn(torch.cat(out,1))

return loss

else:

out_reg = self.head_reg(x)

out_obj = self.head_obj(x)

out_cls = self.head_cls(x)

return torch.cat((out_reg, out_obj.sigmoid(), out_cls.sigmoid()), 1)

I have check that the score in my task is drop a lot when I test on TDA4 than pytorch.

One probable issue is the torch.cat, because the out including reg and conf and obj while cls and obj are between(0~1) while reg's range is much larger, if them quantize together, there will be a problem,I'm trying to not use the xnn.layers.CatBlock but don't know whether it will fix it, or I'll try to not use torch.cat.

Another tricky one I want to ask for help is that, Should I let the process of computing loss out of the network, because in loss_fn contains lots of tensor operators, I don't know if I put it inside the network, whether it would cause some problems during QAT

0 Manu Mathew over 2 years ago in reply to Jerry Wong

TI__Genius 11236 points

You can first try your float model in onnxruntime (in onnxruntime-tidl you can set with tensor_bits: 32 float simulation mode in PC). You can also use 16 bit mode (tensor_bits: 16). This will rule out basic issues including pre/post processing. Once all that is sorted, you can check accuracy with 8-bit quantization.

0 Jerry Wong over 2 years ago in reply to Manu Mathew

Prodigy 100 points

is that onnxruntime-tidl a package like onnxruntime, which I can simply use in python or I have to compile and use it in C

0 Manu Mathew over 2 years ago in reply to Jerry Wong

TI__Genius 11236 points

It is TI's fork of onnxruntime with the ability to offload to TIDL in the backend. It is a package that is installed when you run the setup of edgeai-tidl-tools or edgeai-benchmark

https://github.com/TexasInstruments/edgeai-tidl-tools/blob/master/setup.sh#L293

https://github.com/TexasInstruments/edgeai-benchmark/blob/master/setup_pc.sh#L109

0 Jerry Wong over 2 years ago in reply to Manu Mathew

Prodigy 100 points

Could you plz also help me check this out

In the repositoiry edgeai-yolox https://github.com/TexasInstruments/edgeai-yolox/blob/main/yolox/models/yolo_head.py#L165,

you can see there are lots of post process in forward function, will this post process affects the quantization? Should I remove all these out of the forward function

0 Jerry Wong over 2 years ago in reply to Manu Mathew

Prodigy 100 points

Hi, Could you plz help me with my question? thx

0 Manu Mathew over 2 years ago in reply to Jerry Wong

TI__Genius 11236 points

edgeai-yolox is our repository and is supported in our modelzoo and edgeai-benchmark (https://github.com/TexasInstruments/edgeai-benchmark/tree/master) compilation tool.

After training in edgeai-yolox, you can export an onnx model and prototxt, using this export script. Then you can compile the model using this:

Take a look at how a yolox config is to be define: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/configs/detection_experimental.py#L59

for compilation.

These are the scripts to be used for custom model compilation:

https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_custom_pc.sh

https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py

In benchmark_custom.py, add a yolox config in the pipeline_configs (you can remove the existing ones) and then run run_custom_pc.sh

0 Manu Mathew over 2 years ago in reply to Manu Mathew

TI__Genius 11236 points

What I explained above the the Post Training Quantization (PTQ) work flow. We also have mixed precision support - it is possible to put certain layers alone to 16bits. For example see this: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/configs/detection_experimental.py#L65

We have seen that putting the last convolution layers that produce the prediction into 16bits improves the accuracy significantly.

0 Jerry Wong over 2 years ago in reply to Manu Mathew

Prodigy 100 points

Yes, thanks for you detailed reply. But we mostly using QAT, and I'm just wondering the post-process in model forward function: like make_grid ( which is nothing to do with network but yet in the forward function) will affects the quantize training?

Processors

Processors forum

TDA4VM: Quantization Aware Training (QAT) using edgeai-torchvision repository