This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: training model on edgeai-torchvision

Part Number: TDA4VM

Dear authors,

when I try to train my model on edgeai-torchvision, but I got following error:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Traceback (most recent call last):
File "train4.py", line 141, in
args.func(config, output_dir, args)
File "train4.py", line 93, in train_joint
train_agent.train()
File "/media/data3/edgeai-torchvision/Train_model_frontend.py", line 277, in train
loss_out = self.train_val_sample(sample_train, self.n_iter, True)
File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 466, in train_val_sample
name="original_gt",
File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 542, in get_residual_loss
labels_2D, heatmap, labels_res, patch_size=5, device=self.device
File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 599, in pred_soft_argmax
label_idx.to(device), heatmap.to(device), patch_size=patch_size
File "/media/data3/edgeai-torchvision/utils/losses.py", line 96, in extract_patches
patches = _roi_pool(image, rois, patch_size=patch_size)
File "/media/data3/edgeai-torchvision/utils/losses.py", line 43, in _roi_pool
patches = roi_pool(pred_heatmap, rois.float(), (patch_size, patch_size), spatial_scale=1.0)
File "/media/data3/edgeai-torchvision/torchvision/ops/roi_pool.py", line 46, in roi_pool
output_size[0], output_size[1])
NotImplementedError: Could not run 'torchvision::roi_pool' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchvision::roi_pool' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The problem seems refer to torchvision::roi_pool cannot operate on CUDA backend. I have test my code on standard torchvision and it passed with no error.
I have noticed that there is roi_pool_kernel.cu under directory edgeai-torchvision/torchvision/csrc/ops/cuda/roi_pool_kernel.cu, but it seems not be compiled and registered correctly. Is this the problem?, How could I amend this to make torchvision::roi_pool operate on CUDA backend?

Thanks in advance for helping me with this!

  • Which model are you trying to train? Can you try to train the model with the original pytorch/vision repository? github.com/.../vision

  • Hi Manu,

    Thanks for your quick reply,

    The model I use here is SuperPoint, and it's not the model provided in edgeai-torchvision. Please find its github repository in (https://github.com/eric-yyjau/pytorch-superpoint) .

    I have tested this model in standard pytorch==1.3.1 torchvision==0.4.2, and it works well. Then when I transferred this model to edgeai-torchvision and tried to train the model, it goes wrong. As I mentioned above, the error (torchvision::roi_pool operator cannot operate on CUDA backend) seems refer to  roi_pool_kernel.cu. Although I have followed the setup steps described in README.md in edgeai-torchvision repository, roi_pool_kernel.cu seems not be compiled in right way.

    So could you please look into this. Thanks in advance for helping me with this.

  • Hi Chris,

    I do not have much expertise with roi_pool operator - the torchvision version that you are using (0.4.2) is much different from the version of edgeai-torchvision - that could cause a problem.

    Can you help me understand this statement:

    >>>when I transferred this model to edgeai-torchvision and tried to train the model

    If you are able to train in pytorch-superpoint repository, why do you want to transfer the model to edgeai-torchvision

  • Hi Manu,

    Thanks for your reply!

    The reason is that I try to use the quantization tools  during training in edgeai-torchvision repository.

    So I just copy the code in pytorch-superpoint repository to edgeai-torchvision repository, and then use Quantization Aware Training to wrap up the superpoint model during training. 

    I have noticed that the torchvision version in edgeai-torchvision is quite different from standard torchvision, but I still have no idea how to fix this problem as I mentioned above.

    Still thanks so much for your quick answer, please let me know if there is any solution!

  • We plan to separate out the quantization tools into a separate repository, in order to remove this dependency on a specific torchvision version. 

    For now, one options is that you can copy the xnn folder from edgeai-torchvision/vision/edgeailite/xnn into your repository and import it.

  • Ok, great, thanks for this valuable information. I will try this immediately!