TDA4VM: training model on edgeai-torchvision

Chris

Part Number: TDA4VM

Dear authors,

when I try to train my model on edgeai-torchvision, but I got following error:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Traceback (most recent call last):
File "train4.py", line 141, in
args.func(config, output_dir, args)
File "train4.py", line 93, in train_joint
train_agent.train()
File "/media/data3/edgeai-torchvision/Train_model_frontend.py", line 277, in train
loss_out = self.train_val_sample(sample_train, self.n_iter, True)
File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 466, in train_val_sample
name="original_gt",
File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 542, in get_residual_loss
labels_2D, heatmap, labels_res, patch_size=5, device=self.device
File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 599, in pred_soft_argmax
label_idx.to(device), heatmap.to(device), patch_size=patch_size
File "/media/data3/edgeai-torchvision/utils/losses.py", line 96, in extract_patches
patches = _roi_pool(image, rois, patch_size=patch_size)
File "/media/data3/edgeai-torchvision/utils/losses.py", line 43, in _roi_pool
patches = roi_pool(pred_heatmap, rois.float(), (patch_size, patch_size), spatial_scale=1.0)
File "/media/data3/edgeai-torchvision/torchvision/ops/roi_pool.py", line 46, in roi_pool
output_size[0], output_size[1])
NotImplementedError: Could not run 'torchvision::roi_pool' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchvision::roi_pool' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Traceback (most recent call last):
File "train4.py", line 141, in
args.func(config, output_dir, args)
File "train4.py", line 93, in train_joint
train_agent.train()
File "/media/data3/edgeai-torchvision/Train_model_frontend.py", line 277, in train
loss_out = self.train_val_sample(sample_train, self.n_iter, True)
File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 466, in train_val_sample
name="original_gt",
File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 542, in get_residual_loss
labels_2D, heatmap, labels_res, patch_size=5, device=self.device
File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 599, in pred_soft_argmax
label_idx.to(device), heatmap.to(device), patch_size=patch_size
File "/media/data3/edgeai-torchvision/utils/losses.py", line 96, in extract_patches
patches = _roi_pool(image, rois, patch_size=patch_size)
File "/media/data3/edgeai-torchvision/utils/losses.py", line 43, in _roi_pool
patches = roi_pool(pred_heatmap, rois.float(), (patch_size, patch_size), spatial_scale=1.0)
File "/media/data3/edgeai-torchvision/torchvision/ops/roi_pool.py", line 46, in roi_pool
output_size[0], output_size[1])
NotImplementedError: Could not run 'torchvision::roi_pool' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchvision::roi_pool' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].

CPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/cpu/roi_pool_kernel.cpp:239 [kernel]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradCPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradCUDA: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradXLA: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradLazy: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradXPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradMLC: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradHPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradNestedTensor: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradPrivateUse1: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradPrivateUse2: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
AutogradPrivateUse3: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:291 [backend fallback]
UNKNOWN_TENSOR_TYPE_ID: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:466 [backend fallback]
Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:305 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

The problem seems refer to torchvision::roi_pool cannot operate on CUDA backend. I have test my code on standard torchvision and it passed with no error.
I have noticed that there is roi_pool_kernel.cu under directory edgeai-torchvision/torchvision/csrc/ops/cuda/roi_pool_kernel.cu, but it seems not be compiled and registered correctly. Is this the problem?, How could I amend this to make torchvision::roi_pool operate on CUDA backend?

Thanks in advance for helping me with this!

over 3 years ago

0 Manu Mathew over 3 years ago

TI__Genius 11236 points

Which model are you trying to train? Can you try to train the model with the original pytorch/vision repository? github.com/.../vision

0 Chris over 3 years ago in reply to Manu Mathew

Prodigy 30 points

Hi Manu,

Thanks for your quick reply,

The model I use here is SuperPoint, and it's not the model provided in edgeai-torchvision. Please find its github repository in (https://github.com/eric-yyjau/pytorch-superpoint) .

I have tested this model in standard pytorch==1.3.1 torchvision==0.4.2, and it works well. Then when I transferred this model to edgeai-torchvision and tried to train the model, it goes wrong. As I mentioned above, the error (torchvision::roi_pool operator cannot operate on CUDA backend) seems refer to roi_pool_kernel.cu. Although I have followed the setup steps described in README.md in edgeai-torchvision repository, roi_pool_kernel.cu seems not be compiled in right way.

So could you please look into this. Thanks in advance for helping me with this.

0 Manu Mathew over 3 years ago in reply to Chris

TI__Genius 11236 points

Hi Chris,

I do not have much expertise with roi_pool operator - the torchvision version that you are using (0.4.2) is much different from the version of edgeai-torchvision - that could cause a problem.

Can you help me understand this statement:

>>>when I transferred this model to edgeai-torchvision and tried to train the model

If you are able to train in pytorch-superpoint repository, why do you want to transfer the model to edgeai-torchvision

0 Chris over 3 years ago in reply to Manu Mathew

Prodigy 30 points

Hi Manu,

Thanks for your reply!

The reason is that I try to use the quantization tools during training in edgeai-torchvision repository.

So I just copy the code in pytorch-superpoint repository to edgeai-torchvision repository, and then use Quantization Aware Training to wrap up the superpoint model during training.

I have noticed that the torchvision version in edgeai-torchvision is quite different from standard torchvision, but I still have no idea how to fix this problem as I mentioned above.

Still thanks so much for your quick answer, please let me know if there is any solution!

0 Manu Mathew over 3 years ago in reply to Chris

TI__Genius 11236 points

We plan to separate out the quantization tools into a separate repository, in order to remove this dependency on a specific torchvision version.

For now, one options is that you can copy the xnn folder from edgeai-torchvision/vision/edgeailite/xnn into your repository and import it.

0 Chris over 3 years ago in reply to Manu Mathew

Prodigy 30 points

Ok, great, thanks for this valuable information. I will try this immediately!

Processors

Processors forum

TDA4VM: training model on edgeai-torchvision