This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Dear authors,
when I try to train my model on edgeai-torchvision, but I got following error:
Traceback (most recent call last): File "train4.py", line 141, in args.func(config, output_dir, args) File "train4.py", line 93, in train_joint train_agent.train() File "/media/data3/edgeai-torchvision/Train_model_frontend.py", line 277, in train loss_out = self.train_val_sample(sample_train, self.n_iter, True) File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 466, in train_val_sample name="original_gt", File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 542, in get_residual_loss labels_2D, heatmap, labels_res, patch_size=5, device=self.device File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 599, in pred_soft_argmax label_idx.to(device), heatmap.to(device), patch_size=patch_size File "/media/data3/edgeai-torchvision/utils/losses.py", line 96, in extract_patches patches = _roi_pool(image, rois, patch_size=patch_size) File "/media/data3/edgeai-torchvision/utils/losses.py", line 43, in _roi_pool patches = roi_pool(pred_heatmap, rois.float(), (patch_size, patch_size), spatial_scale=1.0) File "/media/data3/edgeai-torchvision/torchvision/ops/roi_pool.py", line 46, in roi_pool output_size[0], output_size[1]) NotImplementedError: Could not run 'torchvision::roi_pool' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchvision::roi_pool' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode]. CPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/cpu/roi_pool_kernel.cpp:239 [kernel] BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback] Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback] Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback] Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback] Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback] ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback] AutogradOther: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradCPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradCUDA: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradXLA: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradLazy: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradXPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradMLC: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradHPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradNestedTensor: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradPrivateUse1: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradPrivateUse2: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradPrivateUse3: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:291 [backend fallback] UNKNOWN_TENSOR_TYPE_ID: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:466 [backend fallback] Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:305 [backend fallback] Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback] VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
The problem seems refer to torchvision::roi_pool cannot operate on CUDA backend. I have test my code on standard torchvision and it passed with no error.
I have noticed that there is roi_pool_kernel.cu under directory edgeai-torchvision/torchvision/csrc/ops/cuda/roi_pool_kernel.cu
, but it seems not be compiled and registered correctly. Is this the problem?, How could I amend this to make torchvision::roi_pool operate on CUDA backend?
Thanks in advance for helping me with this!
Which model are you trying to train? Can you try to train the model with the original pytorch/vision repository? github.com/.../vision
Hi Manu,
Thanks for your quick reply,
The model I use here is SuperPoint, and it's not the model provided in edgeai-torchvision. Please find its github repository in (https://github.com/eric-yyjau/pytorch-superpoint) .
I have tested this model in standard pytorch==1.3.1 torchvision==0.4.2, and it works well. Then when I transferred this model to edgeai-torchvision and tried to train the model, it goes wrong. As I mentioned above, the error (torchvision::roi_pool operator cannot operate on CUDA backend) seems refer to roi_pool_kernel.cu. Although I have followed the setup steps described in README.md in edgeai-torchvision repository, roi_pool_kernel.cu seems not be compiled in right way.
So could you please look into this. Thanks in advance for helping me with this.
Hi Chris,
I do not have much expertise with roi_pool operator - the torchvision version that you are using (0.4.2) is much different from the version of edgeai-torchvision - that could cause a problem.
Can you help me understand this statement:
>>>when I transferred this model to edgeai-torchvision and tried to train the model
If you are able to train in pytorch-superpoint repository, why do you want to transfer the model to edgeai-torchvision
Hi Manu,
Thanks for your reply!
The reason is that I try to use the quantization tools during training in edgeai-torchvision repository.
So I just copy the code in pytorch-superpoint repository to edgeai-torchvision repository, and then use Quantization Aware Training to wrap up the superpoint model during training.
I have noticed that the torchvision version in edgeai-torchvision is quite different from standard torchvision, but I still have no idea how to fix this problem as I mentioned above.
Still thanks so much for your quick answer, please let me know if there is any solution!
We plan to separate out the quantization tools into a separate repository, in order to remove this dependency on a specific torchvision version.
For now, one options is that you can copy the xnn folder from edgeai-torchvision/vision/edgeailite/xnn into your repository and import it.