Dear authors,
when I try to train my model on edgeai-torchvision, but I got following error:
Traceback (most recent call last): File "train4.py", line 141, in args.func(config, output_dir, args) File "train4.py", line 93, in train_joint train_agent.train() File "/media/data3/edgeai-torchvision/Train_model_frontend.py", line 277, in train loss_out = self.train_val_sample(sample_train, self.n_iter, True) File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 466, in train_val_sample name="original_gt", File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 542, in get_residual_loss labels_2D, heatmap, labels_res, patch_size=5, device=self.device File "/media/data3/edgeai-torchvision/Train_model_heatmap.py", line 599, in pred_soft_argmax label_idx.to(device), heatmap.to(device), patch_size=patch_size File "/media/data3/edgeai-torchvision/utils/losses.py", line 96, in extract_patches patches = _roi_pool(image, rois, patch_size=patch_size) File "/media/data3/edgeai-torchvision/utils/losses.py", line 43, in _roi_pool patches = roi_pool(pred_heatmap, rois.float(), (patch_size, patch_size), spatial_scale=1.0) File "/media/data3/edgeai-torchvision/torchvision/ops/roi_pool.py", line 46, in roi_pool output_size[0], output_size[1]) NotImplementedError: Could not run 'torchvision::roi_pool' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchvision::roi_pool' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode]. CPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/cpu/roi_pool_kernel.cpp:239 [kernel] BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback] Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback] Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback] Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback] Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback] ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback] AutogradOther: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradCPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradCUDA: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradXLA: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradLazy: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradXPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradMLC: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradHPU: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradNestedTensor: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradPrivateUse1: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradPrivateUse2: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] AutogradPrivateUse3: registered at /media/data3/edgeai-torchvision/torchvision/csrc/ops/autograd/roi_pool_kernel.cpp:142 [autograd kernel] Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:291 [backend fallback] UNKNOWN_TENSOR_TYPE_ID: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:466 [backend fallback] Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:305 [backend fallback] Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback] VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
The problem seems refer to torchvision::roi_pool cannot operate on CUDA backend. I have test my code on standard torchvision and it passed with no error.
I have noticed that there is roi_pool_kernel.cu under directory edgeai-torchvision/torchvision/csrc/ops/cuda/roi_pool_kernel.cu, but it seems not be compiled and registered correctly. Is this the problem?, How could I amend this to make torchvision::roi_pool operate on CUDA backend?
Thanks in advance for helping me with this!