SK-AM68: edgeai-modelmaker:Training failed with GPU enabled

csscyt

Part Number: SK-AM68
Other Parts Discussed in Thread: AM68, AM68A

Tool/software:

I use AM68 SDK10.1, and use NVIDIA GeForce RTX 5070，
Since this GPU requires CUDA 12.8+,
I’ve installed CUDA 12.8 and set up my environment with PyTorch (2.7.1+cu128)
Refer to this post，
https://github.com/lllyasviel/Fooocus/issues/4088

Under PyTorch (2.7.1+cu128), enable GPU( num_gpus 1), These two sample scripts can be trained and compiled using GPU，
./run_modelmaker.sh AM68A config_classification.yaml
./run_modelmaker.sh AM68A config_segmentation.yaml

But running the object detection script training fails with GPU，

 ./run_modelmaker.sh AM68A config_detection.yaml
Number of AVX cores detected in PC: 32
AVX compilation speedup in PC     : 1
Target device                     : AM68A
PYTHONPATH                        : .:
TIDL_TOOLS_PATH                   : ../edgeai-benchmark/tools/tidl_tools_package/AM68A/tidl_tools
LD_LIBRARY_PATH                   : ../edgeai-benchmark/tools/tidl_tools_package/AM68A/tidl_tools:
argv: ['./scripts/run_modelmaker.py', 'config_detection_new.yaml', '--target_device', 'AM68A']
---------------------------------------------------------------------
INFO: ModelMaker - task_type:detection model_name:yolox_s_lite dataset_name:tiscapes2017_driving run_name:20250916-155615/yolox_s_lite
- Model: yolox_s_lite
- TargetDevices & Estimated Inference Times (ms): {'TDA4VM': 10.14, 'AM62A': 43.94, 'AM67A': '43.94 (with 1/2 device capability)', 'AM68A': 10.22, 'AM69A': '9.82 (with 1/4th device capability)'}
- This model can be compiled for the above device(s).
---------------------------------------------------------------------
INFO: ModelMaker - dataset split sizes {'train': 393, 'val': 107}
INFO: ModelMaker - max_num_files is set to: 10000
INFO: ModelMaker - dataset split sizes are limited to: {'train': 393, 'val': 107}
INFO: ModelMaker - dataset loading OK
loading annotations into memory...
Done (t=0.03s)
creating index...
index created!
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
INFO: ModelMaker - run params is at: /home/github/edgeai-tensorlab/edgeai-modelmaker/data/projects/tiscapes2017_driving/run/20250916-155615/yolox_s_lite/run.yaml
INFO: ModelMaker - running training - for detailed info see the log file: /home/github/edgeai-tensorlab/edgeai-modelmaker/data/projects/tiscapes2017_driving/run/20250916-155615/yolox_s_lite/training/run.log
TASKS TOTAL=1, NUM_RUNNING=1:   0%|                                                   | 0/1 [00:00<?, ?it/s, postfix={'RUNNING': ['20250916-155615/yolox_s_lite:training'], 'COMPLETED': []}]
ERROR:20250916-155618: Error occurred: 20250916-155615/yolox_s_lite:training - Error Code: 1 at /home/xilutek/github/edgeai-tensorlab/edgeai-benchmark/edgeai_benchmark/utils/parallel_runner.py
TASKS TOTAL=1, NUM_RUNNING=0: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.81s/it, postfix={'RUNNING': [], 'COMPLETED': ['yolox_s_lite']}]
Trained model is at: /home/github/edgeai-tensorlab/edgeai-modelmaker/data/projects/tiscapes2017_driving/run/20250916-155615/yolox_s_lite/training

WARNING: ModelMaker - Training completed with errors.

run.log, how to fix it? thanks

 cat run.log
Traceback (most recent call last):
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/tools/train.py", line 23, in <module>
    from mmdeploy.utils import save_model_proto
  File "/home/github/edgeai-tensorlab/edgeai-mmdeploy/mmdeploy/__init__.py", line 4, in <module>
    from mmdeploy.utils import get_root_logger
  File "/home/github/edgeai-tensorlab/edgeai-mmdeploy/mmdeploy/utils/__init__.py", line 7, in <module>
    from .utils import get_file_path, get_root_logger, target_wrapper, build_model_from_cfg
  File "/home/github/edgeai-tensorlab/edgeai-mmdeploy/mmdeploy/utils/utils.py", line 15, in <module>
    from mmdet.apis import init_detector
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/apis/__init__.py", line 2, in <module>
    from .det_inferencer import DetInferencer
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/apis/det_inferencer.py", line 22, in <module>
    from mmdet.evaluation import INSTANCE_OFFSET
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/evaluation/__init__.py", line 4, in <module>
    from .metrics import *  # noqa: F401,F403
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/evaluation/metrics/__init__.py", line 5, in <module>
    from .coco_metric import CocoMetric
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/evaluation/metrics/coco_metric.py", line 16, in <module>
    from mmdet.datasets.api_wrappers import COCO, COCOeval, COCOevalMP
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/datasets/__init__.py", line 31, in <module>
    from .utils import get_loading_pipeline
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/datasets/utils.py", line 5, in <module>
    from mmdet.datasets.transforms import LoadAnnotations, LoadPanopticAnnotations
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/datasets/transforms/__init__.py", line 6, in <module>
    from .formatting import (ImageToTensor, PackDetInputs, PackReIDInputs,
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/datasets/transforms/formatting.py", line 11, in <module>
    from mmdet.structures.bbox import BaseBoxes
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/structures/bbox/__init__.py", line 2, in <module>
    from .base_boxes import BaseBoxes
  File "/home/xilutek/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/structures/bbox/base_boxes.py", line 9, in <module>
    from mmdet.structures.mask.structures import BitmapMasks, PolygonMasks
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/structures/mask/__init__.py", line 3, in <module>
    from .structures import (BaseInstanceMasks, BitmapMasks, PolygonMasks,
  File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/structures/mask/structures.py", line 12, in <module>
    from mmcv.ops.roi_align import roi_align
  File "/home/.pyenv/versions/py310/lib/python3.10/site-packages/mmcv/ops/__init__.py", line 3, in <module>
    from .active_rotated_filter import active_rotated_filter
  File "/home/.pyenv/versions/py310/lib/python3.10/site-packages/mmcv/ops/active_rotated_filter.py", line 10, in <module>
    ext_module = ext_loader.load_ext(
  File "/home/.pyenv/versions/py310/lib/python3.10/site-packages/mmcv/utils/ext_loader.py", line 13, in load_ext
    ext = importlib.import_module('mmcv.' + name)
  File "/home/.pyenv/versions/3.10.18/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ImportError: /home/.pyenv/versions/py310/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs

WARNING: ModelMaker - Training completed with errors.@ubuntu2204:~/github/edgeai-tensorlab/edgeai-modelmaker/data/projects/tiscapes2017_driving/run/20250916-155615/yolox_s_lite/training$

4 months ago

0 Christina Kuruvilla 4 months ago

TI__Expert 6900 points

Hi,

Please give me some time to investigate. I will update with any questions and should give you an update before end of week.

Warm regards,

Christina

0 Christina Kuruvilla 3 months ago in reply to Christina Kuruvilla

TI__Expert 6900 points

Hello,

Been having some heavy bandwidth so I will need some more time. I appreciate your patience. Have you tried using it without GPU?

Warm regards,

Christina

0 csscyt 3 months ago in reply to Christina Kuruvilla

Prodigy 210 points

Before upgrading PyTorch (2.7.1+cu128), the CPU can be compiled normally. After the upgrade, the GPU and CPU errors are the same.
Help analyze this issue,thanks

0 Christina Kuruvilla 3 months ago in reply to csscyt

TI__Expert 6900 points

Hello,

I am investigating internally on this. We appreciate your patience.

Warm regards,

Christina

0 Manu Mathew 3 months ago in reply to csscyt

TI__Genius 11506 points

edgeai-modelmaker uses several components in edgeai-tesnsorlab for model training and compilation - specifically it uses edgeai-mmdetection for object detection models training.

Multiple setup scripts would have to be modified to make it work with Pytorch built using another cuda version

For example, notice all these scripts include the following line:

pip3 install --no-input torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124

https://github.com/TexasInstruments/edgeai-tensorlab/blob/main/edgeai-mmdetection/setup.sh

https://github.com/TexasInstruments/edgeai-tensorlab/blob/main/edgeai-torchvision/setup.sh

https://github.com/TexasInstruments/edgeai-tensorlab/blob/main/edgeai-tensorvision/setup.sh

All those would have to be modified and setup would have to run again - even then I am not sure if there will be any other change to be done, because we have not tried with CUDA12.8. Suggest to stay with the default settings provided.

0 csscyt 3 months ago in reply to Manu Mathew

Prodigy 210 points

Manu Mathew said:
https://github.com/TexasInstruments/edgeai-tensorlab/blob/main/edgeai-mmdetection/setup.sh

https://github.com/TexasInstruments/edgeai-tensorlab/blob/main/edgeai-torchvision/setup.sh

https://github.com/TexasInstruments/edgeai-tensorlab/blob/main/edgeai-tensorvision/setup.sh

These components have been upgraded to the specified version, still not work.

Manu Mathew said:
because we have not tried with CUDA12.8.

Not sure if you can try this GPU and CPU version, they have the same error using the config_detection.yaml script

0 Manu Mathew 3 months ago in reply to csscyt

TI__Genius 11506 points

If you check your error:

File "/home/github/edgeai-tensorlab/edgeai-mmdetection/mmdet/structures/mask/structures.py", line 12, in <module>
from mmcv.ops.roi_align import roi_align

This problem is due to mmcv being not correctly installed.

https://mmcv.readthedocs.io/en/latest/

https://github.com/open-mmlab/mmcv

mmcv has not been updated for couple of years. However, installing along with torch with cuda 12.4 worked for us. But we have noticed that it doesn't work with several other versions.

We do not have a solution for this right now. If you can make mmcv installation work so that the above line works - then you can use it. There is no other solution at the moment.

Processors

Processors forum

SK-AM68: edgeai-modelmaker:Training failed with GPU enabled