SK-AM62A-LP: Issue with Training YOLOX using Model Maker on GPU

Amanda Vaitty

Part Number: SK-AM62A-LP

Hello,

I am reaching out because I am encountering an error while attempting to train a YOLOX with Model Maker on my GPU.

I am receiving the following error message:

AttributeError: DataContainer has no attribute size for type <class 'list'> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1)

I noticed that this issue has been reported by others on your GitHub repository (https://github.com/TexasInstruments/edgeai-modelmaker/issues/11).

Is there any workaround available to address this error and proceed with training using my GPU ?

Thank you for your attention to this issue.

Best regards

over 1 year ago

0 Derek T over 1 year ago

Prodigy 81 points

I have the same issue. I just tried turning off distributed training (probably only valid for a single GPU solution). It doesn't look like it's using my GPU yet, but it's at least not throwing the error. Maybe it will all the way work on your setup.

edgeai-modelmaker/edgeai-modelmaker/ai-modules/vision/params.py

training=dict(distributed=False)

0 Amanda Vaitty over 1 year ago in reply to Derek T

Prodigy 45 points

Thank you Derek.

I've managed to resolve my issue by updating MMCV. I use MMCV 1.7.2 and changed the version constraint in edge-mmdetection/mmdet/__init__.py

mmcv_maximum_version = '1.5.0'

mmcv_maximum_version = '1.7.2'

Then I was able to train using two GPUs without throwing any error.
I hope this will help you as well.

0 Derek T over 1 year ago in reply to Amanda Vaitty

Prodigy 81 points

That works for me, thank you!

Processors

Processors forum

SK-AM62A-LP: Issue with Training YOLOX using Model Maker on GPU