PROCESSOR-SDK-AM62A: Various errors when trying to run scripts from edgeai-modelmaker and edgeai-benchmark

Narayan Desai

Hello,

I've been trying to set up edgeai-modelmaker and run the example scripts, but I am running into issues with dependencies. I am able to create the pyenv environment and run it without any issues. I am running ./setup_all.sh in the edgeai-modelmaker repository with yolov5 enabled. Everything seems to complete from there, and I am seeing the right edgeai folders in the parent directory. Unfortunately, when I am running ./run-modelmaker.sh with any of the configs, they fail for various reasons. I am also having some issues with edgeai-benchmark when running ./run_benchmarks_pc.sh. Could you please review my output and see if you can figure out what's happening? I am following the instructions exactly as they appear on the edgeai-modelmaker github page, and I have also tried again on a fresh vm image to no avail.

Output from running with config_classification.yaml:

UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 1, which is smaller than what this

DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid

potential slowness/freeze if necessary.

cpuset_checked))

Creating model

=> The shape of the following weights did not match:

classifier.1.weight

classifier.1.bias

=> WARNING: weights could not be loaded completely.

Start training

./run_modelmaker.sh: line 73: 22019 Killed python ./scripts/run_modelmaker.py $2 --target_device $1

Output from running with config_detection.yaml:

dataset split sizes are limited to: {'train': 393, 'val': 107}

loading annotations into memory...

Done (t=0.25s)

creating index...

index created!

loading annotations into memory...

Done (t=0.03s)

creating index...

index created!

Run params is at: /home/narayan/edgeai-modelmaker/data/projects/tiscapes2017_driving/run/20230713-140207/yolox_nano_lite/run.yaml

Traceback (most recent call last):

File "./scripts/run_modelmaker.py", line 140, in <module>

main(config)

File "./scripts/run_modelmaker.py", line 76, in main

model_runner.run()

File "/home/narayan/edgeai-modelmaker/edgeai_modelmaker/ai_modules/vision/runner.py", line 152, in run

self.model_training.run()

File "/home/narayan/edgeai-modelmaker/edgeai_modelmaker/ai_modules/vision/training/edgeai_mmdetection/detection.py", line 415, in run

__name__, force_import=True)

File "/home/narayan/edgeai-modelmaker/edgeai_modelmaker/utils/misc_utils.py", line 99, in import_file_or_folder

imported_module = importlib.import_module(basename, package_name or __name__)

File "/root/.pyenv/versions/3.6.15/lib/python3.6/importlib/__init__.py", line 126, in import_module

return _bootstrap._gcd_import(name[level:], package, level)

File "<frozen importlib._bootstrap>", line 994, in _gcd_import

File "<frozen importlib._bootstrap>", line 971, in _find_and_load

File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked

File "<frozen importlib._bootstrap>", line 665, in _load_unlocked

File "<frozen importlib._bootstrap_external>", line 678, in exec_module

File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed

File "/home/narayan/edgeai-mmdetection/tools/train.py", line 16, in <module>

from mmdet.apis import init_random_seed, set_random_seed, train_detector

File "/home/narayan/edgeai-mmdetection/mmdet/apis/__init__.py", line 2, in <module>

from .inference import (async_inference_detector, inference_detector,

File "/home/narayan/edgeai-mmdetection/mmdet/apis/inference.py", line 7, in <module>

from mmcv.ops import RoIPool

File "/root/.pyenv/versions/py36/lib/python3.6/site-packages/mmcv/ops/__init__.py", line 2, in <module>

from .active_rotated_filter import active_rotated_filter

File "/root/.pyenv/versions/py36/lib/python3.6/site-packages/mmcv/ops/active_rotated_filter.py", line 10, in <module>

['active_rotated_filter_forward', 'active_rotated_filter_backward'])

File "/root/.pyenv/versions/py36/lib/python3.6/site-packages/mmcv/utils/ext_loader.py", line 13, in load_ext

ext = importlib.import_module('mmcv.' + name)

File "/root/.pyenv/versions/3.6.15/lib/python3.6/importlib/__init__.py", line 126, in import_module

return _bootstrap._gcd_import(name[level:], package, level)

ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory

Output from running with config_segmentation.yaml:

INFO:20230713-164953: running - kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx

INFO:20230713-164953: pipeline_config - {'task_type': 'human_pose_estimation', 'dataset_category': 'cocokpts', 'calibration_dataset': <edgeai_benchmark.datasets.coco_kpts.COCOKeypoints object at 0x7f5fdffaa110>, 'input_dataset': <edgeai_benchmark.datasets.coco_kpts.COCOKeypoints object at 0x7f5fdffaaf10>, 'postprocess': <edgeai_benchmark.postprocess.PostProcessTransforms object at 0x7f5fdfc16f50>, 'preprocess': <edgeai_benchmark.preprocess.PreProcessTransforms object at 0x7f5fc785cc10>, 'session': <edgeai_benchmark.sessions.onnxrt_session.ONNXRTSession object at 0x7f5fc7830050>, 'model_info': {'metric_reference': {'accuracy_ap[.5:.95]%': 49.5}, 'model_shortlist': 10}}

INFO:20230713-164953: infer - kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx - this may take some time...libtidl_onnxrt_EP loaded 0x7f5f753e07e0

^CProcess NoDaemonPoolWorker-4:

Traceback (most recent call last):

File "/root/.pyenv/versions/3.6.15/lib/python3.6/multiprocessing/pool.py", line 720, in next

item = self._items.popleft()

IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "./scripts/benchmark_modelzoo.py", line 74, in <module>

tools.run_accuracy(settings, work_dir)

File "/home/narayan/edgeai-benchmark/edgeai_benchmark/tools/run_accuracy.py", line 88, in run_accuracy

pipeline_runner.run()

File "/home/narayan/edgeai-benchmark/edgeai_benchmark/pipelines/pipeline_runner.py", line 81, in run

return self._run_pipelines_parallel()

File "/home/narayan/edgeai-benchmark/edgeai_benchmark/pipelines/pipeline_runner.py", line 114, in _run_pipelines_parallel

results_list = parallel_exec.run()

File "/home/narayan/edgeai-benchmark/edgeai_benchmark/utils/parallel_run.py", line 87, in run

return self._run_parallel()

File "/home/narayan/edgeai-benchmark/edgeai_benchmark/utils/parallel_run.py", line 107, in _run_parallel

result = results_iterator.__next__(timeout=self.maxinterval)

File "/root/.pyenv/versions/3.6.15/lib/python3.6/multiprocessing/pool.py", line 724, in next

self._cond.wait(timeout)

File "/root/.pyenv/versions/3.6.15/lib/python3.6/threading.py", line 299, in wait

gotit = waiter.acquire(True, timeout)

Output from edgeai-benchmark ./run_benchmarks_pc.sh:

download_ok: True

configs to run: ['kd-7040_onnxrt_coco_edgeai-yolov5_yolov5s6_pose_640_ti_lite_54p9_82p2_onnx', 'kd-7050_onnxrt_coco_edgeai-yolov5_yolov5s6_pose_640_ti_lite_54p9_82p2_onnx', 'kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx']

number of configs: 3

TASKS | | 0% 0/3| [< ]

INFO:20230713-164850: starting process on parallel_device - 0 0%| || 0/3 [00:00<?, ?it/s]

INFO:20230713-164856: starting - kd-7040_onnxrt_coco_edgeai-yolov5_yolov5s6_pose_640_ti_lite_54p9_82p2_onnx

INFO:20230713-164856: model_path - /home/narayan/edgeai-yolov5/pretrained_models/models/keypoint/coco/edgeai-yolov5/yolov5s6_pose_640_ti_lite_54p9_82p2.onnx

INFO:20230713-164856: model_file - /home/narayan/edgeai-benchmark/work_dirs/modelartifacts/AM62A/8bits/kd-7040_onnxrt_coco_edgeai-yolov5_yolov5s6_pose_640_ti_lite_54p9_82p2_onnx/model/yolov5s6_pose_640_ti_lite_54p9_82p2.onnx

Downloading 1/1: /home/narayan/edgeai-yolov5/pretrained_models/models/keypoint/coco/edgeai-yolov5/yolov5s6_pose_640_ti_lite_54p9_82p2.onnx

Downloading software-dl.ti.com/.../yolov5s6_pose_640_ti_lite_54p9_82p2.onnx to /home/narayan/edgeai-benchmark/work_dirs/modelartifacts/AM62A/8bits/kd-7040_onnxrt_coco_edgeai-yolov5_yolov5s6_pose_640_ti_lite_54p9_82p2_onnx/model/yolov5s6_pose_640_ti_lite_54p9_82p2.onnx

103481344it [00:14, 7271582.86it/s]

Download done for /home/narayan/edgeai-yolov5/pretrained_models/models/keypoint/coco/edgeai-yolov5/yolov5s6_pose_640_ti_lite_54p9_82p2.onnx

INFO:20230713-164940: starting process on parallel_device - 0

INFO:20230713-164945: starting - kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx

Downloading software-dl.ti.com/.../kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx.tar.gz to /home/narayan/edgeai-benchmark/work_dirs/modelartifacts/AM62A/8bits/kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx.tar.gz

44916736it [00:06, 6948378.06it/s]

Extracting /home/narayan/edgeai-benchmark/work_dirs/modelartifacts/AM62A/8bits/kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx.tar.gz to /home/narayan/edgeai-benchmark/work_dirs/modelartifacts/AM62A/8bits/kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx

INFO:20230713-164953: model_path - /home/narayan/edgeai-modelzoo/models/vision/keypoint/coco/edgeai-yolox/yolox_s_pose_ti_lite_49p5_78p0.onnx

INFO:20230713-164953: model_file - /home/narayan/edgeai-benchmark/work_dirs/modelartifacts/AM62A/8bits/kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx/model/yolox_s_pose_ti_lite_49p5_78p0.onnx

INFO:20230713-164953: running - kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx

INFO:20230713-164953: infer - kd-7060_onnxrt_coco_edgeai-yolox_yolox_s_pose_ti_lite_49p5_78p0_onnx - this may take some time...libtidl_onnxrt_EP loaded 0x7f5f753e07e0

^CProcess NoDaemonPoolWorker-4:

Traceback (most recent call last):

File "/root/.pyenv/versions/3.6.15/lib/python3.6/multiprocessing/pool.py", line 720, in next

item = self._items.popleft()

IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "./scripts/benchmark_modelzoo.py", line 74, in <module>

tools.run_accuracy(settings, work_dir)

File "/home/narayan/edgeai-benchmark/edgeai_benchmark/tools/run_accuracy.py", line 88, in run_accuracy

pipeline_runner.run()

File "/home/narayan/edgeai-benchmark/edgeai_benchmark/pipelines/pipeline_runner.py", line 81, in run

return self._run_pipelines_parallel()

File "/home/narayan/edgeai-benchmark/edgeai_benchmark/pipelines/pipeline_runner.py", line 114, in _run_pipelines_parallel

results_list = parallel_exec.run()

File "/home/narayan/edgeai-benchmark/edgeai_benchmark/utils/parallel_run.py", line 87, in run

return self._run_parallel()

File "/home/narayan/edgeai-benchmark/edgeai_benchmark/utils/parallel_run.py", line 107, in _run_parallel

result = results_iterator.__next__(timeout=self.maxinterval)

File "/root/.pyenv/versions/3.6.15/lib/python3.6/multiprocessing/pool.py", line 724, in next

self._cond.wait(timeout)

File "/root/.pyenv/versions/3.6.15/lib/python3.6/threading.py", line 299, in wait

gotit = waiter.acquire(True, timeout)

over 2 years ago

0 Reese Grimsley over 2 years ago

TI__Genius 15056 points

Hi Naraynan,

Thanks for the query. Let's diagnose these issues.

For the ones that mention 'popleft' or threading, are you doing a control-c keyboard command to cancel what's running? I see the ^C command which suggests it saw a cancel-command.

Is there an nvidia GPU with CUDA setup on your machine? I see that training is failing because it can't find a CUDNN library, which is for using a GPU. If you don't have a GPU, please comment out the line in the config YAML file that references "num_gpus".

I have also seen the messages about weights not loading completely. That shouldn't prevent training for running. You can let that continue.

On the config_segmentation, I'm not sure why that's set to a pose estimation model (yolo-x-pose). Could you change that to one of the other model names in the file, like deeplabv3_mobilenetv2 (or whichever name is most similar)?

Best,
Reese

+1 Reese Grimsley over 2 years ago in reply to Reese Grimsley

TI__Genius 15056 points

Hi Naraynan,

Thanks for the query. Let's diagnose these issues.

Is there an nvidia GPU with CUDA setup on your machine? I see that training is failing for the detection YAML because it can't find a CUDA runtime library, which is for using a GPU. If you don't have a GPU, please comment out the line in the config YAML file that references "num_gpus".

I have also seen the messages about weights not loading completely. That shouldn't prevent training for running. You can let that continue. It just means some of the weights will need to be relearned during training instead of using some pretrained weights. All the other layers should be taking advantage of pretrained weights.

On the config_segmentation, I'm not sure why that's set to a pose estimation model (yolo-x-pose). Is this the correct log output?

For the ones that mention 'popleft' or threading (benchmark scripts), are you doing a control-c keyboard command to cancel what's running? I see the ^C command which suggests it saw a cancel-command.

Best,
Reese

0 Narayan Desai over 2 years ago in reply to Reese Grimsley

Prodigy 70 points

Hi Reese,

Sorry for getting back to you a bit late. I actually was able to get the scripts to run mostly fine by setting everything up in docker instead. Both the pyenv method and the deprecated miniconda method (as expected) were leading to dependency conflicts, and for some reason, running the torchvision setup script was failing (installing torch==1.10.0+cu113 would fail and give a "killed" message); however, when I used docker, setup was smooth, and there were no issues compiling models, running inference, or generating artifacts.

Regarding segmentation, I believe I pasted an incorrect output. Again though, using a docker installation seemed to resolve my issues. Also, where you see ^C, I was killing the process because the script hung and wouldn't proceed.

I also made sure to disable CUDA acceleration as per your advice. Thanks for you help!

Best,

Narayan

Processors

Processors forum

PROCESSOR-SDK-AM62A: Various errors when trying to run scripts from edgeai-modelmaker and edgeai-benchmark