PROCESSOR-SDK-AM68A: ONNX Runtime models do not work in docker in SDK 10.0

Denys Datsko

Tool/software:

I am trying to launch example edgeai gstreamer applications in SDK10, and I am facing a problem. When using the default pre-built image from https://www.ti.com/tool/PROCESSOR-SDK-AM68A, I am able to launch the system and try demo applications by going to /opt/edgeai-gst-apps/apps_python and launching using python3 app_edgeai.py ../configs/<some_config>.yaml, and everything works as expected.

However, when I build the prepared docker (by calling ./docker_build.sh in /opt/edgeai-get-apps/docker), and then run that docker, not all the models work. For example, I can run the TFL-CL-0000-mobileNetV1-mlperf, but not the ONR-CL-6360-regNetx-200mf. It seems that the problem is with all the models that are launched using onnxruntime, as with all of them I get the following output:

[docker] root@am68-sk:/opt/edgeai-gst-apps/apps_python# python3 app_edgeai.py ../configs/image_classification.yaml

libtidl_onnxrt_EP loaded 0x154c95a0

Segmentation fault (core dumped)

I also observe the same behavior when using apps_cpp instead of apps_python, and for both SDK versions 10.00 and 10.01 . Is there some workaround that would allow me launching ONR models from docker?

over 1 year ago

0 Chris Tsongas over 1 year ago

TI__Genius 16500 points

Hi,

Thanks for the info. Can you please tell me if some models fail in the Docker environment or if all fail? I ask this because if all models fail, there is something wrong with the setup. If only a few fail, there may be unsupported layers. I am uncertain why there would be a delta between the host and Docker for the same version, but there may be an issue with the setup script.

Can you please do a git status in edgeai-tidl-tools and send me the output to duplicate your exact environment?

Regards,

Chris

0 Denys Datsko over 1 year ago in reply to Chris Tsongas

Prodigy 20 points

From what I have tried, these models work
- TVM-CL-3090-mobileNetV2-tv
- TFL-CL-0000-mobileNetV1-mlperf
- TFL-OD-2020-ssdLite-mobDet-DSP-coco-320x320
- TVM-OD-5120-ssdLite-mobDet-DSP-coco-320x320

And these don't:
- ONR-OD-8420-yolox-s-lite-mmdet-widerface-640x640
- ONR-CL-6360-regNetx-200mf
- ONR-KD-7060-human-pose-yolox-s-640x640
- ONR-OD-8200-yolox-nano-lite-mmdet-coco-416x416

With ONR, t's the same core dump every time. I also include the backtrace from gdb after running app_edgeai from apps_cpp with ONR-CL-6360-regNetx-200mf:

Fullscreen 2437.ort_core_dump_backtrace.txt Download

(gdb) bt
#0  0x0000fffff1e67ea8 in TIDL_getSupportedNodesInfer () at /usr/lib/libtidl_onnxrt_EP.so
#1  0x0000fffff3716740 in onnxruntime::TidlExecutionProvider::GetCapability(onnxruntime::GraphViewer const&, onnxruntime::IExecutionProvider::IKernelLookup const&) const ()
    at /usr/lib/libonnxruntime.so.1.14.0+10000000
#2  0x0000fffff3cfcde4 in onnxruntime::GetCapabilityForEP(onnxruntime::(anonymous namespace)::GetCapabilityForEPParams const&)::{lambda(onnxruntime::IExecutionProvider const&, onnxruntime::GraphViewer const&, onnxruntime::IExecutionProvider::IKernelLookup const&)#1}::operator()(onnxruntime::IExecutionProvider const&, onnxruntime::GraphViewer const&, onnxruntime::IExecutionProvider::IKernelLookup const&) const [clone .isra.0] () at /usr/lib/libonnxruntime.so.1.14.0+10000000
#3  0x0000fffff3cfd650 in onnxruntime::GetCapabilityForEP(onnxruntime::(anonymous namespace)::GetCapabilityForEPParams const&) () at /usr/lib/libonnxruntime.so.1.14.0+10000000
#4  0x0000fffff3d01464 in onnxruntime::PartitionOnnxFormatModelImpl(onnxruntime::Graph&, onnxruntime::FuncManager&, onnxruntime::KernelRegistryManager&, onnxruntime::KernelRegistry&, onnxruntime::IExecutionProvider&, onnxruntime::GraphPartitioner::Mode, int&, std::function<onnxruntime::common::Status (onnxruntime::Graph&, bool&, onnxruntime::IExecutionProvider&)>) ()
    at /usr/lib/libonnxruntime.so.1.14.0+10000000
#5  0x0000fffff3d02fe8 in onnxruntime::GraphPartitioner::Partition(onnxruntime::Graph&, onnxruntime::FuncManager&, std::function<onnxruntime::common::Status (onnxruntime::Graph&, bool&, onnxruntime::IExecutionProvider&)>, onnxruntime::GraphPartitioner::Mode) const () at /usr/lib/libonnxruntime.so.1.14.0+10000000
#6  0x0000fffff36f5a0c in onnxruntime::InferenceSession::TransformGraph(onnxruntime::Graph&, onnxruntime::GraphTransformerManager const&, onnxruntime::ExecutionProviders const&, onnxruntime::KernelRegistryManager&, onnxruntime::InsertCastTransformer const&, onnxruntime::SessionState&, bool) () at /usr/lib/libonnxruntime.so.1.14.0+10000000
#7  0x0000fffff3700260 in onnxruntime::InferenceSession::Initialize() () at /usr/lib/libonnxruntime.so.1.14.0+10000000
#8  0x0000fffff369b874 in (anonymous namespace)::InitializeSession(OrtSessionOptions const*, std::unique_ptr<onnxruntime::InferenceSession, std::default_delete<onnxruntime::InferenceSession> >&, OrtPrepackedWeightsContainer*) () at /usr/lib/libonnxruntime.so.1.14.0+10000000
#9  0x0000fffff36a280c in OrtApis::CreateSession(OrtEnv const*, char const*, OrtSessionOptions const*, OrtSession**) () at /usr/lib/libonnxruntime.so.1.14.0+10000000
#10 0x0000aaaaaab8074c in ti::dl_inferer::ORTInferer::ORTInferer(ti::dl_inferer::InfererConfig const&) ()
#11 0x0000aaaaaab73084 in ti::dl_inferer::DLInferer::makeInferer(ti::dl_inferer::InfererConfig const&) ()
#12 0x0000aaaaaaaf07dc in ti::edgeai::common::ModelInfo::initialize (this=0xaaaaab380b40) at /opt/edgeai-gst-apps/apps_cpp/common/src/edgeai_demo_config.cpp:1667
#13 0x0000aaaaaaac7244 in ti::edgeai::common::EdgeAIDemoImpl::setupFlows (this=0xaaaaab1e9370) at /opt/edgeai-gst-apps/apps_cpp/common/src/edgeai_demo.cpp:153
#14 0x0000aaaaaaac7094 in ti::edgeai::common::EdgeAIDemoImpl::EdgeAIDemoImpl (this=0xaaaaab1e9370, yaml=...) at /opt/edgeai-gst-apps/apps_cpp/common/src/edgeai_demo.cpp:124
#15 0x0000aaaaaaac7ec0 in ti::edgeai::common::EdgeAIDemo::EdgeAIDemo (this=0xaaaaab3bee70, yaml=...) at /opt/edgeai-gst-apps/apps_cpp/common/src/edgeai_demo.cpp:331
#16 0x0000aaaaaaac0fac in main (argc=2, argv=0xfffffffff278) at /opt/edgeai-gst-apps/apps_cpp/app_edgeai/src/app_edgeai_main.cpp:73

0 Chris Tsongas over 1 year ago in reply to Denys Datsko

TI__Genius 16500 points

Hi Denys,

What is the output from "git status"?

Chris

0 Denys Datsko over 1 year ago in reply to Chris Tsongas

Prodigy 20 points

Hi Chris,
From which repository do you need the information? I am using the prebuilt image (www.ti.com/tool/PROCESSOR-SDK-AM68A), so there is no .git directory anywhere in /opt/edgeai-gst-apps

0 Chris Tsongas over 1 year ago in reply to Denys Datsko

TI__Genius 16500 points

Are you running this on the device or host?

Chris

0 Denys Datsko over 1 year ago in reply to Chris Tsongas

Prodigy 20 points

On the device

0 Chris Tsongas over 1 year ago in reply to Denys Datsko

TI__Genius 16500 points

OK, then my next question is, why would you run Docker on the device? It is made to make isolated environments on the host. The environments on the device are fixed (i.e., whatever you installed on the SD card). One thing that comes up is whether the ONNX models have been compiled on the device. Can you please send me the link to the image (.wic.zx file) you are running? I will take a look from there.

Chris

0 Denys Datsko over 1 year ago in reply to Chris Tsongas

Prodigy 20 points

I am using docker as it is suggested here in the documentation: https://software-dl.ti.com/jacinto7/esd/processor-sdk-linux-am68a/10_01_00/exports/edgeai-docs/common/docker_environment.html

It is a convenient development environment, where I can install additional packages using apt, so I am not restricted by what is available in yocto. I also have docker working properly with the same SDK of versions 08.06.01 and 09.02.00.

I am using .wic.xz image from here: https://www.ti.com/tool/download/PROCESSOR-SDK-LINUX-AM68A/10.00.00.08, and to build docker, I followed the instructions from the documentation by building it with docker_build.sh, running with docker_run.sh and running setup_script.sh inside docker.

0 Chris Tsongas over 1 year ago in reply to Denys Datsko

TI__Genius 16500 points

OK fair enough, but why do this on the target? It's slow and tends to be error prone. Here is an excerpt from he instructions.

Note

Building Docker image on target using the provided Dockerfile will take about 15-20 minutes to complete with good internet connection. Building Docker containers on target can be slow and resource constrained. The Dockerfile provided will build on target without any issues but if you add more packages or build components from source, running out of memory can be a common problem. As an alternative we highly recommend trying QEMU builds for cross-compiling the images for arm64 architecture on a PC and then load the compiled image on target.

The background of this is that it will work, but building isolated Docker environments on the host is a better/faster idea. I can think if you do not have access to a Linux 22.04 system and want to use Docker containers. But in that case, in most cases TIDL will not let you compile on the target, so you will still have to do it on the host.

Chris

0 Denys Datsko over 1 year ago in reply to Chris Tsongas

Prodigy 20 points

Thanks for the clarification. I understand that it can be time-consuming or require a lot of memory to build docker on device, but it worked in this case as a starting point, so there was no need to perform any other setup. I also assumed that there would be no difference between docker images built on the device or using QEMU on other system.

0 Chris Tsongas over 1 year ago in reply to Denys Datsko

TI__Genius 16500 points

In general, the best location to compile (import/quantize whatever the current buzzword is) and test models is on the host. The normal flow is:

1. Find a model (preferably from the model zoo)

2. Train it on the host (Outside of TIDL or use a pre-trained model)

3. Compile it on the host ( python3 ./onnxrt_ep.py -c -m model)

4. Test the model on the host. cd to edgeai-tidl-tools/examples/osrt_python/ort and run inference on the device ( python3 ./onnxrt_ep.py -m model)

5. Copy model artifacts/ and models/ to the device

6, run inference on the device ( Run with the .out file or OSRT python)

You will often get a better error message on the host than on the device. The emulation on the host is good, but corner cases may depend on how the HW is configured (configured memory sizes come to mind).

Chris

0 Denys Datsko over 1 year ago in reply to Chris Tsongas

Prodigy 20 points

Thanks for the answer. But my main goal now is to not compile the model, but rather run it on the device. I experience the same problem in docker with a custom model where I get the Segmentation Fault just after "libtidl_onnxrt_EP loaded", but the model works well outside of the docker environment. As I mentioned in the issue description, I managed to reproduce the problem with all the ONR models from model_zoo.

So, basically the problem is that any ONR model from model_zoo can be run outside of docker on the device, but can't be run inside of the prepared docker container on the device. This greatly complicates the development when I need, for example, ROS, which is available only inside the docker, but not in yocto.

I suppose there is some discrepancy between libraries on the device host filesystem and inside the docker container on the device, but I wasn't able to identify the exact cause of the problem, unfortunately.

0 Chris Tsongas over 1 year ago in reply to Denys Datsko

TI__Genius 16500 points

Hi Denys,

I have posted a question to the dev team. Unless the Dev team answers earlier, I should have an answer in the CCB by 1/27.

Regards,

Chris

0 Denys Datsko over 1 year ago in reply to Chris Tsongas

Prodigy 20 points

Hi Chris,

I found an additional information that could help you with identifying the issue. It appears that when I replace the in-docker /usr/lib/libonnxruntime.so (and libonnxruntime.so.1.14.0+10000000) with the library from the host filesystem (/usr/lib/libonnxruntime.so.1.14.0+10000005), and recompile the apps_cpp inside of docker:
$ cd /opt/edgeai-gst-apps/apps_cpp/build
$ rm -r *
$ cmake ..
$ make
, all the ONR models start to work when launched with /opt/edgeai-gst-apps/apps_cpp/bin/Release/app_edgeai. This library replacement still does not affect apps_python, however

Processors

Processors forum

PROCESSOR-SDK-AM68A: ONNX Runtime models do not work in docker in SDK 10.0