AM62A3: Models compiled with edgeai-tidl-tools cause segmentation fault on AM62a

Part Number: AM62A3

Tool/software:

Hello, I am using a AM62a target device where offloading models to the accelerator causes a segmentation fault.

- I have validated the "out-of-box" examples within a docker container that I use to compile the artifacts 

-  I have compiled the osrt_python/tfl and osrt_python/ort examples successfully within the container (these are the files that I am transfering to the target)

- I have tested the inference in the docker container without offloading

- I have tested the inference without offloading on my target device

Only when I enable offloading on the AM62a device, I get the segmentation fault. This is the case for the osrt_python/tfl as well as the osrt_python/ort models. 

- I am using the release tag 10_00_07_00 within the docker container as well as on the target device

- I have recently updated the target device's TIDL version following this explanation edgeai-tidl-tools/docs/backward_compatibility.md at 10_00_07_00 · TexasInstruments/edgeai-tidl-tools

On a side note: is the description in edgeai-tidl-tools/docs/backward_compatibility.md at 10_00_07_00 · TexasInstruments/edgeai-tidl-tools sufficient to upgrade / install the edgeai-tidl-tool on the target device? I've read something about an RTOS SDK version at some point. Is this also something I need to upgrade? If so, how?

What can I do about this? How do I approach this problem? Thanks for the help!

  • Output of python tflrt_delegate.py:

    root@am62dl:/opt/edgeai-tidl-tools/examples/osrt_python/tfl# python3 tflrt_delegate.py
    Running 4 Models - ['cl-tfl-mobilenet_v1_1.0_224', 'ss-tfl-deeplabv3_mnv2_ade20k_float', 'od-tfl-ssd_mobilenet_v2_300_float', 'od-tfl-ssdlite_mobiledet_dsp_320x320_coco']


    Running_Model : cl-tfl-mobilenet_v1_1.0_224

    Number of subgraphs:1 , 34 nodes delegated out of 34 nodes

    Segmentation fault (core dumped)


  • Hello Stefan,

    We will figure out where this is coming from. My first suspicion is related to tool versions

    - I am using the release tag 10_00_07_00 within the docker container as well as on the target device

    TIDL Tools versions with odd numbers are designated as ones portable to the previous SDK, in this case 9.2. Is that the version of the SDK that you have on your AM62A installation?

    • You should have $EDGEAI_SDK_VERSION set to a string similar to 09_02 or 9.2 defined within your linux environment (defined by an auto-run script on login)
    • You should have updated the firmware, OSRT components (e.g. TFLite and ONNXRT libs), and other ti libraries like libtivision_apps.so using the steps mentioned in the backwards_compatibility.md doc
      • it sounds like you have done this, but simply verifying here
      • Ensure the $SOC environment variable was set to 'am62a'. This should have also been handled by the auto-run script on login 

    If you are seeing seg faults on any model, then my estimation is that something was not correctly updated. Your approach on testing different components is isolating this to the TIDL stack, so it is helpful.

    I am also curious why your device's hostname is root@am62dl, but perhaps that was intentional and we can ignore.

    Suggested steps for collecting more info/logs: 

    • pass debug_level=2 to the runtime when creating the model. It should be sufficient to set this in examples/osrt_python/common_utils.py
    • On target, run /opt/vx_app_arm_remote_log.out in the background before starting your script
    • Run the python application from gdb, and check the backtrace for the thread that seg-faulted
    • Run a `pip3 freeze | grep -i "tflite\|onnx\|tidl" and share the package versions

    On a side note: is the description in edgeai-tidl-tools/docs/backward_compatibility.md at 10_00_07_00 · TexasInstruments/edgeai-tidl-tools sufficient to upgrade / install the edgeai-tidl-tool on the target device? I've read something about an RTOS SDK version at some point. Is this also something I need to upgrade? If so, how?

    Yes, the instructions here are sufficient to upgrade the TIDL stack (not just edgeai-tidl-tools) on the previous SDK with latest bugfixes and changes, with one caveat -- the memory map between the EVM and your hardware platform must be compatible. If you are on the starter kit EVM, ignore this point. 

    I do not think the RTOS SDK (probably PSDK RTOS) is necessary here, but please point me towards this note if you happen across it again. If you needed to change the memory map for your custom hardware, this would be relevant. Note that for AM62A, we have a 'firmware-builder' tool that occupies same function as PSDK RTOS SDK.

    BR,
    Reese

  • Hello Reese,

    thanks for the help, appreciated! 

    TIDL Tools versions with odd numbers are designated as ones portable to the previous SDK, in this case 9.2. Is that the version of the SDK that you have on your AM62A installation?

    The EDGEAI_SDK_VERSION is set to 09_00_00. Since I've tried to update the target device to 10_00_07_00, I guess this is wrong, no? I've checked the setup_target_device.sh script that we've used to update the device and could not find anything related to updating this environment variable. Am I missing some steps here to properly update the device to be compatible with the models compiled with edgeai-tidl-tools version 10_00_07_00? On the target device we have used the 10_00_07_00 tag of the edgai-tidl-tools: "root@am62dl:/opt/edgeai-tidl-tools# git status HEAD detached at 10_00_07_00" .

    Ensure the $SOC environment variable was set to 'am62a'. This should have also been handled by the auto-run script on logi

    The SOC variable is indeed set to am62a upon logging into the target device. This is also what we set to compile the model artifacts in the docker container.

    I am also curious why your device's hostname is root@am62dl, but perhaps that was intentional and we can ignore.

    Yes, this is just us renaming our device. This should not matter at all.

    Run the python application from gdb, and check the backtrace for the thread that seg-faulted

    This is the output of GDB when using the option *thread apply all bt*. Does anything suspicious come to mind here?

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    (gdb) run tflrt_delegate.py
    Starting program: /usr/bin/python3 tflrt_delegate.py
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/libthread_db.so.1".
    warning: Cannot parse .gnu_debugdata section; LZMA support was disabled at compile time
    [New Thread 0xfffff57cf120 (LWP 81481)]
    [New Thread 0xfffff2fbf120 (LWP 81482)]
    [New Thread 0xfffff07af120 (LWP 81483)]
    Running 4 Models - ['cl-tfl-mobilenet_v1_1.0_224', 'ss-tfl-deeplabv3_mnv2_ade20k_float', 'od-tfl-ssd_mobilenet_v2_300_float', 'od-tfl-ssdlite_mobiledet_dsp_320x320_coco']
    Running_Model : cl-tfl-mobilenet_v1_1.0_224
    Number of subgraphs:1 , 34 nodes delegated out of 34 nodes
    Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
    0x0000000500000004 in ?? ()
    (gdb) thread apply all bt
    Thread 4 (Thread 0xfffff07af120 (LWP 81483) "python3"):
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Run a `pip3 freeze | grep -i "tflite\|onnx\|tidl" and share the package versions

    Fullscreen
    1
    2
    3
    root@am62dl:/opt/edgeai-tidl-tools/examples/osrt_python/tfl# pip3 freeze | grep -i "tflite\|onnx\|tidl"
    onnxruntime-tidl @ file:///home/root/arago_j7_pywhl/onnxruntime_tidl-1.14.0%2B10000005-cp310-cp310-linux_aarch64.whl
    tflite-runtime @ file:///home/root/arago_j7_pywhl/tflite_runtime-2.12.0-cp310-cp310-linux_aarch64.whl
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    On target, run /opt/vx_app_arm_remote_log.out in the background before starting your script

    I am not sure if I have done this correctly, but here goes the output after running the scripts once or twice. 

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    [C7x_1 ] 2322568.032723 s: UDMA: Init ... Done !!!
    [C7x_1 ] 2322568.032735 s: MEM: Init ... !!!
    [C7x_1 ] 2322568.032747 s: MEM: Created heap (DDR_LOCAL_MEM, id=0, flags=0x00000004) @ b2000000 of size 117440512 bytes !!!
    [C7x_1 ] 2322568.032776 s: MEM: Init ... Done !!!
    [C7x_1 ] 2322568.032788 s: IPC: Init ... !!!
    [C7x_1 ] 2322568.032800 s: IPC: 3 CPUs participating in IPC !!!
    [C7x_1 ] 2322568.033017 s: IPC: Waiting for HLOS to be ready ... !!!
    [C7x_1 ] 2322568.054528 s: IPC: HLOS is ready !!!
    [C7x_1 ] 2322568.054614 s: IPC: Init ... Done !!!
    [C7x_1 ] 2322568.054629 s: APP: Syncing with 2 CPUs ..
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    pass debug_level=2 to the runtime when creating the model. It should be sufficient to set this in examples/osrt_python/common_utils.py

    I've compiled the models in our container again and set the logging level. This is the captured output. 

    logging.log

    Another question for the setup_target_device.sh script. I dont quite understand the instructions for the TISDK_IMAGE environment variable. How can I tell whether I need to set adas or edgeai here? What is the difference between EVM boards and SK boards? 

    export TISDK_IMAGE=*adas or edgeai* // [adas for evm boards, edgeai for sk boards]

    Also, do I need to update the C7x firmware as well? I've used TISDK_IMAGE=edgeai and not updated the C7x firmware so far.

    export UPDATE_FIRMWARE_AND_LIB=1

    Really appreciate the help. Is there any other information you need? Do you know if the tidl installation on our target device is broken / has the wrong version? What are the next steps? 

    Best Regards

  • Hello Reese, 

    I have run the setup_target_device.sh script again with the below environment variable to update the C7x firmware. 

    export UPDATE_FIRMWARE_AND_LIB=1


    The osrt_python/tfl example no longer gives a segmentation fault (which is great!), but gets stuck when inferencing the model. This completely freezes the shell. 

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    root@am62dl:/opt/edgeai-tidl-tools/examples/osrt_python/tfl# python3 tflrt_delegate.py
    Running 4 Models - ['cl-tfl-mobilenet_v1_1.0_224', 'ss-tfl-deeplabv3_mnv2_ade20k_float', 'od-tfl-ssd_mobilenet_v2_300_float', 'od-tfl-ssdlite_mobiledet_dsp_320x320_coco']
    Running_Model : cl-tfl-mobilenet_v1_1.0_224
    ****** In DelegatePrepare ******
    Number of subgraphs:1 , 34 nodes delegated out of 34 nodes
    ****** In tidlDelegate::Init ******
    ************ in TIDL_subgraphRtCreate ************
    APP: Init ... !!!
    MEM: Init ... !!!
    MEM: Initialized DMA HEAP (fd=6) !!!
    MEM: Init ... Done !!!
    IPC: Init ... !!!
    IPC: Init ... Done !!!
    REMOTE_SERVICE: Init ... !!!
    REMOTE_SERVICE: Init ... Done !!!
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
     

    I captured the backtraces using gdb again:

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    Type "apropos word" to search for commands related to "word"...
    Reading symbols from python3...
    (No debugging symbols found in python3)
    (gdb) run tflrt_delegate.py
    Starting program: /usr/bin/python3 tflrt_delegate.py
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/libthread_db.so.1".
    warning: Cannot parse .gnu_debugdata section; LZMA support was disabled at compile time
    [New Thread 0xfffff57cf120 (LWP 85454)]
    [New Thread 0xfffff2fbf120 (LWP 85455)]
    [New Thread 0xfffff07af120 (LWP 85456)]
    Running 4 Models - ['cl-tfl-mobilenet_v1_1.0_224', 'ss-tfl-deeplabv3_mnv2_ade20k_float', 'od-tfl-ssd_mobilenet_v2_300_float', 'od-tfl-ssdlite_mobiledet_dsp_320x320_coco']
    Running_Model : cl-tfl-mobilenet_v1_1.0_224
    ****** In DelegatePrepare ******
    Number of subgraphs:1 , 34 nodes delegated out of 34 nodes
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Any ideas what is going on here?

    Best Regards

  • Hi Stefan,

    Thanks for the all the information here -- much appreciated and very helpful. I see the issue.

    The EDGEAI_SDK_VERSION is set to 09_00_00. Since I've tried to update the target device to 10_00_07_00, I guess this is wrong, no

    Unfortunately yes, this is probably an incompatible combination. Please see the version_compatibility doc. We started this form of backwards compatibility at 10.0 SDK and maintained compatibility (with the steps you found) for the 9.2 SDK. This does not apply for 9.0 SDK

    So this is version compatibility issue. in doing this, you are applying 10.0.0.7 firmware that is compatible with 9.2 SDK in an actual 9.0 SDK installation.

    Are you able to move SDK's to either 9.2 or 10.0? Worth noting that a 10.1 SDK will release within the next couple weeks. Otherwise, you would need to stick with edgeai-tidl-tools from 09_00_XX_YY

    BR,
    Reese

  • Hey Reese,

    I'm a colleague of Stefan, we work on the same devboard (so all the information Stefan has given also holds for this post).

    Otherwise, you would need to stick with edgeai-tidl-tools from 09_00_XX_YY

    I tried your suggestion to change the TIDL tools version in our devcontainer to 09_00_00_06

    Fullscreen
    1
    2
    tidl-model-compilation/edgeai-tidl-tools$ git st
    HEAD detached at 09_00_00_06
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    I think we also have some issues with updating the SDK on our devboard, Stefan tried to update to 10_00_07_00, but I think this was not successful (see above post for what Stefan tried.)

    However, when I try the example compilation, the python script hangs (see error mdg after ctrl-c at the end)

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    root@c9f23fa83205:/opt/edgeai-tidl-tools/examples/osrt_python/tfl# python tflrt_delegate.py -c
    Running 4 Models - ['cl-tfl-mobilenet_v1_1.0_224', 'ss-tfl-deeplabv3_mnv2_ade20k_float', 'od-tfl-ssd_mobilenet_v2_300_float', 'od-tfl-ssdlite_mobiledet_dsp_320x320_coco']
    Running_Model : cl-tfl-mobilenet_v1_1.0_224
    Running_Model : ss-tfl-deeplabv3_mnv2_ade20k_float
    Running_Model :
    Running_Model : od-tfl-ssdlite_mobiledet_dsp_320x320_coco
    od-tfl-ssd_mobilenet_v2_300_float
    Number of OD backbone nodes = 89
    Size of odBackboneNodeIds = 89
    TIDL Meta PipeLine (Proto) File : ../../../models/public/ssdlite_mobiledet_dsp_320x320_coco_20200519.prototxt
    Number of OD backbone nodes = 112
    Size of odBackboneNodeIds = 112
    Preliminary number of subgraphs:1 , 81 nodes delegated out of 81 nodes
    Preliminary number of subgraphs:1 , 34 nodes delegated out of 34 nodes
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    This is what is being created:

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    root@c9f23fa83205:/opt/edgeai-tidl-tools/models/public# l
    total 117M
    8.7M -rw-r--r-- 1 root root 8.7M Dec 20 10:24 deeplabv3_mnv2_ade20k_float.tflite
    17M -rw-r--r-- 1 root root 17M Dec 20 10:24 mobilenet_v1_1.0_224.tflite
    28M -rw-r--r-- 1 root root 28M Dec 20 10:24 ssdlite_mobiledet_dsp_320x320_coco_20200519.tflite
    4.0K -rw-r--r-- 1 root root 2.9K Dec 20 10:24 ssdlite_mobiledet_dsp_320x320_coco_20200519.prototxt
    65M -rw-r--r-- 1 root root 65M Dec 20 10:24 ssd_mobilenet_v2_300_float.tflite
    (3.10.16) root@c9f23fa83205:/opt/edgeai-tidl-tools/model-artifacts/cl-tfl-mobilenet_v1_1.0_224/tempDir# l
    total 20M
    12K -rw-r--r-- 1 root root 8.8K Dec 20 11:43 86_tidl_net.bin_netLog.txt
    19M -rw-r--r-- 1 root root 19M Dec 20 11:43 86_tidl_net.bin
    40K -rw-r--r-- 1 root root 37K Dec 20 11:43 86_tidl_io_1.bin
    4.0K -rw-r--r-- 1 root root 1.8K Dec 20 11:43 86_tidl_net.bin.layer_info.txt
    236K -rw-r--r-- 1 root root 236K Dec 20 11:43 86_tidl_net.bin.svg
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    It runs locally, but no the devboard it tells me that "allowedNode.txt" is missing.

    Any ideas what went wrong here? 

    (Note: not urgent, I'll be returning from christmas holidays mid january)

  • Hi Dominic,

    Okay, so compiling against 9.0 SDK tools now, got it. This is correct if it is not feasible to upgrade the SDK otherwise. 

    However, when I try the example compilation, the python script hangs (see error mdg after ctrl-c at the end)

    I see that you are running the default example here for the models. This will create fork multiple processes and may hang while it's waiting on one to return. Perhaps one of those failed. It is difficult to tell from the logs.

    • I noted on my side that the od-tfl-ssd_mobilenet_v2_300_float hit a segfault during compilation on my side. 
      • if one model doesn't complete and hangs, then the whole process will hang as the main thread waits for threads to complete. 

    Are you interested in a specific model or just trying to test the tools?

    You can run a single model by adding '-m MODEL_CONFIG_NAME' to the command line args, where MODEL_CONFIG_NAME is a key from the examples/osrt_python/model_configs.py. One of these is "cl-tfl-mobilenet_v1_1.0_224". 

    The files in your tempDir look correct, but those are intermediate files (and some debugging info). The directory up from that has the important files for artifacts. There should be 2 binaries, a model file, and a few supporting files like this allowedNode.txt

    I'd recommend increasing the debug_level parameter to 1. You can change this globally from the common_utils.py file or by adding 'debug_level': 1 to an additional "optional_options" dictionary within a model_configs.py dict entry. Most likely one model is failing and causing the whole script to hang.

    BR,
    Reese

  • Hi Reese,

    I'm back from my holidays and tried out your suggestions - unfortunately none of them seem to have worked.

    Are you interested in a specific model or just trying to test the tools?

    Currently, I'm only interested in completing the compilation / deployment workflow for an arbitrary model. Next would be to deploy a custom model and do evaluations (accuracy, inference time) with it. 

    You can run a single model by adding '-m MODEL_CONFIG_NAME' to the command line args, where MODEL_CONFIG_NAME is a key from the examples/osrt_python/model_configs.py. One of these is "cl-tfl-mobilenet_v1_1.0_224". 

    I think the `-m` option is not available (yet?) in the script `root@cd64c3ba8cc1:/opt/edgeai-tidl-tools/examples/osrt_python/tfl# python tflrt_delegate.py -c` I'm calling. But I just manually edited the `models` list in line 240. 

    I also set `ncpus = 1` (line 41), otherwise I still get the thread related error (os.cpu_count() == 24 on my system):

    Fullscreen
    1
    2
    3
    4
    5
    6
    ^CTraceback (most recent call last):
    File "/opt/edgeai-tidl-tools/examples/osrt_python/tfl/tflrt_delegate.py", line 275, in <module>
    nthreads = join_one(nthreads)
    File "/opt/edgeai-tidl-tools/examples/osrt_python/tfl/tflrt_delegate.py", line 257, in join_one
    sem.acquire()
    KeyboardInterrupt
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    I tried all the following models with the same result

    Fullscreen
    1
    2
    3
    4
    'cl-tfl-mobilenet_v1_1.0_224'
    'ss-tfl-deeplabv3_mnv2_ade20k_float'
    'od-tfl-ssd_mobilenet_v2_300_float'
    'od-tfl-ssdlite_mobiledet_dsp_320x320_coco'
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Exemplary output with debug level 1 and ncpus = 1

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    (3.10.16) root@cd64c3ba8cc1:/opt/edgeai-tidl-tools/examples/osrt_python/tfl# python tflrt_delegate.py -c
    Running 1 Models - ['od-tfl-ssdlite_mobiledet_dsp_320x320_coco']
    Running_Model : od-tfl-ssdlite_mobiledet_dsp_320x320_coco
    tidl_tools_path = /opt/edgeai-tidl-tools/tidl_tools
    artifacts_folder = ../../../model-artifacts//od-tfl-ssdlite_mobiledet_dsp_320x320_coco/
    tidl_tensor_bits = 8
    debug_level = 1
    num_tidl_subgraphs = 16
    tidl_denylist =
    tidl_denylist_layer_name =
    tidl_denylist_layer_type =
    tidl_allowlist_layer_name =
    model_type =
    tidl_calibration_accuracy_level = 7
    tidl_calibration_options:num_frames_calibration = 2
    tidl_calibration_options:bias_calibration_iterations = 5
    mixed_precision_factor = -1.000000
    model_group_id = 0
    power_of_2_quantization = 2
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Not sure if this is of any interest, but we base our dev-container on 

    nvidia/cuda:12.3-devel-ubuntu22.04
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    (3.10.16) root@cd64c3ba8cc1:/opt/edgeai-tidl-tools/examples/osrt_python/tfl# nvidia-smi
    Tue Jan 14 14:22:21 2025
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 555.58.02 Driver Version: 556.12 CUDA Version: 12.5 |
    |-----------------------------------------+------------------------+----------------------+
    | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
    | | | MIG M. |
    |=========================================+========================+======================|
    | 0 NVIDIA RTX A2000 12GB On | 00000000:01:00.0 On | Off |
    | 30% 26C P8 5W / 70W | 255MiB / 12282MiB | 0% Default |
    | | | N/A |
    +-----------------------------------------+------------------------+----------------------+
    +-----------------------------------------------------------------------------------------+
    | Processes: |
    | GPU GI CI PID Type Process name GPU Memory |
    | ID ID Usage |
    |=========================================================================================|
    | No running processes found |
    +-----------------------------------------------------------------------------------------+
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    Thanks in advance,
    Dominic

  • Hi Dominic,

    Reese is out this week and won't be able to respond until next week.

    Regards,

    Jianzhong

  • Hi Dominic,

    Thanks for the patience while I was out.

    I realize that I recommended a CLI option -m that wasn't in this version of the tools. My apologies, I had forgotten this SDK's tools needed the set of models defined within the script itself in this release. 

    I see that you are getting the (particularly opaque) "bus error" as the script fails out. This is the case for all the models you try, correct? Generally there is an easy solution to TIDL failing on bus error. This occurs when some shared memory under /dev/shm fails to clear, and it is unable to allocate more, resulting in error. Try the line below to clear the /dev files that TIDL would have created:

    Fullscreen
    1
    rm /dev/shm/vashm*
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    I noted that my compilation ran into an issue in the last stage for model 'od-tfl-ssdlite_mobiledet_dsp_320x320_coco' and 'od-tfl-ssd_mobilenet_v2_300_float' (later than your logs) but the other two complete without issue and provide reasonable output. To be completely frank, 9.0 SDK was the least stable of releases between 8.6 and current (10.1) -- I recommend upgrading if possible. 

    I think your container is fine. Ubuntu 22.04 is correct. SDK 9.0 did not have GPU-based tools for speeding up compilation, so GPU info / status should not play a role here. 

    BR,
    Reese

  • Hi Reese,

    I tried your suggestion - unfortunately I get the same buserror one output later. 

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    ************ in TIDL_subgraphRtCreate ************
    The soft limit is 2048
    The hard limit is 2048
    MEM: Init ... !!!
    MEM: Init ... Done !!!
    0.0s: VX_ZONE_INIT:Enabled
    0.4s: VX_ZONE_ERROR:Enabled
    0.5s: VX_ZONE_WARNING:Enabled
    0.1520s: VX_ZONE_INIT:[tivxInit:185] Initialization Done !!!
    ************ TIDL_subgraphRtCreate done ************
    tidl_tfLiteRtImport_delegate.cpp Invoke 478
    ******* In TIDL_subgraphRtInvoke ********
    Bus error (core dumped)
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    container resources should be good (started the container fresh, only running container)

    Fullscreen
    1
    2
    CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
    f21fbd32371b loving_mclean 6.04% 1.688GiB / 31.19GiB 5.41% 86MB / 1.4MB 0B / 0B 106
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    To be completely frank, 9.0 SDK was the least stable of releases between 8.6 and current (10.1) -- I recommend upgrading if possible. 

    I think this is what we'll be going to do. Stefan managed to deploy a custom trained yolox model meanwhile on version 10.

    Related: I noticed that we'll likely meet on feb 10, as I'll participate in the SICK workshop where you and Manuel Philippin are signed up--> would it make sense that we compile a list of questions / topics for you beforehand?

    best regards,

    Dominic

  • Hi Dominic, 

    Hmm, still experiencing that bus error. I'm surprised clearing the shared memory didn't resolve this, especially if you are compiling one small model as an initial test.

    I do think you'll have a much better experience in 10.0 or newer.

    • I will mention that a model similar to "od-tfl-ssdlite_mobiledet_dsp_320x320_coco" had an issue on 10.0/10.1. I just confirmed the fix for that last week and bugfix release with this will go live in the next week or so (10.1.0.4 is the equivalent version string for that). I mention this as a quick warning in case you see some error with "ValueError: basic_string::_M_create" prominently printed.

    I think this is what we'll be going to do. Stefan managed to deploy a custom trained yolox model meanwhile on version 10.

    Awesome, that's great to hear. 

    would it make sense that we compile a list of questions / topics for you beforehand?

    Yes! That would be very helpful -- we can then get the content and discussion geared to be as practical as possible. Please send me / Manuel a list of questions so we can review and prepare. Looking forward to meeting you!

    BR,
    Reese