This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SK-AM62A-LP: [edgeai-modelmaker] strange compilation run.log printout

Part Number: SK-AM62A-LP

Tool/software:

Background: I followed the edgeai-modelmaker instructions step by step, the model can be trained and converted into onnx file then model artifacts were generated successfully. Everything is fine, but when I checked out the compilation run.log, some printouts do not look right to me.

Also, I uploaded the same image into SK-AM62A-LP and did inference, I found out that their performances are different. Inference results of the compiled model that is simulated on the host machine are significantly better than inference results produced by SK-AM62A-LP. I suspect that this is due to this potential compilation issue. 

0726.run.log

Hello, 

I would like to ask what do compilation run.log line 47 and 48 "Unable to find initializer at index - 1 for node 93
Unable to find initializer at index - 1 for node 109" mean? I also see the same thing in my compilation run.log.

Another question is as you mentioned, after "==================== [Optimization for subgraph_0 Started] ====================",  "Invalid Layer Name" is printed repeatedly. I don't understand what it means. 

The reason I raise out these two questions is that i found out the inference results of the compiled model artifacts that is simulated on the host machine are significantly better than inference results produced by SK-AM62A-LP with the SAME model artifacts. (please refer to https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1498572/sk-am62a-lp-edgeai-modelmaker-which-shell-python-script-produces-outputs-folder?tisearch=e2e-sitesearch&keymatch=edgeai-tensorlab%20outputs%20folder#  for more details). I suspect that the strange output in run.log is related to this issue. Maybe some layers are not valid in the subgraph optimization, thus affect the performance of the model artifact on SK-AM62A-LP.

I am looking forward to hearing from any of you. Thanks.

Regards,

Matt

  • Hi Matt,

     I understand you're running into challenges with edgeai-modelmaker, and it looks like you're using SDK 10.0 here. 

    Let me make sure I understand the specifics of your issue:

    I found out that their performances are different. Inference results of the compiled model that is simulated on the host machine are significantly better than inference results produced by SK-AM62A-LP.

    The host-side emulation and target-side accuracy is different, then?  Please confirm

    47 and 48 "Unable to find initializer at index - 1 for node 93
    Unable to find initializer at index - 1 for node 109" mean? I also see the same thing in my compilation run.log.

    I expect this means that some layer failed to find constant values like weights,

    and then probably related:

    "Invalid Layer Name" is printed repeatedly

    Is telling me that this network's layers are not being parsed correctly for some reason. I see a list of 9 layers repeated

    Invalid Layer Name  /multi_level_conv_obj.2/Conv_output_0
    Invalid Layer Name  /multi_level_conv_reg.2/Conv_output_0
    Invalid Layer Name  /multi_level_conv_cls.2/Conv_output_0
    Invalid Layer Name  /multi_level_conv_obj.1/Conv_output_0
    Invalid Layer Name  /multi_level_conv_reg.1/Conv_output_0
    Invalid Layer Name  /multi_level_conv_cls.1/Conv_output_0
    Invalid Layer Name  /multi_level_conv_obj.0/Conv_output_0
    Invalid Layer Name  /multi_level_conv_reg.0/Conv_output_0
    Invalid Layer Name  /multi_level_conv_cls.0/Conv_output_0

    I would not expect this to happen for a TI validated flow -- it looks like you are using YOLO-X-S.

    Now to work towards a resolution:

    If you are willing to pass along your exported ONNX model + compiled artifacts, that would be helpful. Alternatively, please show the full layer names that match above -- I am curious if a portion of the name was thrown away or has a strange, invalid character. 

    However, I have a suspicion that this is more closely related to Pytorch-->ONNX export. What are you versions for torch-related and onnx-related libraries in python (see `pip3 freeze`  output).

    • How did you setup modelmaker?
    • Are you using an form of virtual environment to keep these segregated from the rest of your machine's packages? Could there have been other versions of libraries like pytorch or onnx already installed ?

    I think what's happened is that some weights / constant values in the model were not read correctly, and are being saved in such a way that PC differs from target when loading them from the artifacts. We'll need to learn why this happened

    BR,
    Reese

  • Hello Reese,

    Thank you for your reply.


    1. To deal with the model artifacts performance discrepancy along x86 host pc and SK-AM62A-LP, I started to study the relevant documentation you just mentioned today,

    https://github.com/TexasInstruments/edgeai-tidl-tools/blob/10_00_08_00/docs/tidl_osr_debug.md#steps-to-debug-error-scenarios-for-targetevmdevice-execution.

    By the way, I will use edgeai-tidl-tools to generate artifacts instead of edgeai-benchmark (talking about model compilation/ inference ONLY as per your colleague's advice

    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1498572/sk-am62a-lp-edgeai-modelmaker-which-shell-python-script-produces-outputs-folder/5790817#5790817 ).

    Will inquire you if I struggle to understand it.


    2. Yes, i am using YOLOX-small-lite model.

    3. Here is the output folder after I ran ./run_modelmaker AM62A config_detection.yaml under edgeai-modelmaker. 

    https://github.com/MattMak0402/od-8220-20250423.git 

    I uploaded it onto my github repository, please take a look at it. The folder was named "od-8220-20250423" and it includes outputs (the inference result of the generated artifacts on my x86 host pc), the compilation run.log and other compulsory components. This folder is originally generated and located in   /local_data/home/mattmak/mattmak_20250314/edgeai-tensorlab/edgeai-modelmaker/data/projects/htt_Fan_Matt_basic_cam.v8i.coco_RAW/run/20250422-125939/yolox_s_lite/compilation/AM62A/work/od-8220-20250423.

    4. Here is the requirements.txt of my edgeai-modelmaker setup:

    absl-py==2.1.0
    addict==2.4.0
    aenum==3.1.15
    aliyun-python-sdk-core==2.15.2
    aliyun-python-sdk-kms==2.16.5
    attrs==24.2.0
    autocfg==0.0.8
    cachetools==5.5.0
    caffe2onnx @ https://github.com/TexasInstruments/edgeai-caffe2onnx/archive/refs/heads/tidl.zip#sha256=dbf46afa4a035fbf40829c7147fe959c1403ac54b00338fc5d00d5f6a893e82a
    certifi==2024.8.30
    cffi==1.17.1
    charset-normalizer==3.3.2
    click==8.1.7
    cloudpickle==3.0.0
    colorama==0.4.6
    colored==2.2.4
    coloredlogs==15.0.1
    contourpy==1.3.0
    crcmod==1.7
    cryptography==43.0.1
    cycler==0.12.1
    Cython==3.0.11
    dataclasses==0.6
    debugpy==1.8.6
    decorator==5.1.1
    dill==0.3.8
    distro==1.9.0
    dlr @ https://software-dl.ti.com/jacinto7/esd/tidl-tools/10_00_08_00/OSRT_TOOLS/X86_64_LINUX/UBUNTU_22_04/dlr-1.13.0-py3-none-any.whl#sha256=c979a93be924649a6b53cc98b10a4a052493c36155cfd781a8eb17562d87e527
    -e git+https://github.com/TexasInstruments/edgeai-tensorlab.git@3de61dfa503c408346c3bcd029f49a25e42a8a73#egg=edgeai_torchmodelopt&subdirectory=edgeai-modeloptimization/torchmodelopt
    -e git+https://github.com/TexasInstruments/edgeai-tensorlab.git@3de61dfa503c408346c3bcd029f49a25e42a8a73#egg=edgeai_benchmark&subdirectory=edgeai-benchmark
    -e git+https://github.com/TexasInstruments/edgeai-tensorlab.git@3de61dfa503c408346c3bcd029f49a25e42a8a73#egg=edgeai_modelmaker&subdirectory=edgeai-modelmaker
    -e git+https://github.com/TexasInstruments/edgeai-tensorlab.git@34607ac7547989dbbecfe52be212ef03ed6b6e1a#egg=edgeai_tensorvision&subdirectory=edgeai-tensorvision
    -e git+https://github.com/TexasInstruments/edgeai-tensorlab.git@3de61dfa503c408346c3bcd029f49a25e42a8a73#egg=edgeai_xvision&subdirectory=edgeai-torchvision/references/edgeailite
    einops==0.8.1
    exceptiongroup==1.2.2
    filelock==3.14.0
    flatbuffers==1.12
    fonttools==4.54.1
    fsspec==2024.2.0
    gluoncv==0.10.5.post0
    google-auth==2.35.0
    google-auth-oauthlib==0.4.6
    graphviz==0.20.3
    grpcio==1.66.1
    h5py==3.12.1
    huggingface-hub==0.29.3
    humanfriendly==10.0
    idna==3.10
    importlib_metadata==8.5.0
    iniconfig==2.0.0
    Jinja2==3.1.3
    jmespath==0.10.0
    joblib==1.4.2
    json-tricks==3.17.3
    kiwisolver==1.4.7
    loguru==0.7.3
    Markdown==3.7
    markdown-it-py==3.0.0
    MarkupSafe==2.1.5
    matplotlib==3.9.2
    mdurl==0.1.2
    ml_dtypes==0.5.1
    mmcv==2.2.0
    -e git+https://github.com/TexasInstruments/edgeai-tensorlab.git@3de61dfa503c408346c3bcd029f49a25e42a8a73#egg=mmdeploy&subdirectory=edgeai-mmdeploy
    -e git+https://github.com/TexasInstruments/edgeai-tensorlab.git@3de61dfa503c408346c3bcd029f49a25e42a8a73#egg=mmdet&subdirectory=edgeai-mmdetection
    mmengine==0.10.5
    model-index==0.1.11
    mpmath==1.3.0
    multiprocess==0.70.16
    munkres==1.1.4
    networkx==3.2.1
    ninja==1.11.1.1
    numpy==1.23.0
    nvidia-cublas-cu11==11.11.3.6
    nvidia-cublas-cu12==12.1.3.1
    nvidia-cuda-cupti-cu11==11.8.87
    nvidia-cuda-cupti-cu12==12.1.105
    nvidia-cuda-nvrtc-cu11==11.8.89
    nvidia-cuda-nvrtc-cu12==12.1.105
    nvidia-cuda-runtime-cu11==11.8.89
    nvidia-cuda-runtime-cu12==12.1.105
    nvidia-cudnn-cu11==9.1.0.70
    nvidia-cudnn-cu12==8.9.2.26
    nvidia-cufft-cu11==10.9.0.58
    nvidia-cufft-cu12==11.0.2.54
    nvidia-curand-cu11==10.3.0.86
    nvidia-curand-cu12==10.3.2.106
    nvidia-cusolver-cu11==11.4.1.48
    nvidia-cusolver-cu12==11.4.5.107
    nvidia-cusparse-cu11==11.7.5.86
    nvidia-cusparse-cu12==12.1.0.106
    nvidia-nccl-cu11==2.20.5
    nvidia-nccl-cu12==2.19.3
    nvidia-nvjitlink-cu12==12.8.93
    nvidia-nvtx-cu11==11.8.86
    nvidia-nvtx-cu12==12.1.105
    oauthlib==3.2.2
    onnx==1.12.0
    onnx_graphsurgeon @ git+https://github.com/NVIDIA/TensorRT@68b5072fdb9df6b6edab1392b02a705394b2e906#subdirectory=tools/onnx-graphsurgeon
    onnxruntime-tidl @ https://software-dl.ti.com/jacinto7/esd/tidl-tools/10_00_08_00/OSRT_TOOLS/X86_64_LINUX/UBUNTU_22_04/onnxruntime_tidl-1.14.0+10000005-cp310-cp310-linux_x86_64.whl#sha256=366c3af47bcfec87fe1c36155ab7226210303cde812acbb25d4b0509715f355e
    onnxscript==0.2.0
    onnxsim==0.4.35
    opencv-python==4.10.0.84
    opencv-python-headless==4.10.0.84
    opendatalab==0.0.10
    openmim==0.3.9
    openxlab==0.1.2
    ordered-set==4.1.0
    osrt_model_tools @ git+https://github.com/TexasInstruments/edgeai-tidl-tools.git@efae61031b31aa2ba5491c03bb808216d3baef14#subdirectory=scripts
    oss2==2.17.0
    packaging==24.1
    pandas==2.2.3
    pillow==10.4.0
    Pillow-SIMD==9.5.0.post2
    platformdirs==4.3.6
    pluggy==1.5.0
    plyfile==1.1
    portalocker==3.1.1
    prettytable==3.11.0
    progiter==2.0.0
    progressbar==2.5
    protobuf==3.20.1
    psutil==6.0.0
    pyasn1==0.6.1
    pyasn1_modules==0.4.1
    pybind11==2.13.6
    pybind11_global==2.13.6
    pycocotools==2.0.8
    pycparser==2.22
    pycryptodome==3.20.0
    pydot==3.0.2
    Pygments==2.18.0
    pyparsing==3.1.4
    pytest==8.3.3
    python-dateutil==2.9.0.post0
    pytz==2023.4
    PyYAML==6.0.2
    requests==2.28.2
    requests-oauthlib==2.0.0
    rich==13.4.2
    rsa==4.9
    safetensors==0.5.3
    scikit-learn==1.5.2
    scipy==1.13.1
    shapely==2.0.6
    six==1.16.0
    sympy==1.12
    tabulate==0.9.0
    tensorboard==2.11.2
    tensorboard-data-server==0.6.1
    tensorboard-plugin-wit==1.8.1
    termcolor==2.4.0
    terminaltables==3.1.10
    tflite==2.10.0
    tflite-runtime @ https://software-dl.ti.com/jacinto7/esd/tidl-tools/10_00_08_00/OSRT_TOOLS/X86_64_LINUX/UBUNTU_22_04/tflite_runtime-2.12.0-cp310-cp310-linux_x86_64.whl#sha256=e81213e441dc554a8bed2fc75fdd5ad1e25074d27bcb4348d4e31b1fb4eae085
    threadpoolctl==3.5.0
    -e git+https://github.com/TexasInstruments/edgeai-tensorlab.git@34607ac7547989dbbecfe52be212ef03ed6b6e1a#egg=tidl_tools_package&subdirectory=edgeai-benchmark
    timm==1.0.15
    tomli==2.0.1
    torch==2.4.0+cu118
    torchaudio==2.4.0+cu118
    torchinfo==1.8.0
    torchvision==0.19.0+cu118
    tornado==6.4.1
    tqdm==4.65.2
    triton==3.0.0
    tvm @ https://software-dl.ti.com/jacinto7/esd/tidl-tools/10_00_08_00/OSRT_TOOLS/X86_64_LINUX/UBUNTU_22_04/tvm-0.12.0-cp310-cp310-linux_x86_64.whl#sha256=b2a9793bb5f8fca509c90820596d4af14ab3e7fa6792a627a444da8e0692eaac
    typing_extensions==4.12.2
    tzdata==2024.2
    urllib3==1.26.20
    wcwidth==0.2.13
    Werkzeug==3.0.4
    wurlitzer==3.1.1
    yacs==0.1.8
    yapf==0.40.2
    zipp==3.20.2
    


    You asked

    • How did you setup modelmaker?
    • Are you using an form of virtual environment to keep these segregated from the rest of your machine's packages? Could there have been other versions of libraries like pytorch or onnx already installed 

    My answer: 


    I just walkthrough the instructions in https://github.com/TexasInstruments/edgeai-tensorlab/tree/main/edgeai-modelmaker . Besides, i use pyenv to seperate every projects' dependencies. I am quite confident that the virtual environment setup is fine.


    5. Since you have a suspicion that this is more closely related to Pytorch-->ONNX export, here is the training log file.1325.run.log

    Thank you for your help. Please tell me if any further info is needed.

    Regards,
    Matt


    P.S. I am using edgeai-tidl-tools trying to debug this issue, will inform you if there's any news. Thank you Reese.

  • Hi Matt,

    Thanks for the detailed response, this is helpful.

    My colleague was right to point you in the direction of edgeai-tidl-tools. This is best for doing standalone testing with TI Deep Learning. Edgeai-benchmark has a steeper learning curve than is worthwhile for your scenario. 

    I was able to download your model and see the logs.

    As the logs suggest, there is not a layer or output named like any of "Invalid Layer Name /multi_level_conv_obj.2/Conv_output_0", but it is similar to some of the layers feeding into the object-detection head  / NMS (e.g. layer named 193 in your .ONNX model). This suggests a parsing error, still. 

    Another thing that I notice in your model the ONNX opset version. In your model, this is '17'. TIDL supports up 18, and I know from past releases that we've validated against 8, 11, and 18 (with latest 10.1 SDK release supporting opset 19).

    Similarly, I notice that your set of python versions uses ONNX 1.12.0 (support up to opset 17) whereas modelmaker was validated against ONNX==1.13.0 (opset 18)

    As you are testing with edgeai-tidl-tools, I would suggest you try converting the opset version from 17 to 18 for your model. 

    • https://onnx.ai/onnx/api/version_converter.html 
    • I am hopeful that this conversion is sufficient. If the issue persists it might be worth upgrading the ONNX package version in your PYENV from 1.12.0 to 1.13.0, and try rerunning model maker (small number of epochs) to see if this is an artifact of pytorch export


    Let me know how your progress goes!

    BR,
    Reese

  • Hello Reese,

    It is at midnight in Hong Kong now so I will follow your recommendation and try to implement it tomorrow. 

    I just want to tell you that seems like I have solved the issue

    47 and 48 "Unable to find initializer at index - 1 for node 93
    Unable to find initializer at index - 1 for node 109" mean? I also see the same thing in my compilation run.log.

    by updating the edgeai-tidl-tools to version 10_01_04_00.

    Here is the compilation log
    compilation_log_10_01_04_00_debug_level_1.txt
    However, I haven't deployed it onto my SK-AM62A-LP to test it out. Will tell you about this newly generated artifacts performance tomorrow. Hope it solves the problem I mentioned 

    I found out that their performances are different. Inference results of the compiled model that is simulated on the host machine are significantly better than inference results produced by SK-AM62A-LP.

    Will hit you back soon. Thank you very much.

    Regards,
    Matt


    P.S. Little sidetrack: Why are they all 0s in the layer cycles? It is strange. (Please refer to the attached compilation run log)
  • Hello Reese,

    I had changed the opset version from 17 to 18 in edgeai-tidl-tools. But as I updated edgeai-tidl-tools from 10_00_08_00 to 10_01_04_00, the generated artifacts cannot be inferred on SK-AM62A_LP (seems there is a faliure in the TIDL execution provider, maybe version incompatibility). To deal with it, I will try the newest version Processor SDK Linux for AM62x in version 11.00.09.04 .

    To make things clearer, let me show you what have I done in a few words.

    AT the beginning, I trained and compiled models with edgeai-modelmaker.

    Then, I do model compilation only with edgeai-tidl-tools (ver 10_00_08_00). what I asked and mentioned in this forum before this message is based on the edgeai-tidl-tools (ver 10_00_08_00).

    After this message, I will try to use edgeai-tidl-tools (ver 10_01_04_00) with opset18 and onnx>=1.13.0, while Processor SDK Linux for AM62x in version 11.00.09.04 in my SK-AM62A-LP. I will follow you instructions based on this scenario.

    Hope this message can show you a clear picture of what have I done so far.

  • Hello Reese,

    I had solved the issue and now the model artifacts run well on my custom SK-AM62A-LP. Below is the compilation log:
    good_model_V8.txt

    *Please note that although the log shows model opset version 17, I actually set the Model opset version from 17 to 18 explicitly as per your suggestion, it really improves the performace thanks.

    But I still want to ask: Why are they all 0s in the layer cycles? It is strange. I expect non-zero numbers there.  (Please refer to the attached compilation run log)

  • Hello Reese,

    I had solved most of the problems. Now i am struggling to put the original class name on the bounding box. When I deployed the model onto SK-AM62A-LP, the shown class name is always "detection/category_X", where X refers to the class_ID, as I have 7 output classes, so 0<X<=7. However, I want the label to be the original output class name e.g. "Person", "Coat", etc. How can I do it?

    Regards,
    Matt

  • Hello Matt,

    My apologies for the lack of response the last few days. I am glad tyou have been able to proceed in the meantime. 

    I am glad that the opset version change improved your situation. 

    After this message, I will try to use edgeai-tidl-tools (ver 10_01_04_00) with opset18 and onnx>=1.13.0, while Processor SDK Linux for AM62x in version 11.00.09.04 in my SK-AM62A-LP. I will follow you instructions based on this scenario.

    I will briefly mention a couple things:

    1. As you've likely found, the SDK on target (AM62A) and Host-PC TIDL tools must have same version (e.g. 10.1 Linux SDK and tidl tools 10_01_04_00
      1. More details at my FAQ on SDK versions: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1455079/faq-edge-ai-studio-is-sdk-version-important-for-edge-ai-and-ti-deep-learning-tidl-with-c7x-on-am6xa-socs-am62a-am67a-am68a-am68pa-am69a 
    2. You mention AM62x SDK, for 11.0. This is different from AM62A SDK (the part names are very similar, and this can be confusing, I know..). 
      1. Seems like you figured this out too :) AM62A with 10.1 SDK should be used for the TIDL tools you mention
    But I still want to ask: Why are they all 0s in the layer cycles? It is strange. I expect non-zero numbers there.  (Please refer to the attached compilation run log)

    The PC-side tools can emulate the C7xMMA AI accelerator with bit-accuracy. This enables accuracy testing. It is not cycle-accurate, so no performance testing here. As a result, the layer-cycles are shown as 0. If you were to run the model on target with same debug_level, you will see non-zero cycles counts.

    And last remaining question (for now)

    However, I want the label to be the original output class name e.g. "Person", "Coat", etc. How can I do it?

    The class names are generally read from a 'dataset.yaml' file that is present in your artifacts. For edgeai-tidl-tools compilation, it does not have dataset info so it will create a basic list of classnames like 'category_N' for the Nth class index. 

    For model compiled through edgeai-modelmaker / edgeai-benchmark, it should produce a more relevant dataset.yaml. These tools have details about your dataset from training. You can also manually edit the dataset.yaml. 

    BR,

    Reese

  • Hello Reese, 

    Thank you for your reply, I had fixed

    I want the label to be the original output class name e.g. "Person", "Coat", etc. How can I do it?

    However, I do not know how to do 

    The PC-side tools can emulate the C7xMMA AI accelerator with bit-accuracy. This enables accuracy testing. It is not cycle-accurate, so no performance testing here. As a result, the layer-cycles are shown as 0. If you were to run the model on target with same debug_level, you will see non-zero cycles counts.

    How can I set the same debug_level on my SK-AM62A-LP? I had added debug enable mask on the .yaml file but there is no output files outputing the number of layers cycle. 

    Regards,
    Matt


  • Hi Matt,

    You are working with the edgeai-gst-apps examples, correct? I think there is not a way in the application config.yaml's to set TIDL-specific options.

    Alternatively, you can directly edit the edgeai_dl_inferer.py file under /usr/lib/python3.12/site-packages

    • This is for SDK 10.1, and python3.X version will differ based on the SDK, just as FYI
    • Within the onnxrt class in this edgeai_dl_inferer.py script, there is a dictionary of 'runtime_options'.
      • within this, add 'debug_level': 1

        if enable_tidl:
            runtime_options = {
                "tidl_tools_path": "null",
                "artifacts_folder": artifacts,
                "core_number": core_number, 
                "debug_level": 1, #Add this line
            }
            sess_options = _onnxruntime.SessionOptions()
    

    For CPP version of the scripts, this would need to be changed in a different location and the gst-plugins be rebuilt and reinstalled. There are some simple scripts to do this. Let me know if you need guidance for CPP side as well. 

    BR,
    Reese