SK-AM62A-LP: Tensorflow performance not using tidl acceleration / comparison between float- and int8-quantized models

Stefan Birkholz

Part Number: SK-AM62A-LP

Tool/software:

Hi,

(please excuse if this question is not on-topic for this forum!)

we are using the "Linux SDK for edge AI applications on AM62A", version 10.00.00.08. In order to gain a better understanding of Tensorflow-performance, we want to compare inference on float- and int8-quantized models *without* using the "tidl_tfl_delegate", that is we expect Tensorflow to run (multithreaded) the XNNPack delegate for the float-quantized models and "ruy"-enhanced computations for the int8-quantized models.

Using the "perf record" tool as well as runtime-inspection, we see that these assumptions are correct: the inference runs the respective operations for both model types, is multithreaded and utilizes the ARM+SIMD (Neon) instructions. However, inference on the int8-quantized models performs much slower than XNNPack for the other type.

We found a seemingly related issue #21698 in the tensorflow repository (on github) (INT TFLITE very much slower than FLOAT TFLITE · Issue #21698 · tensorflow/tensorflow), one explanation offered pointed to a less-optimized ISA (on x86), which does not cover our case.

Other than using the TIDL delegate, is there some other idea to improve the performance of inference on INT8 models?

Kind regards

Stefan Birkholz

10 months ago

+1 Reese Grimsley 10 months ago

TI__Mastermind 18526 points

Hello Stefan,

This topic for Arm-side inference in TFLite is outside our support -- I can help with general usage of tensorflow-lite in the SDK and anything TIDL related, but standard XNNPACK, including performance / profiling, is not.

I notice in that issue on github, that one of the users sees similar behavior on an Arm device (Android phone) -- quantized runs slower than unquantized: https://github.com/tensorflow/tensorflow/issues/21698#issuecomment-1216966119

Stefan Birkholz said:
Other than using the TIDL delegate, is there some other idea to improve the performance of inference on INT8 models?

No obvious ideas come to mind, unfortunately. W.r.t. TIDL, is there something missing (layer type or feature) that is taking you towards the CPU/XNNPACK topic?

I understand there is also an ARMNN backend for the Arm CPU (of course), but I do not believe this is provided within the SDK. I would not be able to support this delegate option either

BR,
Reese

0 Stefan Birkholz 9 months ago in reply to Reese Grimsley

Prodigy 80 points

Hello Reese,

thank you for your reply - we are testing different setups because in the future we might want to switch to devices that have no NPU/DSP, hence we explicitly test without the tidl-delegate. I basically wanted to make sure that we are not making any obvious (to people with more experience than us) mistakes leaving some performance-knobs unturned.

Consequently our approach was to disassemble the delegates (in order to verify that SIMD instructions are used), and do runtime-profiling of the inference (with "perf record" on the target hw). Btw., we noticed that in our Yocto build (based on the SDK version 10.01.00.01) the "tflite_2.12_aragoj7.tar.gz" and "tflite_runtime-2.12.0-cp312-cp312-linux_aarch64.whl" are downloaded from "">software-dl.ti.com/..." and are therefore apparently available as precompiled binaries - would it be possible to get the "debug" symbols files for these binaries? We tried compiling TFLite/LiteRT ourselves but got stuck trying to cross-compile the libraries.

0 Reese Grimsley 9 months ago in reply to Stefan Birkholz

TI__Mastermind 18526 points

Hi Stefan,

All makes sense, thanks for the context.

Support is limited on the topic of rebuilding the TI-TFLite [1] binaries / wheels. The most helpful resource I can share is a repo [0] for rebuilding wheels outside the internal workflow using Docker.

The instructions I've otherwise found (providing here, but cannot provide detailed guidance) follows for the pip wheels:

git checkout tidl-j7-2.12-latest
Dependencies
- sudo apt install swig libjpeg-dev zlib1g-dev python3-dev python3-numpy
- pip install numpy pybind11 wheel
Build with cmake
- https://www.tensorflow.org/lite/guide/build_cmake -- follow guidance here
- aarch64 wheel:
  - cd tensorflow/lite/tools/pip_package
  - ./build_pip_package_with_cmake.sh aarch64
  - Wheel will be generated in gen/tflite_pip/python3/dist folder.

[0] https://github.com/TexasInstruments-Sandbox/edgeai-osrt-libs-build

[1] https://github.com/TexasInstruments/tensorflow/tree/tidl-j7-2.12-latest

BR,
Reese

Processors

Processors forum

SK-AM62A-LP: Tensorflow performance not using tidl acceleration / comparison between float- and int8-quantized models

Processors

Processors forum

SK-AM62A-LP: Tensorflow performance *not* using tidl acceleration / comparison between float- and int8-quantized models

SK-AM62A-LP: Tensorflow performance not using tidl acceleration / comparison between float- and int8-quantized models