The TI E2E™ design support forums will undergo maintenance from Sept. 28 to Oct. 2. If you need design support during this time, contact your TI representative or open a new support request with our customer support center.

This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SK-TDA4VM: Inference of multiple ML models in parallel on TDA4

Part Number: SK-TDA4VM
Other Parts Discussed in Thread: TDA4VM


A question related to the capacity of the C7x DSP with MMA to run several models in parallel.

According to the data sheet the cores capabilities are:

  • C7x floating point, vector DSP, up to 1.0 GHz, 80 GFLOPS, 256 GOPS
  • Deep-learning matrix multiply accelerator (MMA), up to 8 TOPS (8b) at 1.0 GHz

According to TI's documentation, Yolov5s6_ti_lite_640 model is utilizing 17.48 GFLOPS.

1. Based on rough estimation: 80 GFLOPS/17.48 GFLOPS = 4.57 ~ 4  -> Up to 4 YOLOv5s can run in parallel.
However, this estimation doesn't take into account the MMA capabilities.     
Can you suggest an approach for a rough capacity estimation, considering both C7x DSP and MMA resources?

2. In your answer to Q1, please relate to the difference between 16bits vs. 8bits model compilation.

3. Is there any way to get a log file of utilized resources during specific inference run on the TDA4 platform?