Other Parts Discussed in Thread: TDA4VM
A question related to the capacity of the C7x DSP with MMA to run several models in parallel.
According to the data sheet the cores capabilities are:
- C7x floating point, vector DSP, up to 1.0 GHz, 80 GFLOPS, 256 GOPS
- Deep-learning matrix multiply accelerator (MMA), up to 8 TOPS (8b) at 1.0 GHz
According to TI's documentation, Yolov5s6_ti_lite_640 model is utilizing 17.48 GFLOPS.
1. Based on rough estimation: 80 GFLOPS/17.48 GFLOPS = 4.57 ~ 4 -> Up to 4 YOLOv5s can run in parallel.
However, this estimation doesn't take into account the MMA capabilities.
Can you suggest an approach for a rough capacity estimation, considering both C7x DSP and MMA resources?
2. In your answer to Q1, please relate to the difference between 16bits vs. 8bits model compilation.
3. Is there any way to get a log file of utilized resources during specific inference run on the TDA4 platform?