This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hi,
I have downloaded the .caffemodel and deploy.prototxt from caffe-jacinto-models github depository and ran it on one EVE, below are the performance log I got:
Layer 1 : TSC Cycles = 2.02, SCTM Cycles = 1.39 ARP32 OVERHEAD = 45.52, MAC/CYCLE = 0.78 #MMACs = 1.57, 0.00, 1.57, Sparsity : 0.00, 100.00
Layer 2 : TSC Cycles = 23.32, SCTM Cycles = 22.24 ARP32 OVERHEAD = 4.85, MAC/CYCLE = 11.96 #MMACs = 314.57, 261.23, 278.92, Sparsity : 11.33, 16.96
Layer 3 : TSC Cycles = 22.32, SCTM Cycles = 21.34 ARP32 OVERHEAD = 4.60, MAC/CYCLE = 11.98 #MMACs = 301.99, 257.43, 267.39, Sparsity : 11.46, 14.76
Layer 4 : TSC Cycles = 44.57, SCTM Cycles = 43.13 ARP32 OVERHEAD = 3.34, MAC/CYCLE = 13.39 #MMACs = 603.98, 575.50, 596.90, Sparsity : 1.17, 4.71
Layer 5 : TSC Cycles = 23.33, SCTM Cycles = 22.54 ARP32 OVERHEAD = 3.47, MAC/CYCLE = 12.96 #MMACs = 301.99, 292.55, 302.38, Sparsity : -0.13, 3.13
Layer 6 : TSC Cycles = 42.80, SCTM Cycles = 42.35 ARP32 OVERHEAD = 1.08, MAC/CYCLE = 13.87 #MMACs = 603.98, 576.31, 593.63, Sparsity : 1.71, 4.58
Layer 7 : TSC Cycles = 22.73, SCTM Cycles = 21.59 ARP32 OVERHEAD = 5.30, MAC/CYCLE = 13.14 #MMACs = 301.99, 288.38, 298.75, Sparsity : 1.07, 4.51
Layer 8 : TSC Cycles = 2.72, SCTM Cycles = 0.22 ARP32 OVERHEAD = 1117.79, MAC/CYCLE = 0.10 #MMACs = 0.26, 0.00, 0.26, Sparsity : 0.00, 100.00
Layer 9 : TSC Cycles = 42.09, SCTM Cycles = 41.44 ARP32 OVERHEAD = 1.58, MAC/CYCLE = 13.88 #MMACs = 603.98, 567.18, 584.47, Sparsity : 3.23, 6.09
Layer 10 : TSC Cycles = 21.89, SCTM Cycles = 21.14 ARP32 OVERHEAD = 3.54, MAC/CYCLE = 13.54 #MMACs = 301.99, 287.60, 296.31, Sparsity : 1.88, 4.76
Layer 11 : TSC Cycles = 3.64, SCTM Cycles = 0.21 ARP32 OVERHEAD = 1647.75, MAC/CYCLE = 0.14 #MMACs = 0.52, 0.00, 0.52, Sparsity : 0.00, 100.00
Layer 12 : TSC Cycles = 187.58, SCTM Cycles = 185.97 ARP32 OVERHEAD = 0.87, MAC/CYCLE = 12.62 #MMACs = 2415.92, 2317.24, 2368.05, Sparsity : 1.98, 4.08
Layer 13 : TSC Cycles = 84.96, SCTM Cycles = 83.79 ARP32 OVERHEAD = 1.40, MAC/CYCLE = 13.91 #MMACs = 1207.96, 1143.95, 1181.61, Sparsity : 2.18, 5.30
Layer 14 : TSC Cycles = 23.87, SCTM Cycles = 23.28 ARP32 OVERHEAD = 2.55, MAC/CYCLE = 12.42 #MMACs = 301.99, 289.98, 296.44, Sparsity : 1.84, 3.98
Layer 15 : TSC Cycles = 1.76, SCTM Cycles = 0.59 ARP32 OVERHEAD = 195.59, MAC/CYCLE = 1.19 #MMACs = 2.10, 0.00, 2.10, Sparsity : 0.00, 100.00
Layer 16 : TSC Cycles = 22.40, SCTM Cycles = 21.92 ARP32 OVERHEAD = 2.17, MAC/CYCLE = 13.73 #MMACs = 301.99, 296.64, 307.49, Sparsity : -1.82, 1.77
Layer 17 : TSC Cycles = 0.84, SCTM Cycles = 0.31 ARP32 OVERHEAD = 170.51, MAC/CYCLE = 1.25 #MMACs = 1.05, 0.00, 1.05, Sparsity : 0.00, 100.00
Layer 18 : TSC Cycles = 22.06, SCTM Cycles = 21.72 ARP32 OVERHEAD = 1.58, MAC/CYCLE = 13.81 #MMACs = 301.99, 294.72, 304.61, Sparsity : -0.87, 2.41
Layer 19 : TSC Cycles = 21.98, SCTM Cycles = 21.62 ARP32 OVERHEAD = 1.67, MAC/CYCLE = 13.79 #MMACs = 301.99, 293.82, 303.04, Sparsity : -0.35, 2.70
Layer 20 : TSC Cycles = 22.30, SCTM Cycles = 21.93 ARP32 OVERHEAD = 1.65, MAC/CYCLE = 13.79 #MMACs = 301.99, 297.15, 307.56, Sparsity : -1.84, 1.60
Layer 21 : TSC Cycles = 21.97, SCTM Cycles = 21.61 ARP32 OVERHEAD = 1.68, MAC/CYCLE = 13.79 #MMACs = 301.99, 293.24, 302.97, Sparsity : -0.33, 2.90
Layer 22 : TSC Cycles = 2.95, SCTM Cycles = 2.71 ARP32 OVERHEAD = 8.67, MAC/CYCLE = 12.90 #MMACs = 37.75, 36.82, 38.01, Sparsity : -0.69, 2.45
Layer 23 : TSC Cycles = 0.57, SCTM Cycles = 0.30 ARP32 OVERHEAD = 92.95, MAC/CYCLE = 1.83 #MMACs = 1.05, 0.00, 1.05, Sparsity : 0.00, 100.00
Layer 24 : TSC Cycles = 1.68, SCTM Cycles = 1.19 ARP32 OVERHEAD = 41.16, MAC/CYCLE = 2.50 #MMACs = 4.19, 0.00, 4.19, Sparsity : 0.00, 100.00
Layer 25 : TSC Cycles = 6.03, SCTM Cycles = 4.76 ARP32 OVERHEAD = 26.71, MAC/CYCLE = 2.78 #MMACs = 16.78, 0.00, 16.78, Sparsity : 0.00, 100.00
Layer 26 : TSC Cycles = 1.91, SCTM Cycles = 0.82 ARP32 OVERHEAD = 133.47, MAC/CYCLE = 2.19 #MMACs = 4.19, 0.00, 4.19, Sparsity : 0.00, 100.00
TEST_REPORT_PROCESS_PROFILE_DATA : TSC cycles = 682373066, SCTM VCOP BUSY cycles = 650111108, SCTM VCOP Overhead = 0
The performance looks pretty reasonable. But when I reduce the input size of the network (say to 256x96 or 256x128), I noticed the following:
1. The ARP32 overhead becomes larger, which means the difference between TSC cycles and SCTM cycles becomes larger. I read from here that we can estimate the running time by dividing SCTM cycles by EVE frequency. My concern is that when TSC cycles is far greater than SCTM cycles (for example, I got 118M vs. 56M when input is 256x128), does it affect the processing speed of running a batch of consecutive frames? Can I still take the SCTM cycles to estimate the processing time or fps?
2. Although the #MMACs reduced linearly according to the input size, the decreases in TSC cycles and SCTM cycles are much less. This also means that the MAC/CYCLE dropped. I understand that EVE processes data in such a way that the data is divided into blocks for optimal parallelism. Is there a rule of thumb to determine the input size that would provide optimal performance? Inline with this, does the input size have to be multiples of 8 or power of 2 etc. to achieve better performance?
Apart from the above questions, I have also ran (on EVE) a different deep learning network consisting of depthwise separable layers (layer 3 is depthwise separable, layer 4 is not) and saw that the actualOps is 10 times the totalOps:
Layer 3 : TSC Cycles = 1.51, SCTM Cycles = 0.80 ARP32 OVERHEAD = 88.04, MAC/CYCLE = 21.03 #MMACs = 3.32, 25.54, 31.75, Sparsity : -856.94, -669.79
Layer 4 : TSC Cycles = 2.24, SCTM Cycles = 1.62 ARP32 OVERHEAD = 38.35, MAC/CYCLE = 5.92 #MMACs = 11.80, 11.61, 13.27, Sparsity : -12.50, 1.56
Even so, this does not make it more time-consuming as the MAC/CYCLE of this layer is particularly larger than the layers that are not depthwise separable. Can you help me to understand how these numbers come about?
Thank you in advance.
William
(1 and 2) Yes. TDIL performance does not scale linearly with resolution. The utilization will be better with higher resolution. The batch processing will improve fully connected layer performance to some extent but the convolution layer performance will not improve the batch processing. Performance will be better if the processing resolution is multiple of 32.
We are working on improving dense convolution performance for small resolutions (1x1 and 3x3). This will be part of 01.00 version
(3) We have optimized the performance of depth wise separable convolution 00.08 release. Hope you are using the version. Regarding total ops print, it look like there is issue in the prints (As you can see the TSC cycles for layer 3 is lesser than layer4). Will check on this trace print.
Thanks,
Kumar.D
Yes, TSC cycles shall be used for FPS computation.
Thanks and regards,
Kumar. D
Hi Kumar,
When using dense convolution for better performance (small resolutions, 1x1, 3x3), how can I use sparsity to speedup ?
Without the usage of sparsity, the total performance is even worse.
Thanks