Hi,
I have downloaded the .caffemodel and deploy.prototxt from caffe-jacinto-models github depository and ran it on one EVE, below are the performance log I got:
Layer 1 : TSC Cycles = 2.02, SCTM Cycles = 1.39 ARP32 OVERHEAD = 45.52, MAC/CYCLE = 0.78 #MMACs = 1.57, 0.00, 1.57, Sparsity : 0.00, 100.00
Layer 2 : TSC Cycles = 23.32, SCTM Cycles = 22.24 ARP32 OVERHEAD = 4.85, MAC/CYCLE = 11.96 #MMACs = 314.57, 261.23, 278.92, Sparsity : 11.33, 16.96
Layer 3 : TSC Cycles = 22.32, SCTM Cycles = 21.34 ARP32 OVERHEAD = 4.60, MAC/CYCLE = 11.98 #MMACs = 301.99, 257.43, 267.39, Sparsity : 11.46, 14.76
Layer 4 : TSC Cycles = 44.57, SCTM Cycles = 43.13 ARP32 OVERHEAD = 3.34, MAC/CYCLE = 13.39 #MMACs = 603.98, 575.50, 596.90, Sparsity : 1.17, 4.71
Layer 5 : TSC Cycles = 23.33, SCTM Cycles = 22.54 ARP32 OVERHEAD = 3.47, MAC/CYCLE = 12.96 #MMACs = 301.99, 292.55, 302.38, Sparsity : -0.13, 3.13
Layer 6 : TSC Cycles = 42.80, SCTM Cycles = 42.35 ARP32 OVERHEAD = 1.08, MAC/CYCLE = 13.87 #MMACs = 603.98, 576.31, 593.63, Sparsity : 1.71, 4.58
Layer 7 : TSC Cycles = 22.73, SCTM Cycles = 21.59 ARP32 OVERHEAD = 5.30, MAC/CYCLE = 13.14 #MMACs = 301.99, 288.38, 298.75, Sparsity : 1.07, 4.51
Layer 8 : TSC Cycles = 2.72, SCTM Cycles = 0.22 ARP32 OVERHEAD = 1117.79, MAC/CYCLE = 0.10 #MMACs = 0.26, 0.00, 0.26, Sparsity : 0.00, 100.00
Layer 9 : TSC Cycles = 42.09, SCTM Cycles = 41.44 ARP32 OVERHEAD = 1.58, MAC/CYCLE = 13.88 #MMACs = 603.98, 567.18, 584.47, Sparsity : 3.23, 6.09
Layer 10 : TSC Cycles = 21.89, SCTM Cycles = 21.14 ARP32 OVERHEAD = 3.54, MAC/CYCLE = 13.54 #MMACs = 301.99, 287.60, 296.31, Sparsity : 1.88, 4.76
Layer 11 : TSC Cycles = 3.64, SCTM Cycles = 0.21 ARP32 OVERHEAD = 1647.75, MAC/CYCLE = 0.14 #MMACs = 0.52, 0.00, 0.52, Sparsity : 0.00, 100.00
Layer 12 : TSC Cycles = 187.58, SCTM Cycles = 185.97 ARP32 OVERHEAD = 0.87, MAC/CYCLE = 12.62 #MMACs = 2415.92, 2317.24, 2368.05, Sparsity : 1.98, 4.08
Layer 13 : TSC Cycles = 84.96, SCTM Cycles = 83.79 ARP32 OVERHEAD = 1.40, MAC/CYCLE = 13.91 #MMACs = 1207.96, 1143.95, 1181.61, Sparsity : 2.18, 5.30
Layer 14 : TSC Cycles = 23.87, SCTM Cycles = 23.28 ARP32 OVERHEAD = 2.55, MAC/CYCLE = 12.42 #MMACs = 301.99, 289.98, 296.44, Sparsity : 1.84, 3.98
Layer 15 : TSC Cycles = 1.76, SCTM Cycles = 0.59 ARP32 OVERHEAD = 195.59, MAC/CYCLE = 1.19 #MMACs = 2.10, 0.00, 2.10, Sparsity : 0.00, 100.00
Layer 16 : TSC Cycles = 22.40, SCTM Cycles = 21.92 ARP32 OVERHEAD = 2.17, MAC/CYCLE = 13.73 #MMACs = 301.99, 296.64, 307.49, Sparsity : -1.82, 1.77
Layer 17 : TSC Cycles = 0.84, SCTM Cycles = 0.31 ARP32 OVERHEAD = 170.51, MAC/CYCLE = 1.25 #MMACs = 1.05, 0.00, 1.05, Sparsity : 0.00, 100.00
Layer 18 : TSC Cycles = 22.06, SCTM Cycles = 21.72 ARP32 OVERHEAD = 1.58, MAC/CYCLE = 13.81 #MMACs = 301.99, 294.72, 304.61, Sparsity : -0.87, 2.41
Layer 19 : TSC Cycles = 21.98, SCTM Cycles = 21.62 ARP32 OVERHEAD = 1.67, MAC/CYCLE = 13.79 #MMACs = 301.99, 293.82, 303.04, Sparsity : -0.35, 2.70
Layer 20 : TSC Cycles = 22.30, SCTM Cycles = 21.93 ARP32 OVERHEAD = 1.65, MAC/CYCLE = 13.79 #MMACs = 301.99, 297.15, 307.56, Sparsity : -1.84, 1.60
Layer 21 : TSC Cycles = 21.97, SCTM Cycles = 21.61 ARP32 OVERHEAD = 1.68, MAC/CYCLE = 13.79 #MMACs = 301.99, 293.24, 302.97, Sparsity : -0.33, 2.90
Layer 22 : TSC Cycles = 2.95, SCTM Cycles = 2.71 ARP32 OVERHEAD = 8.67, MAC/CYCLE = 12.90 #MMACs = 37.75, 36.82, 38.01, Sparsity : -0.69, 2.45
Layer 23 : TSC Cycles = 0.57, SCTM Cycles = 0.30 ARP32 OVERHEAD = 92.95, MAC/CYCLE = 1.83 #MMACs = 1.05, 0.00, 1.05, Sparsity : 0.00, 100.00
Layer 24 : TSC Cycles = 1.68, SCTM Cycles = 1.19 ARP32 OVERHEAD = 41.16, MAC/CYCLE = 2.50 #MMACs = 4.19, 0.00, 4.19, Sparsity : 0.00, 100.00
Layer 25 : TSC Cycles = 6.03, SCTM Cycles = 4.76 ARP32 OVERHEAD = 26.71, MAC/CYCLE = 2.78 #MMACs = 16.78, 0.00, 16.78, Sparsity : 0.00, 100.00
Layer 26 : TSC Cycles = 1.91, SCTM Cycles = 0.82 ARP32 OVERHEAD = 133.47, MAC/CYCLE = 2.19 #MMACs = 4.19, 0.00, 4.19, Sparsity : 0.00, 100.00
TEST_REPORT_PROCESS_PROFILE_DATA : TSC cycles = 682373066, SCTM VCOP BUSY cycles = 650111108, SCTM VCOP Overhead = 0
The performance looks pretty reasonable. But when I reduce the input size of the network (say to 256x96 or 256x128), I noticed the following:
1. The ARP32 overhead becomes larger, which means the difference between TSC cycles and SCTM cycles becomes larger. I read from here that we can estimate the running time by dividing SCTM cycles by EVE frequency. My concern is that when TSC cycles is far greater than SCTM cycles (for example, I got 118M vs. 56M when input is 256x128), does it affect the processing speed of running a batch of consecutive frames? Can I still take the SCTM cycles to estimate the processing time or fps?
2. Although the #MMACs reduced linearly according to the input size, the decreases in TSC cycles and SCTM cycles are much less. This also means that the MAC/CYCLE dropped. I understand that EVE processes data in such a way that the data is divided into blocks for optimal parallelism. Is there a rule of thumb to determine the input size that would provide optimal performance? Inline with this, does the input size have to be multiples of 8 or power of 2 etc. to achieve better performance?
Apart from the above questions, I have also ran (on EVE) a different deep learning network consisting of depthwise separable layers (layer 3 is depthwise separable, layer 4 is not) and saw that the actualOps is 10 times the totalOps:
Layer 3 : TSC Cycles = 1.51, SCTM Cycles = 0.80 ARP32 OVERHEAD = 88.04, MAC/CYCLE = 21.03 #MMACs = 3.32, 25.54, 31.75, Sparsity : -856.94, -669.79
Layer 4 : TSC Cycles = 2.24, SCTM Cycles = 1.62 ARP32 OVERHEAD = 38.35, MAC/CYCLE = 5.92 #MMACs = 11.80, 11.61, 13.27, Sparsity : -12.50, 1.56
Even so, this does not make it more time-consuming as the MAC/CYCLE of this layer is particularly larger than the layers that are not depthwise separable. Can you help me to understand how these numbers come about?
Thank you in advance.
William