This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA2SX: [TIDL/EVE] Questions Regarding Performance Evaluation

Part Number: TDA2SX

Hi,

I have downloaded the .caffemodel and deploy.prototxt from caffe-jacinto-models github depository and ran it on one EVE, below are the performance log I got:


Layer    1 : TSC Cycles =   2.02, SCTM Cycles =   1.39  ARP32 OVERHEAD =    45.52, MAC/CYCLE =   0.78 #MMACs =     1.57,     0.00,     1.57, Sparsity :   0.00, 100.00

Layer    2 : TSC Cycles =  23.32, SCTM Cycles =  22.24  ARP32 OVERHEAD =     4.85, MAC/CYCLE =  11.96 #MMACs =   314.57,   261.23,   278.92, Sparsity :  11.33,  16.96

Layer    3 : TSC Cycles =  22.32, SCTM Cycles =  21.34  ARP32 OVERHEAD =     4.60, MAC/CYCLE =  11.98 #MMACs =   301.99,   257.43,   267.39, Sparsity :  11.46,  14.76

Layer    4 : TSC Cycles =  44.57, SCTM Cycles =  43.13  ARP32 OVERHEAD =     3.34, MAC/CYCLE =  13.39 #MMACs =   603.98,   575.50,   596.90, Sparsity :   1.17,   4.71

Layer    5 : TSC Cycles =  23.33, SCTM Cycles =  22.54  ARP32 OVERHEAD =     3.47, MAC/CYCLE =  12.96 #MMACs =   301.99,   292.55,   302.38, Sparsity :  -0.13,   3.13

Layer    6 : TSC Cycles =  42.80, SCTM Cycles =  42.35  ARP32 OVERHEAD =     1.08, MAC/CYCLE =  13.87 #MMACs =   603.98,   576.31,   593.63, Sparsity :   1.71,   4.58

Layer    7 : TSC Cycles =  22.73, SCTM Cycles =  21.59  ARP32 OVERHEAD =     5.30, MAC/CYCLE =  13.14 #MMACs =   301.99,   288.38,   298.75, Sparsity :   1.07,   4.51

Layer    8 : TSC Cycles =   2.72, SCTM Cycles =   0.22  ARP32 OVERHEAD =  1117.79, MAC/CYCLE =   0.10 #MMACs =     0.26,     0.00,     0.26, Sparsity :   0.00, 100.00

Layer    9 : TSC Cycles =  42.09, SCTM Cycles =  41.44  ARP32 OVERHEAD =     1.58, MAC/CYCLE =  13.88 #MMACs =   603.98,   567.18,   584.47, Sparsity :   3.23,   6.09

Layer   10 : TSC Cycles =  21.89, SCTM Cycles =  21.14  ARP32 OVERHEAD =     3.54, MAC/CYCLE =  13.54 #MMACs =   301.99,   287.60,   296.31, Sparsity :   1.88,   4.76

Layer   11 : TSC Cycles =   3.64, SCTM Cycles =   0.21  ARP32 OVERHEAD =  1647.75, MAC/CYCLE =   0.14 #MMACs =     0.52,     0.00,     0.52, Sparsity :   0.00, 100.00

Layer   12 : TSC Cycles = 187.58, SCTM Cycles = 185.97  ARP32 OVERHEAD =     0.87, MAC/CYCLE =  12.62 #MMACs =  2415.92,  2317.24,  2368.05, Sparsity :   1.98,   4.08

Layer   13 : TSC Cycles =  84.96, SCTM Cycles =  83.79  ARP32 OVERHEAD =     1.40, MAC/CYCLE =  13.91 #MMACs =  1207.96,  1143.95,  1181.61, Sparsity :   2.18,   5.30

Layer   14 : TSC Cycles =  23.87, SCTM Cycles =  23.28  ARP32 OVERHEAD =     2.55, MAC/CYCLE =  12.42 #MMACs =   301.99,   289.98,   296.44, Sparsity :   1.84,   3.98

Layer   15 : TSC Cycles =   1.76, SCTM Cycles =   0.59  ARP32 OVERHEAD =   195.59, MAC/CYCLE =   1.19 #MMACs =     2.10,     0.00,     2.10, Sparsity :   0.00, 100.00

Layer   16 : TSC Cycles =  22.40, SCTM Cycles =  21.92  ARP32 OVERHEAD =     2.17, MAC/CYCLE =  13.73 #MMACs =   301.99,   296.64,   307.49, Sparsity :  -1.82,   1.77

Layer   17 : TSC Cycles =   0.84, SCTM Cycles =   0.31  ARP32 OVERHEAD =   170.51, MAC/CYCLE =   1.25 #MMACs =     1.05,     0.00,     1.05, Sparsity :   0.00, 100.00

Layer   18 : TSC Cycles =  22.06, SCTM Cycles =  21.72  ARP32 OVERHEAD =     1.58, MAC/CYCLE =  13.81 #MMACs =   301.99,   294.72,   304.61, Sparsity :  -0.87,   2.41

Layer   19 : TSC Cycles =  21.98, SCTM Cycles =  21.62  ARP32 OVERHEAD =     1.67, MAC/CYCLE =  13.79 #MMACs =   301.99,   293.82,   303.04, Sparsity :  -0.35,   2.70

Layer   20 : TSC Cycles =  22.30, SCTM Cycles =  21.93  ARP32 OVERHEAD =     1.65, MAC/CYCLE =  13.79 #MMACs =   301.99,   297.15,   307.56, Sparsity :  -1.84,   1.60

Layer   21 : TSC Cycles =  21.97, SCTM Cycles =  21.61  ARP32 OVERHEAD =     1.68, MAC/CYCLE =  13.79 #MMACs =   301.99,   293.24,   302.97, Sparsity :  -0.33,   2.90

Layer   22 : TSC Cycles =   2.95, SCTM Cycles =   2.71  ARP32 OVERHEAD =     8.67, MAC/CYCLE =  12.90 #MMACs =    37.75,    36.82,    38.01, Sparsity :  -0.69,   2.45

Layer   23 : TSC Cycles =   0.57, SCTM Cycles =   0.30  ARP32 OVERHEAD =    92.95, MAC/CYCLE =   1.83 #MMACs =     1.05,     0.00,     1.05, Sparsity :   0.00, 100.00

Layer   24 : TSC Cycles =   1.68, SCTM Cycles =   1.19  ARP32 OVERHEAD =    41.16, MAC/CYCLE =   2.50 #MMACs =     4.19,     0.00,     4.19, Sparsity :   0.00, 100.00

Layer   25 : TSC Cycles =   6.03, SCTM Cycles =   4.76  ARP32 OVERHEAD =    26.71, MAC/CYCLE =   2.78 #MMACs =    16.78,     0.00,    16.78, Sparsity :   0.00, 100.00

Layer   26 : TSC Cycles =   1.91, SCTM Cycles =   0.82  ARP32 OVERHEAD =   133.47, MAC/CYCLE =   2.19 #MMACs =     4.19,     0.00,     4.19, Sparsity :   0.00, 100.00

TEST_REPORT_PROCESS_PROFILE_DATA : TSC cycles = 682373066, SCTM VCOP BUSY cycles = 650111108, SCTM VCOP Overhead = 0


The performance looks pretty reasonable. But when I reduce the input size of the network (say to 256x96 or 256x128), I noticed the following:

1. The ARP32 overhead becomes larger, which means the difference between TSC cycles and SCTM cycles becomes larger. I read from here that we can estimate the running time by dividing SCTM cycles by EVE frequency. My concern is that when TSC cycles is far greater than SCTM cycles (for example, I got 118M vs. 56M when input is 256x128), does it affect the processing speed of running a batch of consecutive frames? Can I still take the SCTM cycles to estimate the processing time or fps?

2. Although the #MMACs reduced linearly according to the input size, the decreases in TSC cycles and SCTM cycles are much less. This also means that the MAC/CYCLE dropped. I understand that EVE processes data in such a way that the data is divided into blocks for optimal parallelism. Is there a rule of thumb to determine the input size that would provide optimal performance? Inline with this, does the input size have to be multiples of 8 or power of 2 etc. to achieve better performance?

Apart from the above questions, I have also ran (on EVE) a different deep learning network consisting of depthwise separable layers (layer 3 is depthwise separable, layer 4 is not) and saw that the actualOps is 10 times the totalOps:


Layer    3 : TSC Cycles =   1.51, SCTM Cycles =   0.80  ARP32 OVERHEAD =    88.04, MAC/CYCLE =  21.03 #MMACs =     3.32,    25.54,    31.75, Sparsity : -856.94, -669.79

Layer    4 : TSC Cycles =   2.24, SCTM Cycles =   1.62  ARP32 OVERHEAD =    38.35, MAC/CYCLE =   5.92 #MMACs =    11.80,    11.61,    13.27, Sparsity : -12.50,   1.56


Even so, this does not make it more time-consuming as the MAC/CYCLE of this layer is particularly larger than the layers that are not depthwise separable. Can you help me to understand how these numbers come about?

Thank you in advance.

William

  • (1 and 2) Yes. TDIL performance does not scale linearly with resolution. The utilization will be better with higher resolution. The batch processing will improve fully connected layer performance  to some extent but the convolution layer performance will not improve the batch processing. Performance will be better if the processing resolution is multiple of 32.

    We are working on improving dense convolution performance for small resolutions (1x1 and 3x3). This will be part of 01.00 version

      

    (3) We have optimized the performance of depth wise separable convolution 00.08 release. Hope you are using the version. Regarding total ops print, it look like there is issue in the prints (As you can see the TSC cycles for layer 3 is lesser than layer4). Will check on this trace print.

    Thanks,

    Kumar.D

  • When you said yes to (1), do you mean we should use the TSC cycles to estimate the real fps?
  • Yes, TSC cycles shall be used for FPS computation.

    Thanks and regards,

    Kumar. D

  • Hi Kumar,

        When using dense convolution for better performance (small resolutions, 1x1, 3x3), how can I use sparsity to speedup ?

        Without the usage of sparsity, the total performance is even worse.

    Thanks

        

  • Hi,
    Overheads with sparse convolution for small resolution is high.

    So use dense convolution for small resolution ( Sparsity is not possible).
    Use sparse convolution when resolution is higher than 64x64 or 32x32

    Thanks and regards,
    Kumar.D