TDA2SX: [TIDL/EVE] Questions Regarding Performance Evaluation

William Wong

Part Number: TDA2SX

Hi,

I have downloaded the .caffemodel and deploy.prototxt from caffe-jacinto-models github depository and ran it on one EVE, below are the performance log I got:

Layer 1 : TSC Cycles = 2.02, SCTM Cycles = 1.39 ARP32 OVERHEAD = 45.52, MAC/CYCLE = 0.78 #MMACs = 1.57, 0.00, 1.57, Sparsity : 0.00, 100.00

Layer 2 : TSC Cycles = 23.32, SCTM Cycles = 22.24 ARP32 OVERHEAD = 4.85, MAC/CYCLE = 11.96 #MMACs = 314.57, 261.23, 278.92, Sparsity : 11.33, 16.96

Layer 3 : TSC Cycles = 22.32, SCTM Cycles = 21.34 ARP32 OVERHEAD = 4.60, MAC/CYCLE = 11.98 #MMACs = 301.99, 257.43, 267.39, Sparsity : 11.46, 14.76

Layer 4 : TSC Cycles = 44.57, SCTM Cycles = 43.13 ARP32 OVERHEAD = 3.34, MAC/CYCLE = 13.39 #MMACs = 603.98, 575.50, 596.90, Sparsity : 1.17, 4.71

Layer 5 : TSC Cycles = 23.33, SCTM Cycles = 22.54 ARP32 OVERHEAD = 3.47, MAC/CYCLE = 12.96 #MMACs = 301.99, 292.55, 302.38, Sparsity : -0.13, 3.13

Layer 6 : TSC Cycles = 42.80, SCTM Cycles = 42.35 ARP32 OVERHEAD = 1.08, MAC/CYCLE = 13.87 #MMACs = 603.98, 576.31, 593.63, Sparsity : 1.71, 4.58

Layer 7 : TSC Cycles = 22.73, SCTM Cycles = 21.59 ARP32 OVERHEAD = 5.30, MAC/CYCLE = 13.14 #MMACs = 301.99, 288.38, 298.75, Sparsity : 1.07, 4.51

Layer 8 : TSC Cycles = 2.72, SCTM Cycles = 0.22 ARP32 OVERHEAD = 1117.79, MAC/CYCLE = 0.10 #MMACs = 0.26, 0.00, 0.26, Sparsity : 0.00, 100.00

Layer 9 : TSC Cycles = 42.09, SCTM Cycles = 41.44 ARP32 OVERHEAD = 1.58, MAC/CYCLE = 13.88 #MMACs = 603.98, 567.18, 584.47, Sparsity : 3.23, 6.09

Layer 10 : TSC Cycles = 21.89, SCTM Cycles = 21.14 ARP32 OVERHEAD = 3.54, MAC/CYCLE = 13.54 #MMACs = 301.99, 287.60, 296.31, Sparsity : 1.88, 4.76

Layer 11 : TSC Cycles = 3.64, SCTM Cycles = 0.21 ARP32 OVERHEAD = 1647.75, MAC/CYCLE = 0.14 #MMACs = 0.52, 0.00, 0.52, Sparsity : 0.00, 100.00

Layer 12 : TSC Cycles = 187.58, SCTM Cycles = 185.97 ARP32 OVERHEAD = 0.87, MAC/CYCLE = 12.62 #MMACs = 2415.92, 2317.24, 2368.05, Sparsity : 1.98, 4.08

Layer 13 : TSC Cycles = 84.96, SCTM Cycles = 83.79 ARP32 OVERHEAD = 1.40, MAC/CYCLE = 13.91 #MMACs = 1207.96, 1143.95, 1181.61, Sparsity : 2.18, 5.30

Layer 14 : TSC Cycles = 23.87, SCTM Cycles = 23.28 ARP32 OVERHEAD = 2.55, MAC/CYCLE = 12.42 #MMACs = 301.99, 289.98, 296.44, Sparsity : 1.84, 3.98

Layer 15 : TSC Cycles = 1.76, SCTM Cycles = 0.59 ARP32 OVERHEAD = 195.59, MAC/CYCLE = 1.19 #MMACs = 2.10, 0.00, 2.10, Sparsity : 0.00, 100.00

Layer 16 : TSC Cycles = 22.40, SCTM Cycles = 21.92 ARP32 OVERHEAD = 2.17, MAC/CYCLE = 13.73 #MMACs = 301.99, 296.64, 307.49, Sparsity : -1.82, 1.77

Layer 17 : TSC Cycles = 0.84, SCTM Cycles = 0.31 ARP32 OVERHEAD = 170.51, MAC/CYCLE = 1.25 #MMACs = 1.05, 0.00, 1.05, Sparsity : 0.00, 100.00

Layer 18 : TSC Cycles = 22.06, SCTM Cycles = 21.72 ARP32 OVERHEAD = 1.58, MAC/CYCLE = 13.81 #MMACs = 301.99, 294.72, 304.61, Sparsity : -0.87, 2.41

Layer 19 : TSC Cycles = 21.98, SCTM Cycles = 21.62 ARP32 OVERHEAD = 1.67, MAC/CYCLE = 13.79 #MMACs = 301.99, 293.82, 303.04, Sparsity : -0.35, 2.70

Layer 20 : TSC Cycles = 22.30, SCTM Cycles = 21.93 ARP32 OVERHEAD = 1.65, MAC/CYCLE = 13.79 #MMACs = 301.99, 297.15, 307.56, Sparsity : -1.84, 1.60

Layer 21 : TSC Cycles = 21.97, SCTM Cycles = 21.61 ARP32 OVERHEAD = 1.68, MAC/CYCLE = 13.79 #MMACs = 301.99, 293.24, 302.97, Sparsity : -0.33, 2.90

Layer 22 : TSC Cycles = 2.95, SCTM Cycles = 2.71 ARP32 OVERHEAD = 8.67, MAC/CYCLE = 12.90 #MMACs = 37.75, 36.82, 38.01, Sparsity : -0.69, 2.45

Layer 23 : TSC Cycles = 0.57, SCTM Cycles = 0.30 ARP32 OVERHEAD = 92.95, MAC/CYCLE = 1.83 #MMACs = 1.05, 0.00, 1.05, Sparsity : 0.00, 100.00

Layer 24 : TSC Cycles = 1.68, SCTM Cycles = 1.19 ARP32 OVERHEAD = 41.16, MAC/CYCLE = 2.50 #MMACs = 4.19, 0.00, 4.19, Sparsity : 0.00, 100.00

Layer 25 : TSC Cycles = 6.03, SCTM Cycles = 4.76 ARP32 OVERHEAD = 26.71, MAC/CYCLE = 2.78 #MMACs = 16.78, 0.00, 16.78, Sparsity : 0.00, 100.00

Layer 26 : TSC Cycles = 1.91, SCTM Cycles = 0.82 ARP32 OVERHEAD = 133.47, MAC/CYCLE = 2.19 #MMACs = 4.19, 0.00, 4.19, Sparsity : 0.00, 100.00

TEST_REPORT_PROCESS_PROFILE_DATA : TSC cycles = 682373066, SCTM VCOP BUSY cycles = 650111108, SCTM VCOP Overhead = 0

The performance looks pretty reasonable. But when I reduce the input size of the network (say to 256x96 or 256x128), I noticed the following:

1. The ARP32 overhead becomes larger, which means the difference between TSC cycles and SCTM cycles becomes larger. I read from here that we can estimate the running time by dividing SCTM cycles by EVE frequency. My concern is that when TSC cycles is far greater than SCTM cycles (for example, I got 118M vs. 56M when input is 256x128), does it affect the processing speed of running a batch of consecutive frames? Can I still take the SCTM cycles to estimate the processing time or fps?

2. Although the #MMACs reduced linearly according to the input size, the decreases in TSC cycles and SCTM cycles are much less. This also means that the MAC/CYCLE dropped. I understand that EVE processes data in such a way that the data is divided into blocks for optimal parallelism. Is there a rule of thumb to determine the input size that would provide optimal performance? Inline with this, does the input size have to be multiples of 8 or power of 2 etc. to achieve better performance?

Apart from the above questions, I have also ran (on EVE) a different deep learning network consisting of depthwise separable layers (layer 3 is depthwise separable, layer 4 is not) and saw that the actualOps is 10 times the totalOps:

Layer 3 : TSC Cycles = 1.51, SCTM Cycles = 0.80 ARP32 OVERHEAD = 88.04, MAC/CYCLE = 21.03 #MMACs = 3.32, 25.54, 31.75, Sparsity : -856.94, -669.79

Layer 4 : TSC Cycles = 2.24, SCTM Cycles = 1.62 ARP32 OVERHEAD = 38.35, MAC/CYCLE = 5.92 #MMACs = 11.80, 11.61, 13.27, Sparsity : -12.50, 1.56

Even so, this does not make it more time-consuming as the MAC/CYCLE of this layer is particularly larger than the layers that are not depthwise separable. Can you help me to understand how these numbers come about?

Thank you in advance.

William

over 7 years ago

0 kumar.desappan over 7 years ago

TI__Mastermind 22145 points

(1 and 2) Yes. TDIL performance does not scale linearly with resolution. The utilization will be better with higher resolution. The batch processing will improve fully connected layer performance to some extent but the convolution layer performance will not improve the batch processing. Performance will be better if the processing resolution is multiple of 32.

We are working on improving dense convolution performance for small resolutions (1x1 and 3x3). This will be part of 01.00 version

(3) We have optimized the performance of depth wise separable convolution 00.08 release. Hope you are using the version. Regarding total ops print, it look like there is issue in the prints (As you can see the TSC cycles for layer 3 is lesser than layer4). Will check on this trace print.

Thanks,

Kumar.D

0 William Wong over 7 years ago in reply to kumar.desappan

Prodigy 70 points

When you said yes to (1), do you mean we should use the TSC cycles to estimate the real fps?

0 kumar.desappan over 7 years ago in reply to William Wong

TI__Mastermind 22145 points

Yes, TSC cycles shall be used for FPS computation.

Thanks and regards,

Kumar. D

0 ZhangQiang over 7 years ago in reply to kumar.desappan

Intellectual 415 points

Hi Kumar,

When using dense convolution for better performance (small resolutions, 1x1, 3x3), how can I use sparsity to speedup ?

Without the usage of sparsity, the total performance is even worse.

Thanks

0 kumar.desappan over 7 years ago in reply to ZhangQiang

TI__Mastermind 22145 points

Hi,
Overheads with sparse convolution for small resolution is high.

So use dense convolution for small resolution ( Sparsity is not possible).
Use sparse convolution when resolution is higher than 64x64 or 32x32

Thanks and regards,
Kumar.D

Processors

Processors forum

TDA2SX: [TIDL/EVE] Questions Regarding Performance Evaluation