This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VMXEVM: Performance of TIOVX kernels

Part Number: TDA4VMXEVM

Hello all,

SUMMARY:

I've been benchmarking the J721EX board running the SDK auto Linux and OpenVX for C66 DSPs (TIOVX library). In summary, I've been observing at least one order of magnitude lower performance than I'd expect from datasheets and manuals.

TI MEASURED PERFORMANCE ANALYSIS:

Please consider the table at http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/TIOVX_PERFORMANCE_J7ES_LINUX.html

The entry #80 for the Multiply OpenVX kernel states: 

Index Kernel Variant Frame Size (Pixels) Graph Performance (msec) Node Performance (msec)
80 Multiply S16 x S16 = S16 640x480 (307200) 4.057000 3.978000

This means that 307200 int16_t elements are multiplied by 307200 int16_t elements, one by one, and produces 307200 int16_t elements.

Then, I disassembled the released C66 firmware loaded by Linux:

~/psdk_rtos_auto_j7_06_01_01_12/c6000_7.4.24/bin/dis6x --all ~/psdk_rtos_auto_j7_06_01_01_12/vision_apps/out/J7/C66/SYSBIOS/release/vx_app_tirtos_linux_c6x_1.out > /tmp/c66.lst

And counted the number of cycles spent in the inner loop of VXLIB_multiply_i16s_i16s_o16s_core:

ad1e86d8             $C$L32:
ad1e86d8       0d66           SPLOOP        3
ad1e86da       5947 ||        MV.L2X        A18,B18
ad1e86dc   ec081000           .fphead       n, h, W, BU, nobr, nosat, 1100000b
ad1e86e0             $C$L33:
ad1e86e0       2ce7           SPMASK        L1,L2
ad1e86e2       1581 ||^       ADD.L2X       A19,8,B16
ad1e86e4   044c5765 ||        LDDW.D1T1     *A19++[2],A9:A8
ad1e86e8   024857e7 ||        LDDW.D2T2     *B18++[2],B5:B4
ad1e86ec   09490058 ||^       ADD.L1        8,A18,A18
ad1e86f0   02485764           LDDW.D1T1     *A18++[2],A5:A4
ad1e86f4   044057e6           LDDW.D2T2     *B16++[2],B9:B8
ad1e86f8   00004000           NOP           3
ad1e86fc   e0280003           .fphead       n, h, W, BU, nobr, nosat, 0000001b
ad1e8700   12209032           DMPY2.M2X       B5:B4,A9:A8,B7:B6:B5:B4
ad1e8704   12209030           DMPY2.M1X       A5:A4,B9:B8,A7:A6:A5:A4
ad1e8708   00002000           NOP           2
ad1e870c   0310a01b           PACK2.L2      B5,B4,B6
ad1e8710   0398eff2 ||        PACK2.S2      B7,B6,B7
ad1e8714   1440d033           DMPY2.M2X       B7:B6,A17:A16,B11:B10:B9:B8
ad1e8718   0310a019 ||        PACK2.L1      A5,A4,A6
ad1e871c   0398eff0 ||        PACK2.S1      A7,A6,A7
ad1e8720   1440c030           DMPY2.M1        A7:A6,A17:A16,A11:A10:A9:A8
ad1e8724       ac66           SPMASK        D2
ad1e8726       39d7 ||^       MV.D2X        A3,B17
ad1e8728   00430001           SPMASK        D1
ad1e872c   018d0940 ||^       ADD.D1        A3,0x8,A3
ad1e8730   0a21201b           PACK2.L2      B9,B8,B20
ad1e8734   0aa96ff2 ||        PACK2.S2      B11,B10,B21
ad1e8738   0a4457c7           STDW.D2T2     B21:B20,*B17++[2]
ad1e873c   e0400004           .fphead       n, l, W, BU, nobr, nosat, 0000010b
ad1e8740   0a212019 ||        PACK2.L1      A9,A8,A20
ad1e8744   0aa96ff0 ||        PACK2.S1      A11,A10,A21
ad1e8748   0c034001           SPKERNEL      3,0
ad1e874c   0a0c5744 ||        STDW.D1T1     A21:A20,*A3++[2]
ad1e8750             $C$L34:

Between SPLOOP and SPKERNEL, there are 20 cycles, and each DMPY2 processes 4 elements, resulting in 8 multiplications per loop iteration (the 3rd and 4th DMPY2 are applying a constant scaling factor). Therefore, as a 1st order of approximation, each element takes 20/8 = 2.5 cycles to be processed. Compared to http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/vxlib_c66x_1_1_4_0/docs/VXLIB_c66x_TestReport.html, my estimation is 5 times bigger than the best measured result of 0.5 cycles/element. I could not find a better inner loop in the disassembled firmware.

In total, the 307200 multiplications should take 2.5*307200 =  768000 cycles.

We can finally estimate the DSP clock frequency as (768000 cycles) / (3.978 ms) = 193061840 cycles/s = 193 MHz.

The TDA4VM datasheet states that the C66 DSP can be clocked up to 1.35 GHz, so this means that the evaluation kit DSP is largely underclocked by a factor of 7.

MY FP IMPLEMENTATION:

I've implemented and benchmarked FP kernels (with SIMD FP multiply QMPYSP) myself and observed a consistent underclocking factor of 10. In other words, the performance I'd expect is consistently 10x lower, even when processing multiplications of 5120 x 3840 image planes = 79 MB.

Some hypotheses I can imagine are:

1) DSP is underclocked

2) L2 HW prefetcher in C66 not initialized

3) DDR bus not at full speed 1866 MHz

Could someone please help here?

Thanks,

Fernando A. Endo

  • Hello again,

    Just a mistake in my computations:

    The 20 cycles refers to the dynamic length of the loop. The iteration interval (one stage length) is 3.

    Because there is no data dependency between the 2 first DMPY2, they can fit in one stage. The same conclusion is valid for the 2 last DMPY2. So, the cycles/element is actually (3 cycles) / ((2 DMPY2) * (4 elements/DMPY2)) = 3/8 = 0.375 cycle/element.

    Following the same logic in the previous message, the DSP frequency should be 29 MHz only, almost 50x slower than the peak frequency!

    Regards,

    Fernando

  • Fernando,

    The loop you mentioned with 3/8 (0.375) cycles/element is the fastest inner loop at line 163 (labeled case 1B in the source code comments). This is for aligned pointers and overflow_policy == VXLIB_CONVERT_POLICY_WRAP.

    This closely matches the curve fit equation of the performance results from the Mode 1 from the test report you mentioned:

    http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/vxlib_c66x_1_1_4_0/docs/VXLIB_c66x_TestReport.html


    Mode 1: scale is integer; width == stride; WRAP
          Test vectors run: 3
          Formula:
            Cycles:  0.36895*N + 142
            Where:   N = width * height

    Please be aware, that this test report is trying to communicate the absolute best baseline that can be achieved from the DSP core code, as it is run on a simulator which assumes that all code and data is in L1 memory (no memory hierarchy, therefore no cache stalls).  At the VXLIB level, we wanted to show this baseline so you can see what the core loops can achieve relative to each other and not considering memory hierarchy latencies.

    The actual TDA4x board performance from OpenVX is shared in the other table you mentioned:
    http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/TIOVX_PERFORMANCE_J7ES_LINUX.html

    As of now, these VXLIB kernels are operating from DDR via L1 and L2 caches.  DMA optimizations (for example using BAM framework) bring on average about 2x speed improvement, and are available on TDA2/3, but there is a pending DMA library update needed for TDA4 before we can enable the BAM DMA optimizations.  So for now, these numbers are reflecting the cache-only mode for these kernels (as indicated at the top of the test report).

    So the discrepancy you are seeing is primarily due to the stalls due to cache misses through the L1/L2 and to DDR.

    DSP Speed = 1.35 GHz
    Pixels = 307200
    Time = 3.978 ms
    Effective Cycles Per Pixel with cache stall/memory latencies = 0.003978 s * 1.35 GHz / 307200 = 17.48 cycles / element

    Many of these kernels in VXLIB are highly optimized at the DSP, but are very simple such that they are I/O bound (Add/absdiff/mult, etc). Simply reading images from DDR through the cache and doing a pixelwise multiply before writing back to DDR heavily under utilizes the DSP since most of the time the DSP will be stalled on the cache misses.  The memory system through the caches simply can not feed this kernel fast enough to get no stalls to the DSP that is running at full speed.  However, you shouldn't think of this as a blanket (17.48/0.375) 47x degradation due to memory because the effect is not linear.
     
    Assuming you keep the interface the same such that the kernel reads and writes the same amount of data, but the compute did much more than a multiply and it took 18 cycles per pixel, then the actual result with memory system would still be 18-19 cycles per element (estimate) because the compute and the IO latencies can largely be done in parallel are are much more balanced.

    This highlights how one should optimize loops and algorithms running on DSP to get maximum performance.  If your loop takes a relatively high number of cycles (like > 17 in this case), it may be worth spending time to optimize the loop to bring the cycle time down.  If optimized is still > 17 cycles, then using DMA to bring data into L2SRAM from DDR will not improve performance since the compute is the bottleneck.

    However, if the optimized loop is significantly less cycles than this, then this means that the memory I/O is the bottleneck and using DMA to ping/pong transferring of blocks of data into L2SRAM in parallel to compute, can bring about further improvements since we are reducing the latencies of the memory hierarchy.  This was the purpose of using BAM and we hope to enable that framework on TDA4 in the coming year.

    Other optimizations to consider are, if you are cascading several kernels (loops) on the DSP for the whole image, one might consider tiling and putting intermediate results in L2SRAM so that you only have the latencies of L1 cache and not the more expensive L2 cache to DDR.

    If in the end you meet real time and all of your loops are still I/O bound, you can save power by reducing the clock speed of the DSP until the compute time is more balanced with the I/O time.

    Please let me know if this makes sense and if you have any follow up questions.

    Regards,

    Jesse

  • Hello Jesse,

    Thanks for your detailed explanation. I still have some follow up questions:

    Jesse Villarreal said:
    As of now, these VXLIB kernels are operating from DDR via L1 and L2 caches.  DMA optimizations (for example using BAM framework) bring on average about 2x speed improvement, and are available on TDA2/3, but there is a pending DMA library update needed for TDA4 before we can enable the BAM DMA optimizations.  So for now, these numbers are reflecting the cache-only mode for these kernels (as indicated at the top of the test report).

    So, basically, with BAM working on TDA4, we would get around 2x speedup over the a non-BAM sequence of kernels. Then, instead of the 47x degradation, we would get 23x of slowdown compared to a full L1 hit ratio. I'm not yet convinced that 23x slowdown is a good result on average. Could you please give us a full example of the BAM results? If possible the best result, with a long pipeline of kernels.

    Jesse Villarreal said:
    So the discrepancy you are seeing is primarily due to the stalls due to cache misses through the L1/L2 and to DDR.

    Your conclusion seem fair, but only in the case that no hardware prefetchers are present in the cache hierarchy. Basically, without prefetching, every cache line miss (64 bytes for L1 and 128 for L2) will have to pay the DDR latency, which should be around hundreds of CPU cycles.

    However, the C66 has a L2 hardware prefetcher. In this case, if its has been properly designed, in a stream processing kernel, the DDR latency should be paid only a few times until the prefetcher warms up and detects the 2 sources and 1 destination streams.

    So, my next questions are:

    1. Is the L2 hardware prefetcher enabled by default in the TDA4 RTOS SDK?
    2. According to the C66x CorePac User's Guide (Rev. C), the prefetcher type seems to be a stride prefetcher. Is it possible to set up the prefetcher parameters? For instance, set the number of prefetch requests to DDR once a stream has been detected.
    3. As far as I understood, the L1 program cache is permanently disabled in the current TDA4 revision. Will the L1 program cache be enabled in future silicon revisions? What's the performance loss by not having it enabled?
    4. Do you have an estimate of when the BAM-plugin will be available for the TDA4?

    Thanks for your help,

    Fernando

  • Fernando,

    Here are some answers to your questions:

    Fernando Endo said:
    Is the L2 hardware prefetcher enabled by default in the TDA4 RTOS SDK?

    As far as I know, there is no way to turn it off in SW, so yes it is enabled.

    Fernando Endo said:

    According to the C66x CorePac User's Guide (Rev. C), the prefetcher type seems to be a stride prefetcher. Is it possible to set up the prefetcher parameters? For instance, set the number of prefetch requests to DDR once a stream has been detected.

    No this is not configurable.

    Fernando Endo said:
    As far as I understood, the L1 program cache is permanently disabled in the current TDA4 revision. Will the L1 program cache be enabled in future silicon revisions? What's the performance loss by not having it enabled?

    What documentation or discussion led to this conclusion?  The L1 program cache is not disabled in C66x on TDA4.

    Fernando Endo said:
    Do you have an estimate of when the BAM-plugin will be available for the TDA4?

    Current estimate given our priorities is end of year 2020.  Please let me know if you will need this earlier or later based on your schedule and we may be able to adjust the priority.  Our understanding is that since this is a performance optimization feature, it is typically needed after initial development but before production/optimization phases.

    Additional comments—

    1. the number of lines that will be prefetched by the prefetcher does not cover the entire latency trip to DDR memory.   In addition, the L2 controller itself can only see a few cache line misses at a time.  So the functioning of the prefetcher should not be expected to reduce the DDR latency penalty to zero after time.
    2. In the SDK, the L2 memory is configured for 64Kb cache, and the rest is set to addressable RAM.  We did this in anticipation of people using the L2 RAM as a scratchpad for DMA (either custom DMA or using BAM in the future, for example).  In your experiments, if you are not using L2RAM, then you can configure full L2 memory as cache to get better cache performance.

    Regards,

    Jesse

  • Hello Jesse,

    Thanks again for your detailed answers. Here is some discussion and details requested:

    Jesse Villarreal said:
    What documentation or discussion led to this conclusion?  The L1 program cache is not disabled in C66x on TDA4.

    There is a note in the "SPRUIL1A – May 2019 – Revised November 2019", AM752x/DRA829/TDA4xM Technical Reference Manual, section 6.4.1.1 C66SS Features:

    "NOTE: The C66x L1P memory is disabled (not supported) in this device."

    Jesse Villarreal said:
    the number of lines that will be prefetched by the prefetcher does not cover the entire latency trip to DDR memory.   In addition, the L2 controller itself can only see a few cache line misses at a time.  So the functioning of the prefetcher should not be expected to reduce the DDR latency penalty to zero after time.

    Yes, I agree, that's why I asked if it is possible to change the prefetcher configuration, especially the number of requests and/or prefetching distance (i.e., prefetch more cache lines in advance and/or prefetch one cache line at a time that is foreseen to be accessed after a configurable time in the future). By tunning these parameters, per kernel, we can satisfactorily hide the DDR latency.

    Jesse Villarreal said:
    In the SDK, the L2 memory is configured for 64Kb cache, and the rest is set to addressable RAM.  We did this in anticipation of people using the L2 RAM as a scratchpad for DMA (either custom DMA or using BAM in the future, for example).  In your experiments, if you are not using L2RAM, then you can configure full L2 memory as cache to get better cache performance.

    Good to know, thanks!

    Kind regards,

    Fernando A. Endo

  • Fernando,

    Fernando Endo said:
    "NOTE: The C66x L1P memory is disabled (not supported) in this device."

    This must be referring to the addressable RAM option for L1P.  In some devices, the L1 and L2 memories can be configured to be cache, or addressable RAM, or a combination of it.  In the case of TDA4, the 32kB L1P is fixed to be full cache and can not be configured as addressable RAM.  I will file a ticket to see if this note can be clarified to avoid confusion.

    Thanks,

    Jesse