This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA2EVM5777: Vision SDK VXLIB API -> performance issue in image processing algorithms

Part Number: TDA2EVM5777
Other Parts Discussed in Thread: MATHLIB

Hi,

We have implemented few image processing algorithms using Vision SDK on TDA2X hardware and we are facing performance issue (high CPU usage
).

You can find more details of algorithm and other related parameters, in the attached excel sheet.

We mainiy setting some parameter values and calling Vision SDK VXLIB's APIs in algorithm process function, there is no more floating point operation or any busy loop is there, could you please provide us your feedback and some points which we should take care to improve the performance.

Let us know if you need more information or you want to see our code implementation.

Please reply this mail with prioirty, as we have some urgency for this task.

Thanks & Regards,
Rajesh Rathod.

Performance.xls

  • Hi Rajesh,

    I have forwarded your question to a VXLIB expert.

    Regards,
    Yordan
  • Yes, can you please send me the exact VXLIB Functions you are calling, and the configuration parameters of each. You can see best case performance of each kernel in the test report that comes with the VXLIB release. From here you can see how far from the best case you are actually seeing. The best case in this report is assuming that all the code and data is in L1, single cycle access. This is typically not achievable in practice, but depending on the memory model, you may be able to get close. One issue could be if the cache is not configured to be enabled for the DDR data regions where the buffers are located which you are passing to the VXLIB kernels. Be sure to check if the cache is configured properly. Furthermore, if you use DMA to bring part of the input image into L2 SRAM in a ping pong fashion, then you can get close to this best case performance as you are saving time doing the DMA in parallel with the compute. Without using DMA, you are probably running into memory access latency issues as the bottleneck, as much of VXLIB kernels are more I/O bound than compute. If you are using only cache and not DMA, you may still be able to get good performance if you divide the image into slices and call your VXLIB kernels in series across these slices so you can get some cache locality benefit.

    Have you considered the above conditions? You can let me know which VXLIB Functions you are using, and what the measured cycles/pixel compared to the best case from the release and I can confirm if this is expected based on your answers to the DMA vs cache vs no cache discussion above.
  • By the way, here is another post where I discuss such optimization techniques:
    e2e.ti.com/.../2177740
  • Hi,
        Thanks for your reply with detailed explanation, as you have requested I have attached excel file which lists VXLIB API's we use for each algorithm with parameter information.

    Please let me know if something is missing in given file.

    In addition to this, I have some questions as below:

    1. As you have suggested processing few lines of image file at a time (for ex. 32 lines), with that approach how we can use 2D convolution algorithm ? If I am not wrong this approach is not possbile with 2D Convolution algorithm, if so is there any alternate approach we can apply for 2D Convolution Algorithm?

    2. From Vision SDK documentation we found that DSP1_L2_SRAM & DSP2_L2_SRAM has some unused portion, can we increase the size of DSP1_L2_SRAM & DSP2_L2_SRAM and utilize that unused portion, what is your view on that?

    Thanks & Regards,
    API_LIST.xls

  • Hi,

    Any updates on this thread ?

    In addition to this, while reading performance optimization techniques on DSP I came across Software pipelining concept and use of 'restrict' keyword, I have tried adding 'restrict' keyword but no change.

    My question is , Vision SDK uses software pipelining internally ? if no, how and where we can add '-o3' & '-mt' compilation flag so that we can inform compiler that we want to use software pipelining ?  is there any similar technique which we can apply?

    Please reply to this questions and above questions with priority.

    Any help would be appricated.

    Thanks & Regards,

  • FYI,

    There are multiple online DSP optimization reference guides and trainings.  The following lists a sub-set of these that may be of interest to someone ramping up on DSP optimizations.

    a. C6000 Optimizing Compiler (Reference Manual):

    1. Lists all of the intrinsic functions and the instructions they map to.
    2. List all compiler options
    3. http://www.ti.com/lit/ug/spru187u/spru187u.pdf

    b. C66 DSP CPU and Instruction Set (Reference Manual)

    1. Lists all the instructions and processing units they belong to
    2. www.ti.com/lit/ug/sprugh7/sprugh7.pdf

    c. C66 DSP Architecture (Training Session)

    1. https://training.ti.com/keystone-c66x-dsp-corepac-overview

    d. Tutorial for optimizing Vision Kernels on TI DSP (Application Note)

    1. http://www.ti.com/lit/an/spna165/spna165.pdf

    e. Optimizing loops on the C66 DSP (Application Note)

    1. http://www.ti.com/lit/an/sprabg7/sprabg7.pdf

    f. Introduction to TMS320C6000 DSP Optimization (Application Note)

    1. http://www.ti.com/lit/an/sprabf2/sprabf2.pdf?DCMP=leadership&HQS=ep-pro-dsp-leadership-problog-150507-mc-en

    I will make a separate post to answer your questions.

    Jesse

  • Rajesh Rathod said:

    1. As you have suggested processing few lines of image file at a time (for ex. 32 lines), with that approach how we can use 2D convolution algorithm ? If I am not wrong this approach is not possbile with 2D Convolution algorithm, if so is there any alternate approach we can apply for 2D Convolution Algorithm?

    Splitting the image into tiles or blocks is still possible with 2D convolution, but additional precautions should be done.  For example, lets say we still read the entire width of the image, but we  split the image height into pieces (for ex. 32 lines).  Using the example of 5x5 2D convolution, the number of output lines is equal to input lines - 4.  So for the first 32 lines, you would read 32 lines, and output 28 lines.  Now when you read the next slice, you will need to re-read the last 2 lines of the previous slice to be the first 2 lines of the new slice so that the output matches as if you didn't divide up the image at all.  This approach would also apply if you divided the image vertically as well ... you would need to re-read the last inputs of the previous block.

    So if you cascade multiple kernels which perform some kind of filtering which reduces the output image size, then the block reads need to account for the sum of the pixel reduction across the cascaded kernels.

    Best Regards,

    Jesse

  • When Vision SDK is compiled using release mode, then the -o3 is enabled and pipelining should be supported. I suggest reading some of the optimization techniques from the documents I posted earlier. It will point you to be able to inspect the assembly files that are generated by the compiler to see the compiler feedback on if a loop was pipelined or not, and why? For example, if a loop contains any function calls, the compiler can't pipeline. You would need to either make the function calls as "inline" functions, or manually move calls in. Some math operations also implicitly call rts library functions. For example, and integer divide or modulus will internally call a rts library function call, which may prevent pipelining of loops. Some of these math.h function calls can be averted when using mathlib from TI, which provides inline approximation functions you can use.

    Other reasons why pipelining is not enabled is if the loop has too many instructions. The assembly file generated from the compiler may indicate this. In this case, you would need to split the loop into smaller loops to pipeline each of the smaller loops.

    Restrict keyword, may or may not help. It usually helps on API definitions where the compiler doesn't know the addresses being passed at compile time. And it may remove loop carried dependencies.

    All of these issues are explained in more detail in the Optimizing Loops document: www.ti.com/.../sprabg7.pdf

    Regards,
    Jesse
  • Rajesh Rathod said:

    2. From Vision SDK documentation we found that DSP1_L2_SRAM & DSP2_L2_SRAM has some unused portion, can we increase the size of DSP1_L2_SRAM & DSP2_L2_SRAM and utilize that unused portion, what is your view on that?

    Yes, you can increase, however, you need to make sure that the unused portion that you are adding is not assigned already to cache.