This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/TMDSEVM572X: OpenCV performance with DSP has strange behavior.

Part Number: TMDSEVM572X

Tool/software: Linux

Hello,

I use AM5728-EVM with OpenCV(enable DSP). I wonder that my Test Programs are work well.

My questions are("API" means OpenCV C++ API):

 1. Enable DSP is too fast:
   - I think that the test image(A4@300dpi, about 2600*3700 px, JPEG)is processed less than 1ms is too fast.
   - These may be asynchronous processing? and other API include these wait time for synchronize?

 2. After acceralated API, other API be late:
   - For example, "canny" is acceralated, but next API like "imwrite" with current image( or "imread" with next image, even if skipped "imwrite"), are late.
   - As a result, total time with enable DSP is longer than with disable DSP.

 3. If 1 and 2 are normal behavior, is the following sequence unsuitable use-case?
   - Receive image from imaging unit -> Image read -> Image processing 1 -> Image prosessing 2 -> ... -> Image write -> Sending image to Host
   - (I think these flow are very commonly...)

Following table is my test result. 

- File processing order is:
  enable DSP(Change environment variable) -> DSP+image1 -> DSP+image2 -> DSP+image3 -> disable DSP ->CPU+image1 -> CPU+image2 -> CPU+image3 .
- Internal processing order is:
  imread -> cvtColor -> GaussianBlur -> Canny -> imwrite.

  • Hi,

    Could you try not running 3 images in sequence before and after DSP offloading is enabled? Could you try one image at a time?

    OpenCL in OpenCV has somewhat different call flow. It submits a task (to DSP OpenCL queue) and returns immediately. This allows parallelization between accelerator (DSP) and CPU, as long as data dependency is satisfied. So, when an action is requested, CPU waits for end of submitted task before it can actually do the action. DSP in many cases is slower than A15 (NEON operation). The cost time of a certain action offloading to DSP includes: waiting time (for DSP to finish) plus actual action time and any format conversion (done on CPU). It also depends a lot on data types used, and if floating point operations are involved. This can be accelerated if DSP optimized implementation of that action is created.

    Rex
  • Hi,

    Have you tried running the test separately with each image? Do you get different results or they are still the same?
    If you don't have any more questions, I'll close this thread. Please open a new one if you have other issues.

    Rex
  • Hi,

    Thank you for your reply.  I'll try to test the method that you proposed.

    By the way, if I want to accelerate the use-case I said*, what is the solution?  I think that the device can't keep opening many images same time, because of the memory size limit...

    I imagine that; Divide processes and use IPC? The one is only processing image memories with DSP, the another is only reading/writing image files with CPU?

     *: Receive image from imaging unit -> Image read -> Image processing 1 -> Image prosessing 2 -> ... -> Image write -> Sending image to Host

  • Hi,

    The behavior you see is related to async (non-blocking) operation difference.
    In case of DSP OpenCL offload, it submits a job and returns right away, if all input operands are available.
    If input operand is not available it will wait, like in imwrite case, which I think is always sync (blocking) operation.

    Rex