This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5728: OpenCV algorithm processing delay

Part Number: AM5728


Hi,

I'm trying to profile the time taken by OpenCV algorithm for background subtraction (BackgroundSubtractorMOG2) on AM5728 based platform.
I have enabled DSP accelaration and so, I get the following debug logs from the OpenCL program -

...

[core 1] TIDSP Modified MOG2 clk=15294248 frame_row=480 frame_col=640 (80a580 80c980 808180) prune=-0.000100
[core 0] TIDSP Modified MOG2 clk=15297946 frame_row=480 frame_col=640 (80a580 80c980 808180) prune=-0.000100

The above logs seem to indicate that approx. 25.5 ms was taken for processing an image on each 600MHz DSP core (15294248 / 600000000).
My understanding is that the OpenCL programs should run parallel on each DSP core, and so, the time taken for algorithm to process entire image (640x480 resolution) should be 25.5 ms.
However, when I profile the time taken by the algorithm to process the image on ARM side, it is almost double (~51 ms).
Are the OpenCL programs on each DSP core somehow serialized? Or am I missing something here?

Regards,
Manu

  • The software team have been notified. They will respond here.
  • Hi Manu,

    Execution on DSP cores is not serialized. Work partitioning is data input based: so top half of image is processed by first core, and bottom half by the second core.

    You can see implementation details in: http://git.ti.com/opencv/tiopencv/blobs/tiopencvrelease_3.1/modules/video/src/opencl/bgfg_mog2.cl .

    Regarding difference in execution time on DSP and ARM: reported clocks are coming out of C66 core directly, not including data transfer between Linux user space and DSP memory.

    Please note that this background estimator has very big state: majority of state variables are SP floating point (SP32=4 bytes) .

    In fact there are 3 SP32 matrices (same number of elements as we have pixels): weights, variance, mean. And additional one for modes (uchar).

    So, apart from transferring input image, a lot of data need to be updated (per each frame).

    These data transfers are not accounted in DSP clk-s, but they are visible when you do ARM-based profiling.

    Hope this help - regards

    djordje

  • Hello Djordje,

    Thanks for the detailed reply. This is helpful.
    What you have said makes a lot of sense. But, it is hard to believe that data transfer delay is same as the processing delay!
    Anyway, thanks again for your inputs.

    Regards,
    Manu