This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

memcpy is so slowly

When my 138 board is start, the log is:

ARM Clock : 456000000 Hz
DDR Clock : 198000000 Hz

But in my code, when I process frame data, and it's need some memcpy function. But I find that it is so slowly  on DSP side. In my program, ARM communicate to DSP with ListMP. In DSP, I copy D1 420 frame to DSP buffer from ARM side.

I don't known what wrong about it. Who can help me ?

Thanks.

  • Please refer to the following post like yours.

    http://e2e.ti.com/support/embedded/linux/f/354/t/146992

    http://e2e.ti.com/support/dsp/omap_applications_processors/f/447/p/29260/101628#101628

    http://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/115/t/129433
  • Hi Changsheng Li,

    We would like to understand how do you measure the execution time of the function "memcpy". Did you use the profile clock option in CCS which measure the instruction cycle between lines of code?

    And also, please give details about the name of the package you use and the data rate of memory copy.

    In which memory segment the the source buffer and the destination buffer lies??

    Regards,
    Shankari

    -------------------------------------------------------------------------------------------------------
    Please click the Verify Answer button on this post if it answers your question.
    --------------------------------------------------------------------------------------------------------
  • Hi Titus and Shankari G:
    Thanks very much!
    In my project, I using MCSDK.
    I capture video frame in Arm side, and using ListMP to send D1 frame data to DSP size(Syslink). I find that both ARM and DSP side, memcpy will need 40ms or so to copy 608k buffer.

    memory segment is: SHARED_REGION_1
    Thanks.

  • In ARM size or DSP side, I convert 422 UV to 420 UV:
    for (i=0; i<288; i++)
    {
    for (j=0; j < 360; j++)
    {
    *uu++ = *ss++;
    *vv++ = *ss++;
    }
    ss+=720;
    }

    When I perform these code, I need about 58ms. so slowly!

  • Hi Changsheng Li,

    I thought you were using the "memcopy" library function.

    Have you attempt with memcopy function instead of this routine?

    Regards,

    Shankari

    -------------------------------------------------------------------------------------------------------

    Please click the Verify Answer button on this post if it answers your question.
    --------------------------------------------------------------------------------------------------------

  • No. I cann't. Because 422's UV data is interlace, when convert to 420, I must copy UV data one by one.
    Thanks.
  • Some suggestions:
    1) Turn up the compiler optimization.
    2) Access memory in the largest width possible. Accessing slow 32-bit wide memory 8 bytes at a time is inefficient.
    3) Declare most often used variables with the register attribute. If you have enough free registers, all your vairables are out of slow memory. Maximum compiler optimizaion might do this for you if the compiler knows your loop count.
    4) Enable data and instruction cacheing.

    Your YUV422 to YUV420p code does not appear complete. Missing Y and averaging of two U and V from 2 rows.