SK-AM62P-LP: About memcpy speed

Pu Jia

Part Number: SK-AM62P-LP
Other Parts Discussed in Thread: AM62P

Tool/software:

Hello TI team

I'm using MCU SDK demo Display share , and I found the speed of memcpy is abnormal. the Code is below:

        endMs = (ClockP_getTimeUsec() / 1000U);
        DebugP_log("Splash -> Gen 1 buffer: %u ms\r\n", (endMs - gBoot2SplashStartMs));
        {   // copy second flame 
            uint8_t *dst  = (uint8_t*)&gFirstPipelineFrameBuf[1];
            uint8_t *src  = (uint8_t*)&gFirstPipelineFrameBuf[0];
            size_t   bytes = 5529600U; // 1920*720*4 = 5,529,600
            memcpy(dst, src, bytes);
            /* Ensure DDR visibility for the second frame */
            CacheP_wb(dst, (uint32_t)bytes, CacheP_TYPE_ALLD);
        }
        endMs = (ClockP_getTimeUsec() / 1000U);
        DebugP_log("Splash -> Gen 2 buffer: %u ms\r\n", (endMs - gBoot2SplashStartMs));

And I got the log below :

Splash -> Gen 1 buffer: 264 ms
Splash -> Gen 2 buffer: 408 ms

That's means the memcpy about 5529600bit data used for about 140ms.

Why so slow?

2 months ago

0 Divyansh Mittal 2 months ago

TI__Genius 9620 points

Hi,
To remove any variability introduced by other function calls apart from memcpy, can you please try something like:

{   // copy second flame 
    uint8_t *dst  = (uint8_t*)&gFirstPipelineFrameBuf[1];
    uint8_t *src  = (uint8_t*)&gFirstPipelineFrameBuf[0];
    size_t   bytes = 5529600U; // 1920*720*4 = 5,529,600
    startMs = (ClockP_getTimeUsec() / 1000U);
    memcpy(dst, src, bytes);
    endMs = (ClockP_getTimeUsec() / 1000U);
    /* Ensure DDR visibility for the second frame */
    CacheP_wb(dst, (uint32_t)bytes, CacheP_TYPE_ALLD);
}
DebugP_log("Splash -> Gen 2 buffer: %u ms\r\n", (endMs - startMs));

Also, is this value consistent over multiple frames?

0 Pu Jia 2 months ago in reply to Divyansh Mittal

Prodigy 200 points

Hello TI team

I changed the code : get the log below:

Splash -> Gen 2 buffer: 78 ms

The modify point is in DispApp_splashThread() the second buffer is to memcpy not to gen another flame:

/* Update frame buffers for the pipeline before starting display */

DispApp_updateSplashFrameBuffer((void*)&gFirstPipelineFrameBuf[0], DISP_SPLASH_IMAGE_XPOSTION, \

DISP_SPLASH_IMAGE_YPOSTION, DISP_SPLASH_IMAGE_WIDTH, \

DISP_SPLASH_IMAGE_HEIGHT, DISP_BYTES_PER_PIXEL);

0 Divyansh Mittal 2 months ago in reply to Pu Jia

TI__Genius 9620 points

Are building the image in DEBUG mode or RELEASE mode?

0 Pu Jia 2 months ago in reply to Divyansh Mittal

Prodigy 200 points

It's Debug mode.

0 Divyansh Mittal 2 months ago in reply to Pu Jia

TI__Genius 9620 points

Please try with RELEASE mode, it is expected to reduce that latency.

0 Pu Jia 2 months ago in reply to Divyansh Mittal

Prodigy 200 points

It's working but not too much:

before set to release (debug mode):

ciserver Testapp Built On: Apr  3 2025 09:26:45
Sciserver Version: v2025.04.0.0-REL.MCUSDK.K3.11.00.00.16+
RM_PM_HAL Version: v11.00.07
Starting Sciserver..... PASSED
DispApp_init() - DONE !!!
Display create complete!!锛
Starting display ... !!!
Display in progress ... DO NOT HALT !!!
Splash -> Start to gen 1 Frame buffer 56 ms
Splash -> Gen 1 buffer: 166 ms
Splash -> memcpy 2 buffer: 140 ms
Splash -> Gen 2 buffer: 355 ms
Splash-> Start Fvid2 driver & first flame displayed: 358 ms
DSS display share Passed!!
[IPC RPMSG ECHO] Version: REL.MCUSDK.K3.11.00.00.16+ (Aug 27 2025 10:43:02):

Release Mode log below:

Sciserver Version: v2025.04.0.0-REL.MCUSDK.K3.11.00.00.16+
RM_PM_HAL Version: v11.00.07
Starting Sciserver..... PASSED
DispApp_init() - DONE !!!
Display create complete!!锛
Starting display ... !!!
Display in progress ... DO NOT HALT !!!
Splash -> Start to gen 1 Frame buffer 51 ms
Splash -> Gen 1 buffer: 103 ms
Splash -> memcpy 2 buffer: 140 ms
Splash -> Gen 2 buffer: 322 ms
Splash-> Start Fvid2 driver & first flame displayed: 325 ms
DSS display share Passed!!
[IPC RPMSG ECHO] Version: REL.MCUSDK.K3.11.00.00.16+ (Aug 27 2025 15:52:05):
First Number of elapsed frames = 300, elapsed msec = 5005, fps = 59.94

The generation or do RLE decode faster :110ms => 52ms; memcpy time No Change: 140ms ; but CacheP_wb time increased from 49ms to 79ms.

This's the log which print from:

 
            }
            endMs = (ClockP_getTimeUsec() / 1000U);
            DebugP_log("Splash -> Generation 1 frame buffer finish: %u ms\r\n", (endMs - gBoot2SplashStartMs));
            {   // copy second flame 
            uint8_t *dst  = (uint8_t*)&gFirstPipelineFrameBuf[1];
            uint8_t *src  = (uint8_t*)&gFirstPipelineFrameBuf[0];
            size_t   bytes = 5529600U; // 1920*720*4 = 5,529,600
            uint32_t startMs = (ClockP_getTimeUsec() / 1000U);
            memcpy(dst, src, bytes);
            endMs = (ClockP_getTimeUsec() / 1000U);
            /* Ensure DDR visibility for the second frame */
            DebugP_log("Splash -> memcpy 2 buffer time endMs-startMs: %u ms\r\n", (endMs - startMs));
            CacheP_wb(dst, (uint32_t)bytes, CacheP_TYPE_ALLD);
            }
            endMs = (ClockP_getTimeUsec() / 1000U);
            DebugP_log("Splash -> Gen 2 buffer finished: %u ms\r\n", (endMs - gBoot2SplashStartMs));

It's very strange in DDR read or write speed, as my calculation the BW just 78.99 MB/s, it's much lower than DDR SK-AM62PX speed,

0 Divyansh Mittal 2 months ago in reply to Pu Jia

TI__Genius 9620 points

Hi,
Let me look at it internally and get back to you by mid next week.

0 Pu Jia 2 months ago in reply to Divyansh Mittal

Prodigy 200 points

Hello Divyansh Mittal

Did we have any progress in this issue?

0 Meet Thakar 2 months ago in reply to Pu Jia

TI__Genius 13955 points

Hi,

I am working on this, please allow me some time to test this at my end.

Best Regards,

Meet.

0 Meet Thakar 2 months ago in reply to Meet Thakar

TI__Genius 13955 points

Hi,

I tested this at my end, and for the same number of bytes memcpy takes 71ms time for me, I will run some more tests to confirm this, but this is what I observed currently.

One more thing to add here is that could you make startMs and endMs variable as type uint64_t, this is the return type for ClockP_getTimeUsec function. I am not sure whether this has any impact or not but you can test this once.

Best Regards,

Meet.

0 Pu Jia 2 months ago in reply to Meet Thakar

Prodigy 200 points

Thanks！ wait for your further reply !

0 Meet Thakar 2 months ago in reply to Pu Jia

TI__Genius 13955 points

Meet Thakar said:
One more thing to add here is that could you make startMs and endMs variable as type uint64_t,

Meanwhile I request you to try this once.

Best Regards,

Meet.

0 Pu Jia 2 months ago in reply to Meet Thakar

Prodigy 200 points

Yes,

I changed to uint64

Splash -> Generation 1 frame buffer finish: 552 ms
Splash -> memcpy 2 buffer time: 119 ms, dst=93DCA000, src=93500000, size=7680 bytes
Splash -> Gen 2 buffer finished: 683 ms

no change. Is there some driver or device not init? or the memcpy need to customize to TI board?

0 Meet Thakar 2 months ago in reply to Pu Jia

TI__Genius 13955 points

Hi,

Pu Jia said:
Splash -> memcpy 2 buffer time: 119 ms, dst=93DCA000, src=93500000, size=7680 bytes

I am getting the 71ms time for 5529600 bytes, In your code snippet I also see A DebugP_print log, CacheP_wb and some other operations are also included in this 140ms time, if you want to measure the accurate timing then you should take time stamps immediately before and after your memcpy instruction.

memcpy is time expensive, you can use either DMA or Utils_memcpyWord if you think memcpy is taking too long for your system, usually you would use a graphics engine to modify certain pixels in the frame, but updating the whole frame after each vsync is supposed to be expensive.

Could you elaborate what is the usecase here and what time you are expecting memcpy to be completed in?

Best Regards,

Meet.

0 Swargam Anil 2 months ago in reply to Meet Thakar

TI__Mastermind 48856 points

Hello Pu Jia,

The above discussed points are valid.
We did not use any Cache API or Debug API in the MEMCPY operation during measurement.
With the current Meet setup, we were able to transfer approximately 5 MB of data in ~71 msec.
• If you are looking to further reduce this transfer time, the only viable option is to leverage DMA-based data movement.
• For reference, in earlier tests on the AM62x device, the same 5 MB transfer took ~10 msec using DMA.
• We recommend performing the same DMA-based test on the AM62P device to measure and validate the results in your setup.

Regards,

Anil.

0 Pu Jia 2 months ago in reply to Swargam Anil

Prodigy 200 points

Hello TI team

So it's normal to do memcpy cpu based copy in am62px right? as I tired to use udma, it's about 40ms to copy 1920*1200*4 framebuffer.

Is this normal or not?

0 Swargam Anil 1 month ago in reply to Pu Jia

TI__Mastermind 48856 points

Hello Pu Jia,

There is no problem using memcpy on AM62P devices. However, for large transfers (for example 5 MB or 8.6 MB) memcpy will be noticeably slower than DMA, so we recommend using DMA for large blocks.

On AM62x we’ve measured DMA throughput on the order of ~500 MB/s for bulk memory transfers. At that rate:
• 5 MB → ~10.0 ms (decimal MB) / ~10.49 ms (5 MiB)
• 8.6 MB → ~17.2 ms (decimal MB) / ~18.0 ms (8.6 MiB)
So a transfer of ~8.6 MB completes in ~17–18 ms with DMA at ~500 MB/s.

But your profiling see much larger values for the 8.6MB and I feel that you may not be measuring the right method.

Recommended DMA example sequence and measurement procedure:

1. Start the DMA transfer (queue the descriptor / call the UDMA start API).
2. Start your timer immediately before the UDMA queue/start.
3. Stop the timer in the DMA completion callback after you perform cache invalidate.
4. Repeat the measurement multiple times and take the median/percentiles to avoid single-run noise.

Regards,

Anil.

0 Pu Jia 1 month ago in reply to Swargam Anil

Prodigy 200 points

Hello TI team

As my code, it need about 40ms to copy 8.6M data.

int32_t udma_memcpy_2d(void *dst, const void *src,
                       uint32_t line_bytes, uint32_t lines,
                       uint32_t src_stride, uint32_t dst_stride,
                       uint32_t timeout_ms)
{
    if (g_ch == NULL || dst == NULL || src == NULL)
        return SystemP_FAILURE;
    if (line_bytes == 0u || lines == 0u)
        return SystemP_FAILURE;
    if (src_stride < line_bytes || dst_stride < line_bytes)
        return SystemP_FAILURE;

    CSL_UdmapTR15 *tr = (CSL_UdmapTR15 *)UdmaUtils_getTrpdTr15Pointer(g_trpd, 0U);
    fill_tr15_2d(tr, src, dst, line_bytes, lines, (int32_t)src_stride, (int32_t)dst_stride);

    uint32_t src_span = (lines - 1u) * src_stride + line_bytes;
    uint32_t dst_span = (lines - 1u) * dst_stride + line_bytes;

    CacheP_wb((void*)src, src_span, CacheP_TYPE_ALLD);
    CacheP_inv(dst, dst_span, CacheP_TYPE_ALLD);

    int32_t ret = queue_trpd_and_wait(g_trpd, timeout_ms);
    if (ret != SystemP_SUCCESS) return ret;

    CacheP_inv(dst, dst_span, CacheP_TYPE_ALLD);
    return SystemP_SUCCESS;
}

Could you please help to improve my code speed?

0 Swargam Anil 1 month ago in reply to Pu Jia

TI__Mastermind 48856 points

Hello Pu Jia,

• On which core are you running the above code? (A53, DM R5F, or MCU R5F)

• If it is running on the A53 core, please confirm the OS being used — Linux or baremetal?

• From the code flow, it looks more like custom code rather than the standard MCU+ SDK examples. Can you confirm?

• Are you building and running this in Release mode or Debug mode?

• How exactly are you taking the performance measurements — where are you starting and stopping the timers?

• In your test case, which timer are you using in the setup like PMU or Generic Timer ?

Regards,

Anil.

0 Pu Jia 1 month ago in reply to Swargam Anil

Prodigy 200 points

Hello Team

This's DM R5 Core example.

The example located in MCU driver/dss/dss share. I think it's FreeRTOS.

The code is my customized memcpy function using UDMA.

It's Release mode.

I start /stop code please see the top of my post.

I don't know the timer. just use function ClockP_getTimeUsec().

0 Swargam Anil 1 month ago in reply to Pu Jia

TI__Mastermind 48856 points

Hello Pu Jia,

If you are using custom code, then it is difficult for me to review your code flow without knowing where exactly you are starting and stopping the timer.

In the custom UDMA code, how are you differentiating the flow from the MCU+SDK reference flow? Please clarify that part.

For measurement, I suggest you try using the MCU+SDK application below:
• Start the timer before the UDMA_queraw() call.
• Stop the timer after the CacheP_inv() call.

This sequence will give proper results. I have also measured in this way, and the numbers are consistent.

C:\ti\mcu_plus_sdk_am62px_11_01_00_16\examples\drivers\udma\udma_memcpy_interrupt\am62px-sk

Regards,

Anil.

0 Pu Jia 1 month ago in reply to Swargam Anil

Prodigy 200 points

Hello Team

You can try to replace the code line 231 ~233 to my code: in void DispApp_splashThread(void *args)

            endMs = (ClockP_getTimeUsec() / 1000U);
            DebugP_log("Splash -> Generation 1 frame buffer finish: %u ms\r\n", (endMs - gBoot2SplashStartMs));
            {   // copy second flame 
            uint8_t *dst  = (uint8_t*)&gFirstPipelineFrameBuf[1];
            uint8_t *src  = (uint8_t*)&gFirstPipelineFrameBuf[0];
            size_t   bytes = 1920*1200*4U; 
            uint32_t startMs = (ClockP_getTimeUsec() / 1000U);
            memcpy(dst, src, bytes);
            endMs = (ClockP_getTimeUsec() / 1000U);
            /* Ensure DDR visibility for the second frame */
            DebugP_log("Splash -> memcpy 2 buffer time endMs-startMs: %u ms\r\n", (endMs - startMs));
            CacheP_wb(dst, (uint32_t)bytes, CacheP_TYPE_ALLD);
            }
            endMs = (ClockP_getTimeUsec() / 1000U);
            DebugP_log("Splash -> Gen 2 buffer finished: %u ms\r\n", (endMs - gBoot2SplashStartMs));

and try.... it will cause 120ms to memcpy, that's Meet Thakar confirmed.

So I tried to use the udma_cpy, it will take about 40 ms to copy about 8.7M data.

That's why I tried to find help from your side.~~ below is my udma cpy main code area, could you please help to check why my code is much slower than yours?

static inline void fill_tr15_linear(CSL_UdmapTR15 *pTr, const void *src, void *dst, uint32_t length)
{
    /* Generate completion event; end-of-packet set. */
    pTr->flags    = CSL_FMK(UDMAP_TR_FLAGS_TYPE,          CSL_UDMAP_TR_FLAGS_TYPE_4D_BLOCK_MOVE_REPACKING_INDIRECTION);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_STATIC,        0U);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_EOL,           CSL_UDMAP_TR_FLAGS_EOL_MATCH_SOL_EOL);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_EVENT_SIZE,    CSL_UDMAP_TR_FLAGS_EVENT_SIZE_COMPLETION);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_TRIGGER0,      CSL_UDMAP_TR_FLAGS_TRIGGER_NONE);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_TRIGGER0_TYPE, CSL_UDMAP_TR_FLAGS_TRIGGER_TYPE_ALL);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_TRIGGER1,      CSL_UDMAP_TR_FLAGS_TRIGGER_NONE);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_TRIGGER1_TYPE, CSL_UDMAP_TR_FLAGS_TRIGGER_TYPE_ALL);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_CMD_ID,        0x00U);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_SA_INDIRECT,   0U);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_DA_INDIRECT,   0U);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_EOP,           1U);

    pTr->icnt0    = length;
    pTr->icnt1    = 1U;
    pTr->icnt2    = 1U;
    pTr->icnt3    = 1U;
    pTr->dim1     = (int32_t)pTr->icnt0;
    pTr->dim2     = (int32_t)(pTr->icnt0 * pTr->icnt1);
    pTr->dim3     = (int32_t)(pTr->icnt0 * pTr->icnt1 * pTr->icnt2);
    pTr->addr     = (uint64_t)Udma_defaultVirtToPhyFxn((void *)src, 0U, NULL);

    pTr->dicnt0   = pTr->icnt0;
    pTr->dicnt1   = 1U;
    pTr->dicnt2   = 1U;
    pTr->dicnt3   = 1U;
    pTr->ddim1    = (int32_t)pTr->dicnt0;
    pTr->ddim2    = (int32_t)(pTr->dicnt0 * pTr->dicnt1);
    pTr->ddim3    = (int32_t)(pTr->dicnt0 * pTr->dicnt1 * pTr->dicnt2);
    pTr->daddr    = (uint64_t)Udma_defaultVirtToPhyFxn((void *)dst, 0U, NULL);

    pTr->fmtflags = 0x00000000U;
}

static inline void fill_tr15_2d(CSL_UdmapTR15 *pTr,
                                const void *src, void *dst,
                                uint32_t line_bytes, uint32_t lines,
                                int32_t src_stride, int32_t dst_stride)
{
    pTr->flags    = CSL_FMK(UDMAP_TR_FLAGS_TYPE,          CSL_UDMAP_TR_FLAGS_TYPE_4D_BLOCK_MOVE_REPACKING_INDIRECTION);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_STATIC,        0U);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_EOL,           CSL_UDMAP_TR_FLAGS_EOL_MATCH_SOL_EOL);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_EVENT_SIZE,    CSL_UDMAP_TR_FLAGS_EVENT_SIZE_COMPLETION);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_TRIGGER0,      CSL_UDMAP_TR_FLAGS_TRIGGER_NONE);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_TRIGGER0_TYPE, CSL_UDMAP_TR_FLAGS_TRIGGER_TYPE_ALL);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_TRIGGER1,      CSL_UDMAP_TR_FLAGS_TRIGGER_NONE);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_TRIGGER1_TYPE, CSL_UDMAP_TR_FLAGS_TRIGGER_TYPE_ALL);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_CMD_ID,        0x00U);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_SA_INDIRECT,   0U);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_DA_INDIRECT,   0U);
    pTr->flags   |= CSL_FMK(UDMAP_TR_FLAGS_EOP,           1U);

    pTr->icnt0    = line_bytes;
    pTr->icnt1    = lines;
    pTr->icnt2    = 1U;
    pTr->icnt3    = 1U;
    pTr->dim1     = src_stride;
    pTr->dim2     = (int32_t)(src_stride * (int32_t)pTr->icnt1);
    pTr->dim3     = (int32_t)(src_stride * (int32_t)pTr->icnt1 * (int32_t)pTr->icnt2);
    pTr->addr     = (uint64_t)Udma_defaultVirtToPhyFxn((void *)src, 0U, NULL);

    pTr->dicnt0   = pTr->icnt0;
    pTr->dicnt1   = pTr->icnt1;
    pTr->dicnt2   = 1U;
    pTr->dicnt3   = 1U;
    pTr->ddim1    = dst_stride;
    pTr->ddim2    = (int32_t)(dst_stride * (int32_t)pTr->dicnt1);
    pTr->ddim3    = (int32_t)(dst_stride * (int32_t)pTr->dicnt1 * (int32_t)pTr->dicnt2);
    pTr->daddr    = (uint64_t)Udma_defaultVirtToPhyFxn((void *)dst, 0U, NULL);

    pTr->fmtflags = 0x00000000U;
}
int32_t udma_memcpy_2d(void *dst, const void *src,
                       uint32_t line_bytes, uint32_t lines,
                       uint32_t src_stride, uint32_t dst_stride,
                       uint32_t timeout_ms)
{
    if (g_ch == NULL || dst == NULL || src == NULL)
        return SystemP_FAILURE;
    if (line_bytes == 0u || lines == 0u)
        return SystemP_FAILURE;
    if (src_stride < line_bytes || dst_stride < line_bytes)
        return SystemP_FAILURE;

    CSL_UdmapTR15 *tr = (CSL_UdmapTR15 *)UdmaUtils_getTrpdTr15Pointer(g_trpd, 0U);
    fill_tr15_2d(tr, src, dst, line_bytes, lines, (int32_t)src_stride, (int32_t)dst_stride);

    uint32_t src_span = (lines - 1u) * src_stride + line_bytes;
    uint32_t dst_span = (lines - 1u) * dst_stride + line_bytes;

    CacheP_wb((void*)src, src_span, CacheP_TYPE_ALLD);
    CacheP_inv(dst, dst_span, CacheP_TYPE_ALLD);

    int32_t ret = queue_trpd_and_wait(g_trpd, timeout_ms);
    if (ret != SystemP_SUCCESS) return ret;

    CacheP_inv(dst, dst_span, CacheP_TYPE_ALLD);
    return SystemP_SUCCESS;
}

0 Swargam Anil 1 month ago in reply to Pu Jia

TI__Mastermind 48856 points

Hello ,

To review and provide accurate feedback, we would kindly request the following:
1. Full Project for Review
Please share the complete project . With partial code, it is difficult for us to validate whether the timer is started/stopped at the correct locations or if there are hidden dependencies in your setup.
2. Timer Placement
• The timer should be started immediately before queuing the transfer (before queue_trpd_and_wait).
• The timer should be stopped immediately after the cache invalidate (CacheP_inv) or once the transfer completion is confirmed.
This way, the measurement reflects the true transfer latency observed by the CPU.
3. Clarification on queue_trpd_and_wait
Could you please confirm if this function only returns after DMA completion, or does it support polling Mode or Interrupt Mode ? Sharing the implementation would help us confirm.
We understand that 71 ms was observed for a ~5MB test case, but not for the 8.6 MB transfer.

Meet Thakar said:
I am getting the 71ms time for 5529600 bytes, In your code snippet I also see A DebugP_print log, CacheP_wb and some other operations are also included in this 140ms time, if you want to measure the accurate timing then you should take time stamps immediately before and after your memcpy instruction

Regards,

Anil.