TDA4VM: UDMA on C7X is slower than memcpy on A72

FredC_LT

Part Number: TDA4VM

Hi,

I have created an openvx kernel that copies a vx_tensor to another. Each core is specified as a target for the kernel, the only difference is that the A72 uses a memcpy and the C7X, C66, R5F use UDMA.

Here are the numbers I got for copying a vx_tensor with a size of 864*128*sizeof(float) = 442368 bytes

Core	memcpy	appUdmaCopy1D
A72	670 us	N/A
C66	4889 us	2237 us
C7X	1616 us	2003 us
R5F	18097 us	5794 us

The numbers I'm getting are quite disappoiting, it seems that the UDMA is slower than doing an A72 memcpy. On the C7x, It's even slower that doing a mempcy.

Are these numbers normal? I haven't found any spec for UDMA.

Here's my (simplified) kernel code:

bool TensorBlockCpyDma(const size_t nBytes,
                       const tivx_obj_desc_tensor_t* const srcDescriptor,
                       const tivx_obj_desc_tensor_t* const dstDescriptor)
{
    bool ret = false;
    if (   (0U   == nBytes)
        || (NULL == srcDescriptor)
        || (NULL == dstDescriptor))
    {
        VX_PRINT(VX_ZONE_ERROR, "Invalid input pointer\n");
    }
    else
    {
        uint64_t srcPhys = tivxMemShared2PhysPtr(srcDescriptor->mem_ptr.shared_ptr, VX_MEMORY_TYPE_HOST);
        uint64_t dstPhys = tivxMemShared2PhysPtr(dstDescriptor->mem_ptr.shared_ptr, VX_MEMORY_TYPE_HOST);

        app_udma_copy_1d_prms_t prms;
        appUdmaCopy1DPrms_Init(&prms);
        prms.dest_addr = dstPhys;
        prms.src_addr = srcPhys;
        prms.length = nBytes;

        if (0 == appUdmaCopy1D(NULL, &prms))
        {
            ret = true;
        }
    }
    return ret;
}

static vx_status VX_CALLBACK tivxTensorcpyProcess(
       tivx_target_kernel_instance kernel,
       tivx_obj_desc_t *obj_desc[],
       uint16_t num_params, void *priv_arg)
{
    vx_status status = (vx_status)VX_SUCCESS;
    const tivx_obj_desc_tensor_t *src_desc;
    const tivx_obj_desc_tensor_t *dst_desc;

    if ( (num_params != TIVX_KERNEL_TENSORCPY_MAX_PARAMS)
        || (NULL == obj_desc[TIVX_KERNEL_TENSORCPY_SRC_IDX])
        || (NULL == obj_desc[TIVX_KERNEL_TENSORCPY_DST_IDX])
    )
    {
        status = (vx_status)VX_FAILURE;
    }

    if((vx_status)VX_SUCCESS == status)
    {
        src_desc = (const tivx_obj_desc_tensor_t *)obj_desc[TIVX_KERNEL_TENSORCPY_SRC_IDX];
        dst_desc = (const tivx_obj_desc_tensor_t *)obj_desc[TIVX_KERNEL_TENSORCPY_DST_IDX];
    }

    if((vx_status)VX_SUCCESS == status)
    {
        void *src_target_ptr;
        void *dst_target_ptr;

        src_target_ptr = tivxMemShared2TargetPtr(&src_desc->mem_ptr);
        tivxCheckStatus(&status, tivxMemBufferMap(src_target_ptr,
           src_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST,
           (vx_enum)VX_READ_ONLY));
        dst_target_ptr = tivxMemShared2TargetPtr(&dst_desc->mem_ptr);
        tivxCheckStatus(&status, tivxMemBufferMap(dst_target_ptr,
           dst_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST,
           (vx_enum)VX_WRITE_ONLY));

        {
            /* call kernel processing function */
            uint32_t start = tivxPlatformGetTimeInUsecs();
#ifdef A72
            memcpy(dst_target_ptr, src_target_ptr, src_desc->mem_size);
#else
            if (!TensorBlockCpyDma(src_desc->mem_size, src_desc, dst_desc))
            {
                VX_PRINT(VX_ZONE_ERROR, "TensorBlockCpyDma failed\n");
            }
#endif
            uint32_t delta = tivxPlatformGetTimeInUsecs() - start;
            VX_PRINT(VX_ZONE_WARNING, "TensorBlockCpyDma copied %u bytes in %u us\n", src_desc->mem_size, delta);
            /* kernel processing function complete */

        }
        tivxCheckStatus(&status, tivxMemBufferUnmap(src_target_ptr,
           src_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST,
            (vx_enum)VX_READ_ONLY));
        tivxCheckStatus(&status, tivxMemBufferUnmap(dst_target_ptr,
           dst_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST,
            (vx_enum)VX_WRITE_ONLY));
    }

    return status;
}

The A72 kernel calls memcpy() instead of appUdmaCopy1D().

Thank you,

Fred

over 2 years ago

0 Brijesh Jadav over 2 years ago

TI__Guru**** 481755 points

Hi Fred,

Well, A72 memcpy is better than UDMA copy, because the size of the copy it small. It is just the 450KB, that might be completely stored in the cache and can give better performance than even UDMA. Can you please try copying more than 2MB of memory? I think in this case, UDMA will performance better than memcpy on A72.

On C7x, can you please try using DRU channels, instead of UDMA channels?

Regards,

Brijesh

0 FredC_LT over 2 years ago in reply to Brijesh Jadav

Intellectual 796 points

Thanks for the answer, Brijesh.

I tried a TensorCpy of 5times the original size (2211840 bytes), but the A72 memcpy still outperforms the DMA by a large margin.

Core	864 * 128 * 5 * sizeof(float)
A72 memcpy	2033 us
C66 DMA	9460 us
C7X DMA (DRU)	9074 us
R5F DMA	13582 us

I used DRU for C7X as you suggested by adding this piece of code in TensorBlockCpyDma():

app_udma_ch_handle_t handle = NULL;
#ifdef C71
handle = appUdmaCopyNDGetHandle(8U);
#endif
if (0 == appUdmaCopy1D(handle, &prms))
{
    ret = true;
}

0 FredC_LT over 2 years ago in reply to Brijesh Jadav

Intellectual 796 points

Hi Brijesh, any update on this?

0 FredC_LT over 2 years ago in reply to Brijesh Jadav

Intellectual 796 points

Hi Brijesh, could you follow up, please?

After speaking with Kai, she said that DRU should be faster than memcpy from C7X, so we need to find where the problem lies. Any indication?

Thanks,

Fred

0 Brijesh Jadav over 2 years ago in reply to FredC_LT

TI__Guru**** 481755 points

Hi FredC_LT,

Yes, DRU performance should be better than A72 memcpy performance. Are you sure that the test is really using DRU channels? Could you help me understand how you have enabled DRU channels for this test? If possible, please share the code snippet?

This is because performance number for the non-DRU and DRU DMA copy looks very similar (9.4ms vs 9ms), so i doubt if DRU is really being used.

Also on A72 side, how are you doing memcpy? How is the memory allocated for src and dst buffer? If possible, can you please share the code here also?

Regards,

Brijesh

0 FredC_LT over 2 years ago in reply to Brijesh Jadav

Intellectual 796 points

Hi Brijesh,

To enable DRU, I added a call to appUdmaCopyNDGetHandle() where the channel is 8U. Here's my TensorCpyDma function:

typedef struct
{
    uint32_t elems;
    uint32_t srcOffset;
    uint32_t dstOffset;
} TensorCpyCfg;

bool TensorBlockCpyDma(const TensorCpyCfg* const cfg,
                       const tivx_obj_desc_tensor_t* const srcDescriptor,
                       const tivx_obj_desc_tensor_t* const dstDescriptor,
                       const void* restrict const srcMapped,
                       void* restrict const dstMapped)
{
    bool ret = false;
    if (   (NULL == cfg)
        || (NULL == srcDescriptor)
        || (NULL == dstDescriptor)
        || (NULL == srcMapped)
        || (NULL == dstMapped))
    {
        VX_PRINT(VX_ZONE_ERROR, "Invalid input pointer\n");
    }
    else if (0U == cfg->elems)
    {
        VX_PRINT(VX_ZONE_ERROR, "Empty copy\n");
    }
    else
    {
        const uint32_t transferSize = cfg->elems * srcDescriptor->stride[0];
        const uint32_t srcAvailableSize = srcDescriptor->mem_size - cfg->srcOffset * srcDescriptor->stride[0];
        const uint32_t dstAvailableSize = dstDescriptor->mem_size - cfg->dstOffset * dstDescriptor->stride[0];
        if (   (transferSize > srcAvailableSize)
            || (transferSize > dstAvailableSize))
        {
            VX_PRINT(VX_ZONE_ERROR, "Invalid size/offset combination\n");
        }
        else
        {

            const uint32_t srcOffset = cfg->srcOffset*srcDescriptor->stride[0];
            const uint32_t dstOffset = cfg->dstOffset*dstDescriptor->stride[0];

            uint64_t srcPhys = tivxMemShared2PhysPtr(srcDescriptor->mem_ptr.shared_ptr, VX_MEMORY_TYPE_HOST) + srcOffset;
            uint64_t dstPhys = tivxMemShared2PhysPtr(dstDescriptor->mem_ptr.shared_ptr, VX_MEMORY_TYPE_HOST) + dstOffset;

            uint32_t start = tivxPlatformGetTimeInUsecs();
            app_udma_copy_1d_prms_t prms;
            appUdmaCopy1DPrms_Init(&prms);
            uint32_t delta = tivxPlatformGetTimeInUsecs() - start;
            VX_PRINT(VX_ZONE_WARNING, "appUdmaCopy1DPrms_Init = %u us\n", delta);
            prms.dest_addr = dstPhys;
            prms.src_addr = srcPhys;
            prms.length = transferSize;

            app_udma_ch_handle_t handle = NULL;
            #ifdef C71
            start = tivxPlatformGetTimeInUsecs();
            handle = appUdmaCopyNDGetHandle(8U);
            delta = tivxPlatformGetTimeInUsecs() - start;
            VX_PRINT(VX_ZONE_WARNING, "appUdmaCopyNDGetHandle = %u us\n", delta);
            #endif
            start = tivxPlatformGetTimeInUsecs();
            if (0 == appUdmaCopy1D(handle, &prms))
            {
                // void* dstVirt = (uint8_t*) dstMapped + dstOffset;
                // appMemCacheInv(dstVirt, transferSize);
                ret = true;
            }
            delta = tivxPlatformGetTimeInUsecs() - start;
            VX_PRINT(VX_ZONE_WARNING, "appUdmaCopy1D = %u us\n", delta);
        }
    }
    return ret;
}

The new logic for DRU is in the block surrounded with "#ifdef C71"

As for the A72 memcpy, it's done in a TIOVX A72 kernel. The allocation is done in a C++ Catch2 test:

TEST_CASE("TensorCpy - Performance comparison)
{
	REQUIRE(0 == appInit());
	tivxRegisterMemoryTargetA72Kernels();	// Registers the A72 memcpy kernel

	std::string target = TIVX_TARGET_A72_0;
	std::string impl = "CPU";

	const size_t COLS = 864;
	const size_t ROWS = 128 * 5;
	std::vector<vx_size> dims{COLS, ROWS};

	TiovxUserKernelsTests::VxTensorWrapper src("TensorCpy src", dims);
	TiovxUserKernelsTests::VxTensorWrapper dst("TensorCpy dst", dims);
	TiovxUserKernelsTests::VxUserDataObjectWrapper<TensorCpyCfg> cfg("TensorCpyCfg");

	TensorCpyGraph graph(cfg, src, dst, target);	// Calls vxCreateContext()

	src.Allocate(graph.ctx);						// Calls vxCreateTensor()
	dst.Allocate(graph.ctx);						// Calls vxCreateTensor()

	float* srcptr = src.Map(VX_WRITE_ONLY);			// Calls tivxMapTensorPatch()
	float* dstptr = dst.Map(VX_WRITE_ONLY);			// Calls tivxMapTensorPatch()
	for (size_t i = 0; i < COLS*ROWS; i++)
	{
		srcptr[i] = i;
		dstptr[i] = 0;
	}
	src.Unmap();									// Calls tivxUnMapTensorPatch()
	dst.Unmap();									// Calls tivxUnMapTensorPatch()

	TensorCpyCfg cfgData;
	cfgData.elems = COLS * ROWS;
	cfgData.srcOffset = 0;
	cfgData.dstOffset = 0;
	cfg.Allocate(graph.ctx, &cfgData);				// Calls vxCreateUserDataObject()

	graph.Allocate();								// Calls vxCreateGraph()
	graph.Verify();									// Calls vxVerifyGraph()
	graph.Run();									// Calls vxProcessGraph()
	graph.PrintStats();								// Calls tivx_utils_graph_perf_print()

	srcptr = src.Map(VX_READ_ONLY);					// Calls tivxMapTensorPatch()
	dstptr = dst.Map(VX_READ_ONLY);					// Calls tivxMapTensorPatch()

	REQUIRE(0 == memcmp(srcptr, dstptr, COLS*ROWS*sizeof(float)));

	src.Delete();									// Calls vxReleaseTensor()
	dst.Delete();									// Calls vxReleaseTensor()
	cfg.Delete();									// Calls vxReleaseUserDataObject()

	graph.Delete();									// Calls vxReleaseGraph(), vxReleaseContext()

	tivxUnRegisterMemoryTargetA72Kernels();			// Unregisters A72 memcpy kernel
	REQUIRE(0 == appDeInit());
}

The allocation process for the C7X kernel is exactly the same.

0 Brijesh Jadav over 2 years ago in reply to FredC_LT

TI__Guru**** 481755 points

Hi FredC_LT,

As per the datasheet on below link, when we use DRU channels from any core, we should get around 10GB/s. Which means, for around 0.5MB of above data, it should take around 0.5ms..

https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/pdk_jacinto_08_05_00_36/docs/datasheet/jacinto/datasheet_j721e.html#udma

This reason why it is taking a lot more time could be because of not enabled interrupt. In the APIappUdmaCopyNDGetHandle, i see that the interrupts are disabled by setting flag udmaCreatePrms.enable_intr to 0 and when it is set to 0, task just does busy waiting by doing TaskP_yield. This might not give good performance because there could be other same or higher priority task running and not allowing this to get unblocked..

I think it is better to check the performance with interrupt enabled, but not sure if this is supported/validated for DRU channels.

Regards,

Brijesh

0 FredC_LT over 2 years ago in reply to Brijesh Jadav

Intellectual 796 points

In appUdmaCreatePrms_Init() (which I use), there is the following comment:

/* Interrupt mode not yet supported for C7x - use polling */

I guess that means that the vision_apps API is not ready for efficient DRU transfers.

Is there another API that you could refer me to for using DRU with C7X?

0 Brijesh Jadav over 2 years ago in reply to FredC_LT

TI__Guru**** 481755 points

Hi FredC_LT,

Exactly, without interrupt mode, the default performance will not be good in the default code.

For the time being, just for measuring the performance, can you please comment out call to TaskP_yield API in appUdmaTransfer and then check the performance?

Regards,

Brijesh

+1 FredC_LT over 2 years ago in reply to Brijesh Jadav

Intellectual 796 points

Hi Brijesh, we found a workaround by using the ND API instead of 1D/2D

appUdmaCopyNDPrms_Init()

appUdmaCopyNDInit()

appUdmaCopyNDTrigger()

appUdmaCopyNDWait()

appUdmaCopyNDDeinit()

I think there's a performance issue with some of app_udma that is shipped with vision_apps.

Processors

Processors forum

TDA4VM: UDMA on C7X is slower than memcpy on A72