This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: UDMA on C7X is slower than memcpy on A72

Part Number: TDA4VM

Hi,

I have created an openvx kernel that copies a vx_tensor to another. Each core is specified as a target for the kernel, the only difference is that the A72 uses a memcpy and the C7X, C66, R5F use UDMA.

Here are the numbers I got for copying a vx_tensor with a size of 864*128*sizeof(float) = 442368 bytes

Core memcpy
appUdmaCopy1D
A72 670 us N/A
C66 4889 us 2237 us
C7X 1616 us 2003 us
R5F 18097 us 5794 us

The numbers I'm getting are quite disappoiting, it seems that the UDMA is slower than doing an A72 memcpy. On the C7x, It's even slower that doing a mempcy.

Are these numbers normal? I haven't found any spec for UDMA.

Here's my (simplified) kernel code:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
bool TensorBlockCpyDma(const size_t nBytes,
const tivx_obj_desc_tensor_t* const srcDescriptor,
const tivx_obj_desc_tensor_t* const dstDescriptor)
{
bool ret = false;
if ( (0U == nBytes)
|| (NULL == srcDescriptor)
|| (NULL == dstDescriptor))
{
VX_PRINT(VX_ZONE_ERROR, "Invalid input pointer\n");
}
else
{
uint64_t srcPhys = tivxMemShared2PhysPtr(srcDescriptor->mem_ptr.shared_ptr, VX_MEMORY_TYPE_HOST);
uint64_t dstPhys = tivxMemShared2PhysPtr(dstDescriptor->mem_ptr.shared_ptr, VX_MEMORY_TYPE_HOST);
app_udma_copy_1d_prms_t prms;
appUdmaCopy1DPrms_Init(&prms);
prms.dest_addr = dstPhys;
prms.src_addr = srcPhys;
prms.length = nBytes;
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The A72 kernel calls memcpy() instead of appUdmaCopy1D().

Thank you,

Fred

  • Hi Fred,

    Well, A72 memcpy is better than UDMA copy, because the size of the copy it small. It is just the 450KB, that might be completely stored in the cache and can give better performance than even UDMA. Can you please try copying more than 2MB of memory? I think in this case, UDMA will performance better than memcpy on A72. 

    On C7x, can you please try using DRU channels, instead of UDMA channels? 

    Regards,

    Brijesh

  • Thanks for the answer, Brijesh.

    I tried a TensorCpy of 5times the original size (2211840 bytes), but the A72 memcpy still outperforms the DMA by a large margin.

    Core 864 * 128 * 5 * sizeof(float)
    A72 memcpy 2033 us
    C66 DMA 9460 us
    C7X DMA (DRU) 9074 us
    R5F DMA 13582 us

     

    I used DRU for C7X as you suggested by adding this piece of code in TensorBlockCpyDma():

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    app_udma_ch_handle_t handle = NULL;
    #ifdef C71
    handle = appUdmaCopyNDGetHandle(8U);
    #endif
    if (0 == appUdmaCopy1D(handle, &prms))
    {
    ret = true;
    }
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • Hi Brijesh, any update on this?

  • Hi Brijesh, could you follow up, please?

    After speaking with Kai, she said that DRU should be faster than memcpy from C7X, so we need to find where the problem lies. Any indication?

    Thanks,

    Fred

  • Hi ,

    Yes, DRU performance should be better than A72 memcpy performance. Are you sure that the test is really using DRU channels? Could you help me understand how you have enabled DRU channels for this test? If possible, please share the code snippet? 

    This is because performance number for the non-DRU and DRU DMA copy looks very similar (9.4ms vs 9ms), so i doubt if DRU is really being used. 

    Also on A72 side, how are you doing memcpy? How is the memory allocated for src and dst buffer? If possible, can you please share the code here also?

     

    Regards,

    Brijesh

  • Hi Brijesh,

    To enable DRU, I added a call to appUdmaCopyNDGetHandle() where the channel is 8U. Here's my TensorCpyDma function:

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    typedef struct
    {
    uint32_t elems;
    uint32_t srcOffset;
    uint32_t dstOffset;
    } TensorCpyCfg;
    bool TensorBlockCpyDma(const TensorCpyCfg* const cfg,
    const tivx_obj_desc_tensor_t* const srcDescriptor,
    const tivx_obj_desc_tensor_t* const dstDescriptor,
    const void* restrict const srcMapped,
    void* restrict const dstMapped)
    {
    bool ret = false;
    if ( (NULL == cfg)
    || (NULL == srcDescriptor)
    || (NULL == dstDescriptor)
    || (NULL == srcMapped)
    || (NULL == dstMapped))
    {
    VX_PRINT(VX_ZONE_ERROR, "Invalid input pointer\n");
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    The new logic for DRU is in the block surrounded with "#ifdef C71"

    As for the A72 memcpy, it's done in a TIOVX A72 kernel. The allocation is done in a C++ Catch2 test:

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    TEST_CASE("TensorCpy - Performance comparison)
    {
    REQUIRE(0 == appInit());
    tivxRegisterMemoryTargetA72Kernels(); // Registers the A72 memcpy kernel
    std::string target = TIVX_TARGET_A72_0;
    std::string impl = "CPU";
    const size_t COLS = 864;
    const size_t ROWS = 128 * 5;
    std::vector<vx_size> dims{COLS, ROWS};
    TiovxUserKernelsTests::VxTensorWrapper src("TensorCpy src", dims);
    TiovxUserKernelsTests::VxTensorWrapper dst("TensorCpy dst", dims);
    TiovxUserKernelsTests::VxUserDataObjectWrapper<TensorCpyCfg> cfg("TensorCpyCfg");
    TensorCpyGraph graph(cfg, src, dst, target); // Calls vxCreateContext()
    src.Allocate(graph.ctx); // Calls vxCreateTensor()
    dst.Allocate(graph.ctx); // Calls vxCreateTensor()
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    The allocation process for the C7X kernel is exactly the same.

  • Hi ,

    As per the datasheet on below link, when we use DRU channels from any core, we should get around 10GB/s. Which means, for around 0.5MB of above data, it should take around 0.5ms.. 

    https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/pdk_jacinto_08_05_00_36/docs/datasheet/jacinto/datasheet_j721e.html#udma

    This reason why it is taking a lot more time could be because of not enabled interrupt. In the APIappUdmaCopyNDGetHandle, i see that the interrupts are disabled by setting flag udmaCreatePrms.enable_intr to 0 and when it is set to 0, task just does busy waiting by doing TaskP_yield. This might not give good performance because there could be other same or higher priority task running and not allowing this to get unblocked.. 

    I think it is better to check the performance with interrupt enabled, but not sure if this is supported/validated for DRU channels. 

    Regards,

    Brijesh

  • In appUdmaCreatePrms_Init() (which I use), there is the following comment:

    /* Interrupt mode not yet supported for C7x - use polling */
    I guess that means that the vision_apps API is not ready for efficient DRU transfers.
    Is there another API that you could refer me to for using DRU with C7X? 
  • Hi ,

    Exactly, without interrupt mode, the default performance will not be good in the default code. 

    For the time being, just for measuring the performance, can you please comment out call to TaskP_yield API in appUdmaTransfer and then check the performance? 

    Regards,

    Brijesh

  • Hi Brijesh, we found a workaround by using the ND API instead of 1D/2D

    appUdmaCopyNDPrms_Init()
    appUdmaCopyNDInit()
    appUdmaCopyNDTrigger()
    appUdmaCopyNDWait()
    appUdmaCopyNDDeinit()
    I think there's a performance issue with some of app_udma that is shipped with vision_apps.