This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hi,
I have created an openvx kernel that copies a vx_tensor to another. Each core is specified as a target for the kernel, the only difference is that the A72 uses a memcpy and the C7X, C66, R5F use UDMA.
Here are the numbers I got for copying a vx_tensor with a size of 864*128*sizeof(float) = 442368 bytes
Core | memcpy |
appUdmaCopy1D
|
A72 | 670 us | N/A |
C66 | 4889 us | 2237 us |
C7X | 1616 us | 2003 us |
R5F | 18097 us | 5794 us |
The numbers I'm getting are quite disappoiting, it seems that the UDMA is slower than doing an A72 memcpy. On the C7x, It's even slower that doing a mempcy.
Are these numbers normal? I haven't found any spec for UDMA.
Here's my (simplified) kernel code:
bool TensorBlockCpyDma(const size_t nBytes, const tivx_obj_desc_tensor_t* const srcDescriptor, const tivx_obj_desc_tensor_t* const dstDescriptor) { bool ret = false; if ( (0U == nBytes) || (NULL == srcDescriptor) || (NULL == dstDescriptor)) { VX_PRINT(VX_ZONE_ERROR, "Invalid input pointer\n"); } else { uint64_t srcPhys = tivxMemShared2PhysPtr(srcDescriptor->mem_ptr.shared_ptr, VX_MEMORY_TYPE_HOST); uint64_t dstPhys = tivxMemShared2PhysPtr(dstDescriptor->mem_ptr.shared_ptr, VX_MEMORY_TYPE_HOST); app_udma_copy_1d_prms_t prms; appUdmaCopy1DPrms_Init(&prms); prms.dest_addr = dstPhys; prms.src_addr = srcPhys; prms.length = nBytes; if (0 == appUdmaCopy1D(NULL, &prms)) { ret = true; } } return ret; } static vx_status VX_CALLBACK tivxTensorcpyProcess( tivx_target_kernel_instance kernel, tivx_obj_desc_t *obj_desc[], uint16_t num_params, void *priv_arg) { vx_status status = (vx_status)VX_SUCCESS; const tivx_obj_desc_tensor_t *src_desc; const tivx_obj_desc_tensor_t *dst_desc; if ( (num_params != TIVX_KERNEL_TENSORCPY_MAX_PARAMS) || (NULL == obj_desc[TIVX_KERNEL_TENSORCPY_SRC_IDX]) || (NULL == obj_desc[TIVX_KERNEL_TENSORCPY_DST_IDX]) ) { status = (vx_status)VX_FAILURE; } if((vx_status)VX_SUCCESS == status) { src_desc = (const tivx_obj_desc_tensor_t *)obj_desc[TIVX_KERNEL_TENSORCPY_SRC_IDX]; dst_desc = (const tivx_obj_desc_tensor_t *)obj_desc[TIVX_KERNEL_TENSORCPY_DST_IDX]; } if((vx_status)VX_SUCCESS == status) { void *src_target_ptr; void *dst_target_ptr; src_target_ptr = tivxMemShared2TargetPtr(&src_desc->mem_ptr); tivxCheckStatus(&status, tivxMemBufferMap(src_target_ptr, src_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST, (vx_enum)VX_READ_ONLY)); dst_target_ptr = tivxMemShared2TargetPtr(&dst_desc->mem_ptr); tivxCheckStatus(&status, tivxMemBufferMap(dst_target_ptr, dst_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST, (vx_enum)VX_WRITE_ONLY)); { /* call kernel processing function */ uint32_t start = tivxPlatformGetTimeInUsecs(); #ifdef A72 memcpy(dst_target_ptr, src_target_ptr, src_desc->mem_size); #else if (!TensorBlockCpyDma(src_desc->mem_size, src_desc, dst_desc)) { VX_PRINT(VX_ZONE_ERROR, "TensorBlockCpyDma failed\n"); } #endif uint32_t delta = tivxPlatformGetTimeInUsecs() - start; VX_PRINT(VX_ZONE_WARNING, "TensorBlockCpyDma copied %u bytes in %u us\n", src_desc->mem_size, delta); /* kernel processing function complete */ } tivxCheckStatus(&status, tivxMemBufferUnmap(src_target_ptr, src_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST, (vx_enum)VX_READ_ONLY)); tivxCheckStatus(&status, tivxMemBufferUnmap(dst_target_ptr, dst_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST, (vx_enum)VX_WRITE_ONLY)); } return status; }
The A72 kernel calls memcpy() instead of appUdmaCopy1D().
Thank you,
Fred
Hi Fred,
Well, A72 memcpy is better than UDMA copy, because the size of the copy it small. It is just the 450KB, that might be completely stored in the cache and can give better performance than even UDMA. Can you please try copying more than 2MB of memory? I think in this case, UDMA will performance better than memcpy on A72.
On C7x, can you please try using DRU channels, instead of UDMA channels?
Regards,
Brijesh
Thanks for the answer, Brijesh.
I tried a TensorCpy of 5times the original size (2211840 bytes), but the A72 memcpy still outperforms the DMA by a large margin.
Core | 864 * 128 * 5 * sizeof(float) |
A72 memcpy | 2033 us |
C66 DMA | 9460 us |
C7X DMA (DRU) | 9074 us |
R5F DMA | 13582 us |
I used DRU for C7X as you suggested by adding this piece of code in TensorBlockCpyDma():
app_udma_ch_handle_t handle = NULL; #ifdef C71 handle = appUdmaCopyNDGetHandle(8U); #endif if (0 == appUdmaCopy1D(handle, &prms)) { ret = true; }
Hi Brijesh, could you follow up, please?
After speaking with Kai, she said that DRU should be faster than memcpy from C7X, so we need to find where the problem lies. Any indication?
Thanks,
Fred
Hi FredC_LT,
Yes, DRU performance should be better than A72 memcpy performance. Are you sure that the test is really using DRU channels? Could you help me understand how you have enabled DRU channels for this test? If possible, please share the code snippet?
This is because performance number for the non-DRU and DRU DMA copy looks very similar (9.4ms vs 9ms), so i doubt if DRU is really being used.
Also on A72 side, how are you doing memcpy? How is the memory allocated for src and dst buffer? If possible, can you please share the code here also?
Regards,
Brijesh
Hi Brijesh,
To enable DRU, I added a call to appUdmaCopyNDGetHandle() where the channel is 8U. Here's my TensorCpyDma function:
typedef struct { uint32_t elems; uint32_t srcOffset; uint32_t dstOffset; } TensorCpyCfg; bool TensorBlockCpyDma(const TensorCpyCfg* const cfg, const tivx_obj_desc_tensor_t* const srcDescriptor, const tivx_obj_desc_tensor_t* const dstDescriptor, const void* restrict const srcMapped, void* restrict const dstMapped) { bool ret = false; if ( (NULL == cfg) || (NULL == srcDescriptor) || (NULL == dstDescriptor) || (NULL == srcMapped) || (NULL == dstMapped)) { VX_PRINT(VX_ZONE_ERROR, "Invalid input pointer\n"); } else if (0U == cfg->elems) { VX_PRINT(VX_ZONE_ERROR, "Empty copy\n"); } else { const uint32_t transferSize = cfg->elems * srcDescriptor->stride[0]; const uint32_t srcAvailableSize = srcDescriptor->mem_size - cfg->srcOffset * srcDescriptor->stride[0]; const uint32_t dstAvailableSize = dstDescriptor->mem_size - cfg->dstOffset * dstDescriptor->stride[0]; if ( (transferSize > srcAvailableSize) || (transferSize > dstAvailableSize)) { VX_PRINT(VX_ZONE_ERROR, "Invalid size/offset combination\n"); } else { const uint32_t srcOffset = cfg->srcOffset*srcDescriptor->stride[0]; const uint32_t dstOffset = cfg->dstOffset*dstDescriptor->stride[0]; uint64_t srcPhys = tivxMemShared2PhysPtr(srcDescriptor->mem_ptr.shared_ptr, VX_MEMORY_TYPE_HOST) + srcOffset; uint64_t dstPhys = tivxMemShared2PhysPtr(dstDescriptor->mem_ptr.shared_ptr, VX_MEMORY_TYPE_HOST) + dstOffset; uint32_t start = tivxPlatformGetTimeInUsecs(); app_udma_copy_1d_prms_t prms; appUdmaCopy1DPrms_Init(&prms); uint32_t delta = tivxPlatformGetTimeInUsecs() - start; VX_PRINT(VX_ZONE_WARNING, "appUdmaCopy1DPrms_Init = %u us\n", delta); prms.dest_addr = dstPhys; prms.src_addr = srcPhys; prms.length = transferSize; app_udma_ch_handle_t handle = NULL; #ifdef C71 start = tivxPlatformGetTimeInUsecs(); handle = appUdmaCopyNDGetHandle(8U); delta = tivxPlatformGetTimeInUsecs() - start; VX_PRINT(VX_ZONE_WARNING, "appUdmaCopyNDGetHandle = %u us\n", delta); #endif start = tivxPlatformGetTimeInUsecs(); if (0 == appUdmaCopy1D(handle, &prms)) { // void* dstVirt = (uint8_t*) dstMapped + dstOffset; // appMemCacheInv(dstVirt, transferSize); ret = true; } delta = tivxPlatformGetTimeInUsecs() - start; VX_PRINT(VX_ZONE_WARNING, "appUdmaCopy1D = %u us\n", delta); } } return ret; }
The new logic for DRU is in the block surrounded with "#ifdef C71"
As for the A72 memcpy, it's done in a TIOVX A72 kernel. The allocation is done in a C++ Catch2 test:
TEST_CASE("TensorCpy - Performance comparison) { REQUIRE(0 == appInit()); tivxRegisterMemoryTargetA72Kernels(); // Registers the A72 memcpy kernel std::string target = TIVX_TARGET_A72_0; std::string impl = "CPU"; const size_t COLS = 864; const size_t ROWS = 128 * 5; std::vector<vx_size> dims{COLS, ROWS}; TiovxUserKernelsTests::VxTensorWrapper src("TensorCpy src", dims); TiovxUserKernelsTests::VxTensorWrapper dst("TensorCpy dst", dims); TiovxUserKernelsTests::VxUserDataObjectWrapper<TensorCpyCfg> cfg("TensorCpyCfg"); TensorCpyGraph graph(cfg, src, dst, target); // Calls vxCreateContext() src.Allocate(graph.ctx); // Calls vxCreateTensor() dst.Allocate(graph.ctx); // Calls vxCreateTensor() float* srcptr = src.Map(VX_WRITE_ONLY); // Calls tivxMapTensorPatch() float* dstptr = dst.Map(VX_WRITE_ONLY); // Calls tivxMapTensorPatch() for (size_t i = 0; i < COLS*ROWS; i++) { srcptr[i] = i; dstptr[i] = 0; } src.Unmap(); // Calls tivxUnMapTensorPatch() dst.Unmap(); // Calls tivxUnMapTensorPatch() TensorCpyCfg cfgData; cfgData.elems = COLS * ROWS; cfgData.srcOffset = 0; cfgData.dstOffset = 0; cfg.Allocate(graph.ctx, &cfgData); // Calls vxCreateUserDataObject() graph.Allocate(); // Calls vxCreateGraph() graph.Verify(); // Calls vxVerifyGraph() graph.Run(); // Calls vxProcessGraph() graph.PrintStats(); // Calls tivx_utils_graph_perf_print() srcptr = src.Map(VX_READ_ONLY); // Calls tivxMapTensorPatch() dstptr = dst.Map(VX_READ_ONLY); // Calls tivxMapTensorPatch() REQUIRE(0 == memcmp(srcptr, dstptr, COLS*ROWS*sizeof(float))); src.Delete(); // Calls vxReleaseTensor() dst.Delete(); // Calls vxReleaseTensor() cfg.Delete(); // Calls vxReleaseUserDataObject() graph.Delete(); // Calls vxReleaseGraph(), vxReleaseContext() tivxUnRegisterMemoryTargetA72Kernels(); // Unregisters A72 memcpy kernel REQUIRE(0 == appDeInit()); }
The allocation process for the C7X kernel is exactly the same.
Hi FredC_LT,
As per the datasheet on below link, when we use DRU channels from any core, we should get around 10GB/s. Which means, for around 0.5MB of above data, it should take around 0.5ms..
This reason why it is taking a lot more time could be because of not enabled interrupt. In the APIappUdmaCopyNDGetHandle, i see that the interrupts are disabled by setting flag udmaCreatePrms.enable_intr to 0 and when it is set to 0, task just does busy waiting by doing TaskP_yield. This might not give good performance because there could be other same or higher priority task running and not allowing this to get unblocked..
I think it is better to check the performance with interrupt enabled, but not sure if this is supported/validated for DRU channels.
Regards,
Brijesh
In appUdmaCreatePrms_Init() (which I use), there is the following comment:
Hi FredC_LT,
Exactly, without interrupt mode, the default performance will not be good in the default code.
For the time being, just for measuring the performance, can you please comment out call to TaskP_yield API in appUdmaTransfer and then check the performance?
Regards,
Brijesh
Hi Brijesh, we found a workaround by using the ND API instead of 1D/2D