This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/TDA2: VXLIB Remap Performance

Part Number: TDA2


Tool/software: Linux

Hi. I'm using Vision SDK 3.4 to develop 3ch 1080p vip capture -> panorama algorithm.

To make panorama image to fit 1080 width, I resized each channel to 640x360.

So, 1920x360 size Y-LUT and 1920x180 size CbCr-LUT is used to make panorama image.

I used VXlib function , VXLIB_remapBilinear_bc_i8u_i32f_o8u.

According to VXLib Test Report, remap function takes 7.5*(dst width * dst height) + 139 cycles.

In my case, remap function should take about 13ms.

But, actually it takes over 40ms.

LUT is allocated as belows

pPanoramaObj->remapBuf = Utils_memAlloc(

UTILS_HEAPID_DDR_CACHED_SR,
( 1920*360*8*2 ),
MY_FRAME_ALIGN /* 32u*/
);

and below is the part of remap.

VXLIB_bufParams2D_t src_addr;
VXLIB_bufParams2D_t dst_addr;
VXLIB_bufParams2D_t remap_addr;
VXLIB_STATUS vx_status;

src_addr.dim_x = 1920;
src_addr.dim_y = 360;
src_addr.stride_y = 1920;
src_addr.data_type = VXLIB_UINT8;

dst_addr.dim_x = 1920;
dst_addr.dim_y = 360;
dst_addr.stride_y = 1920;
dst_addr.data_type = VXLIB_UINT8;

remap_addr.dim_x = 1920*2;
remap_addr.dim_y = 360;
remap_addr.stride_y = 1920*8;
remap_addr.data_type = VXLIB_FLOAT32;

vx_status = VXLIB_remapBilinear_bc_i8u_i32f_o8u(
(UInt8*)((UInt32)pMosaicFrameBuffer->bufAddr[0] + 1920*360),
&src_addr,
(UInt8*)((UInt32)pOutFrameBuffer->bufAddr[0] + 1920*360),
&dst_addr,
(VXLIB_F32*)((UInt8*)pPanoramaObj->remapBuf),
&remap_addr,
0
);

if(vx_status!=VXLIB_SUCCESS)
Vps_printf("remapY result:%d\n", vx_status);

remap function itself works well, and image is normal.

What should I do for better performance?

Thanks in advance.

  • Thanks for your question.

    Explanation of difference you are seeing

    All of the performance numbers listed in the test report reflect the core DSP performance without considering memory stalls.  Therefore it is a best case performance assuming all of the code and data are in L1 memory.  In reality, this is not usually possible, so these numbers give an idea of what the performance should not get better than.

    Now I have done some testing to see what the performance is assuming all data and memory were accessed from DDR cached memory, and this kernel which used randomly generated remap table ranged from 23 to 82 cycles per output pixel on average.  When I ran the same test except put the code and data in DSP L2 memory, the performance improved to range from 10 to 30 cycles per output pixel.  Here is a summary of performances:

    1. CPU only (no L2/DDR memory) : 7.5 cycles

    2. Code/Data in L2: 10 to 30 cycles

    3. Code/Data in DDR: 23 to 82 cycles

    The difference between case 1 and 2 is the L1 cache miss penalty.  The difference between case 2 and case 3 is the L2 cache miss penalty to DDR.

    The cache miss rate will vary based on remap table.  For example, a unity remap table will have a high cache hit rate since subsequent input pixel accesses will be on same cache line that was already fetched.

    Proposal for Improvement


    As seen above, the best thing you can do to improve performance is to process the data from L2 RAM.  Since the whole input image probably does not fit into L2 RAM, DMA the input from DDR into L2 RAM prior to calling this function.  This probably requires you to pipeline the processing in a ping/pong fashion across blocks of input/output.   What I mean is make a loop wherein you initiate a transfer of an input block into L2 RAM, and while the DMA is happening, call VXLIB function on a previously DMA'd block in L2.  When the VXLIB function is done, initiate the next transfer, and continue on until the end of the whole frame.

    Since you are doing a remap, which blocks you transfer and how big the transfer is depends on the remap function you are doing.

    Final Questions

    After rereading your post, I'm curious why you are using remap for whole panorama.  Is most of the remap a copy, or does most every pixel get shifted/warped relative to input?  If you are primarily just appending input images together, you can use a DMA for all/most of the copy and that will be the fastest.

  • As proposed, I copied LUT block in DDR to L2 using DMA.

    I didn't use pingpong buffer, but performance is double-graded.

    If I use ping/pong buffer, better result will be expected.

    To answer for your final questions,

    in my case, almost every pixel is warped into new position. So, I can't choose the way just copying image blocks.

    Thanks for response.