This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA2EVM5777: vision SDK VXLIB : Histogram algorithm performance improvement help required.

Part Number: TDA2EVM5777

Hi,

I have implemented Histogram algorithm using Vision SDK VXLib apis and I am just calling following two apis and takes 32% on DSP.


Image Format is YUV422. I have done profiling and here is the result for Image width = 1280, height = 720 , stride = 2560.

VXLIB_channelExtract_1of2_i8u_o8u (Takes approximately 3414337 CPU cycles)

VXLIB_histogram_i8u_o32u ( Takes approximately 2630963 CPU cycles)

Could you please guide me how to optimize this algorithm ?

If required I can paste code here, please help.

Thanks,

  • Hi Rajesh,

    I have forwarded your question to a VXLIB expert.

    Regards,
    Yordan
  • you can improve the performance by
    (A) Making sure cache is enabled or
    (B) use DMA to get data in L2 and operate from L2.

    An example of DMA can be found in DMAUtils packages <dmautils>\test\edma_utils_autoincrement_1d_test

    Thanks,
    With Regards,
    Pramod
  • I am out of the office this week, so I can't confirm your results on my board.  In the mean time, you can try the following easy fix to take advantage of cache to gain performance:

    // Set bufparams for proper width, but height = 1;

    for(i=0; i<height; i++)

    {

         // Adjust pointers to point to beginning of each line for channelExtract input/output, and histogram input

         VXLIB_channelExtract_1of2_i8u_o8u

         VXLIB_histogram_i8u_o32u

    }

    This should reduce the cache miss rate of histogram.

    When I return next week, I can check if your cycle count are matching what I expect.

    To get better performance, you may consider using DMA to bring in lines from DDR into L2SRAM in a ping pong fashion so that the data access overhead is hidden in parallel with the compute of different buffers.

    Jesse

  • Hi,

    Thanks for your reply, I have tried your suggestions partly but I think I don't understand how to implement " consider using DMA to bring in lines from DDR into L2SRAM in a ping pong fashion " so just making height = 1 increases CPU usage .

    Could you please point us some good example code to implement "consider using DMA to bring in lines from DDR into L2SRAM in a ping pong fashion" ?

    I am attaching my code sample here for your better understanding, please see the attached file.

    Alg_Histogram_Obj * Alg_HistogramCreate(
            Alg_HistogramCreateParams *pCreateParams)
    {
    
        Alg_Histogram_Obj * pAlgHandle;
    
        pAlgHandle = (Alg_Histogram_Obj *) malloc(sizeof(Alg_Histogram_Obj));
    
        UTILS_assert(pAlgHandle != NULL);
    
        pAlgHandle->maxHeight   = pCreateParams->maxHeight;
        pAlgHandle->maxWidth    = pCreateParams->maxWidth;
    
        /* Temporary buffers for intermediate frame allocations */
        pAlgHandle->frameBuff[0] =  Utils_memAlloc(UTILS_HEAPID_DDR_CACHED_SR,
                                                   RES_720P,
                                                   UINT_32);
        pAlgHandle->frameBuff[1] =  Utils_memAlloc(UTILS_HEAPID_DDR_CACHED_SR,
                                                   RES_720P,
                                                   UINT_32);
    
        return pAlgHandle;
    }
    
    
    Int32 Alg_HistogramProcess(Alg_Histogram_Obj *algHandle,
                               UInt32            *inPtr[],
                               UInt32            *outPtr[],
                               UInt32             width,
                               UInt32             height,
                               UInt32             inPitch[],
                               UInt32             outPitch[],
                               UInt32             dataFormat
    )
    {
    
    
        const uint8_t *inputPtr = (const uint8_t *)inPtr[0];
        uint8_t *histInPtr = (uint8_t *)algHandle->frameBuff[0];
        UInt32 distribution[TOTAL_BINS] = {0};
        UInt32 scratch[SCRATCH_SIZE] = {0};
        UInt32 rowLine = 0;
    
    
        /*Histogram Algorithm for YUV422 interleaved format */
        if(dataFormat == SYSTEM_DF_YUV422I_YUYV)
        {
            //--- benchmarking initialization code start -----
            // In the variable declaration portion of the code:
            uint64_t start_time, end_time, overhead, cyclecount;
            // In the initialization portion of the code:
            TSCL = 0;
            //enable TSC
            start_time = _itoll(TSCH, TSCL);
            end_time = _itoll(TSCH, TSCL);
            overhead = end_time-start_time; //Calculating the overhead of the method.
            //------------------------------------------------
    
            VXLIB_bufParams2D_t    src_addr, dst_addr; //initializing parameters for the API
    
            //used during extraction of U and V planes
            src_addr.dim_x = width;
            src_addr.dim_y = 1;//height;
            src_addr.stride_y = inPitch[0];
            src_addr.data_type = VXLIB_UINT8;
    
            //used during extraction of U and V planes
            dst_addr.dim_x = width;
            dst_addr.dim_y = 1;//height;
            dst_addr.stride_y = outPitch[0]>>1;
            dst_addr.data_type = VXLIB_UINT8;
    
    
            start_time = _itoll(TSCH, TSCL);
    
            for(rowLine = 0; rowLine < height; ++rowLine)
            {
    
    
            VXLIB_channelExtract_1of2_i8u_o8u (
                    inputPtr,
                    &src_addr,
                    histInPtr,
                    &dst_addr,
                    0
            );
    
            end_time = _itoll(TSCH, TSCL);
            cyclecount = end_time-start_time-overhead;
            Vps_printf("\n\n\nThe VXLIB_channelExtract_1of2_i8u_o8u function took: %lld CPU cycles\n", cyclecount);
    
            /* Buffer information needed by Histogram equalization VxLib API */
            VXLIB_bufParams2D_t    srcAddr;
    
            srcAddr.dim_x = width;
            srcAddr.dim_y = 1;//height;
            srcAddr.stride_y = width;
            srcAddr.data_type = VXLIB_UINT8;
    
            /*
             * Histogram VxLib API
             * applied on luma component and save result in distribution buffer
             *
             */
    
            start_time = _itoll(TSCH, TSCL);
    
            VXLIB_histogram_i8u_o32u(( const uint8_t *)histInPtr,
                                     &srcAddr,
                                     distribution,
                                     (uint32_t *)scratch,
                                     OFFSET,
                                     RANGE,
                                     TOTAL_BINS,
                                     FINAL_CALL
            );
             inputPtr = inputPtr + width;
            }
    
            end_time = _itoll(TSCH, TSCL);
            cyclecount = end_time-start_time-overhead;
            Vps_printf("\n\n\nThe VXLIB_histogram_i8u_o32u function took: %lld CPU cycles\n", cyclecount);
        }
        else
        {
            Vps_printf("\nInvalid Frame Format\n");
            return SYSTEM_LINK_STATUS_EFAIL;
        }
    
        return SYSTEM_LINK_STATUS_SOK;
    }
    

    Thanks,

  • Thanks for attaching the code. It is very helpful.

    Making height equal to 1 should have reduced the CPU usage. One thing I noticed about your code is that there is now a print statement inside of your for loop. This might be the reason why the cycles have gone up ... try removing this statement. Also, there is a profile start and end inside the loop, if you remove these, then the total combined function cycle count should be accurate.

    Regarding the DMA example, Pramod mentioned an example in a previous reply:

    "An example of DMA can be found in DMAUtils packages <dmautils>\test\edma_utils_autoincrement_1d_test"

    Jesse
  • Hi,

    Thanks for your reply and I am sorry for some miss understanding in given code.

    Actually without print statement and profiling code just with two VXLIB Api call it is taking approximately 31% that was the reason I have added profiling code later to check number of CPU cycle it takes.

    I am trying ping pong buffer example, but meanwhile if you can check my code on your side it will be great help.

    Thanks,

  • Are you saying that the version of the code with the height=1 using these two kernels in a loop across height (with prints removed) is using 31%, and the code with calling the full first kernel followed by the full second kernel is 32% load? I would expect the load for height = 1 to be at least 1/4 lower loading. The code looks correct assuming the prints were removed. You might want to move the histogram initialization of bufParams to outside of the loop since it doesn't change for each iteration and it wastes cycles within the loop.

    I see that you also allocate the intermediate buffer using the stack. You may want to allocate it in the L2SRAM so that it doesn't get paged out, but I don't expect that to make a huge difference.

    I have confirmed your initial results you posted with regard to the cycles for these 2 functions when running the full frames using only cache. It does appear that the cache is enabled, which is good.

    1. The channel extract API is very data I/O bound since there is basically no compute. The simulator (as if data is in L1 already) shows 0.18 cycles per pixel since it is simply a SIMD load, pack, and store operation. It is bound on the D unit, and the other units are largely idle. When you run the kernel from cache, the performance is bound by the large cache miss rate and gets reduced to 3.5 - 4.5 cycles per pixel on average. If you had some compute function which took more than 4 or 5 cycles of simulator compute, then the cache miss rate would not impact the performance as much since the CPU would be loaded more. Given the simulator performance data for each kernel, you can estimate which ones can be done with minimal impact using cache, versus those like this one which will be largely impacted by cache miss rates.

    When the kernel is run on data already in the L2SRAM, then the performance improves to somewhere between 0.5 and 1.0 cycles per pixel. So the biggest improvement to performance would be to utilize the DMA to bring in lines (or blocks) of input data into L2SRAM. You can refer to the example that Pramod referred to. Alternatively, you can refer to the example that OpenVX uses if you want to use the BAM framework: c. <VSDK 3.x>/ti_components/open_compute/TIOVX_01_00_00_00/kernels/openvx-core/c66x/bam. Note that these functions are for OpenVX, but if you want to include the bam_wrapper functions header file, you may be able to call the single/multi BAM graph functions.

    2. For the histogram kernel, the simulator shows 1.26 cycles per pixel. Using cache, the performance again is somewhat impeded by cache miss rate, and performance degrades to 2.5-2.8 cycles per pixel. However, when data is in L2SRAM, the performance is mostly recovered to 1.5 cycles per pixel. So again, putting this together with channel extract using DMA would help.

    3. Finally, if you need more performance after doing this, one option would be to merge these two kernels into a single kernel. This means making a new kernels which borrows the logic from the original 2 VXLIB kernels to take advantage of the pipeline. The cycles per pixel of this new kernel has the potential to be lower than the sum of the 2 individual kernels.

    Best Regards,
    Jesse