TDA2EVM5777: Need a help to develop the canny edge detection algorithm

parth Modi

Part Number: TDA2EVM5777

Hi All,

We are working on a project.IN which, We are using TDA2xx EVM to develop algorithms.We are new for this TDA2x ADAS family.
We want to develop canny edge detection algorithm.So, we have used TI's vxlib library APIs.

For developing this algorithm we have to follow below-listed steps,
1.Gradient magnitude and orientation computation using a noise resistant operator (Sobel).
2.Non-maximum suppression of the gradient magnitude, using the gradient orientation information.
3.Apply double threshold to determine potential edges.
4.Tracing edges in the modified gradient image using hysteresis thresholding to produce a binary result.

VxLIB has provides above all steps' API to develop canny edge detection, When we use Sobel filter image API (first step ) then output frame's width is going half and this API consume the 90 % of DSP used.So, We have below listed queries for this

1. Why this API takes 90% of CPU usage?
2. How to we can achieve full frame width in Sobel API?

We have attached sobel API's c file for your reference.

Please let us know if you need more information from our side.

Thanks,
Parth

Fullscreen API_referance_file.c Download

1) To compute sobel filter image we have used VXLIB_sobel_3x3_i8u_o16s_o16s  API. 
See the below code for API we have sets parameter. 
src_addr.dim_x = width;
src_addr.dim_y = height;
src_addr.stride_y = inPitch[0];
src_addr.data_type = VXLIB_UINT8;

dst_x_addr.dim_x = width - 2;
dst_x_addr.dim_y = height - 2;
dst_x_addr.stride_y = inPitch[0];
dst_x_addr.data_type = VXLIB_INT16;

dst_y_addr.dim_x = width - 2;
dst_y_addr.dim_y = height - 2;
dst_y_addr.stride_y = inPitch[0];

dst_y_addr.data_type = VXLIB_INT16;  

VXLIB_sobel_3x3_i8u_o16s_o16s((const uint8_t *)inputPtr,&src_addr,
                              (int16_t *)pBufGradX, &dst_x_addr,
                              (int16_t *)pBufGradY, &dst_y_addr);

2) To compute L1 norm form we have used VXLIB_normL1_i16s_i16s_o16u  API.

See the below code for API we have sets parameter. 

src_x_addr.dim_x = width - 2;
src_x_addr.dim_y = height - 2;
src_x_addr.stride_y = inPitch[0];
src_x_addr.data_type = VXLIB_INT16;

src_y_addr.dim_x = width - 2;
src_y_addr.dim_y = height - 2;
src_y_addr.stride_y = inPitch[0];
src_y_addr.data_type = VXLIB_INT16;

dst_addr.dim_x = width;
dst_addr.dim_y = height;
dst_addr.stride_y = outPitch[0];
dst_addr.data_type = VXLIB_UINT16;

VXLIB_normL1_i16s_i16s_o16u((const int16_t *)pBufGradX, &src_x_addr,
                            (const int16_t *)pBufGradY, &src_y_addr,
                           (uint16_t *)outputPtr, &dst_addr);

over 8 years ago

0 Yordan Kamenov over 8 years ago

TI__Mastermind 42515 points

Hi Parth,

I have forwarded your question to VLIB experts.

Regards,
Yordan

0 Jesse Villarreal over 8 years ago

TI__Expert 5625 points

> 2. How to we can achieve full frame width in Sobel API?

I suggest that for integration debugging, you try to call the VXLIB_<kernel name>_checkParams() version of the API's. These API's will report any error on the parameter list. For example, in your attached file, I see that you have "inPitch[0]" for both source and destination stride. The stride value is in terms of bytes, so this is probably an error. The output stride is probably twice as big as the input pitch since it is 2 bytes per pixel instead of 1 byte per pixel.

> 1. Why this API takes 90% of CPU usage?

When you say 90% of CPU, I assume you have some frame-rate requirement and this API is taking 90% of that window? Is that what you mean? This depends more on what the frame rate requirement is and what the image resolution is. The performance listed in the release for the kernel is assuming all data and code is in L1 (single cycle access). So this is really best case, and the actual performance will depend on the memory hierarchy configuration and data fetch scheme you use. Since many of the VXLIB API's are I/O bound ... meaning that the compute portion has been optimized so much that the bottleneck to performance may be accessing the data from DDR. So, if the DSP doesn't have the cache turned on or mapped to the appropriate DDR address range, then this would be a problem. If the cache configured properly, then a further optimization can be to use the DMA to move blocks of data into/out of the L2SRAM and operate this API on one block at a time. For sobel_3x3, we have observed a 4x performance improvement when comparing DMA to cache only usage. You can reference how VXLIB kernels are integrated in OpenVX using VXLIB from the following package path: ti_components/open_compute/TIOVX_01_00_00_00/kernels/openvx-core/c66x/vx_canny_target.c. This file shows the integration of VXLIB kernels for Canny using cache only. If you want to see the version which uses BAM for DMA, here is the file: ti_components/open_compute/TIOVX_01_00_00_00/kernels/openvx-core/c66x/bam/vx_bam_canny_target.c. If you are not using OpenVX, you may still want to refer to this code as a reference for calling these VXLIB functions.

0 parth Modi over 8 years ago in reply to Jesse Villarreal

Intellectual 310 points

Hi,

Thanks for the quick response.

As per your suggestion, we experimented with different stride values and receive the return values for the "VXLIB_sobelX_3x3_i8u_o16s" and "VXLIB_sobelX_3x3_i8u_o16s_checkParams".We got return value as 0 (i.e SUCCESS, no error in a passed argument).We ware pass below mentioned parameters.

src_addr.dim_x = width;
src_addr.dim_y = height;
src_addr.stride_y = inPitch[0];
src_addr.data_type = VXLIB_UINT8;

dst_x_addr.dim_x = width;
dst_x_addr.dim_y = height-2;
dst_x_addr.stride_y = (outPitch[0]*2);
dst_x_addr.data_type = VXLIB_INT16;

API's returning a value of zero despite success, but we are still unable to get proper output on the display. The output is highly distorted.

NOTE: We are applying the sobel operator over the luma plane on YUV420 video frame. The chroma plane has been masked out.

Thanks,

Parth Modi

0 Jesse Villarreal over 8 years ago in reply to parth Modi

TI__Expert 5625 points

Parth,

The parameter settings you posted look correct, as proven by the API returning SUCCESS. It is difficult for me to help you without more information. If you have a JTAG, I suggest stepping through the code and checking the input/output buffers are being updated as per your expectation. The display corruption can be be some later process overwriting the output, or perhaps improper pointers being passed to the functions. It is a good idea to try to narrow down where the problem is.

Jesse

0 parth Modi over 8 years ago in reply to Jesse Villarreal

Intellectual 310 points

Hi Jesse,

We appreciated your support.

We get proper output for canny edge detection.But now we are facing DSP processor overshoot problem.

As per your suggestion in your reply on Oct 13, 2017

"if the DSP doesn't have the cache turned on or mapped to the appropriate DDR address range, then this would be a problem. If the cache configured properly, then a further optimization can be to use the DMA to move blocks of data into/out of the L2SRAM and operate this API on one block at a time"

Please address my below queries for your above reply,

1) What do you mean by cache configured properly (could you please suggest a better way to configure the cache so that we can optimise algorithm process performance )?
2) Could you also share the document for configuring DSP1 L2SRAM cache as well as code optimization?

Please let me know if you need more information on our side.

Thanks,
Parth Modi

0 Jesse Villarreal over 8 years ago in reply to parth Modi

TI__Expert 5625 points

1) What do you mean by cache configured properly (could you please suggest a better way to configure the cache so that we can optimise algorithm process performance )?

The L2SRAM is 256KB + 32KB. The 32KB portion is always used as memory mapped SRAM only. The remaining 256KB can be configured as cache or memory mapped SRAM, or a mixture of both (see TRM). If you configure a portion of this RAM as cache, then there are a series of registers that configure which memory pages in the L3 memory map to configure as "cacheable" pages, vs non-cachable.

If you are using VLIB functions from within the VSDK from TI on a TI EVM, then the cache should already be configured "properly", that is, the L2SRAM has cache at least partial configured, and the memory buffers used for shared image buffers are configured as cacheable. There are some regions which are non-cacheable, and these are typically smaller data structures which are used for syncronization and locks between cores.

If you are using VLIB test bench in a stand-alone manner (bare metal) without VSDK BIOS and applications running, the VLIB test bench has a function that is called at the beginning of main that configures these cache registers such that the heap where the image buffers are allocated from are in cachable memory address ranges.

The case you may need to worry about is if you have changed the memory map from the VSDK, such as when you use a custom board design. Or potentially, the memory you are trying to access for image buffers in VLIB were allocated within the non-cachable memory regions.

In any of these cases, I suggest you refer to the document below, in conjunction with the relavent memory-map in the TRM, to debug with JTAG if the cache registers are configured in such a way that you confirm if the memory address ranges where you are accessing code or data are within the "cachable" memory regions as configured in these cache registers.

2) Could you also share the document for configuring DSP1 L2SRAM cache as well as code optimization?

You can refer to:

Processors

Processors forum

TDA2EVM5777: Need a help to develop the canny edge detection algorithm