How to use DMA to improve the efficiency on format conversion from YCbCr422 to RGB888

Alan Yi

Dear all:

Basically, the almost format to do image process is RGB888, so the first step to do is format conversion from YCbCr422 to RGB888.Why the efficiency is so low when we use to convert from YCbCr422 to RGB888, it takes almost 1 sec to process this conversion. Let alone to use on real time application. We try to use the optimization level in Build options, but the result is limited. Can anyone have any suggestion to improve the efficiency, below is the demo code.

void ycbcr2rgb(Uint8* src, Int32 width, Int32 height, Uint8* des)

{

Int32 byte_count_line_rgb= width*3;

Int32 byte_count_line_yuv=width*2;

Int32 i,j,k;

Uint8 temp[4];

float p1 , p2 , p3;

for(j=0 ; j <height ; j++) //convert from DDR2 buffer_in to buffer_out

{

k=0;

for(i=0; i< byte_count_line_yuv ; i=i+4 )//1 line

{

temp[0] = *( Uint8*)(src+j*byte_count_line_yuv+i); //cb0

temp[1] = *( Uint8*)(src+j*byte_count_line_yuv+i+1); //y0

temp[2] = *( Uint8*)(src+j*byte_count_line_yuv+i+2); //cr0

temp[3] = *( Uint8*)(src+j*byte_count_line_yuv+i+3); //y1

p1=(temp[1]-16)*1.164+(temp[2]-128)*1.596; //b0

p2=(temp[1]-16)*1.164-(temp[2]-128)*0.813-(temp[0]-128)*0.392; //g0

p3=(temp[1]-16)*1.164+(temp[0]-128)*2.017; //r0

*(Uint8*)(des+j*byte_count_line_rgb+k)=(Uint8)p1 ;//b0;

*(Uint8*)(des+j*byte_count_line_rgb+k+1)=(Uint8)p2 ;//g0;

*(Uint8*)(des+j*byte_count_line_rgb+k+2)=(Uint8)p3 ;//r0;

p1=(temp[3]-16)*1.164+(temp[2]-128)*1.596; //b1

p2=(temp[3]-16)*1.164-(temp[2]-128)*0.813-(temp[0]-128)*0.392; //g1

p3=(temp[3]-16)*1.164+(temp[0]-128)*2.017; //r1

*(Uint8*)(des+j*byte_count_line_rgb+k)=(Uint8)p1 ;//b1;

*(Uint8*)(des+j*byte_count_line_rgb+k+1)=(Uint8)p2 ;//g1;

*(Uint8*)(des+j*byte_count_line_rgb+k+2)=(Uint8)p3 ;//r1;

k=k+6;

}

http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/99/p/121390/434701.aspx#434701

According to someone suggestion, use DMA can improve the bottleneck for memory read/write. Can anyone answer this queation?

Best regards,

Alan

over 14 years ago

0 Viet Dinh over 14 years ago

TI__Genius 15310 points

Hi Alan,

Please refer to this User Guide for DMA setup: http://focus.ti.com/lit/ug/spru987a/spru987a.pdf.

Regards,

Viet

0 Alan Yi over 14 years ago in reply to Viet Dinh

Intellectual 410 points

Dear Viet:

We knew this document, but please give us constructive suggestions or some hints.

Best regards,

Alan

0 MattLipsey over 14 years ago in reply to Alan Yi

Genius 3575 points

If you are using the DM6437 or similar dsp without a floating point processor, using the dma isn't going to give you any speed benefit for this function. The bottleneck is the large number of floating point multiplies and additions you do for every pixel in your image. Why do you want to convert to RGB888 for image processing?

0 Alan Yi over 14 years ago in reply to MattLipsey

Intellectual 410 points

Dear MattLipsey:

In fact, we found that the performance is poor even if we just move image data between external memories and not do anything fixed point calculates. This is what we though the bottle neck is moving data between external memory and L1/L2 cache memory, so we though why TI take so many effort on developing EDMA3 and IDMA interface.

Why we need to convert to RGB888, because the algo we developed is based on RGB color model. Maybe it also can be referenced to YCbCr mode, I guess.

Regards,

Alan

0 MattLipsey over 14 years ago in reply to Alan Yi

Genius 3575 points

So if it takes ~1 sec to do the conversion, how long does it take to do a comparable sized memory move? What are the dimensions of your image? Do you have cache enabled for both the source & the destination addresses?

0 Alan Yi over 14 years ago in reply to MattLipsey

Intellectual 410 points

Dear MattLipsey:

The dimension of image is 720x480. Because the real time issue, we use VLIB instead of our own c code. We are new to TI-DM6437, especially the IDMA and EDMA3, we try to understand how to control them, we do not know how to optimize the whole control flow between L1/L2/DDR2.

Until now, it is ok for us to use QDMA/EDMA to move image from VPFE on DDR2 (source address) to VPBE on DDR2 (destination address). But we do not how to link the interface between IDMA and EDMA to speed up the optimization on image processing, for CPU to fetch program and data on L1/L2 cache memory.

Regards,

Alan

0 MattLipsey over 14 years ago in reply to Alan Yi

Genius 3575 points

To understand using the edma to optimize image processing, you can look at the document spraan4a, specifically the section on ping pong buffering.

To understand cache, you can look at the document spru862. Here is a function which does a very basic cache setup to make all ddr2 locations cacheable and sets L1/L2 caches to their maximum size:

void cacheInit()
{
    volatile Uint32 *marPtr;
    Uint32 i;

    #define SIZE_DDR2         0x10000000
    #define MAR_STEP_SIZE    0x01000000

    CACHE_L1PINV = 1;    // L1P invalidated
    CACHE_L1PCFG = 7;    // L1P on, MAX size
    CACHE_L1DINV = 1;    // L1D invalidated
    CACHE_L1DCFG = 0;    // L1D off
    CACHE_L2INV = 1;    // L2 invalidated
    CACHE_L2CFG = 3;     // 128k L2 cache enabled
    i = CACHE_L2CFG;

    marPtr = (volatile Uint32 *)0x01848200;    // base of ddr2 @ 0x80000000

    for (i = 0; i < SIZE_DDR2; i+= MAR_STEP_SIZE) {
        *marPtr++ = 1;
    }

    CACHE_L1DCFG = 0x00000004; // grab 32k of cache
    i = CACHE_L1DCFG;    // read register to setup cache mode change
}

You can do a quick calculation on your DDR2 bandwidth to see how fast you should be able to move data around in ddr2. If you don't get at least 80% of your theoretical max bandwidth for a simple block move operation done by the processor, you haven't set something up correctly.

My recommendation is to optimize your system for cache first and see if that will meet your requirements. It is significantly less complicated than trying to do edma double buffering. But again, if you are trying to do multiple floating point operations on every pixel during your processing, you are going to be limited by the overhead of function calls and processing instead of memory bandwidth.

0 Alan Yi over 14 years ago in reply to MattLipsey

Intellectual 410 points

Dear MattLipsey:

Thanks for your suggestion; we still have other questions about this.

(1) What is the memory map that depends on your cache memory setting? Below is our original setting without L2 cache enable. The L2 physical start address is 0x00800000 or 0x10800000? Do L1P and L1D also have physical address?

(2) How to define the physical memory address on MEMORY map below for L2 on RAM and cache.

(3) How to calculate the bandwidth for DDR2?

===============================================

-l rts64plus.lib

-l evmdm6437bsl.lib

-l vlib.l64P

-stack 0x00010000 /* Stack Size */

-heap 0x00400000 /* Heap Size */

MEMORY

{

INT: o = 0x10800400 l = 0x00000400

L2RAM: o = 0x10800800 l = 0x00020000

DDR2: o = 0x80000000 l = 0x10000000

}

SECTIONS

{

.bss > L2RAM

.cinit > L2RAM

.cio > L2RAM

.const > L2RAM

.data > L2RAM

.far > L2RAM

.stack > L2RAM

.switch > L2RAM

.text > L2RAM

.sysmem > DDR2

.vectors > INT

}

Regards,

Alan

0 MattLipsey over 14 years ago in reply to Alan Yi

Genius 3575 points

Alan,

1) I basically put all the code sections in DDR2, and put my stack, bss, & interrupt sections in L1DRAM. That lets me use 100% of L2 memory for cache. The internal memory sections are all double mapped, but in my linker command file I have defined:

MEMORY
{
    IRAM        : origin = 0x10800000, len = 0x20000        // 128 kB
    CACHE_L1P    : origin = 0x10e08000, len = 0x8000        // 32 kB
    L1D            : origin = 0x10f04000,    len = 0x14000        // 80 kB
    FLASH        : origin = 0x42000000, len = 0x01000000    // 16 MB
    DDR2        : origin = 0x80000000, len = 0x10000000    // 256 MB
}

You can get up to 32k of the 80k space in L1D as cache, in which case you will need to reduce the size of L1D (cache is at the high side of memory)

L1P has a physical address, but if you use it as cache you don't need to include it in your linker file.

2) This question is a little confusing to me. Think of cache like an area of memory that nothing is allocated to. When the processor tries to access data (say in your .far space) that data is cached to speed up access. This is different from using the area as SRAM. If you want to use 100% IRAM section as cache, then just don't allocate anything to that section. If you want to use 50% as cache, then just reduce the size by 50%... you have to pretend like cache isn't there since you can't allocate anything to it. Then you have to make sure you have configured the appropriate registers correctly. Things are more complicated if you want to switch cache/sram use dynamically.

3) To calculate ideal DDR2 bandwidth, figure out how fast your clock is & how many bytes you bring in per clock. So if you had a 150MHz ddr2 clock, and a 32 bit wide data bus, you would get data on each edge for a rate of ~ (150 MHz x 2 edges x 4 bytes per edge) = 1.2 GBytes/second. You can apply a fudge factor to account for refresh rates, page changes, etc., keeping in mind that this applies for sequential data access, not random data access. That should give you a ballpark estimate for best case data access, which you can compare to your current data access speed.

0 Alan Yi over 14 years ago in reply to MattLipsey

Intellectual 410 points

Dear MattLipsey:

Thanks for your kindly help, the problem troubled us for a long time, there has significant improvement on image data access when cache enabled according to you suggestion.

It made us more clearly on how to use cache to speed up the data access, it reminds us that one document discussed on TI wiki below is the target what we want to reach to:

http://processors.wiki.ti.com/index.php/C64x%2B_iUniversal_Codec_Creation_-_from_memcpy_to_Canny_Edge_Detector

“The benchmarking of the slicing implementation is shown in Table 2 which also shows for comparison the caching data”

	Slice N=3 (ms)	Slice N=7 (ms)	Gaussian Cache, Slice N=3 (ms)	Cache (ms)
Gaussian Filtering	4.8	4.8	5.4	7.0
Gradient Calculation	0.7	0.8	0.7	16.9
Non-maximal Suppression	5.7	5.8	5.7	12.4
Double Thresholding	2.4	2.9	2.4	8.0
Edge Relaxation	2.1	3.0	2.0	2.2
Slice DMA/cache management	4.8	2.5	3.1
Canny Total	20.5	19.8	19.3	46.5
Preprocessing - Luma Extraction	1.7	1.5	1.7	3.9
Pre-processing - Chroma insertion	8.3	8.0	8.3	10.6
Pre-Post Processing Total	10.0	9.5	10.0	14.5
Total	31.2	28.8	30.3	61.3

There are some questions about this topic:

(1) What are the differences in cache memory configuration on L1 and L2 between "frame/cache" and "slicing" mode?

(2) Why the slicing is more effective than frame/cache does?

(3) What is the situation for data access on image process? Is that right to move data between L1 and L2 by using IDMA interface and to move data between L2 and DDR2 by using EDMA? Or just to move data directly between L1 and DDR2 by using EMDA?

We do not really understand the principle which talking about on this article, could you explain it more detail for us, thanks.

Regards,

Alan

0 MattLipsey over 14 years ago in reply to Alan Yi

Genius 3575 points

1) In the "slicing" mode, I believe this example is configuring L1D as SRAM and using the EDMA to directly transfer data between DDR2 & L1DSRAM. "Slicing" refers to the fact that a slice of the image is processed at a time: EDMA moves 3 lines into SRAM, cpu processes it in place, then EDMA moves processed data back to DDR2, all wrapped up in a rolling buffer scheme that minimizes the number of times a piece of data is accessed.

In the "frame/cache" mode, L1D (and maybe L2 also) is configured as cache. The cpu processes the entire data frame and accesses to/from DDR2 are handled by the cache.

2) Slicing is more effective because of the data access pattern of the algorithm. Notice the section of your linked article:

Using Slices to process data in Internal L1 Memory

As noted before all the functions except VLIB_hysteresisThresholding() operate on lines in the image. This means that they are suitable for processing in slices. The principle of using slices can be summarized as follows:

The benefit of this technique scales with the number of data operands that the function has to process. The cache is able to bring in lines of 128 bytes to on chip memory but there is still a significant penalty for the first cache miss. The slicing technique takes advantage of a priori knowledge of the algorithm which allows more efficient use of the DMA to bring in Kbytes of required data as efficiently as possible.

3) I believe that that in cases where your image processing algorithm is appropriate for this "ping pong" style buffer scheme, you will want to use the edma to move data directly between L1D & DDR2. I haven't seen examples of anyone using the IDMA on this forum. Remember, the cpu has to get data into L1D before it can process it. This is accomplished either by the cache controller or by manually moving it to L1D. The more intermediate steps your data has to take, the longer it takes to process it.

0 Alan Yi over 14 years ago in reply to MattLipsey

Intellectual 410 points

Dear MattLipsey:

We do appreciate what you do. Besides, we have to study and try to establish the ping-pong buffer between L1D and DDR2. Maybe we could ask you about this topic later if we met some problem. However, thanks anyway.

Regards,

Alan

Processors

Processors forum

How to use DMA to improve the efficiency on format conversion from YCbCr422 to RGB888

Using Slices to process data in Internal L1 Memory