CCS/TDA2EVM5777: Kernel porting on DSP

vikas patel

Part Number: TDA2EVM5777

Tool/software: Code Composer Studio

Hi,

We are developing a simple kernel (Sobel) and optimizing on DSP. We have taken reference from C66x Sobel code but after processing we are not getting expected result.

Could anyone help me to find the problem in code?

Thanks

Vikasdsp_sobel_kernel.rar

attaching the project for reference

over 6 years ago

0 Yordan Kamenov over 6 years ago

TI__Mastermind 42515 points

Hi Vikas,

I have forwarded your question to DSP experts.

Regards,
Yordabn

0 Shyam Jagannathan over 6 years ago in reply to Yordan Kamenov

TI__Genius 10345 points

Hi Vikas,

Can you please elaborate? It will help me identify the issue faster?

Regards,
Shyam

0 vikas patel over 6 years ago in reply to Shyam Jagannathan

Prodigy 170 points

Hi Vikas,

Can you please elaborate? It will help me identify the issue faster?

Regards,
Shyam

Hi Shyam,

Please check this code if you see anything wrong in sobel edge detection..

I also attaching output which i receive from this code ................

void sobel_dsp_kernel_optimised(uint8_t *inPtr, uint8_t *outPtr, uint32_t width, uint32_t height )
{

       uint8_t *inptr_a1[WIDTH],*inptr_a2[WIDTH],*inptr_a3[WIDTH]; // input buffer to process data
       uint8_t buffer[WIDTH]; // output buffer to process data
       uint8_t *temp1, *inptr1, *inptr2, *inptr3, *outpointer , *inpointer;

       inptr1 = (uint8_t *)inptr_a1 ;
       inptr2 = (uint8_t *)inptr_a2 ;
       inptr3 = (uint8_t *)inptr_a3;

       inpointer = inPtr;
       outpointer = outPtr;
       // copy first row
       memcpy(inptr1 , inpointer , WIDTH);
       inpointer += WIDTH;

       // copy second row
       memcpy(inptr2 , inpointer , WIDTH);
       inpointer += WIDTH;

       // copy third row
       memcpy(inptr3 , inpointer , WIDTH);
       inpointer += WIDTH;

       // copy first row of input directly to output buffer as we are not going to get processed first row
      memcpy(outpointer , inptr1 , WIDTH);
      outpointer += WIDTH;

        int32_t x,y;
        int64_t mask1_8 = 0x0101010101010101;
        int64_t mask2_8 = 0x0202020202020202;
        uint32_t mask1_4 = _loll(mask1_8);
        uint32_t mask2_4 = _loll(mask2_8);
        int64_t r0, r1, r2, r01_lo, r01_hi, r012_lo, r012_hi, r012_lo_d2, r012_hi_d2, c02_lo, c02_hi;
        __x128_t r0_2, r1_2, r2_2;


        for ( y = 1; y < (height-1); y++)
         {
            uint32_t x0_3210 = _mem4_const(inptr1);
            uint32_t x1_3210 = _mem4_const(inptr2);
            uint32_t x2_3210 = _mem4_const(inptr3);

            /* Convert from 8bpp to 16bpp so we can do SIMD addition of rows */
            int64_t x0_2 = _mpyu4ll(x0_3210, mask1_4);
            int64_t x1_2 = _mpyu4ll(x1_3210, mask2_4);
            int64_t x2_2 = _mpyu4ll(x2_3210, mask1_4);

                   /* Add rows 0+1+2, column-wise */
            int32_t x01    = _add2((int32_t)_loll(x0_2), (int32_t)_loll(x1_2));
            uint32_t xaxb  = (uint32_t)_add2(x01, (int32_t)_loll(x2_2));


            for(x=0; x < (width/8); x++)
            {
                /* Read 8 bytes from each of the 3 lines. */
                 r0 = _mem8_const(&inptr1[(x*8) + 2]);
                 r1 = _mem8_const(&inptr2[(x*8) + 2]);
                 r2 = _mem8_const(&inptr3[(x*8) + 2]);

                  /* Convert from 8bpp to 16bpp so we can do SIMD addition of rows */
                 r0_2 = _dmpyu4(r0, mask1_8);
                 r1_2 = _dmpyu4(r1, mask2_8);
                 r2_2 = _dmpyu4(r2, mask1_8);

                     /* Add rows 0+1, column-wise */
                 r01_lo = _dadd2(_lo128(r0_2), _lo128(r1_2));
                 r01_hi = _dadd2(_hi128(r0_2), _hi128(r1_2));

                           /* Add previous sum to row 2, column-wise */
                 r012_lo = _dadd2(r01_lo, _lo128(r2_2));
                 r012_hi = _dadd2(r01_hi, _hi128(r2_2));

                           /* Left shift this sum by 2 pixels, using history */
                 r012_lo_d2 = _itoll(_loll(r012_lo), xaxb);
                 r012_hi_d2 = _itoll(_loll(r012_hi), _hill(r012_lo));

                  c02_lo = _dsub2(r012_lo, r012_lo_d2);
                  c02_hi = _dsub2(r012_hi, r012_hi_d2);

                           /* save overlap data for next iteration */
                  xaxb = (uint32_t)_mvd((int32_t)_hill(r012_hi));

                           /* store 8 sobel filter values */
                  _amem8(&buffer[x*8])     = c02_lo;
                  _amem8(&buffer[(x*8)+4]) = c02_hi;

            }

            memcpy(outpointer, buffer ,WIDTH);
            outpointer += WIDTH;

            // reshuffle the pointer and read one more row
            temp1 = inptr2;
            inptr2 = inptr3;
            inptr1 = temp1;

             // copy third row
             memcpy(inptr3 , inpointer , WIDTH);
             inpointer += WIDTH;
         }
  }

0 Shyam Jagannathan over 6 years ago in reply to vikas patel

TI__Genius 10345 points

Hi Vikas,

I haven't tried this code yet, but at first glance, I see a potential problem here,

_amem8(&buffer[x*8]) = c02_lo;

_amem8(&buffer[(x*8)+4]) = c02_hi;

When you are using amem8 instead of mem8, you need to be sure that address is aligned to 64-bit boundary. Seeing that buffer[] is allocated on the stack, there is no guarantee that its aligned to 64-bit boundary. As a quick fix, can you use mem8 instead of amem8 for the stores?

Regards,
Shyam

0 vikas patel over 6 years ago in reply to Shyam Jagannathan

Prodigy 170 points

Hi Shyam,
This doesn' t work.

0 Jesse Villarreal over 6 years ago in reply to vikas patel

TI__Expert 5485 points

Vikas,

I took a look at your main.c, and it looks like you used VXLIB_sobelX_3x3_i8u_o16s.c as your reference, is that correct? If so, this code only generates the X gradient in int16_t format, but you are updating the output pointer assuming it is 8 bits per pixel, so that is one issue. Your buffer[WIDTH] array is 8 bits, so 16 bit writes to it will overflow this buffer by 2x. You can see that for 8 input pixels, 16 output bytes are being written.

Also, your reference C looks like it generates both X and Y gradients, and combines them into a single 8 bit output, which the optimized code doesn't do, so that is another issue. If you need both X and Y, I suggest you reference the VXLIB_sobel_3x3_i8u_o16s_o16s function, which computes both X and Y gradients in separate buffers.

Best Regards,

Jesse

0 vikas patel over 6 years ago in reply to Jesse Villarreal

Prodigy 170 points

Yes, we have taken reference from the same file and we were trying to find the edges in only one direction which is reflecting form the code itself and you are correct the problem was with storing the data . we have added this code at the end of the inner loop and it starts giving proper output.

int16_t pixel[8];
_mem8(&pixel[0]) = c02_lo;
_mem8(&pixel[0+4]) = c02_hi;

for (i=0;i<8;i++)
{
pixel[i] = abs(pixel[i]);
if(pixel[i] >255)
pixel[i] = 255;
buffer[(x*8)+ i] = pixel[i];
}

Any help to replace this piece of code with intrinsics .

Thanks for your support.
Vikas

0 Shyam Jagannathan over 6 years ago in reply to vikas patel

TI__Genius 10345 points

Vikas,

I have not tested this code, but you can try something like this to eliminate the loop and avoid too many load/stores to the memory.

===========================================
for (i=0;i<8;i++)
{
pixel[i] = abs(pixel[i]);
if(pixel[i] >255)
pixel[i] = 255;
buffer[(x*8)+ i] = pixel[i];
}
===========================================

//Replicate the clip value 4 times and store in a 64-bit register
uint32_t clip = (255 << 16) | 255;
uint64_t clip_255k = _itoll(clip, clip);

//Load 8, 16-bit values
uint64_t pixel_0123 = _mem8(&pixel[0]);
uint64_t pixel_4567 = _mem8(&pixel[0 + 4]);

//Compute absolute of 2 16-bit values at a time
uint32_t abs_01 = _abs2(_loll(pixel_0123));
uint32_t abs_23 = _abs2(_hill(pixel_0123));
uint32_t abs_45 = _abs2(_loll(pixel_4567));
uint32_t abs_67 = _abs2(_hill(pixel_4567));

//Pack them into groups of 4
uint64_t abs_0123 = _itoll(abs_01, abs_23);
uint64_t abs_4567 = _itoll(abs_45, abs_67);

//Check if values are greater than 255
uint32_t cmpgt_0123 = _dcmpgt2(abs_0123, clip_255k);
uint32_t cmpgt_4567 = _dcmpgt2(abs_4567, clip_255k);

//Expand the result bits to masks
uint32_t masks_0123 = _dxpnd2(cmpgt_0123);
uint32_t masks_4567 = _dxpnd2(cmpgt_4567);

//Use the mask to extract clip value where pixel > 255
uint32_t result_0123 = clip_255k & masks_0123;
uint32_t retuls_4567 = clip_255k & masks_4567;

//Use the mask to extract pixel value < 255
uint32_t result_0123 = result_0123 | (abs_0123 & ~masks_0123);
uint32_t retuls_4567 = result_4567 | (abs_4567 & ~masks_4567);

//Store the result in buffer (assuming buffer is 16-bit buffer)
_mem8(&buffer[0]) = result_0123;
_mem8(&buffer[0 + 4]) = result_4567;

Also, these documents will help you get familiar with the programming model, intrinsics and optimizations

www.ti.com/.../spru198k.pdf - TMS320C6000 Programmer's Guide
www.ti.com/.../sprugh7.pdf - TMS320C6600 CPU and Instruction Set
www.ti.com/.../sprabf2.pdf - Introduction to TMS320C6000 DSP Optimization
www.ti.com/.../sprabg7.pdf - Optimizing loops on the C66x DSP

Regards,
Shyam

0 vikas patel over 6 years ago in reply to Shyam Jagannathan

Prodigy 170 points

Thanks Shyam for taking time to understand the issue and giving appropriate suggestion.

I have converted the c code in less intrinsics by using _davg2 and _dmin2

//Compute absolute of 2 16-bit values at a time
abs_01 = _abs2(_loll(c02_lo));
abs_23 = _abs2(_hill(c02_lo));
abs_45 = _abs2(_loll(c02_hi));
abs_67 = _abs2(_hill(c02_hi));

//Pack them into groups of 4
c02_lo = _itoll(abs_23 , abs_01 );
c02_hi = _itoll(abs_67 , abs_45 );

//take the average of x direction and y direction edge
c012_lo = _davg2 ( c02_lo , c012_lo); where c012_lo is y direction edge
c012_hi = _davg2 ( c02_hi , c012_hi); where c012 _hi is y direction edge

// clamp the value with 0xff
c012_lo = _dmin2(c012_lo,MIN_VALUE);
c012_hi = _dmin2(c012_hi,MIN_VALUE);

/* store 8 sobel y filter values */
_mem8(&pixel[0]) = c012_lo;
_mem8(&pixel[0 + 4]) = c012_hi;

for (i=0;i<8;i++)
{
buffer[(x*8)+ i] = pixel[i];
}

Now only one thing is remaining , storing data in uint8_t array from int64_t variable.

Thanks

Vikas

0 Shyam Jagannathan over 6 years ago in reply to vikas patel

TI__Genius 10345 points

Vikas,

Since each uint64_t variable has 4 16-bit values you can direclty use SPACKU4 which will do both saturate (at 255) and pack results in 8-bit.
So you can avoid the dmin2(...) instruction.

Regards,
Shyam

0 vikas patel over 6 years ago in reply to Shyam Jagannathan

Prodigy 170 points

Yes Shyam, following lines convert and saturate result in 16 bits to 8 bits and store in 8 bits array.

_mem4( & buffer[(x*8)]) = _spacku4( _hill(c012_lo), _loll(c012_lo));
_mem4( & buffer[(x*8)+4]) = _spacku4( _hill(c012_hi), _loll(c012_hi));

Thanks for your support.

Vikas

Processors

Processors forum

CCS/TDA2EVM5777: Kernel porting on DSP