This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/TDA2EVM5777: Kernel porting on DSP

Part Number: TDA2EVM5777

Tool/software: Code Composer Studio

Hi,

We are developing a simple kernel (Sobel) and optimizing on DSP. We have taken reference from C66x Sobel code but after processing we are not getting expected result.

Could anyone help me to find the problem in code?

Thanks

Vikasdsp_sobel_kernel.rar

attaching the project for reference

  • Hi Vikas,

    I have forwarded your question to DSP experts.

    Regards,
    Yordabn
  • Hi Vikas,


    Can you please elaborate? It will help me identify the issue faster?

    Regards,
    Shyam

  • Hi Vikas,


    Can you please elaborate? It will help me identify the issue faster?

    Regards,
    Shyam

    Hi Shyam,
    Please check this code if you see anything wrong in sobel edge detection..

    I also attaching output which i receive from this code ................

    void sobel_dsp_kernel_optimised(uint8_t *inPtr, uint8_t *outPtr, uint32_t width, uint32_t height )
    {
    
           uint8_t *inptr_a1[WIDTH],*inptr_a2[WIDTH],*inptr_a3[WIDTH]; // input buffer to process data
           uint8_t buffer[WIDTH]; // output buffer to process data
           uint8_t *temp1, *inptr1, *inptr2, *inptr3, *outpointer , *inpointer;
    
           inptr1 = (uint8_t *)inptr_a1 ;
           inptr2 = (uint8_t *)inptr_a2 ;
           inptr3 = (uint8_t *)inptr_a3;
    
           inpointer = inPtr;
           outpointer = outPtr;
           // copy first row
           memcpy(inptr1 , inpointer , WIDTH);
           inpointer += WIDTH;
    
           // copy second row
           memcpy(inptr2 , inpointer , WIDTH);
           inpointer += WIDTH;
    
           // copy third row
           memcpy(inptr3 , inpointer , WIDTH);
           inpointer += WIDTH;
    
           // copy first row of input directly to output buffer as we are not going to get processed first row
          memcpy(outpointer , inptr1 , WIDTH);
          outpointer += WIDTH;
    
            int32_t x,y;
            int64_t mask1_8 = 0x0101010101010101;
            int64_t mask2_8 = 0x0202020202020202;
            uint32_t mask1_4 = _loll(mask1_8);
            uint32_t mask2_4 = _loll(mask2_8);
            int64_t r0, r1, r2, r01_lo, r01_hi, r012_lo, r012_hi, r012_lo_d2, r012_hi_d2, c02_lo, c02_hi;
            __x128_t r0_2, r1_2, r2_2;
    
    
            for ( y = 1; y < (height-1); y++)
             {
                uint32_t x0_3210 = _mem4_const(inptr1);
                uint32_t x1_3210 = _mem4_const(inptr2);
                uint32_t x2_3210 = _mem4_const(inptr3);
    
                /* Convert from 8bpp to 16bpp so we can do SIMD addition of rows */
                int64_t x0_2 = _mpyu4ll(x0_3210, mask1_4);
                int64_t x1_2 = _mpyu4ll(x1_3210, mask2_4);
                int64_t x2_2 = _mpyu4ll(x2_3210, mask1_4);
    
                       /* Add rows 0+1+2, column-wise */
                int32_t x01    = _add2((int32_t)_loll(x0_2), (int32_t)_loll(x1_2));
                uint32_t xaxb  = (uint32_t)_add2(x01, (int32_t)_loll(x2_2));
    
    
                for(x=0; x < (width/8); x++)
                {
                    /* Read 8 bytes from each of the 3 lines. */
                     r0 = _mem8_const(&inptr1[(x*8) + 2]);
                     r1 = _mem8_const(&inptr2[(x*8) + 2]);
                     r2 = _mem8_const(&inptr3[(x*8) + 2]);
    
                      /* Convert from 8bpp to 16bpp so we can do SIMD addition of rows */
                     r0_2 = _dmpyu4(r0, mask1_8);
                     r1_2 = _dmpyu4(r1, mask2_8);
                     r2_2 = _dmpyu4(r2, mask1_8);
    
                         /* Add rows 0+1, column-wise */
                     r01_lo = _dadd2(_lo128(r0_2), _lo128(r1_2));
                     r01_hi = _dadd2(_hi128(r0_2), _hi128(r1_2));
    
                               /* Add previous sum to row 2, column-wise */
                     r012_lo = _dadd2(r01_lo, _lo128(r2_2));
                     r012_hi = _dadd2(r01_hi, _hi128(r2_2));
    
                               /* Left shift this sum by 2 pixels, using history */
                     r012_lo_d2 = _itoll(_loll(r012_lo), xaxb);
                     r012_hi_d2 = _itoll(_loll(r012_hi), _hill(r012_lo));
    
                      c02_lo = _dsub2(r012_lo, r012_lo_d2);
                      c02_hi = _dsub2(r012_hi, r012_hi_d2);
    
                               /* save overlap data for next iteration */
                      xaxb = (uint32_t)_mvd((int32_t)_hill(r012_hi));
    
                               /* store 8 sobel filter values */
                      _amem8(&buffer[x*8])     = c02_lo;
                      _amem8(&buffer[(x*8)+4]) = c02_hi;
    
                }
    
                memcpy(outpointer, buffer ,WIDTH);
                outpointer += WIDTH;
    
                // reshuffle the pointer and read one more row
                temp1 = inptr2;
                inptr2 = inptr3;
                inptr1 = temp1;
    
                 // copy third row
                 memcpy(inptr3 , inpointer , WIDTH);
                 inpointer += WIDTH;
             }
      }

  • Hi Vikas,

    I haven't tried this code yet, but at first glance, I see a potential problem here,

        _amem8(&buffer[x*8])     = c02_lo;

        _amem8(&buffer[(x*8)+4]) = c02_hi;
    When you are using amem8 instead of mem8, you need to be sure that address is aligned to 64-bit boundary. Seeing that buffer[] is allocated on the stack, there is no guarantee that its aligned to 64-bit boundary. As a quick fix, can you use mem8 instead of amem8 for the stores?
    Regards,
    Shyam
  • Hi Shyam,
    This doesn' t work.
  • Vikas,

    I took a look at your main.c, and it looks like you used VXLIB_sobelX_3x3_i8u_o16s.c as your reference, is that correct?  If so, this code only generates the X gradient in int16_t format, but you are updating the output pointer assuming it is 8 bits per pixel, so that is one issue.  Your buffer[WIDTH] array is 8 bits, so 16 bit writes to it will overflow this buffer by 2x.  You can see that for 8 input pixels, 16 output bytes are being written.

    Also, your reference C looks like it generates both X and Y gradients, and combines them into a single 8 bit output, which the optimized code doesn't do, so that is another issue.  If you need both X and Y, I suggest you reference the VXLIB_sobel_3x3_i8u_o16s_o16s function, which computes both X and Y gradients in separate buffers.

    Best Regards,

    Jesse

  • Yes, we have taken reference from the same file and we were trying to find the edges in only one direction which is reflecting form the code itself and you are correct the problem was with storing the data . we have added this code at the end of the inner loop and it starts giving proper output.

    int16_t pixel[8];
    _mem8(&pixel[0]) = c02_lo;
    _mem8(&pixel[0+4]) = c02_hi;

    for (i=0;i<8;i++)
    {
    pixel[i] = abs(pixel[i]);
    if(pixel[i] >255)
    pixel[i] = 255;
    buffer[(x*8)+ i] = pixel[i];
    }

    Any help to replace this piece of code with intrinsics .

    Thanks for your support.
    Vikas
  • Vikas,

    I have not tested this code, but you can try something like this to eliminate the loop and avoid too many load/stores to the memory.

    ===========================================
    for (i=0;i<8;i++)
    {
    pixel[i] = abs(pixel[i]);
    if(pixel[i] >255)
    pixel[i] = 255;
    buffer[(x*8)+ i] = pixel[i];
    }
    ===========================================

    //Replicate the clip value 4 times and store in a 64-bit register
    uint32_t clip = (255 << 16) | 255;
    uint64_t clip_255k = _itoll(clip, clip);

    //Load 8, 16-bit values
    uint64_t pixel_0123 = _mem8(&pixel[0]);
    uint64_t pixel_4567 = _mem8(&pixel[0 + 4]);

    //Compute absolute of 2 16-bit values at a time
    uint32_t abs_01 = _abs2(_loll(pixel_0123));
    uint32_t abs_23 = _abs2(_hill(pixel_0123));
    uint32_t abs_45 = _abs2(_loll(pixel_4567));
    uint32_t abs_67 = _abs2(_hill(pixel_4567));

    //Pack them into groups of 4
    uint64_t abs_0123 = _itoll(abs_01, abs_23);
    uint64_t abs_4567 = _itoll(abs_45, abs_67);

    //Check if values are greater than 255
    uint32_t cmpgt_0123 = _dcmpgt2(abs_0123, clip_255k);
    uint32_t cmpgt_4567 = _dcmpgt2(abs_4567, clip_255k);

    //Expand the result bits to masks
    uint32_t masks_0123 = _dxpnd2(cmpgt_0123);
    uint32_t masks_4567 = _dxpnd2(cmpgt_4567);

    //Use the mask to extract clip value where pixel > 255
    uint32_t result_0123 = clip_255k & masks_0123;
    uint32_t retuls_4567 = clip_255k & masks_4567;

    //Use the mask to extract pixel value < 255
    uint32_t result_0123 = result_0123 | (abs_0123 & ~masks_0123);
    uint32_t retuls_4567 = result_4567 | (abs_4567 & ~masks_4567);

    //Store the result in buffer (assuming buffer is 16-bit buffer)
    _mem8(&buffer[0]) = result_0123;
    _mem8(&buffer[0 + 4]) = result_4567;

    Also, these documents will help you get familiar with the programming model, intrinsics and optimizations

    www.ti.com/.../spru198k.pdf - TMS320C6000 Programmer's Guide
    www.ti.com/.../sprugh7.pdf - TMS320C6600 CPU and Instruction Set
    www.ti.com/.../sprabf2.pdf - Introduction to TMS320C6000 DSP Optimization
    www.ti.com/.../sprabg7.pdf - Optimizing loops on the C66x DSP

    Regards,
    Shyam
  • Thanks Shyam for taking time to understand the issue and giving appropriate suggestion.

    I have converted the c code in less intrinsics by using _davg2 and _dmin2

    //Compute absolute of 2 16-bit values at a time
    abs_01 = _abs2(_loll(c02_lo));
    abs_23 = _abs2(_hill(c02_lo));
    abs_45 = _abs2(_loll(c02_hi));
    abs_67 = _abs2(_hill(c02_hi));

    //Pack them into groups of 4
    c02_lo = _itoll(abs_23 , abs_01 );
    c02_hi = _itoll(abs_67 , abs_45 );

    //take the average of x direction and y direction edge
    c012_lo = _davg2 ( c02_lo , c012_lo);   where  c012_lo is y direction edge
    c012_hi = _davg2 ( c02_hi , c012_hi);  where  c012 _hi is y direction edge

    // clamp the value with 0xff
    c012_lo = _dmin2(c012_lo,MIN_VALUE);
    c012_hi = _dmin2(c012_hi,MIN_VALUE);

    /* store 8 sobel y filter values */
    _mem8(&pixel[0])        =    c012_lo;
    _mem8(&pixel[0 + 4])   =   c012_hi;

    for (i=0;i<8;i++)
    {
    buffer[(x*8)+ i] = pixel[i];
    }

    Now only one thing is remaining , storing data in uint8_t array from int64_t variable.

    Thanks

    Vikas

  • Vikas,

    Since each uint64_t variable has 4 16-bit values you can direclty use SPACKU4 which will do both saturate (at 255) and pack results in 8-bit.
    So you can avoid the dmin2(...) instruction.

    Regards,
    Shyam
  • Yes Shyam, following lines convert and saturate result in 16 bits to 8 bits and store in 8 bits array.

    _mem4( & buffer[(x*8)])         =     _spacku4( _hill(c012_lo), _loll(c012_lo));
    _mem4( & buffer[(x*8)+4])     =    _spacku4( _hill(c012_hi), _loll(c012_hi));

    Thanks for your support.

    Vikas