Mulplication + Intrinsics

Filipe Alves49699

Intellectual 460 points

Hi ,

I have a very basic question, still and since i am just getting started in the C6000 DSP world i am going to post it.

I have a C64x+ DSP, and i want to perform an image multiplication by a scaler.

I know the result will never be bigger than 255, and my image is an unsigned char.

This is what i have done:

void IMG_mulS_8
(
    unsigned char * restrict imgR, /* Read pointer for the input image */
    unsigned char * restrict imgW,           /* Write pointer for the output image */
    unsigned char constData,                  /* Constant data */
    int count                        /* Number of samples in the image */
)
{
    int i;
    long long p7_p6_p5_p4_p3_p2_p1_p0;
    int p7_p6_p5_p4, p3_p2_p1_p0;
    double rp3_rp2_rp1_rp0, rp7_rp6_rp5_rp4;
    int cD_cD_cD_cD;

    cD_cD_cD_cD = (constData << 24) | (constData << 16) |
                   (constData << 8) | (constData);

    for (i = 0; i < count >> 4; i++) {

        p7_p6_p5_p4_p3_p2_p1_p0 = _amem8(imgR); //Read 8 bytes from meomory

        p7_p6_p5_p4 = _hill (p7_p6_p5_p4_p3_p2_p1_p0);//Grab the MSBytes part
        p3_p2_p1_p0 = _loll (p7_p6_p5_p4_p3_p2_p1_p0);//Grab the LSBytes part

    imgR += 8;    //Increase pointer

    rp3_rp2_rp1_rp0 = _mpyu4 (cD_cD_cD_cD, p7_p6_p5_p4); //multyply 4 bytes at a time
        rp7_rp6_rp5_rp4 = _mpyu4 (cD_cD_cD_cD, p3_p2_p1_p0);

    *((double *)imgW) = rp3_rp2_rp1_rp0;
        imgW += 4;
        *((double *)imgW) = rp7_rp6_rp5_rp4;
        imgW += 4;

    //Do again for another 8
    p7_p6_p5_p4_p3_p2_p1_p0 = _amem8(imgR);
        p7_p6_p5_p4 = _hill (p7_p6_p5_p4_p3_p2_p1_p0);
        p3_p2_p1_p0 = _loll (p7_p6_p5_p4_p3_p2_p1_p0);
        imgR += 8;

    rp3_rp2_rp1_rp0 = _mpyu4 (cD_cD_cD_cD, p7_p6_p5_p4);
        rp7_rp6_rp5_rp4 = _mpyu4 (cD_cD_cD_cD, p3_p2_p1_p0);

    *((double *)imgW) = rp3_rp2_rp1_rp0;
        imgW += 4;
        *((double *)imgW) = rp7_rp6_rp5_rp4;
        imgW += 4;
    }
}

But this is not working, can you give me a tip of what is wrong?

Best Regards

Filipe Alves

over 14 years ago

0 Brad Griffis over 14 years ago

TI__Guru*** 125430 points

Use "long long" instead of double
Use mpyu4ll instead of mpyu4
The data coming back is twice as large as the data going in. Try this:

Filipe Alves said:
    rp3_rp2_rp1_rp0 = _mpyu4 (cD_cD_cD_cD, p7_p6_p5_p4); //multyply 4 bytes at a time
        rp7_rp6_rp5_rp4 = _mpyu4 (cD_cD_cD_cD, p3_p2_p1_p0);

    *((double *)imgW) = rp3_rp2_rp1_rp0;
        imgW += 4;
        *((double *)imgW) = rp7_rp6_rp5_rp4;
        imgW += 4;

rp3_rp2_rp1_rp0 = _mpyu4ll (cD_cD_cD_cD, p7_p6_p5_p4); //multyply 4 bytes at a time
        rp7_rp6_rp5_rp4 = _mpyu4ll (cD_cD_cD_cD, p3_p2_p1_p0);

    *((long long *)imgW) = _packl4(_hill(rp3_rp2_rp1_rp0),_loll(rp3_rp2_rp1_rp0));
        imgW += 4;
        *((long long *)imgW) = _packl4(_hill(rp7_rp6_rp5_rp4),_loll(rp7_rp6_rp5_rp4));

The above code uses _packl4 to grab the lower 8-bits of each 16-bit pair that gets returned. That should be correct as long as your original statement about never getting a product over 255 was correct. On the flip side if you're doing fractional fixed point math then you'd want to keep the top half and use _packh4 instead.

0 Filipe Alves49699 over 14 years ago in reply to Brad Griffis

Intellectual 460 points

Hi Brad,

Thanks for you reply. Unfortunately it is not working i get some kind of strips like it was doing the multiplication in 4 or 8 bytes than it does not for 4 or 8 and it continues till the end.

I tried to change the cast *((long long *)imgW) = _packl4(_hill(rp3_rp2_rp1_rp0),_loll(rp3_rp2_rp1_rp0)); to int, it was better still the stripes appeared maybe i am getting some kind of overlap.

What i want to do is basically this

for(i=0;i<frame_size;i++)
p_img_Buf2[i]=p_img_Buf[i]*255/DMAX;

DMAX is the maximum value in p_img_Buf.

Could you check it again if possible.

Another question imagine i have a short type variable and i want to divide it by an unsigned char value. The best way to do it is using the Q format for fixed point and do a multiplication?

Best Regards

Filipe Alves

0 Brad Griffis over 14 years ago in reply to Filipe Alves49699

TI__Guru*** 125430 points

Filipe,

This is not something to do by visual inspection. You need to put some known values into the algorithm and look step by step at what happens to the values and adjust your algorithm/instructions accordingly.

Filipe Alves said:

What i want to do is basically this

for(i=0;i<frame_size;i++)
p_img_Buf2[i]=p_img_Buf[i]*255/DMAX;

You could make a lookup table of 256 bytes where you precompute the values of 255/DMAX. Then you would just multiply your image by that scalar using the algorithm I mentioned previously, though you'll want packh not packl for the Q-math.

0 Rahul Prabhu over 14 years ago in reply to Brad Griffis

TI__Guru** 116170 points

Filipe,

I am assuming you are refering to the C intrinsic code in the appendix section of the IMGLIB documentation to create this function. There is a small change that you need to make to this to obtain the correct output. If you notice in your code you self increament the imgW and imgR pointer so you don`t have the output pointer imgW pointing to the start of the image once you have finished the processing. Having temporary variables to store the start addresses should give you the correct output.

Here is the change you need to make:

void IMG_mulS_8
(
unsigned char * restrict imgR, /* Read pointer for the input image */
unsigned char * restrict imgW, /* Write pointer for the output image */
char constData, /* Constant data */
int count /* Number of samples in the image */
){
int i;
long long p7_p6_p5_p4_p3_p2_p1_p0;
int p7_p6_p5_p4, p3_p2_p1_p0;
double rp3_rp2_rp1_rp0, rp7_rp6_rp5_rp4;
int cD_cD_cD_cD;
unsigned char *tmp1, *tmp2;
tmp1=imgR;
tmp2=imgW;
cD_cD_cD_cD = (constData << 24) | (constData << 16) |
(constData << 8) | (constData);
for (i = 0; i < count; i += 16) {
p7_p6_p5_p4_p3_p2_p1_p0 = _amem8(imgR);
p7_p6_p5_p4 = _hill (p7_p6_p5_p4_p3_p2_p1_p0);
p3_p2_p1_p0 = _loll (p7_p6_p5_p4_p3_p2_p1_p0);
imgR += 8;
rp3_rp2_rp1_rp0 = _mpysu4 (cD_cD_cD_cD, p3_p2_p1_p0);
rp7_rp6_rp5_rp4 = _mpysu4 (cD_cD_cD_cD, p7_p6_p5_p4);
*((double *)imgW) = rp3_rp2_rp1_rp0;
imgW += 4;
*((double *)imgW) = rp7_rp6_rp5_rp4;
imgW += 4;
p7_p6_p5_p4_p3_p2_p1_p0 = _amem8(imgR);
p7_p6_p5_p4 = _hill (p7_p6_p5_p4_p3_p2_p1_p0);
p3_p2_p1_p0 = _loll (p7_p6_p5_p4_p3_p2_p1_p0);
imgR += 8;
rp3_rp2_rp1_rp0 = _mpysu4 (cD_cD_cD_cD, p3_p2_p1_p0);
rp7_rp6_rp5_rp4 = _mpysu4 (cD_cD_cD_cD, p7_p6_p5_p4);
*((double *)imgW) = rp3_rp2_rp1_rp0;
imgW += 4;
*((double *)imgW) = rp7_rp6_rp5_rp4;
imgW += 4;
}
imgR=tmp1;
imgW=tmp2;
}

Processors

Processors forum

Mulplication + Intrinsics