Some questions about the optimization of DSP

tianxing hou

Expert 2435 points

Hi,

I have some questions about the optimization of C6000 DSP.

1. I test the performance of 16-bit * 16-bit and 32-bit * 32-bit in C6678 and DM648, my code as follow:

short input_array1[256];

short input_array2[256];

int output_array[256];

for(i = 0; i < 256; i++)

{

input_array1[i] = i;

input_array2[i] = i + 10;

output_array[i] = 0;

}

TSCLBegin = TSCL;

#pragma MUST_ITERATE(256, 256, 8)

#pragma UNROLL(4)

for(j = 0; j < 256; j++)

{

output_array[i] = input_array1[j]*input_array2[j];

}

The performance of C6678 and DM648 as follow:

However in the datasheet, it said the multiply performance of C66x DSP is 4x C64x+ DSP.

And the MUST_ITERATE is no effect in performance.

Could you give me some advices about the optimization.

Thank you.

Tianxing

over 13 years ago

0 HRi over 13 years ago

Guru 10750 points

Hi Tianxing,

Please use -o3, --debug_software_pipeline and --asm_listing, than go over the ASM list and the pipeline information, the cycles should be much much lower,

Thanks,

0 Victor Kazmirenko over 13 years ago

Guru 13202 points

Datasheet is correct and does not contradict your observation. Ford Mustang cruise at 200 km/h. Is that true? Depends on conditions. If you get to highway and don't afraid of police, you can do that. Now imagine traffic jam somewhere in Manhattan. Still true?

So what is behind this story. C66x core does have hardware multipliers units, and their number is larger comparing to that of C64/C64+. That's what datasheet says. Can you make all them making useful job simultaneously? Its up to your application. You as programmer have to ensure proper data alignment and organization, so compiler can produce the most efficient code for you. MUST_ITERATE one of the useful things, but not the only one and by no means sufficient to produce high performance code. I suggest to walk through optimization tutorial and pay attention on generated assembly explanation.

0 ran35366 over 13 years ago in reply to Victor Kazmirenko

TI__Genius 12805 points

To add to the previous message -

Look at teh output of the optimized assembly and see how many values are multiplied in the loop and how many cycles is the loop. This is the faster processing time IF the data is in L1 memory. Otherwise, what slow you down is the memory access. Where do you put your vectors?

Ran

0 tianxing hou over 13 years ago in reply to HRi

Expert 2435 points

Hi HRi,

I used the -o3, as you said, the cycles are much lower.

I have a question, I used the MUST_ITERATE, but the performance don't improved.

And how can I see the pipeline information, the functional unit information, etc.

Thank you,

Tianxing

0 HRi over 13 years ago in reply to tianxing hou

Guru 10750 points

Hi Tianxing,

As for your case the variables and loop count are predefined so I don't think MUST_ITERATE will help, you can see all the info in the asm file before the loop you can get what functional unit's where used what registers where used and the estimated cycle count,

Thanks,

0 Clement FR over 13 years ago in reply to HRi

Genius 4750 points

I highly suggest reading :

a) Hand-Tuning Loops and Control Code on the C6000

b) Optimizing Loops on the C66x DSP

and for the compiler options use :

-O3 ; --symdebug:none ; -k ; -mw

in your code use whenever possible :

#pragma Must Iterate ; restrict ; _nassert ;and intrinsics of course

0 tianxing hou over 13 years ago in reply to Clement FR

Expert 2435 points

Hi,

My code as follow:

5277.YUV422ItoRGB.rar

In the function DH_YUV422ItoRGB_Integer(), I transfer YUV422I image to RGB format.

In the function DH_YUV422ItoRGB_Integer_Optimization(), I used the _amem8(), _amem4(), and _spacku4() to optimize the code, and I used the MUST_ITERATE and UNROLL.

And I build the project with the -o3 option, the DH_YUV422ItoRGB_Integer() consumed about 63ms, the DH_YUV422ItoRGB_Integer_Optimization() consumed about 28ms. However the MUST_ITERATE and UNROLL may be no effect in optimization.

And if I didn't use the -o3 option, it will consume 1.05s and 480ms respectively.

Would you give some advices about my code.

Thank you

Tianxing

0 Victor Kazmirenko over 13 years ago in reply to tianxing hou

Guru 13202 points

First thing is that compiler tries to optimize inner loop. Decorating outer loop with pragmas may give no benefit, as you observed already. So, first try to place pragmas for inner loop.

Second observation is that your loop contains a lot of control code, I mean lots of if-else. Again, I suggest to read produced assembly. Very often I saw message like "disqualified for pipelining - contain control code". You should not think, that if-else always braking pipelining, but it does often. Take a look in assembly and find, whether your loop was pipelined, how much cycles used for single iteration. It will give you idea how to improve the code. It would require some work however. Knowledge of your application is the best aid.

For example, I see some range limiting constructs. As I see, there are multiplies and sums. If you certain, that there is no negative input data, then no need to check for (x < 0). Next you may think about saturated multiplies and additions. SIMD instructions for that available through intrinsic functions.

0 tianxing hou over 13 years ago in reply to Victor Kazmirenko

Expert 2435 points

HI rrlagic,

Thank you for your advise. I will try it.

In the DH_YUV422ItoRGB_Integer(), the g_i16CoeffMatrix[ ] have negative data as the coefficient, so I think the result would be negatived.

I have optimized the DH_YUV422ItoRGB_Integer() in the DH_YUV422ItoRGB_Integer_Optimization() with _amem8(), _amem4(), _spacku4(). Could you give some advices for the DH_YUV422ItoRGB_Integer_Optimization().

Thank you,

Tianxing

0 Victor Kazmirenko over 13 years ago in reply to tianxing hou

Guru 13202 points

No offence, but I cannot do this job better, than you. Your strong side is knowledge about the application.

Next, I don't have practical experience with C66x yet. My experience might be obsolete, so take is only as direction, not recommendation. In mentioned function you also have to apply pragmas to inner loop, not outer.

i16YData = ((lUYVYData & 0x000000000000FF00) >> 8) + g_i16PreOffset[0];
i16UData =  (lUYVYData & 0x00000000000000FF)       + g_i16PreOffset[1];
i16VData = ((lUYVYData & 0x0000000000FF0000) >> 16)+ g_i16PreOffset[2];

Here you pull a byte from packed word and add a constant. I don't know about your data structure. If every byte in packed word is unsigned, then you may try to apply SIMD instruction. Your i16PreOffset contains single byte constants. Then I would make a constant like 0x00008080 and use SADD4 instruction (remeber, I have experience with C64x). It will produce a word, where most significant byte is meaningless, and rest three contain components of lUYVYData with g_i16PreOffset added to respective component. Something similar could be done in next lines with multiply and add. You may try to see something like dot product. Again, I don't say its correct or only way, just give you idea, how to utilize SIMD instruction.

0 Alberto Chessa over 13 years ago in reply to tianxing hou

Mastermind 6670 points

Hi,

You can try to reudce the 2 nested loop to one and remove some calculation in accesing the input and output array items. It also seems to me that the _amem8() and _amem4() are not required and the same result can be otained in plain C.

If possible, enable the "-speculate_loads" Runtime model options. A value of 64 should feet (look at the generated assembler to get hints)

Just to show the idea:

typedef union //to let the compiler do the dirty work for your (shift and mask)
{
unsigned long long raw;
unsigned char b[sizeof(unsigned long long)/sizeof(unsigned char)];
} l2b_t;

const unsigned long long* const restrict pui8YCIn_=(unsigned long long*)pui8YCIn;
unsigned long* const restrict pui8ROut_=(unsigned long*)pui8ROut;
unsigned long* const restrict pui8GOut_=(unsigned long*)pui8GOut;
unsigned long* const restrict pui8BOut_=(unsigned long*)pui8BOut;

_nassert((int)pui8YCIn % 8 == 0); //some hints to the compiler
_nassert((int)pui8ROut % 4 == 0);
_nassert((int)pui8GOut % 4 == 0);
_nassert((int)pui8BOut % 4 == 0);

const int N=iHeight*(iWidth/4); //Warning: not tested, mybe it is not correct!!!

for(i=0; i<N; ++i)
{
//lUYVYData = _amem8(pui8YCIn + i * iWidth * 2 + j);
l2b_t lUYVYData;

lUYVYData.raw=pui8YCIn_[i];

    i16YData = (lUYVYData.b[1]) + g_i16PreOffset[0];
    i16UData = (lUYVYData.b[0]) + g_i16PreOffset[1];
    i16VData = (lUYVYData.b[2]) + g_i16PreOffset[2];

....

    pui8ROut[i]= _spacku4(((iRComponentTemp22<<16)|iRComponentTemp21), ((iRComponentTemp12<<16)|iRComponentTemp11));
    pui8GOut[i]= _spacku4(((iGComponentTemp22<<16)|iGComponentTemp21), ((iGComponentTemp12<<16)|iGComponentTemp11));
    pui8BOut[i]= _spacku4(((iBComponentTemp22<<16)|iBComponentTemp21), ((iBComponentTemp12<<16)|iBComponentTemp11));
}

        for(i=0; i<N; ++i)
       {
           //lUYVYData = _amem8(pui8YCIn + i * iWidth * 2 + j);
           l2b_t lUYVYData;
           lUYVYData.raw=pui8YCIn[i];

0 Alberto Chessa over 13 years ago in reply to Alberto Chessa

Mastermind 6670 points

Hi,

I suppose You notice in my prevous post there is mistake: pui8Rout[i] should be pui8Rout_[i] (and I should name is oui16Rout...).

Another hint, priot to go to use SIMD intrinsics, is to try to reduce a bit the number of generated assembler instructions in the loop so to make the loop implementable as a software pilelined loop.

In your case consider to remove the last shift used to compose the value for the pack instrinsic. If, instead of using a scaling of 10 bits you use a scaling of 16 bits (and use a int32 to hold the coeff value), you can write:

iRComponentTemp22=iRComponentTemp22 & 0xFFFF0000; //an "&" instead of a shift.

...

pui8ROut_[i]= _spacku4(((iRComponentTemp22)|iRComponentTemp21); // iRComponentTemp22 already shifter left by 16

...

I'm not sure the code produce the right results (and this is not a trascurable problem...), but as a "case study" for the optimization, I try to compile a so-modified version of your code: it produce a software pipelined loop of 13 execution packet, and for a 32x32 matrix it use 3420 ticks against the 6137 of the original one.

Processors

Processors forum

Some questions about the optimization of DSP