Working with pointers

Lars

Hello there!

I use the 6416 DSP for face recognition in my diploma thesis and want to optimize my code. In lieu of

for(i=240;i>0;i-=2) {
for(j=320;j>0;j-=2) {
  if((i>=3 && i<=237) && (j>=3 && j <= 317)) {
    rgb[0][240-i][320-j]= ((*(SrcFrame + (320-j+1) + XMAX*(240-i+1)))+ (*(SrcFrame + (320-j+1) + XMAX*(240-i-1)))) /2;
    rgb[1][240-i][320-j]= ((*(SrcFrame + (320-j) + XMAX*(240-i-1))) + (*(SrcFrame + (320-j) + XMAX*(240-i+1)))) /2;
    rgb[2][240-i][320-j]= (*(SrcFrame + (320-j) + XMAX*(240-i)));

    rgb[0][240-i][320-j+1]= (*(SrcFrame + (320-j+1) + XMAX*(240-i+1) ));
    rgb[1][240-i][320-j+1]= (*(SrcFrame + (320-j) + XMAX*(240-i+1)));
    rgb[2][240-i][320-j+1]= ((*(SrcFrame + (320-j) + XMAX*(240-i))) + (*(SrcFrame + (320-j) + XMAX*(240-i+2))) )/2;

    rgb[0][240-i+1][320-j]= ((*(SrcFrame + (320-j+1) + XMAX*(240-i-1) ))+(*(SrcFrame + (320-j+1) + XMAX*(240-i+1))))/2;
    rgb[1][240-i+1][320-j]= (*(SrcFrame + (320-j+1) + XMAX*(240-i) ));
    rgb[2][240-i+1][320-j]= ((*(SrcFrame + (320-j) + XMAX*(240-i))) + (*(SrcFrame + (320-j+2) + XMAX*(240-i))) )/2;

    rgb[0][240-i+1][320-j+1]= (*(SrcFrame + (320-j+1) + XMAX*(240-i+1) ));
    rgb[1][240-i+1][320-j+1]= ((*(SrcFrame + (320-j) + XMAX*(240-i+1)))+ (*(SrcFrame + (320-j+1) + XMAX*(240-i) )) + (*(SrcFrame + (320-j+1) + XMAX*(240-i+2) )) + (*(SrcFrame + (320-j+2) + XMAX*(240-i+1))) )/4;
    rgb[2][240-i+1][320-j+1]= ((*(SrcFrame + (320-j) + XMAX*(240-i)))+ (*(SrcFrame + (320-j) + XMAX*(240-i+2)))+ (*(SrcFrame + (320-j+2) + XMAX*(240-i))) + (*(SrcFrame + (320-j+2) + XMAX*(240-i+2))))/4;
  }
}
}

I wanna use three unsigned char pointer *r,*g,*b (same type as the 3dim rgb array) and do

    r = &rgb[0][240-i][320-j];
    g = &rgb[0][240-i][320-j];
    b = &rgb[0][240-i][320-j];

at the beginning of the loop.

So now the Question: Why does the load6x sim say that it needs more cycles when I use r+=240; than it needs for r+=1; or r+=31; (the magical number where it switches is 32)?

P.S.: Any advices to optimize my Bayer to RGB Interpolation?

over 16 years ago

0 BrandonAzbell over 16 years ago

TI__Guru 64500 points

Lars said:

So now the Question: Why does the load6x sim say that it needs more cycles when I use r+=240; than it needs for r+=1; or r+=31; (the magical number where it switches is 32)?

I would suggest consulting the TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide (SPRU732) to understand the Instruction Set Architecture. I would also suggest that you set the compiler build options to generate the listing files. This will give you an idea of the assembly code generated for the above function.

Some of the instructions allow for a 5-bit constant operand to be embedded in the opcode. 5-bits would give you values from 0 to 31. But if the constant you are trying to add is larger than 31, then the compiler will need to load this constant from memory into a register and then use a different ADD instruction operands to add the two registers. The loading of the constant into a register will consume additional cycles.

0 RandyP over 16 years ago

TI__Guru* 84110 points

Lars said:

P.S.: Any advice to optimize my Bayer to RGB Interpolation?

What compiler switches are you using? The optimizer should be able to do a lot with this, but you did not mention the settings. At least use the Release configuration so you get -o3 selected.

It seems unorthodox to use down-counting "for loops" and then every use of the indexes you subtract them from the start value. I may have missed some subtle use case here, but it looks like you are really doing up-counts by starting high then subtracting to reach the low value. Would you not get the same result counting up from 0 and not using the 240- and 320-? In any case, the compiler optimizer should be able to sort this out so it macht nichts.

The embedded "if" statement tightens the operating range of your loop. It is possible that the optimizer can figure this out and remove the extra executions of the inner loop, but you could also move the "if" tests to be the starting point and ending point of the two "for loops".

Please look at the Optimizing C Compiler User's Guide and try some of the #pragma's such as MUST_ITERATE.

Since several (or all?) SrcFrame pixels get used multiple times, you could read them all once at the top of the inner loop and then use the local copy in the calculations.

If SrcFrame is stored as successive bytes, you could use multi-byte reads to get all the SrcFrame pixels in fewer memory accesses. Type-casting and/or unions can allow you to read a 64-bit "long long" then access individual bytes from it; this might need the _mem4 intrinsic that is described in the Compiler User's Guide.

0 Lars over 16 years ago in reply to RandyP

Prodigy 60 points

Ah thank you both very much! 'Entschuldigung' for not mentioning the compiler settings, I selected -o3 as well as all optimize options I'm familiar with. I have tested several options for my loop counting with the load6x simulator and as I like a picture where up is up, this was the example that needs the fewest cycles. I hope i can count on the simulator.

Also sorry for not posting the whole function, there are some other if-cases in the loop to handle every image border individually so that I need the whole 320-1 and 240-1 loops. Maybe I should neglect the borders.

I'll try to pack my unsigned char *Srcframe into a "long long", this sounds very promising, thank you for that!

Processors

Processors forum

Working with pointers