This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Optimizing FOR-Loop with multiple dependancies for C6748

Hello,

I am (hopefully) in the final stages of porting-over some Windows PC-code to a C6748 (using the Experimenter kit with a C6748 SOM) platform using C/C++.

The last obsticle is a FOR-loop that I believe has tough dependancy-issues and is causing the process to run painfully slow.

Background: C6748 (because of need for low-power) L1 data and program cache is active, 128k of L2 cache is enabled. We are using a fair portion of mDDR both program and data for parts of the program. Clock is 300Mhz, mDDR is 132MHz.

The pointers and arrays in question in the FOR-loop are stored in local memory that is enabled to be cached in L1 and L2 data caches. I have also tried putting them directly into L1 data memory with no improvement. No optimization settings on the compiler have made any difference to this part of the program.

In one function call there is a WHILE-loop that typically gets iterated about 256 times. In that WHILE-loop there is a FOR-loop that get iterated between 25-40 times. The function (in which this WHILE-loop is contained gets called about 183 times per processing cycle. So we are looking at about 1.4 million iterations total of the for-loop per processing cycle which we would like to be about (or less than ) 5 seconds, but is currently about 40 seconds. Based on time measurements using CLK_gethtime and collecting a sum of the for-loop call times within the while loop, each while loop (256-iterations) takes about 106msec, and of that the for loop consumes 100msec. The other code in the while-loop is a considerable amount and uses less than 6 % of the time which adds to the frustration.

Visual aid:

while() //Iterates 256 (

{

... a bit of code

Bad for-loop() Iterates 25-40

{

11 array  value updates that depend on a previous result

}

A lot of code..

} // end of while

 

In the for-loop there are 11 array updates that require n-1 solutions from the same array as part of the answer. I will give one as an example:

Defined in h-file:

#define CSqMod(ar,ai) ((ar*ar)+(ai*ai))

Sample of one dependent calculation in the for loop:

CPP-file:

for(m = 1; m <= M; m++)

{

mM1 = m-1;

AR[m] = AR[mM1] - (AR[mM1] * AR[mM1] * CSqMod(BR[mM1], BI[mM1]) / RBR[mM1]);

}

AR has an n-1 dependancy on itself and the BR, BI, and RBR are the previous iteration of a value also calculated in the loop. When I eliminated this one calculaltion the overhead dropped by 50 msec (of the 100msec) per loop. There are a total of 11 calls similar to this one. The individual calls  can be seperated into individual for-loops so long as one proceeds another that depends on it, but trying this made no difference to the  performance...it may even have added time.

If someone has any ideas that will save us, it would be much appreciated.

Thank you very much in advance.

Dan.

 

 

 

 

 

 

  • DanB said:

    AR[m] = AR[mM1] - (AR[mM1] * AR[mM1] * CSqMod(BR[mM1], BI[mM1]) / RBR[mM1]);

    This expression has a divide operation in it.  Unless the divisor is an integer constant, this is going to result in a function call, which will disqualify the loop from software pipelining.

    Look at the compiler-generated assembly code for the function in question.  For each inner loop, you should see a comment block entitled SOFTWARE PIPELINE INFORMATION which should give you a lot more information about what the bottleneck is.

  • Hello Archaeologist,

    Sounds like this is definitely the right direction. As you mentioned the ASM did indicate inability to pipeline due to function call and with the divide removed, I saw successful pipelining information and it ran much faster. The divide is a float value by the way.

    Now that this appears to be the right direction, should I try to create a seperate inverted (i.e. 1/x)  RBR array before running this loop?

    From a convenience point-of-view I would probably want to use a for-loop for this inversion array as well which would presumably lead back to the same problem? Since M is between 25 and 40, this could conceivably be done individually if this would help.

    Would intrinsic calls be of use here?

    Thanks very much .

    Dan.

     

  • DanB said:

    Now that this appears to be the right direction, should I try to create a seperate inverted (i.e. 1/x)  RBR array before running this loop?

    Yes, if the values are re-used a lot, you should try pre-computing the 1/x values, since multiplication is an actual instruction and does not require a function call.

    DanB said:

    From a convenience point-of-view I would probably want to use a for-loop for this inversion array as well which would presumably lead back to the same problem?

    Perhaps, but even if that loop fails to software pipeline, you might still be ahead.

    DanB said:

    Would intrinsic calls be of use here?

    Intrinsic calls are not going to solve the problem of the fact that accurate floating-point division requires a call.  If you only require an estimate, you can get much faster.  In particular, you could use _rcpsp or _rcpdp to get a good approximation of 1/x.

  • Hi Archaeologist,

    It seems the floating point divides in general are the focal point for the slowness. I calculated reciprocal arrays where needed and replaced with multiplies. I would do one and replace one instance in a closed for-loop. There was no improvement, but when used more than once there was a benefit. I found that after the first iteration, it was possible to pass previously inverted arrays to the next inverted array for multiplication substitution (rather than re-calculate another inversion)...that is where things really started to pick up. We went from 107msec per loop to 36msec per loop.

    I tried using the intrinsic _rcpsp to try to calculate some of the inverted arrays while the net result was not generally successful, the improvement in performance was large. If I calculated just one of the inverted arrays with  the intrinsic single precision reciprical estimate function the time was cut by 3-fold.

    So my next question:

    Is there a "fast" single precision float divide function out there that may comprimise a little accuracy, but be better than _rcpsp?

    Thanks again very much for your help!

    Dan.

  • Hi,

     

    Scratch the last question: I just found the fast RTS library link!

     

    Thanks for your help.

     

    Dan.