Hello,
I am (hopefully) in the final stages of porting-over some Windows PC-code to a C6748 (using the Experimenter kit with a C6748 SOM) platform using C/C++.
The last obsticle is a FOR-loop that I believe has tough dependancy-issues and is causing the process to run painfully slow.
Background: C6748 (because of need for low-power) L1 data and program cache is active, 128k of L2 cache is enabled. We are using a fair portion of mDDR both program and data for parts of the program. Clock is 300Mhz, mDDR is 132MHz.
The pointers and arrays in question in the FOR-loop are stored in local memory that is enabled to be cached in L1 and L2 data caches. I have also tried putting them directly into L1 data memory with no improvement. No optimization settings on the compiler have made any difference to this part of the program.
In one function call there is a WHILE-loop that typically gets iterated about 256 times. In that WHILE-loop there is a FOR-loop that get iterated between 25-40 times. The function (in which this WHILE-loop is contained gets called about 183 times per processing cycle. So we are looking at about 1.4 million iterations total of the for-loop per processing cycle which we would like to be about (or less than ) 5 seconds, but is currently about 40 seconds. Based on time measurements using CLK_gethtime and collecting a sum of the for-loop call times within the while loop, each while loop (256-iterations) takes about 106msec, and of that the for loop consumes 100msec. The other code in the while-loop is a considerable amount and uses less than 6 % of the time which adds to the frustration.
Visual aid:
while() //Iterates 256 (
{
... a bit of code
Bad for-loop() Iterates 25-40
{
11 array value updates that depend on a previous result
}
A lot of code..
} // end of while
In the for-loop there are 11 array updates that require n-1 solutions from the same array as part of the answer. I will give one as an example:
Defined in h-file:
#define CSqMod(ar,ai) ((ar*ar)+(ai*ai))
Sample of one dependent calculation in the for loop:
CPP-file:
for(m = 1; m <= M; m++)
{
mM1 = m-1;
AR[m] = AR[mM1] - (AR[mM1] * AR[mM1] * CSqMod(BR[mM1], BI[mM1]) / RBR[mM1]);
}
AR has an n-1 dependancy on itself and the BR, BI, and RBR are the previous iteration of a value also calculated in the loop. When I eliminated this one calculaltion the overhead dropped by 50 msec (of the 100msec) per loop. There are a total of 11 calls similar to this one. The individual calls can be seperated into individual for-loops so long as one proceeds another that depends on it, but trying this made no difference to the performance...it may even have added time.
If someone has any ideas that will save us, it would be much appreciated.
Thank you very much in advance.
Dan.