This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
8284.UpdateTDDBasicStats.asmHi Jim,
I took a look at it. you can parallelize the MACF32 and MOV32. I made the changes and it builds with no pipeline issues but haven't tested it. Let me know if it works
J Lubbe said:Do you see any other opportunities to substantially reduce the compute time?
Not really. The loop is pretty tight. There is nothing you can do about the load/stores that are not parallelized. From the looks of it you are buffering the adc input in an array x and passing the pointer. If you were doing this on a per sample basis you could skip the storing before the function, and just load the ADC value and convert to float in the first two lines of the function
So you would put this at the top of your .asm file
.cdecls C, LIST, NOWARN, "F2837xD_Adc.h"
Then first two lines of code in the function would be
MOVW DP, #_AdcaResultRegs UI16TOF32 R1H, @_AdcaResultRegs.ADCRESULT0
So basically if you don't need to buffer the sampled input, just the processed input you can save a couple more cycles reading the ADC inside the function
Thanks for all of your help, Vishal. My next goal is to get the basic stats code running on the 28377 CLA.
The CLA has fewer instructions, and fewer floating point and auxiliary registers than the C28x + FPU so I have discovered that there are some extra hoops to jump through to perform the same calculations in assembly code. I don't yet have a correctly working assembly function but I did run some tests with C code in a CLA task.
#define N 32
zSqr = z * z;
With 32 points, it takes 7us to execute a standard for loop.
for (j = 0; j < N; j++)
{
zSqrSum[j] += zSqr;
hzSum[j] += h[j] * z;
}
I manually unrolled the loop in C and the time it took to execute 32 points dropped to 1.7us.
zSqrSum[0] += zSqr;
hzSum[0] += h[0] * z;
zSqrSum[1] += zSqr;
hzSum[1] += h[1] * z;
.
.
.
zSqrSum[N-1] += zSqr;
hzSum[N-1] += h[N-1] * z;
When running the same stats calculations on the C28x + FPU, it takes 1.15us for 32 points. If possible, I would like to achieve this same performance on the CLA. I am looking at the assembly code generated by the compiler to see if there are more opportunities for optimization. I've attached that file to this post. The stats calculations code runs in Cla1Task1.
Do you have any suggestions?
Jim
hi Jim,
It would be helpful if you can isolate the code you want to optimize in task 1. There is some crc code mixed in there. Also can you turn on optimization to O2. Keep the for loop, just above it add the following
#pragma UNROLL(32)
Now on O2 i think the compiler will automatically unroll the loop, but the pragma will kinda prod the compiler to do it. Also i see in the assembly, its making a lot of use of MR0, MR1, and frequently reading and writing to the scratchpad. It should be making more use of MR2, MR3 and less use of memory. O2 should help with that.
We can then try to optimize the compiler optimized code.
I set optimization to O2 and added the UNROLL pragma. Now the FOR loop code executes as fast as the manually unrolled C code. On other thing that I tried is moving the h variable into the structure instead of having a separate array for it. After I did that, the execution time for 32 points dropped from 1.7us to 1.25us. That is getting very close to what I was achieving in the optimized assembly routine on the C28x + FPU. With everything in the array of structures, I think the compiler is better able to make use of indexing. I've attached a file with the generated assembly code for just CLA Task 1.
typedef struct {
float zSqrSum;
float hzSum;
float h;
} t_BasicStats;
t_BasicStats TDD_BasicStats[SIZE];
float h[SIZE];
float r;
__interrupt void Cla1Task1 ( void )
{
uint16_t j;
float rSqr;
rSqr = r * r;
#pragma UNROLL(SIZE)
for (j = 0; j < SIZE; j++)
{
TDD_BasicStats[j].zSqrSum += rSqr;
TDD_BasicStats[j].hzSum += TDD_BasicStats[j].h * r;
}
}
Yeah that unrolled loop looks pretty compact to me. Dont think there's much room for improvement there. Getting rid of the GPIO set and clear should get rid of about 16 instructions, so there is some saving there but then you wouldn't be able to measure the elapsed time on a scope.