This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Optimizing Vector Math Calculations on C28x + FPU

Other Parts Discussed in Thread: CONTROLSUITE

I've attached an assembly function that I created to perform some basic statistics calculations on incoming analog data on a 2837x.  At each sample time, a sliding window of stats for N points is updated (I am currently using N=64).  The organization of the function is based on techniques used in similar functions found in the Vector Math examples and source in the FPU library in controlSUITE.  I tried to take advantage of parallel instructions and executing instructions in the delay cycles of 2p instructions whenever possible.
 
I also coded up the calculations in C but did not attempt any specific optimizations other than what the compiler performs.
 
#define SIZE 64
zSqr = z * z;
for (iTap = 0; iTap < SIZE; iTap++)
{
    zSqrSum[j] = zSqrSum[j] + zSqr;
    hzSum[j] = hzSum[j] + h[j] * z;
}
 
With both the assembly and C code, the code is being run out of RAM instead of Flash to improve speed.  Here are the profiling results when running the code on the 28377D experimenters kit:
 
Every 1 ADC sample / 64 points:
C - 14us
Optimized Assembly - 2.5us
 
Based on my hand counting of system cycles for each instruction in the assembly function, it should take 6N + 27 cycles (N = number of points) to complete all of the instructions including the call and return.  For 64 points and a 5ns system clock, this works out to be 2.055us which is in the ball park of the number that I get from profiling on the micro.
 
I think I've optimized the assembly function about as much as I can given the C28x + FPU architecture but would like someone at TI who is very familiar with the FPU to look over my code to see if I am missing something. 
 
Thanks!
Jim
  • 8284.UpdateTDDBasicStats.asmHi Jim,

    I took a look at it. you can parallelize the MACF32 and MOV32. I made the changes and it builds with no pipeline issues but haven't tested it. Let me know if it works

  • Hi Vishal,

    I implemented your suggested changes to take greater advantage of the pipeline and was able to knock off 0.2us from the overall time and go from 2.5us to 2.3us for processing 64 points. Thanks!

    Do you see any other opportunities to substantially reduce the compute time?

    Jim
  • J Lubbe said:
    Do you see any other opportunities to substantially reduce the compute time?

    Not really. The loop is pretty tight. There is nothing you can do about the load/stores that are not parallelized. From the looks of it you are buffering the adc input in an array x and passing the pointer. If you were doing this on a per sample basis you could skip the storing before the function, and just load the ADC value and convert to float in the first two lines of the function

    So you would put this at the top of your .asm file

       .cdecls C, LIST, NOWARN, "F2837xD_Adc.h"

    Then first two lines of code in the function would be

      

        MOVW   DP, #_AdcaResultRegs
    UI16TOF32  R1H, @_AdcaResultRegs.ADCRESULT0

    So basically if you don't need to buffer the sampled input, just the processed input you can save a couple more cycles reading the ADC inside the function

  • Thanks for all of your help, Vishal.  My next goal is to get the basic stats code running on the 28377 CLA.  

    The CLA has fewer instructions, and fewer floating point and auxiliary registers than the C28x + FPU so I have discovered that there are some extra hoops to jump through to perform the same calculations in assembly code.  I don't yet have a correctly working assembly function but I did run some tests with C code in a CLA task.

    #define N 32

    zSqr = z * z;

    With 32 points, it takes 7us to execute a standard for loop.

    for (j = 0; j < N; j++)

    {

       zSqrSum[j] += zSqr;

       hzSum[j] += h[j] * z;

    }

    I manually unrolled the loop in C and the time it took to execute 32 points dropped to 1.7us.

    zSqrSum[0] += zSqr;

    hzSum[0] += h[0] * z;

    zSqrSum[1] += zSqr;

    hzSum[1] += h[1] * z;

    .

    .

    .

    zSqrSum[N-1] += zSqr;

    hzSum[N-1] += h[N-1] * z;

    When running the same stats calculations on the C28x + FPU, it takes 1.15us for 32 points.  If possible,  I would like to achieve this same performance on the CLA.  I am looking at the assembly code generated by the compiler to see if there are more opportunities for optimization.  I've attached that file to this post.  The stats calculations code runs in Cla1Task1.

    Do you have any suggestions?

    Jim

    crc8.asm

  • hi Jim,

    It would be helpful if you can isolate the code you want to optimize in task 1. There is some crc code mixed in there. Also can you turn on optimization to O2. Keep the for loop, just above it add the following

    #pragma UNROLL(32)

    Now on O2 i think the compiler will automatically unroll the loop, but the pragma will kinda prod the compiler to do it. Also i see in the assembly, its making a lot of use of MR0, MR1, and frequently reading and writing to the scratchpad. It should be making more use of MR2, MR3 and less use of memory. O2 should help with that. 

    We can then try to optimize the compiler optimized code.

  • I set optimization to O2 and added the UNROLL pragma.  Now the FOR loop code executes as fast as the manually unrolled C code.  On other thing that I tried is moving the h variable into the structure instead of having a separate array for it.  After I did that, the execution time for 32 points dropped from 1.7us to 1.25us.  That is getting very close to what I was achieving in the optimized assembly routine on the C28x + FPU.  With everything in the array of structures, I think the compiler is better able to make use of indexing.  I've attached a file with the generated assembly code for just CLA Task 1.

    typedef struct {

       float zSqrSum;

       float hzSum;

       float h;

    } t_BasicStats;

    t_BasicStats TDD_BasicStats[SIZE];

    float h[SIZE];

    float r;

    __interrupt void Cla1Task1 ( void )

    {

      uint16_t j;

      float rSqr;

      rSqr = r * r;

    #pragma UNROLL(SIZE)

      for (j = 0; j < SIZE; j++)

      {

      TDD_BasicStats[j].zSqrSum += rSqr;

      TDD_BasicStats[j].hzSum += TDD_BasicStats[j].h * r;

      }

    }

    TDD.asm

  • Yeah that unrolled loop looks pretty compact to me. Dont think there's much room for improvement there. Getting rid of the GPIO set and clear should get rid of about 16 instructions, so there is some saving there but then you wouldn't be able to measure the elapsed time on a scope.

  • Ok. Thanks for all of your help!