This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Problem in Speeding Up Floating Point Computations on DM3730

Other Parts Discussed in Thread: DM3730

Hi,

I have a code where i require around (10E+6)*(10E+6) number of
floating point multiplication and that much number of floating point
additions as I am doing auto correlation for my development for 10E+6
samples in circular shifting manner

When I try to run my code in DM3730  on ARM Core (Since I don't
know how to use DSP Core), I get around 50E+6 multiplications and
additions in greater than 5 minutes which is very slow for my entire
requirement. My BB uses Angstrom and I am not able to use hard FPU.

Can Some body suggest me How to speed up Computation speed on BB so
that I can make a feasible system.

My snippet of the code is as follows: Please help me I am stuck....

The maximum value of k =1000000.

        for(i=1;i<=k;i++)
        {
                sum=0;
                for (j=1;j<=k;j++)
                {
                        sum = sum + (*(prdc_pulse_out_store + j))*(*(prdc_pulse_out
+ j));
                }

                *(Rxy + i) = (sum/k);

                // circular shifting
                temp = *(prdc_pulse_out + k);
                for(j=k;j>=2;j--)
                {
                        *(prdc_pulse_out + j) = *(prdc_pulse_out + j -1);
                }
                *(prdc_pulse_out + 1) = temp;

        }

  • I don't think there is much you can do about the floating point operations. I am assuming you are using double througout. It you are using float, you might be able to carefully cast or ordering your operations to avoid promotion to double. Depends on the compiler.

    You could avoid shuffling your circular buffer by using a base index. Something like:

    j0 = 1;
    for(i=1;i<=k;i++)
    {
      sum=0;
      for (j=1;j<=k;j++)
      {
        jout = j0+j-1;
        if(jout > k)
          jout = jout - k;
        sum += prdc_pulse_out_store[j] * prdc_pulse_out[jout];
      }

      Rxy[i] = (sum/k);

      j0--;
      if(j0 <= 0)
        j0 = k;
    }

    Placing variables in registers might help but I think the FP ops will take the majority of the time.

    I noticed that you are indexing from 1. Was this ported from FORTRAN code? Your arrays should allocate one more to avoid memory problems.

    Maybe the better question; Are there other auto-correlation algorithms?

     

  •  

    Hi Norman,

     

    Actually I am a VHDL programmer for FPGAs so I do not have this C language hands on....

    so thanks for giving clue for removing this circular buffer... this will help...

     

    I have replicated the matlab code, so thats why the 1 remains as it is....

     

    Yeah I ahve to look for some other auto correlation algos.

     

    regards

     

    mohit

     

     

  • If you know that dynamic range and precision of your data, you could go with fixed precision integer math. Just a matter of keeping track of the decimal point. With integer math, restricting k to 2^n allows the division to be turned into a right shift.

     

  • Hi Norman:

      When you say there is no too much we can deal with floating point, do you base on the assumption that customer has utilize VFP in the Cortex A8?  If not, could you give us some suggestion how to use VFP inside ARM core, and also Neon could deal with floating point calculation as well. How could we utilize that in Android system? Currently our software team complaining the math calculation is 50% slower on DM3730 compared to Qualcomm QSD8250.  We are using Benchmark-Release.apk from google to verify the performance. Probably you could shine some light why we are see the big difference on the math performance.

  • Sorry, I have little experience with HW FPUs, VFP or the DM3730. I let the compiler take care of that by selecting a HW FPU if available. I did not see any way to reduce the FP (HW or SW) operations in the algorithm presented by Mohit Hada.