This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

profiling and Floating point multiplications on DM6446

I have few basic questions.

1. Is it possible to obtain the number of cycles taken by a function without the use of the profiler? is there any hardware counter or some thing like that in DM6446 hardware.

2. With the number of cycles how do i calculate the time taken by DM6446 say at 513 MHz?

3. Can I perform floating point multiplications on DM6446? I have a 3x3 filter with floating point coefficients. On the simulator CCSv3.3 i am able to multiply two floating point values.  will the same code work on the hardware?

Thank you

  • Ramprasad said:

    3. Can I perform floating point multiplications on DM6446? I have a 3x3 filter with floating point coefficients. On the simulator CCSv3.3 i am able to multiply two floating point values.  will the same code work on the hardware?

    Thank you

    I think IQMath lib will help you to deal with float calculation on DM6446.

  • 1.  This is a complex questions as DM6446 has an ARM and DSP core; therefore, this would depend on where your function is.  A tool you may find very helpful is our SoC Analyzer tool; this is a post-mortem tool that will allow you to see where time is being spent in your system down to the function level in both ARM and DSP side  (see http://wiki.davincidsp.com/index.php?title=SoC_Analyzer for more details).

    2.  As part of the SoC Analyzer tool setup, you will enter the speed you are running at (e.g. 513 MHz), and the tool will give you the time.

    3. Yes, but floating point calculations will be done is software as DM644X has no floating point hardware support as some of our other DSPs.  As Lorry mentioned above, there are certain libraries that may help optimize this effort.

  • 1.       The function will run on the DSP. I am developing the algorithm as a library using the simulator and have made it xDAIS compliant (using the Hyperception's Component Wizard and have verified it using QualiTI). Do I have to worry about ARM?

    2.       From the profiler’s view under ‘cycles.Total:Incl.Total’ I have taken the number of cycles. Will “cycles/(513*1000000)” give the time taken in seconds?

    3.       Currently I have just multiplied the ‘float’ values directly using ‘*’. The compiler did not warn me. The code is running as expected (though very slowly). Does it mean that the compiler has added the code to perform floating point multiplication in software? If yes, is compiler’s implementation not as good as IQMath lib’s?  

    4.       ‘float’ addition also is done in software, am I right?

    5.       I see that the profiler is able to display all the details including the line numbers, even though the code was compiled in release mode, without debug info. How is that possible? Is the profiling intrusive?

    Currently I don’t have the hardware, I am fully dependent on the simulator to design and implement.

    Thank you very much for your support.

  • For fixed point device (64x+ in this case), it is always preferable to modify the algorithm to use fixed point arithmetic. The floating point emulation kernels that will be called from the RTS support library will be slow. IQMath is a fixed point library and will perform faster as compared to the floating point emulation library. On a fixed point device, the below is what you should try:

    • Implement natively in fixed point for best performance
    • Implement using IQMath lib that provides better control on precision but still allowing staying in fixed point
      • Inline the IQMath APIs to get better performance. The std release includes the source for most commonly used APIs that you can inline.
      • Call the IQMath APIs from the lib if you can't inline
    • For still higher dynamic range or genuine need to stay in floating point, use fastRTS library
      • Use inline version of fastRTS APIs. The source is provided with the release
      • Call the fastRTS APIs from the lib if you can't inline
    • For float kernels that are one time call or not performance critical, use regular RTS

    Regards,

    Gagan