This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

66AKH14 C66x DSP Lib and Floating Point performance

Other Parts Discussed in Thread: 66AK2H14, 66AK2H12

I am in the midst of testing the performance of the C66X core for the 66AK2H14. The claimed capability is around 19.2 GFlops I assume using the SIMD operations. Obviously this is marketing and optimal, but I am only getting around 3 GFlops in both DSP library FFT functions and my own test loops. So the questions are:

1) Does the C66x DSP library use any of the vector instructions (it doesn't seem like it does)?

2) Besides turning on all optimizations and options in the C compiler is there anything that would get the compiler to actually use these SIMD instructions or do they have to be coded in assembly?

Thanks.

  • Regarding GFLOPS:

    C66x DSP architecture supports 8 floating Multiplies ( 4 on each M1 and M2 unit) and 8 floating point additions (2 on each L1, L2, S1 and S2 unit). That makes it 16 FLOPS per cycle or 16 GFLOPS for 1GHz machine and 66AK2H12 runs at 1.2 GHz so the device will achieve 16x1.2 = 19.2 GFLOPS. There are specific instructions QMPYSP, CMPYSP which can achieve 4 Multiplies on .M1 or .M2 units and DADDSP which can achieve 2 multiplies on L1, L2, S1 and S2 units. Refer the DSPF_sp_fir_cmplx example in DSPlib which uses the CMPYSP(complex multiply) and DADDSP(dual additions) instructions for achieving 16FLOPS per cycle.

  • Also refer below thread,
    e2e.ti.com/.../430140
  • Thank you, do you know if the any of the FFT DSP lib routines use these accelerations? These are the routines I used and they only seem to manage around 3 GFlops.
  • They tend to ignore threads that are marked with green tick, so don't hold your breath for TI.

    As for 19GFLOPS. You have to recognize that it's not something you can achieve with arbitrary algorithm. Indeed, 19GFLOPS means that you have to have 4 complex [single-precision] numbers to multiply and add every cycle. Now consider matrix multiplication for example. Can you load 4 complex per cycle? No, you can load only 2, and so you would be able to achieve only 9.5 GFLOPS. Algorithm can be more sophisticated than straightforward multiply-n-add which might claim additional cycle in loop kernel for just a pair of FLOPS and so you are suddenly down to 5.94 GFLOPS. And there can be algorithmic dependencies between iterations (like in FFT), which in combination with short enough inputs will make inner loops' epilogues count. Or simply imagine that data is laid down in such manner that causes cache trashing...  Bottom line is that advertised top FLOPS is not really helpful metric for estimating performance of arbitrary algorithm implementation.

    To the question itself. Are magic instructions that allow you to achieve 19GFLOPS in that very specific case are used in DSPF_sp_fftSPxSP? Yes, both CMPYSP and DADDSP are used in linear assembly module. But as just said, it doesn't just give you 19GFLOPS in this specific case. Also, as implied above when it comes to top-notch performance just an extra cycle can drag you down. And I'd actually go ahead and say that linear assembly doesn't actually give you ultimate performance guarantee that that extra cycle is unavoidable. I'm not saying that generated code is suboptimal, nor am I saying that 3 GFLOPS is adequate result in this case. All I'm saying is that there is no easy answer, and finding answer takes very sharp understanding of every detail of implementation. Real problem is that that obtaining that kind of understanding can be equivalent to implementing algorithm. Which means that you can't actually expect definitive answer to question is X GLOPS is adequate in some specific case.

  • Hi Andy and Edmund,
    Apologize for the delay. I have requested appropriate expert to answer this thread. They will get back to you soon.
    Thank you.
  • The DSP library has an optimized version of the code under \packages\ti\dsplib\src\DSPF_sp_fftSPxSP\c66. Those functions use the _daddsp _dmpysp intrinsics which actually uses the vector operations take benefit of the architecture.

    Regards,
    Asheesh
  • Can you point to particular functions that will achieve this performance please?

  • If you read the response from Raja above the FIR DSPF_sp_fir_cmplx function achieve the performance you are looking for and it is part of the DSP library.
    Regards,
    Asheesh