This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler/TMS320C6678: 6678 optimized code performance issue

Part Number: TMS320C6678

Tool/software: TI C/C++ Compiler

Hello

i have two question to ask.

we tried different size of FFT (double, float, int32) inside dsplib of c6678 (version 3.4.0.0). we used TSCL of dsp to measure execute time of FFT. in all cases our measured time was 2 to 2.5 times greater than numbers reported in dsplib benchmark, here is first question:

1- what is source of this mismatch between our measured execution time and TI reported times (we implement our test according to all of TI recommendations)?

we also compared TI benchmark result of dsplib for 6455 to 6678. and Here is second question:

2- considering all of improvement in 6678, for example addition of more multiplier and ..., how comes that 6455 and 6678 performance are approximately the same (based on TI report numbers)?

  • Malek,

    there are numerous E2E posts that discuss this topic. The summary of the issue is that DSPLIB benchmarks are published using simulator model that assumes flat memory while the DSP devices have 3-4 different memory tiers and cache and interrupt latencies that are not accounted for in those benchmarks. If you run the code from MSMC memory or DDR memory without turning on the cache then there is a likely chance that the numbers would be relatively slower.

    The best way to reproduce numbers close to the ones provides in DSPLIB is to put all of your code and data in L2 an denable L1D and L2 cache. Which we have done when providing the DSP COre benchmarks discussed here:

    http://www.ti.com/lit/an/sprac13/sprac13.pdf

    Or in the Audio benchmark starterkit demo that we provide in Processor SDK RTOS:

    http://software-dl.ti.com/processor-sdk-rtos/esd/docs/latest/rtos/index_examples_demos.html#audio-benchmark-starterkit

    malek alashiri said:
    2- considering all of improvement in 6678, for example addition of more multiplier and ..., how comes that 6455 and 6678 performance are approximately the same (based on TI report numbers)?

    Can you provide some context for this comment. IF this is specific to int32 numbers that you are indicating then it is possible that the code written in hand assembly for C6455 is the same as C66x as the C66x is back wards compatible but for floating point and double precision the number should definitely look better when both DSPs and associated memory is run at the same speeds. 

    BAsed on your comment, I checked the reported numbers reported for DSP_fft32x32 for DSPLIB C64x and DSPLIB C66x and the numbers seem to indicate 20% improvement.

    C64x DSPLIB:

    C66x DSPLIB:

      

    Regards,

    Rahul

  • hello

    thank for your response.

    it seems that TI result is measured in simulator and it has nothing do to with memory place that data resides in.

    as a side note we measured clock in simulator mode using "clock tool" in CCS for two condition:

    1- when clock settings sets for CPU.Cycle

    2- when clock settings sets for CPU.Total

    the number reported in TI documents possibly results from first test (CPU.Cycle).

    it was interesting to see results of second test was equal to our Hardware measurements.(nearly 2 times of TI reported cycles)

    about comparison between 6678 and 6455, we compared them for INT32 as you mentioned. but in case of 6455 we used "sprueb8b.pdf" as reference which shows less than 20% improvement. (honestly we expect improvement over 30% in 6678)