This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Performance considerations, DSP

Other Parts Discussed in Thread: OMAP-L137, TMS320C6747

I have the OMAP-l137 running at 300mHZ and suppose that I have a clock that is giving 300.000.000instructions/sec, if I have a sample rate at 20KHZ, equals 15.000 cycle/sample to complete the sample and the analyse. Now I have build in a buffer to store samples to buffer what I can't process within the 50us, I have added several optimization to the compiler and changed the software in several ways to improve the performance, but I have problems to get the performance where expected. My program is analysing for 10 sec and shall during that period make the DSP analyse. If I look at the program in ASM in CCS5.3 I can count what looks like 3700 steps/analyse, counted with single step. If I run the analyse  an measure what time the counted 3700 steps takes, it looks like that 20.000 samples takes 1sec. I can't see that I have 15.000steps witch should finish the 20.000 samples in 1/4 sec. I have checked the clock and I get it to 300Mhz where is the missing 75% in performance or is my calculation wrong?

How is the 3848MIPS found (from the tms320c6747.pdf ) ?

Claus.

  • Hi Claus,

    Thanks for your post.

    I don't really understand that, how did you measure time count, like 3700 steps and 1 sec to process for 20000 samples? Please elloborate your calculation in detail.

    I have some few clarifications as below:

    How are you evaluating the performance of C6747 in terms of number of cycles? Can you describe the procedure in detail?

    Have you enabled clock option in CCS by Run --> Clock --> Enable or have you setup any CPU cycle count by  Run --> Clock --> Setup --> Count: CPU Execute Cycles?

    How are you measuring the cpu cycles? Is it function wise, by putting some breakpoints to any small chunk of code or Is your code size is huge?

    Please check the C6747 datasheet and it is mentioned as 3648/2736 MIPS/MFLOPS (page1) as below:

    http://www.ti.com/lit/ds/symlink/tms320c6747.pdf

    If you are looking for CPU cycles for each assembly instruction you should look in the Instruction Set Reference Guide for the device you are interested in http://www.ti.com/lit/ug/sprufe8b/sprufe8b.pdf

    Please refer the below wiki for the compiler's user guide and assembly language tools guide as below:

    http://processors.wiki.ti.com/index.php/Before_asking_for_CGT_support

    Also,  please refer TMS320C6000 DSP Optimization document as below:

    http://www.ti.com/lit/an/sprabf2/sprabf2.pdf

     Thanks & regards,

    Sivaraj K

    ---------------------------------------------------------------------------------
    Please click the
    Verify Answer button on this post if it answers your question.
    ---------------------------------------------------------------------------------

     

     

  • Thanks Sivaraj K.

    My question is generated becaouse wee can't make the required analyse in the time required, going back to a point where things are visible and measuring the instruction cycle/second to see what you can expect in instruction/seconds

    We have 8 filters with 11 order, by single stepping trough the code parts in one filter and multiply by 8 and 11 I get the expected steps required for running one analyse and that is about 3700 single steps in CCS5.3, instructions and wait states are all counted, knowing that wee need to analyse 20.000 samples/seconds that is used as reference fore the measurement, and that takes about one second to finish 20.000 analyses, measured by toggling a port pin on start and finish.

    20.000 * 3700 = 74.000.000 single step/seconds or instruction cycle/second.

    this is equal to a instruction clock of 74mhz !!!

    Is 74.000.000 instructions/sec what I can expect ?? (it might be 75.000.000 the counting is not 100% accurate.

    I know that I can optimise with -g to make it run faster but this was to get a idea about the performance or the instruction/seconds, from a point where the instructions/sample is known, once you optimize with -g you cant see what is going on any more.

    The question is then:

    Is it correct that a 300mhz clock makes 75.000.000 instructions/seconds?

    Claus.

  • Hi Claus,

    Thanks for your explanation.

    As per your calculation, it is not correct that a 300MHZ clock makes 75000000 CPU instruction cycles/second, but i doubt there is a messup in calculating the expected steps required for running one analyse which is 3700 steps.

    In general, when we calculate CPU instuction cycles, there are some performance considerations like overheads for all CPU exceptions (9 cycles), exception latency (13 cycles), exceptions on pipeline operations such as program and data memory stalls which inherently extends CPU cycles, Multicycle NOP's etc. The memory stall causes all of the pipeline phases to lengthen beyond a single clock cycle, causing execution to take additional clock cycles to finish

    On considering the above, it would have consumed more CPU cycles/steps required for running one analyse and i guess, the calculation would have went wrong. But any way, if you require more clarification on the same, i would move this thread to TI C/C++ Compiler forum based on your request.

    Thanks & regards,

    Sivaraj K

    ---------------------------------------------------------------------------------
    Please click the
    Verify Answer button on this post if it answers your question.
    ---------------------------------------------------------------------------------

     

  • Hi Sivaraj K.

    Thanks for your explanation most of that am I aware of. You might have some point on the ASM step and the DSP cycle some of the instructions takes up more cycles. To eliminate all the stall and wait state is a part of the optimization.

    Let me then put my question another way.

    Is it realistic to make a 300MHZ DSP running that kind of filter, 8 filter of 11 order  and have additional 1 timer interrupt  and a ADC sample interrupt on SPI , everything shall be finished within the 50us, further more there is also some kind of detection on the filter results that takes up what looks like a additional 1 filter (8*11).

    The 8 filter use 64bit float point and have a total of 196 additions and 192 multiplication.

    The code and arrays are all located in internal ram.

    Claus.

  • Hi Claus,

    To get further/more assistance, moving your post to TI C/C++ Compiler forum.

     

    Regards,

    Shankari

  • Claus Rimestad said:
    How is the 3848MIPS found (from the tms320c6747.pdf ) ?

    Peak performance is 3648 MIPS = 8 instructions per issue * 456 MHz

    That assumes that you can keep all 8 functional units busy every cycle, and does not count memory stalls.

    Peak performance at 300 MHz would be 300 * 8 = 2400 MIPS

    Not every functional unit can perform floating-point operations, so peak floating-point performance is 2736 MFLOPS = 6 instructions per issue * 456

    Peak float performance at 300 MHz would be 300 * 6 = 1800 MFLOPS

  • Claus Rimestad said:
    The 8 filter use 64bit float point and have a total of 196 additions and 192 multiplication.

    The theoretical peak MFLOPS numbers are for single-precision (32-bit) floating-point instructions.  Double-precision (64-bit) instructions take more cycles, and you will not get the theoretical peak MFLOPS for 64-bit floating-point code.  I'm sorry, I don't have a reference for theoretical peak MFLOPS for strictly 64-bit floating-point.  That would be a question to ask on the C6000 forum.

    We on the compiler forum cannot say much about the theoretical performance of your algorithm without seeing the code.  Performance on C6000 depends heavily on software-pipelined loops.  Look at the assembly code generated by the compiler for your key loops.  If the compiler was able to software pipeline them at near the minimum ii, and the kernel looks dense, that's probably about as good as the compiler is going to do without changing your source code.

    So... I don't think this is a compiler issue yet.  We need to find out what the expected and peak MFLOPS for 64-bit floating-point instructions are, and that's more of a C6000 forum question.  If you find your algorithm is not achieving the expected MFLOPS, then the compiler team can look at the source code to determine if the compiler is failing to efficiently exploit the hardware.

  • Thanks but that isn't helping me out, I have found that I can't make the designed filter solution run fast enough and need something that I can use to support me in that conclusion, some reference project that looks similar or something that I can scale up to make the calculation.

    What will a known filter design with known number of MULL and ADD using 64bit float take in time if code filter data and array data all are placed in shared ram,  there must be something that I can use  to verify my conclusion, I am not that much interested in MIPS and MFLOPS they say something but in and out of memory with 64bit is another thing.

    Claus.

  • I agree that having something similar to compare it to would be very helpful.  Unfortunately, I am not familiar with a suitable benchmark.  While the compiler team has benchmarks, we don't have any that explore peak float performance for particular classes of algorithms.  I think you need to ask that specific question on the C6000 forum to get the attention of the applications team.  In fact, this whole thread should probably be moved there.