Future Floating Point DSP Speeds from TI and performance question

PaulSchoenke

Other Parts Discussed in Thread: TMS320C6748, CCSTUDIO

Hello:

We're using the TI TMS320C6713 and TMS320C6748 in designs currently. We're looking for higher performance for our next generation products. It sounds like the roadmap will not have higher speed parts until 2014, but that will not meet my project schedule, so therefore, I'm looking to decrease the cycle time of the algorithm. From our current timing analysis on our existing design, we’re running at 288MHz, and scaling to the 456MHz clock rate of the higher speed grade part shows that I’ll be about 1% over our timing budget.

Since the C66x part won’t be available for a while, let me ask a few additional questions. First a description of what I’m doing. The application in question will be EDMAing pairs of 16-bit unsigned integers from an external FIFO into buffers in IRAM. The data pairs are scaled and ratioed and then a histogram bin (also in IRAM) incremented. So for each data pair there are a few multiplies, one addition and a divide, then an array dereference and increment. I compared performance of our 6713 design (core clocking at 300MHz) and our 6746 based design (core clocking at 288MHz) and scaled both to the 456MHz clock of the higher speed grade 6746. I know the 6713 isn’t available in the higher speed grade, but it was what I had available at the moment so I used it for my initial benchmark and when I finally got around to running it on the 6746 I was surprised at the results. The 6713 code was compiled with cl6x V5.1.0 under CCStudio V3.1 and the 6746 code was compiled with cl6x V7.32.6 in CCStudio V5.2. Both were compiled at –O3 (no –g) and had the –mv6710 or –mv6740 flags set (for the 6713 and 6746, respectively). Here are the questions:

The normalized performance results give me about an 83ns cycle time for the 6713 and 101ns cycle time for the 6746. This shows the 6713 performing about 18% better than the 6746. This surprised me as I would have expected the 6713 and 6746 cores to have at least equivalent performance if not better for the C6746 at the same clock rate. Is this what you would expect as well?
I did set the compiler flags to output the annotated assembler with optimization info. Interestingly, the output is significantly different for the main processing loop. I would have expected much more similar output from the two compilers. Or perhaps if I ask the question this way, would you expect the V7.x compiler give equivalent or better performance than the V5.x compiler?
I did try modifying a number of things, decrementing the main for() loop instead of incrementing, changing from 16 to 32-bit variables, but with all my tweaking, the above results were the best I could achieve. Is there a single document for the latest version of the compiler that describes coding for performance or is the Wiki the best place to get this info these days? Most of what I have is a couple of years old at this point.
Finally, is the 456MHz speed grade going to be the highest offered for the 6746 or is an faster part coming in the near future?

Thanks much for any additional info you can provide.

Best regards,

Paul

over 12 years ago

0 RandyP over 12 years ago

TI__Guru* 84110 points

Paul,

The C6671/2/4/8 devices are available now. Which C66x-based device are you waiting for?

Your comparison results are reasoned and logical, but it is impossible to tell from the outside what the explanations are or what other results you could get. The C674x core has instruction enhancements over the C6713 that can improve its effective instruction rate per clock cycle. The C6746 device has architectural differences to keep up with the higher clock speeds that can work the other direction depending on how much infrastructure activity goes on (moving data from one place to another outside of the DSP core).

The Wiki is the best place to look, and the E2E forums.

I may not be in a position to know what our future plans are for the C6746, but with C66x devices running more than twice that fast with up to 8 DSP cores in a single package, I doubt the C6746 will be released in a higher speed version. Other devices may be released that have similar features, so you should be able to find whatever you need.

Regards,
RandyP

0 Asheesh Bhardwaj over 12 years ago

TI__Expert 4680 points

There are multiple levels of code optimization on TI DSPs and these are documented on the wikis and articles.

The Application Note sprabf2.pdf “Introduction to TMS320C6000 DSP Optimization” provides details on the necessary steps you can use for the code optimization. The detailed document comes with the compiler releases. The application Note has links to the compiler optimization guides.

1. TMS320C6000 Programmer’s Guide (SPRU198)

2. TMS320C6000 Optimizing Compiler v 7.4 User's Guide (SPRU187u)

3. TMS320C674x DSP CPU and Instruction Set Reference Guide (SPRUFE8)

The application notes guides you through the basic loop optimization of loop unrolling which helps the compiler to leverage the parallelism in the architecture. The compiler also provided feedback on how the loop has been optimized and where further optimizations are possible when more information is supplied. In order to get to the details of the compiler feedback for loop optimizations refer Programmer’s Guide (SPRU198) chapter 4.

On your question for advantages in moving to the newer compiler. There are multiple bug fixes and enhancements done over time to get the better performance in different scenarios by the compiler and that’s why it is better you move to the newer compiler for development.

Regards,

Asheesh

0 Andy Polyakov over 12 years ago in reply to Asheesh Bhardwaj

Expert 1340 points

But no manual can replace imagination. Imagine that you have to perform division. Imagine that it takes only A-registers to perform it. Is it possible to perform two independent divisions in parallel by performing same operation on A- and B-registers? Yes. On top of than consider double floating point division on C674x. Basically it's reciprocal approximation followed by a bunch of dependent multiplications and additions. In the context "dependent" means that next operation has to be issued X cycles later to consume the result of previous operation. But independent operation can be issued ~2 times earlier. Meaning that if you had two divisions to perform, you'd be able to overlap calculations and achieve close to 2x improvement [per register bank half]. In other words processor has capacity to perform 4 double-precision independent divisions at cost slightly exceeding cost of 1. Would compiler help? Not if your C code uses / to denote division. Would compiler allow to meet the goal if you don't use /? I don't know. But if it would, it would still be a certain gamble. Surely newer compiler is better, but there is no guarantee that it's better on everything. Assembly implementation is the only way to secure the outcome.

Processors

Processors forum

Future Floating Point DSP Speeds from TI and performance question