C6657 EVM FFT benchmarks

ReinierC

Expert 2425 points

Good day experts,

I am having trouble in achieving the specified FFT benchmarks of the C66x DSPLIB.

My setup is as follows:

- TMDSEVM6657LS EVM, using only core 0, clocked at 1000 MHz

- C66x DSPLIB v3.4.0.0

- CCS v5.5

- Codegen tools v7.6.0 (-o3 optimization)

- No operating system

I am benchmarking the DSP_fft16x32() routine, with all code and data in L2 internal RAM, with L1P and L1D configured as cache. I am executing the routine 100000 times in a for-loop, using the MUST_ITERATE pragma to ensure that its not optimized out, and measuring the duration with a timer.

From the DSPLIB specified benchmarks:

- 128 point FFT takes 813 cycles, i.e. 0.813 us at 1000 MHz CPU clock

- 256 point FFT takes 1469 cycles, i.e. 1.469 us at 1000 MHz CPU clock

However, I measure the following:

- 128 point FFT (100000 iterations) takes 116 ms, i.e. 1.16 us per FFT

- 256 point FFT (100000 iterations) takes 213 ms, i.e. 2.13 us per FFT

Now I understand that there are some overheads involved in my setup, which are not present in the DSPLIB supplied benchmarks, but I am more than 40% slower in both instances. I did similar benchmarks for the C64x+ DSPLIB on the C6748 and the measurements were within less than 5% of the benchmarks.

Are there perhaps some setup settings I am missing here? Can someone please provide advice?

Thanks in advance.

over 9 years ago

0 ReinierC over 9 years ago

Expert 2425 points

Does someone have any ideas?

0 Raja over 9 years ago in reply to ReinierC

TI__Guru* 81335 points

Hi,

We are working with expert to answer this post. Thank you for your patience.

0 ReinierC over 9 years ago in reply to Raja

Expert 2425 points

Thank you Rajasekaran, I am eagerly awaiting the reply from your experts.

0 Asheesh Bhardwaj over 9 years ago in reply to ReinierC

TI__Expert 4680 points

The DSPlib performance is based on the functional simulator with flat memory model. There will be difference in the performance number measured on the EVM. Whether you have measured the DSP lib cycles by running the DSP lib kernel as is on the EVM without your function overheads?

Regards

Asheesh

0 ReinierC over 9 years ago in reply to Asheesh Bhardwaj

Expert 2425 points

Asheesh,

You are restating what I already said in my first post.

Yes, I am aware that the actual performance measurements shall be worse than the simulator cycle measurements due to function overheads, but surely the function overheads cannot decrease the measured performance by more than 40%? Especially since everything is executing from internal L2 SRAM.

As I already mentiond, for the C6748 I did the exact same benchmarking for the DSP_fft16x32() routine using the C64x+ DSPLIB and my measured results were actually within 2% of the expected results. I repeated this benchmarking on the C6657 EVM using the C66x DSPLIB and my measured results were off by more than 40% of the expected results. I am trying to determine what the reason(s) might be for this large discrepancy.

Are you able to perform the benchmarking on actual hardware on your side? If so, I can provide you with the source code I used for benchmarking.

0 ReinierC over 9 years ago in reply to ReinierC

Expert 2425 points

I am just checking if there is any feedback yet?

0 Asheesh Bhardwaj over 9 years ago in reply to ReinierC

TI__Expert 4680 points

The performance you have measured with the layered memory on EVM will be different than the flat memory measurement. It depends on how the L1 cache is accessed by a particular algorithm while data is in L2SRAM. The performance degradation for cache access pattern will change from one algorithm to another. There will cases where you will see same performance as flat memory too which you might have observed on the previous DSPs.

For this particular algorithm I can also confirm that I am also observing the similar behavior but you can observe that Radix-2 (128 point) FFT performance degradation is not same as Radix-4(256 point) FFT.

Regards

Asheesh

0 ReinierC over 9 years ago in reply to Asheesh Bhardwaj

Expert 2425 points

Asheesh,

I can understand that there would be differences using a flat memory model vs a layered memory model, but since this benchmarking application is so small and in L2SRAM, everything is probably already loaded into L1 cache before it even starts executing. From my understanding the C66x fixed point core is essentially the C64x+ fixed point core (employed on the C674x) with a few improvements, and the C66x cache is also simply an enhancement of the C64x cache, so I am struggling to understand why this would have such a drastic effect on the measured benchmarks.

What is the point of providing benchmarks if its apparently not closely achievable on actual hardware?

Is there not something we are missing here? What is the opinion of your other C66x experts?

0 ReinierC over 9 years ago in reply to ReinierC

Expert 2425 points

Asheesh,

I have gone through the source code provided for the C66x DSPLIB (v3.4.0.0) and I have found no hand-optimized Assembler routines for the DSP_fft16x32() function, only a C-implementation which uses intrinsics.

For the C64x+ DSPLIB (v3.1.0.0) there is in fact hand-optimized Assembler routines for DSP_fft16x32().

This might explain why the C64x+ version of this routine actually executes faster than the C66x version on the C6657 platform!?

Are there any hand-optimized Assembler routines available for the C66x, which specifically employs the enhanced C66x architecture and its new instructions?

0 Asheesh Bhardwaj over 9 years ago in reply to ReinierC

TI__Expert 4680 points

C66x compiler gives most optimal results with code written in C and Intrinsics. The use of intrinsics utilizes all the features of architecture. Refer the optimization guide provided with the compiler for using the intrinsics. It is not needed to write the code in assembly.

Regards

Asheesh

0 ReinierC over 9 years ago in reply to Asheesh Bhardwaj

Expert 2425 points

I seriously doubt if C-code using instrinsics will yield the same performance results as pure hand-optimized Assembler. I agree that the optimized compiler has significantly closed the gap in recent years, but a gap non the less still exists.

How would you explain the C64x+ DSP_fft16x32() executing faster than the C66x DSP_fft16x32() on the C6657 EVM?

Processors

Processors forum

C6657 EVM FFT benchmarks