TMS320C6674: Delay time and processing speed for TMS320C6674

Li Chen92

Part Number: TMS320C6674
Other Parts Discussed in Thread: MATHLIB

Hi TI Expert,

TMS320C6674 is used to realize my algorithm. The algorithm is mainly matrix operation. The calculation amount is about 1 trillion times plus operation. It is convert to C language (floating-point double) through MATLAB. The processing delay is 287ms by calculating the number of instruction cycle execution times on TMS320C6674, which is over expectation. I have several questions:

1. Could you give me any suggestion on reducing processing delay and increasing operation speed?

2. What is the difference between the processing speed of TMS320C6674 processor for fixed-point number and floating-point number? Now the algorithm is a floating-point number, is it necessary to convert the algorithm to a fixed-point number?

3. The current algorithm only has one core open. How do TMS320C6674 multi cores turn on automatically or need to be configured manually? Can you provide a demo or tutorial to open multi-core to improve the operation speed?

4. DDR3 cannot be read or written. How to configure it? Is there a tutorial or demo?

Thanks and Best regards,

Thomas

over 3 years ago

0 lding over 3 years ago

TI__Guru* 95265 points

Hi,

For such performance issue:

First check if you enabled L1D, L1P as 32KB cache, and run CPU as fast as possible, e.g. 1.0GHz or 1.25GHz. Then try to put code/data inside L2 at first, and if L2 is not big enough, use MSMC next. Finally you can consider the DDR3. If you have to use DDR3, configure it as cache enabled and pre-fetch enabled. And you have to make part of L2 as cache.

Then, look at your compiler options, at least use -O3.

>>>>It is convert to C language (floating-point double) through MATLAB>>>>I don't have experience of converting Matlab code to C code, how efficient of that? As most of the people use Matlab for simulation, they directly write C code for the same implementation.

I felt you can use above suggestions to see what kinds of improvement you can get. If is still not meet your goal, then the rest are advanced topics. Whether you should use fixed point 32-bit or 16-bit C, or float or double precision. What kinds of quantization error you can tolerate in trade of execution speed. You can look Keystone architecture training here: https://training.ti.com/keystone-arm-dsp-multicore-device-training-series?context=15819

There are C66x intrinsics you can write to replace your generic C code for improvement. Also if you use more math and FFT kinds of operation, then you can consider using TI optimized MATHLIB and DSPLIB.

Finally, if you want to parallel processing, then there is openMP:

For DDR initialization, this is a separate topic and please open a new thread. We need to understand if this is TI EVM or customer board, how do they do DDR initialization during board manufacture, how do they calculate the DDR timing? Do you use GEL?

Regards, Eric

Processors

Processors forum

TMS320C6674: Delay time and processing speed for TMS320C6674