Optimize processing speed for C6748 for math operations.

Tuyen Nguyen

Other Parts Discussed in Thread: SYSBIOS, OMAPL138

Dear TI Experts,

We used to have our image processing device developed on ARM11 500MHz. Our image processing library is optimized to a performance that is enough to do live evaluation on a video stream.

We have decided to have the similar device developed on C6748 300Mhz. A TI expert has kindly advised us that with a heavy math-based library like ours, C6748 will be a best fit which produces a competitive performance.

We have successfully had our library run on C6748 but unfortunately the speed is 6 times slower than the one from ARM11. I refer to do some optimization suggestions from TI such as --opt_level (3), --program_level_compile, --opt_for_speed (5), --call_assumptions (0). The speed improves a little bit but not near to our expectation. We do know that TI supports IMAGELIB, DSPLIB, etc. but after referring to their APIs, we decide to keep our code the same since we will not utilize much of them.

We think that there must be something wrong with the way we use your DSP fixed and floating point features. Do we need to do some special settings to enable fixed and floating point features or they are already by default enabled in TI's compiler/linker?

One more thing, when I try to change ABI (application binary interface) from eabi to coff, the speed is improve quite a bit (like 2, 3 times) however the system becomes unstable with coffabi (algorithm execution suddenly goes wrong, appear trash numbers, etc.). Does ABI affect the speed?

We really appreciate if you can advise us some optimization approaches to improve our performance speed.

Thank you very much,

Best regards,

Tuyen Nguyen

over 13 years ago

0 Sam Kuzbary over 13 years ago

Expert 2655 points

Greetings,

Dependent on where your PS and DS are located, on chip or off chip and the use of caching in your system, you will need to use all necessary optimization techniques aside from -o3 to reach maximum performance on this device.

You may want to download and do a quick review of TMS320C6000 Optimization Workshop from TI.

Good Luck,

Sam

0 Rahul Prabhu over 13 years ago in reply to Sam Kuzbary

TI__Guru** 116770 points

Hi,

Since you are using Natural C code, you are hoping to get the best out of box performance from the C674x device using the compiler. We have a quick introduction to optimizing your code for the C674X device which is discussed in the following document.

http://www.ti.com/lit/an/sprabf2/sprabf2.pdf

Ensure that you use appropriate linker command scripts and cache settings while evaluating the performance on the hardware.

Regards,

Rahul

0 George Mock over 13 years ago

TI__Guru**** 251970 points

Tuyen Nguyen said:
when I try to change ABI (application binary interface) from eabi to coff, the speed is improve quite a bit (like 2, 3 times) however the system becomes unstable with coffabi (algorithm execution suddenly goes wrong, appear trash numbers, etc.). Does ABI affect the speed?

This doesn't make any sense. All other things being equal, changing the ABI will have very little impact on performance. Something else must be changing at the same time, and it only appears that ABI is the cause. I'm not sure what could make such a difference. But, whatever it is, it is worth finding. Because you can probably apply it to the EABI build, and get all that performance back.

Thanks and regards,

-George

0 Sam Kuzbary over 13 years ago in reply to George Mock

Expert 2655 points

Greetings,

First this come to mind is the DSP PLL/DDR etc... initialization difference. You may want to inspect the registers in both cases and compare.

Good Luck,

Sam

0 Sam Kuzbary over 13 years ago in reply to Sam Kuzbary

Expert 2655 points

Also the state of L1P and L1D caching (left over from Boot ROM).

0 Tuyen Nguyen over 13 years ago in reply to Rahul Prabhu

Prodigy 180 points

Thanks a lot,

Because the urgency of the project, I wanted to have quick techniques to optimize the speed without touching the code. But apparently, I cannot. So I spend time on digging into the code to apply compiler optimization for functions and loops and it improves a little bit.

ABI change: yes I found it weird too. All I did was that I set "output format" to "legacy COFF" and --abi to coffabi, other settings were kept the same. And suddenly the algorithms ran faster with unstable behaviors. I read somewhere in one of TI's documents saying that COFF is obsolete but gives higher performance due to some reduced overhead. Not sure if it is true. But apparently I cannot use COFF due to the unstable problem.

Cache: I have been reading documents about TI's cache concept and settings. And to be honest, because I am new in TI chip, I don't know exactly how to enable it using CCS 5. I am using Sys/Bios. I try to config it through .cfg file but it seems there is no change on the speed. I suspect that I am doing it wrong. There is a document guiding cache setting using CCS4 through .tcf file. But I don't know how to do it in CCS 5 because in my CCS 5 version, there is no tool to generate .tcf. I am using the default ti.platforms.evmOMAPL138. But when the project is finalized, I have to create my own customized platform. I don't know exactly which tool should I use.

It would be very nice of you to give me some initial instructions for cache. Plus, do I need to do cache setting for .lib project or only for the application project which uses that .lib?

Thank you very much for all your supports,

Best regards,

Tuyen Nguyen

0 Rahul Prabhu over 13 years ago in reply to Tuyen Nguyen

TI__Guru** 116770 points

Tuyen,

The unstable nature of the binary with COFF is weird. Can you specify the compiler flags that you have and the version of compiler that you are using? If you can extract the piece of code that shows the unstable behaviors when compiled in COFF format, we can examine what is causing this issue.

WIth regards to cache, you need to do cache settings only for the application project and not for the lib project. SYSBIOS enables device cache by default while configuring the platform but if you wish to explicitly turn on the cache use the following lines in your .cfg file

/* Enable MAR bits for Cache */
var Cache = xdc.useModule('ti.sysbios.family.c64p.Cache');
Cache.MAR128_159 = 0xFFFFFFFF;

By default OMAPL138 EVM uses 32K L1P program cache and 32K L1D data cache. You can also use Cache_enable() API to enable all levels of cache in sysbios.

Please refer to section 5.6 and section 6.4 for all SYSBIOS Cache configuration options.

Regards,

Rahul

0 Sam Kuzbary over 13 years ago in reply to Rahul Prabhu

Expert 2655 points

Greetings,

Here is where you set them in CCS5.x SYS/BIOS

Good Luck,

Sam

Processors

Processors forum

Optimize processing speed for C6748 for math operations.