This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Optimize processing speed for C6748 for math operations.

Other Parts Discussed in Thread: SYSBIOS, OMAPL138

Dear TI Experts,

We used to have our image processing device developed on ARM11 500MHz. Our image processing library is optimized to a performance that is enough to do live evaluation on a video stream.

We have decided to have the similar device developed on C6748 300Mhz. A TI expert has kindly advised us that with a heavy math-based library like ours, C6748 will be a best fit which produces a competitive performance.

We have successfully had our library run on C6748 but unfortunately the speed is 6 times slower than the one from ARM11.  I refer to do some optimization suggestions from TI such as --opt_level (3), --program_level_compile, --opt_for_speed (5), --call_assumptions (0). The speed improves a little bit but not near to our expectation. We do know that TI supports IMAGELIB, DSPLIB, etc. but after referring to their APIs, we decide to keep our code the same since we will not utilize much of them.

We think that there must be something wrong with the way we use your DSP fixed and floating point features. Do we need to do some special settings to enable fixed and floating point features or they are already by default enabled in TI's compiler/linker?

One more thing, when I try to change ABI (application binary interface) from eabi to coff, the speed is improve quite a bit (like 2, 3 times) however the system becomes unstable with coffabi (algorithm execution suddenly goes wrong, appear trash numbers, etc.). Does ABI affect the speed?

We really appreciate if you can advise us some optimization approaches to improve our performance speed.

Thank you very much,

Best regards,

Tuyen Nguyen

  • Greetings,

    Dependent on where your PS and DS are located, on chip or off chip and the use of caching in your system, you will need to use all necessary optimization techniques aside from -o3 to reach maximum performance on this device.

    You may want to download and do a quick review of TMS320C6000 Optimization Workshop from TI.

    Good Luck,

    Sam

  • Hi,

    Since you are using Natural C code, you are hoping to get the best out of box performance from the C674x device using the compiler. We have a quick introduction to optimizing your code for the C674X device which is discussed in the following document.

    http://www.ti.com/lit/an/sprabf2/sprabf2.pdf

    Ensure that you use appropriate linker command scripts and cache settings while evaluating the performance on the hardware.

    Regards,

    Rahul

  • Tuyen Nguyen said:
    when I try to change ABI (application binary interface) from eabi to coff, the speed is improve quite a bit (like 2, 3 times) however the system becomes unstable with coffabi (algorithm execution suddenly goes wrong, appear trash numbers, etc.). Does ABI affect the speed?

    This doesn't make any sense.  All other things being equal, changing the ABI will have very little impact on performance.  Something else must be changing at the same time, and it only appears that ABI is the cause.  I'm not sure what could make such a difference.  But, whatever it is, it is worth finding.  Because you can probably apply it to the EABI build, and get all that performance back.

    Thanks and regards,

    -George 

  • Greetings,

    First this come to mind is the DSP PLL/DDR etc... initialization difference.  You may want to inspect the registers in both cases and compare.

    Good Luck,

    Sam

  • Also the state of L1P and L1D caching (left over from Boot ROM).

  • Thanks a lot,

    Because the urgency of the project, I wanted to have quick techniques to optimize the speed without touching the code. But apparently, I cannot. So I spend time on digging into the code to apply compiler optimization for functions and loops and it improves a little bit.

    ABI change: yes I found it weird too. All I did was that I set "output format" to "legacy COFF" and --abi to coffabi, other settings were kept the same. And suddenly the algorithms ran faster with unstable behaviors. I read somewhere in one of TI's documents saying that COFF is obsolete but gives higher performance due to some reduced overhead. Not sure if it is true. But apparently I cannot use COFF due to the unstable problem.

    Cache: I have been reading documents about TI's cache concept and settings. And to be honest, because I am new in TI chip, I don't know exactly how to enable it using CCS 5. I am using Sys/Bios. I try to config it through .cfg file but it seems there is no change on the speed. I suspect that I am doing it wrong. There is a document guiding cache setting using CCS4 through .tcf file. But I don't know how to do it in CCS 5 because in my CCS 5 version, there is no tool to generate .tcf. I am using the default ti.platforms.evmOMAPL138. But when the project is finalized, I have to create my own customized platform. I don't know exactly which tool should I use.

    It would be very nice of you to give me some initial instructions for cache. Plus, do I need to do cache setting for .lib project or only for the application project which uses that .lib?

    Thank you very much for all your supports,

    Best regards,

    Tuyen Nguyen

  • Tuyen,

    The unstable nature of the binary with COFF is weird. Can you specify the compiler flags that you have and the version of compiler that you are using? If you can extract the  piece of code that shows the unstable behaviors when compiled in COFF format, we can examine what is causing this issue.

    WIth regards to cache, you need to do cache settings only for the application project and not for the lib project. SYSBIOS enables device cache by default while configuring the platform but if you wish to explicitly turn on the cache use the following lines in your .cfg file

    /* Enable MAR bits for Cache */
    var Cache = xdc.useModule('ti.sysbios.family.c64p.Cache');
    Cache.MAR128_159 = 0xFFFFFFFF;

    By default OMAPL138 EVM uses 32K L1P program cache and 32K L1D data cache. You can also use Cache_enable() API to enable all levels of cache in sysbios.

    Please refer to section 5.6 and section 6.4 for all SYSBIOS Cache configuration options.

    Regards,

    Rahul

  • Greetings,

    Here is where you set them in CCS5.x SYS/BIOS

    Good Luck,

    Sam