OMAPL138 / C674x execution times

jimj2713

Other Parts Discussed in Thread: TMS320F28335

I am looking to port a TMS320F28335 application to the OMAP, and initially I ported a small floating-point and pointer intensive body of code in order to measure the improvement in execution time. My linker-command file is just the default created by the hello-world example app, so I know there is much to improve, but I was surprised that the OMAP took over 3-times the clock cycles to execute an identical piece of code in the 28335.My 28335 is running at 150Mhz and the OMAP is running at the default clock, which I believe is 300Mhz (not verified yet).

My initial question is general -- where should I start to improve the OMAP performance? I know the OMAP code is running from L2RAM, which is not the fastest, but from the docs, it is unclear how much better L1 will be (and I haven't figured out the tcf syntax). I enabled full-optimization, which did not help much. Is it more likely the RAM, or should I investigate cache-use or ?

Also I am using the CCS debugger clock-cycle counter to evaluate the execution times, which I assume is valid.

Thanks.

over 13 years ago

0 jimj2713 over 13 years ago

Expert 2385 points

I have some additional information about this, so let me ask the question differently. On my representative body of code, I found the hand-tuning control-loops app-note, and I improved the C674x core performance a bit (from ~5100 cycles to ~3100 cycles); however my 28335 runs this same body of code in ~1600 cycles. With the OMAP running at twice the clock, the net performance improvement is minimal.

This body of code I am using is mostly without loops, and the pipelining documentation suggests that pipelining only occurs on loops -- is that true?

Also, I turned on the -mw flag (generate verbose SW pipeline info), and the .asm output file shows there is almost no parallelism. And without parallelism, the added NOPs (delay slots) are excessive -- almost 700 of the 3100 cycles are NOPs.

This particular body of code is representative of the code base, so it cannot be completely rewritten, so finally my question -- in non-looping, floating-point and pointer-intensive code, is the 28335 actually more efficient than the C674x?

0 AC2150 over 13 years ago in reply to jimj2713

TI__Intellectual 2605 points

hi Jim,

jimj2713 said:

This body of code I am using is mostly without loops, and the pipelining documentation suggests that pipelining only occurs on loops -- is that true?

Correct. Pipelining is all about getting the next iteration (the term 'ii' iteration interval is used) started as quickly as possibly and clearing hurdles to do so (e.g. dependency bounds etc to tell the compiler x doesnt depend on y later in the loop).

jimj2713 said:

This particular body of code is representative of the code base, so it cannot be completely rewritten, so finally my question -- in non-looping, floating-point and pointer-intensive code, is the 28335 actually more efficient than the C674x?

If all your code is "control code" then I agree that a different processor may be better. I know you're looking at AM335x - since thats Cortex-A8 (+Neon) it will execute ctrl-code very well.

My suggestion is to give the c674x angle another shot but then if you conclude its not the right play then eval e.g. AM335x (or stick to 28335)

Re c674x optimization check out: -

- http://processors.wiki.ti.com/index.php/C6000_Compiler:_Recommended_Compiler_Options

- http://processors.wiki.ti.com/index.php/Restrict_Type_Qualifier

In particular since you mentioned you are pointer-intensive read up on the restrict stuff...

Regards, Alan

Processors

Processors forum

OMAPL138 / C674x execution times