Optimizing threshold function on DSP C64x+

GregoireGentil

Other Parts Discussed in Thread: CODECOMPOSER, SYSBIOS

I'm running a relatively simple threshold function for a 320x240 NV12 image on a DSP C64x+. I have read the various optimization recommendations and I have tried to optimize as much as possible but the performance is still disappointing. The code looks like the following:

(I insert an image so that it's possible to read it more clearly)

Is there any other recommendation? Am I missing something obvious?

over 11 years ago

0 Titusrathinaraj Stalin over 11 years ago

TI__Guru** 116100 points

Hi,

Is this your own code or TI provided ?

Are you using any libraries from TI for image processing ?

Please refer the following useful TI wikis on DSP performance optimizing.

http://processors.wiki.ti.com/index.php/Optimization_Techniques_for_the_TI_C6000_Compiler

http://focus.ti.com/general/docs/video/Portal.tsp?lang=en&entryid=0_1wypv8jm

0 Sivaraj Kuppuraj over 11 years ago

TI__Mastermind 35645 points

Hi Gregoire,

Thanks for your post.

I am not sure whether you have utilized the loop buffer (SPLOOP) which would improve performance and reduce code size for software pipelined loops which does lot of benefits. To know more on this, please refer section 3.4 in the c6000 optimizing compiler user guide as below:

http://www.ti.com/lit/ug/spru187v/spru187v.pdf

There are c6000 optimization workshop and materials with c64+ optimization techniques involved. please refer the below wiki's:

http://processors.wiki.ti.com/index.php/Optimization_Techniques_for_the_TI_C6000_Compiler

http://processors.wiki.ti.com/index.php/TMS320C6000_DSP_Optimization_Workshop

http://processors.wiki.ti.com/index.php/Optimized_Sort_Algorithms_For_DSP#Sort_Algorithms

There are also techniques involved in optimizing c64x+ compiler to produce efficient code which includes:

Restrict keyword & usage with structure fields

Impact of supplying trip count info

Balancing resources

Optimizing “if” statements

Handling function calls

Understanding optimizer output etc.

For more details, please refer the below wiki presentation:

https://processors.wiki.ti.com/images/e/e8/C64plus_cgt_overview.pdf

Also, there are c64/c64+ compiler optimization tricks involved. To know more on this, please refer the below wiki presentation:

http://processors.wiki.ti.com/images/6/6e/C64p_cgt_optimization.pdf

Thanks & regards,

Sivaraj K

-------------------------------------------------------------------------------------------------------
Please click the Verify Answer button on this post if it answers your question.
-------------------------------------------------------------------------------------------------------

0 GregoireGentil over 11 years ago in reply to Sivaraj Kuppuraj

Expert 2330 points

Thanks Titusrathinaraj and Sivaraj for your guidance. I appreciate that.

It's my own code. We have VLIB up and running. We have DSPLIB too though we haven't used this one. I have searched API in those libraries that could do what we desire but there was no good match - close but not good enough.

I have read the various documents but I haven't obviously implemented all possible optimizations.

I have further optimized the C-code so that the internal loop is as simple as possible. I believe that I'm at the end of the algorithm and standard C optimizations.

Is there any pragma /trick that could help me at this point? Or is the next stage assembly with SPLOOP? Can I do SPLOOP or something similar in standard C?

Many thanks in advance.

0 RandyP over 11 years ago in reply to GregoireGentil

TI__Guru* 84110 points

Gregoire,

You have been given several articles and workshops to look at for advice on optimization. You have not applied those in your code, so what can we do for you? There is no use of 'restrict' and no use of '#pragma" for optimizing your functions and loops. You must apply these tools to get the best optimization in your code.

Please look at the materials referenced above. There is a lot of reading, but you have been given some good hints of what to apply first. Please let us know how using those tools improves your code performance.

Regards,
RandyP

0 RandyP over 11 years ago in reply to RandyP

TI__Guru* 84110 points

Use int, not short, for individual variables. Int (32-bit) is the native register size in the C64x+ architecture.

0 Andy Polyakov over 11 years ago in reply to GregoireGentil

Expert 1340 points

Consider if (i%w==0) in loop. % results in call to subroutine that calculates remainder. Any call to any subroutine kills the chances for SPLOOP. Given that i is reset to zero, if (i==w) is sufficient.

Consider if ((uv[posuv]>=umin) && (uv[posuv+1]>=vmin)) {...}. In C logical expressions are evaluated "literally", meaning that it's equivalent to if ((uv[posuv]>=umin)) { if ((uv[posuv+1]>=vmin) { ... }}. Note that second load becomes conditional, i.e. you have to wait for result of first load and comparison to decide if you should perform the second load and comparison. Replacing && with & should allow compiler to perform loads and comparisons in parallel.

And as suggested, get your types matching natural register width. Trouble is that deviation can cause extra instruction on critical path.

0 Sivaraj Kuppuraj over 11 years ago in reply to Andy Polyakov

TI__Mastermind 35645 points

Hi Gregoire,

Thanks for your update.

If you don't have any further questions on c64x+ dsp optimation techniques, please clsoe this thread by doing the action below:

Thanks & regards,
Sivaraj K

0 GregoireGentil over 11 years ago in reply to Sivaraj Kuppuraj

Expert 2330 points

I have made some progress but I'm still not satisfied.

I have read and I'm continuously reading all the resources and try to do "similar" things.

I'm using the following compiled flags: -on2 -o3 -mt -s -al -mw and I have discovered the lst file which is very handy.

Thanks to Andy, I have seen one improvement which loop is not any more disqualified for pipeline after removing the % which was indeed the external call.

The code looks like this now:

I have also another version without if but the performance is exactly the same.

Nevertheless, the performance is still not so great. It takes around 40~50ms while the VLIB Pyramid8 API takes 10ms. I compare to this API because the job seems quite similar.

What can I do more? Here is the lst file output: http://pastebin.com/KVVykDNi

PS: I have searched very hard in the compilation script and I don't see where this "-g" option is coming from. I'm on the "release" profile.

0 Andy Polyakov over 11 years ago in reply to GregoireGentil

Expert 1340 points

Don't use multiply, stick to if. There is no (shouldn't be) actual branch in generated code, instead instructions are predicated, which is more efficient.

There is no SPLOOP generated. Do you use -mv6400+?

0 GregoireGentil over 11 years ago in reply to Andy Polyakov

Expert 2330 points

Thanks Andy. This is very useful again. I think that we are touching the core of my problem.

Initially, there was no "SPLOOP" in the lst file.

I'm on the OMAP4430 DSP Tesla. I don't use CodeComposer because it's a specific C64x+ and TI provides this huge DSP tarball which includes a lot of scripts, compilers and everything.

Usually, I edit this package.bld file which includes:

var testArray = [
{name: 'tesla-dsp', sources: ["baseimage", "../main_module"], config: "track_dsp", copts: "-on2 -o3 -mt -s -al -mw", lopts: "-l /work/ai.private/aiTesla/track/ai/track/vlib/lib/dsp/release/WTSD_TESLAMMSW.alg.vlib.ae64T", buildPlatforms: ["ti.platform.omap4430.dsp"]/* COMMENTED OUT, buildTargets: ["C64P"]*/},
];

If I add -mv64+ to the copts string, I start to see a SPLOOPD in the lst file! I also see "Loop will be splooped". But linking fails with:

fatal error: file
"/work/ai.private/aiTesla/OPBU_Ducati_GLP1.6.7/ducati-build/titools/bios_6_34_02_18/packages/ti/sysbios/family/c62/lib/sysbios/debug/ti.sysbios.family.c
62.ae64T<IntrinsicsSupport.oe64T>" specifies ISA revision "C6x - Tesla", which is not compatible with ISA revision "(unknown)" specified in a previous file or on the command line

When I integrated the TI VLIB, I had a lot of similar trouble which I solved by getting the VLIB version for Tesla.

I think that my problem boils down to enable SPLOOP on Tesla. I guess that without it, I will never get some good performance. How can I do that?

0 Andy Polyakov over 11 years ago in reply to GregoireGentil

Expert 1340 points

GregoireGentil said:

I think that my problem boils down to enable SPLOOP on Tesla. I guess that without it, I will never get some good performance. How can I do that?

I can't help with this, so please don't expect anything from me. I don't even know if DSP in question is in fact C64x+. But in either case it's not like SPLOOP is the Answer to the Ultimate Question of Life, the Universe, and Everything, in sense that conventional loop can deliver same performance under certain circumstances. Original code was effectively using loop fusion, which might be the way to achieve the goal. Indeed, if single loop fires iteration every N cycles, so that M loops would deliver M*N*pixels, fused loop might do it in (N+M)*pixels...

0 Jesse Villarreal over 11 years ago in reply to Andy Polyakov

TI__Expert 5655 points

The OMAP4 Tesla DSP doesn't have SPLOOP hardware support, which is why it is removed from the compiler when compiling for this target.

0 GregoireGentil over 11 years ago in reply to Andy Polyakov

Expert 2330 points

OMAP4430 datasheet says: "The device includes the digital signal processor (DSP) subsystem, based on a derivative of the TMS320DMC64x+™ very long instruction word (VLIW) DSP core."

At the top of the LST file, I have:

;******************************************************************************
;* G3 TMS320C6x C/C++ Codegen Unix v7.2.5 *
;* Date/Time created: Thu Jun 5 13:31:19 2014 *
;******************************************************************************
.compiler_opts --abi=eabi --c64p_l1d_workaround=default --endian=little --hll_source=on --long

;******************************************************************************
;* GLOBAL FILE PARAMETERS *
;* *
;* Architecture : C6x - Tesla *
;* Optimization : Enabled at level 3 *
;* Optimizing for : Speed *
;* Based on options: -o3, no -ms *
;* Endian : Little *
;* Interrupt Thrshld : 10 *
;* Data Access Model : Far Aggregate Data *
;* Pipelining : Enabled *
;* Speculate Loads : Disabled *
;* Memory Aliases : Presume not aliases (optimistic) *
;* Debug Info : DWARF Debug *
;* *
;******************************************************************************

I have tried other optimizations but I bounce on this invisible 50ms floor that I don't manage to crack. Once again, there are some VLIB APIs that do a similar job in a much faster way (around 10ms) on the same chip. So I'm missing either a flag or a code trick.

0 RandyP over 11 years ago in reply to GregoireGentil

TI__Guru* 84110 points

Gregoire,

The first thing we teach in DSP optimization is to know your performance requirement and to stop optimizing when you reach there. What frame rate do you need to support with this algorithm? Video analytic applications generally can easily run at 5-10 fps, which would be 100-200ms per frame.

Setting your goal to "make it run as fast as some other video routine" is not a valid requirement, but more of an ego goal. There may be significant differences between your algorithm and the ones you are referencing in VLIB that make those routines more readily optimized. In addition, the engineers on that project may have a significant greater experience at DSP programming and knowledge of the features of the Tesla DSP.

Why are you mixing boolean and numerical values? I doubt that the compiler is constrained to only evaluate a boolean to 0 or 1, but maybe I am wrong about that. I would have expected a warning to be generated for this, where you test1 in math operations.

I am not sure I agree with replacing && with &. I would have tried copying the two pixel values to temporary variables then testing with those variables. But the right solution is whatever gets you to fewer loads and faster performance.

You should use #nassert to tell the compiler about the starting address of the arrays. You will find examples in the training materials and the compiler documentation.

You should try the unroll pragma to get to at least a single LDW read for every pair of passes through your loops. If you have LDB's in there, you are wasting a lot of time.

Did you try the Insert Code button? If you use that, then we can copy from your code inline to make comments. With the inserted image, we cannot copy-and-paste from your code, only your text.

Since there are not any comments in your code, it is very difficult to figure out what you are trying to do in the algorithms. It seems you are parsing through the top 1/4 of a 320x240 image, with the assumption that the input image is UV data only, not the Y component. Some of the other code is equally confusing, since it does not follow a pattern that is recognizable to me - that may just be my ignorance.

Do you have cache enabled, or are you pre-loading your image data into L1D or L2? Where do you get your VLIB performance numbers from, measurements or documentation? If from documentation, make sure you are running in the same conditions in terms of cache and buffer locations.

Regards,
RandyP

0 GregoireGentil over 11 years ago in reply to RandyP

Expert 2330 points

Randy, Thanks.

Good news! I have reached my goal. Based on Jesse comment, I have taken a look at imglib which source code is available. There are a couple of APIs very similar: threshold and boundary. The key gain comes from using:

_cmpgtu4(_amem4_const(&uv[i]), threshold);

and also unrolling x4 the loop.

I didn't know the "insert code", and I was doing image for clarity of indentation. I will definitely use that the next time.

I'm now at 10ms which starts to be acceptable. It was not an ego quest ;-). I'm at 50fps so the initial 50ms / frame was not acceptable.

For the records, I'm parsing a NV12 frame so it's why I have 1/2 in a few places.

To fully recap, the first thing was to add the right set of flag, then analyze the LST file and then leverage the intrinsics. The next step would be to write everything in assembly. There is another sample in the imglib source code but I'm not too proficient in assembly.

Thanks all to your help! CLOSED NOW.

0 Sivaraj Kuppuraj over 11 years ago in reply to GregoireGentil

TI__Mastermind 35645 points

Hi Gregoire,

Please close this thread, if your issue gets resolved.

-------------------------------------------------------------------------------------------------------

Please click the Verify Answer button on this post if it answers your question.

--------------------------------------------------------------------------------------------------------

Thanks & regards,
Sivaraj K

Processors

Processors forum

Optimizing threshold function on DSP C64x+