C6678 OpenCV Code Optimization

Anish Mathew

TI Experts-

We are porting OpenCV to C66x. We have replaced TI memory allocation functions with "smart malloc" and others that
allow us to conform with OpenCV (for example, control over alignment, use local mem or external mem depending on the
situation, etc), and we're using VLIB functions inside OpenCV functions to optimize where possible (with no change in
results).

In addition, we're optimizing key C/C++ codes, which is what my questions are about.

I had this original line of code, where pRowSrcf is a float pointer and pRowDstu is a uint8_t pointer (this line is
executed in a 'for' loop):

*pRowDstu++ = (uint8_t)*pRowSrcf++;

When the original line was replaced with below two lines as shown below:

int iVal = (int)*pRowSrcf++;
*pRowDstu++ = iVal;

there was a 10x increase in performance. I have attached the generated .asm code for each case and screenshot of replaced asm code. I think I understand what the compiler did here and why. My questions are:

-can this code be further optimized without resorting to hand asm coding ?

-currently the pointers are aligned to 4 bytes, would 16 bytes help ?

-is there a way to read four (4) floats at one time, convert into 4 uint8_t values, then store all 4 uint8_t bytes at once?

The reason this level of discussion is important is that copying images, with possible conversions of pixel data (both
numerical representation and pixel format), is fundamental to OpenCV -- such operations are everywhere in OpenCV
codes. I'm hoping to develop optimized methods embedded in a C++ class that I can add to opencv source and include in
the c66x branch on the opencv repository.

Thanks.

-Anish
HPC Engineer, Signalogic

over 10 years ago

0 Anish Mathew over 10 years ago

Intellectual 395 points

Below are the original and replaced asm files. Also I'm using C6678 with 64 KB L2 cache, CGT 7.4.2, and -O2.

1185.originalcode.asm 0537.replacedcode.asm

0 George Mock over 10 years ago in reply to Anish Mathew

TI__Guru**** 253380 points

Thank you for informing us about this problem.

For reasons I cannot explain, the fast loop uses the instruction SPTRUNC to convert from float to int, and the slow loop calls an RTS function to convert from float to unsigned char. Any function call, even to an internal RTS support function like this, prevents a loop from being software pipelined. And software pipelining a loop can cause a big performance boost. The 10X you see is typical.

I filed SDSCM00052060 in the SDOWP system to have this investigated. It is not a typical defect report, but a performance issue. The code generated is not wrong, but slow. You are welcome to track it with the SDOWP link below in my signature.

Thanks and regards,

-George

0 Jeff Brower73 over 10 years ago in reply to George Mock

Genius 3420 points

George-

It's not a problem, we're happy with TI compiler performance, and it's straightforward for us to observe coding procedures that recognize this behavior. I wouldn't even worry about it if I were you.

But, can you please answer our questions? Thanks.

-Jeff
Signalogic

0 George Mock over 10 years ago

TI__Guru**** 253380 points

Anish Mathew said:
-can this code be further optimized without resorting to hand asm coding ?

Consider applying some of the techniques discussed in this wiki article. There may also be some intrinsics worth trying The full list of available intrinsics is in the section titled Using Intrinsics to Access Assembly Language Statements in the C6000 compiler manual.

Anish Mathew said:
currently the pointers are aligned to 4 bytes, would 16 bytes help ?

An alignment of 8 bytes could help. That could enable use of LDDW instructions. The wiki article referred to above has more detail.

Anish Mathew said:
is there a way to read four (4) floats at one time, convert into 4 uint8_t values, then store all 4 uint8_t bytes at once?

A float is 32-bits. It is possible to load or store 64-bits at once. So, you could load 2 floats in one instruction. There might be an instruction that converts 2 floats to 2 integers. You can store 4 uint8_t bytes at once. But given how the leading instructions limit you to generating two values, I'm not sure you could stage things to take advantage of it.

Thanks and regards,

-George

0 George Mock over 10 years ago in reply to George Mock

TI__Guru**** 253380 points

Another optimization idea to consider ... Check out DSPLIB. It is a library of highly optimized DSP routines for C6000 devices. Even if there is no function which does exactly what you need, you can learn some good techniques by reading the source code.

Thanks and regards,

-George

0 Jeff Brower73 over 10 years ago in reply to George Mock

Genius 3420 points

George-

Thanks for your help. Yes we are studying all available TI docs and applying a variety of recommended techniques. We're using #ifdef _TI66X to apply TI compiler directives, re-organize loops, "restrict" keyword (which OpenCV is not using), etc. Also we've noticed that where x86 intrinsics are used, that typically seems to be a good opportunity to use c66x intrinsics.

For certain key functions in OpenCV we have reached "parity": one x86 core ~= one c66x core, so we continue to make progress.

-Jeff

0 CHIP Smarter over 10 years ago in reply to Jeff Brower73

Prodigy 125 points

Hi Jeff,

We are optimizing OpenCV3 on C66x too, so maybe we could talk about it. The following is the performance of the core module:

The red ones are C66x faster than i7. The CVD_O1 means just by inner loop level optimizing, typically by VLIW assembly.

My email address is: rex@cvdsp.com.

-Rex Chou

0 Todd Snider over 10 years ago

TI__Intellectual 2435 points

Hi Anish,

The reason that the C6x compiler is able to use SPTRUNC in the proposed replacement code is because the replacement code does two type conversions.

First "int itmp = (int)*srcf++;" performs a signed 32-bit float to signed 32-bit int conversion which is exactly what SPTRUNC is for.

The statement "*dstu++ = itmp;" performs a second conversion from signed 32-bit int to unsigned 8-bit int.

This is not equivalent to the original statement "*dstu++ = *srcf++;" which performs a conversion from signed 32-bit float into unsigned 32-bit int before truncating the result to an unsigned 8-bit int.

If the range of the incoming floats is [0..255], then you rewrite the original code to "*dstu++ = (uint8_t)(int)*srcf++;" to get the SPTRUNC loop. If the range of the incoming floats is [0..127], you could use int8_t instead of uint8_t for the type of the destination array. Then your original statement would generate the SPTRUNC loop.

The reason the C6x compiler uses a call to the RTS function __c6xabi_fixfu in your original code is because the incoming float is signed and the compiler must assume that it could be negative. The __c6xabi_fixfu RTS function will convert negative float values to 0xffffffff. The SPTRUNC instruction performs a float to signed int conversion.

Hope this helps to clarify the compiler behavior a little bit.

Sincerely,

Todd Snider

Compiler Support Team

Texas Instruments Incorporated

0 Jeff Brower73 over 10 years ago in reply to Todd Snider

Genius 3420 points

All-

Our c6678 OpenCV port was very successful. Here is a wiki describing our results:

c66x OpenCV

-Jeff

Code Composer Studio™︎

Code Composer Studio forum

C6678 OpenCV Code Optimization