Linear Assembly / Assembly optimization

Pandiyarajan Pitchaimuthu

Hi All,

I am working with TMS320C64x+ processor on Code Composer Studio of Version 4.2.0.10018. My goal is to Optimize a code so that the current MIPS of the code being around 140 has to be reduced to around 50. I was able to reduce the cycles of the code with C code optimization technique and with Compiler options. I am now with MIPS around 80. I am trying to Optimize the code with Linear Assembly code which seems to be the final option to optimize the code further. I converted the available code to Linear Assembly code (.sa file). But the MIPS has not decreased with Linear Assembly code.

The content in my code is such that it has a loop which in turn has 3 functions, each function containing a loop. I have tried my best in using most of the possible ways to reduce the MIPS.

My doubt is, when the code generation tool itself does the work of software pipelining, loop unrolling and all other techniques with the C code itself, why do we need to go for Linear Assembly code? I have seen many materials/documents stating the 3 techniques of optimizing a code 1) Optimizing the C code 2) Using compiler options 3) Using Linear Assembly code. But i have not seen any material, explaining all these techniques sequentially ( using the output of one technique as the input to the other technique of the 3)

So, could anybody help me in clearing my doubts and in reducing the MIPS of the code with the help of Linear Assembly code. If am not clear with my words or i need to give some more info, please let me know so that i could write it in better way.

Regards,

Pandiyarajan.P

over 14 years ago

0 Jim Noxon over 14 years ago

TI__Genius 14940 points

1) optimizing C code
This is basically a statement that you should write your C code in a manner which is efficient for the task you are trying to accomplish. An example would be using a bubble sort vs. a quick sort algorithm to sort an array. Clearly one is faster than the other by its means of implementation. You can turn on all the optimization you like for the bubble sort and you will still get a bubble sort. In effect, the optimizer will not modify your algorithm, rather it will rearrange the statements that make up your algorithm so the processor can execute the algorithm in the most efficient manner possible.

2) using compiler options
This includes enabling the optimizer but also doing things like managing code and data placement in a manner which will utilize cache operations most effectively or simply placing data and code in memory which has the fewest number of access cycles, etc.

3) linear assembly
This is in reference to writing your assembly code directly or, as you indicated, using the assembly code generated by the compiler. In this case, it is expected that you would further hand optimize the code by looking for additional interactions between assembler statements that the optimizer cannot identify. This can include such things as the need to check for a valid value before using it or short circuiting a generalized operation because you know the values that will be used are limited to a specific domain, etc.

Each of these approaches has its merits. The first one maintains code in a much more portable form but it assumes which ever platform you compile it for will have the necessary processing bandwidth for the algorithm to be useful.

The second one usually includes modifying the source code slightly to coerce the optimizer to make a more effective decision than without the source code modification. This is good for the given compiler as you can get similar results from one build to the next but can be bad if you need to migrate to another build tool or the build tool you are using gets upgraded in a manner where the optimizer makes decisions in a different way. This is usually an approach used where you are quite familiar with the optimizer and what types of decisions it generally makes.

The third option is generally a last resort when you need to get the absolute maximum through put for your system. Starting with pre-generated code can be a big advantage but can also hinder you if you don't keep an open mind as it can get you to focus on optimizing in the same direction the compiler did but if you started from scratch and took a different direction you may end up with even better code. It also entails getting very familiar with the assembly instruction set, the architecture of the processor, and also the architecture of the environment the processor works within such as cache type, memory spaces, etc.

In general, the options in the order given will be a trade off between ease/speed of implementation vs. throughput of resulting code. Just like any engineering endeavor, there are trade offs to consider.

In my experience, increasing the through put of your already optimized code by 3x is rather unlikely. Usually speed increases of this magnitude require a new algorithmic approach or some form of reduced precision/accuracy. These can sometimes be identified if you look at the absolute minimum requirements of your output instead of what you want to achieve.

Jim Noxon

0 Pandiyarajan Pitchaimuthu over 14 years ago in reply to Jim Noxon

Prodigy 100 points

Hi Jim,

The suggestion/post was useful. I was on long vacation so that i could not reply it.

On analyzing the code further, I found that some single statements takes more cycles than a function that has loops. For example a single statement involving double variable takes cycles count of about 300, but a function with a for loop involving many statements takes cycles of just 120 cycles. What could be the reason behind this. Can this be because of double variable manipulation? If so could this be minimized?

Regards,

Pandiyarajan.P

0 RandyP over 14 years ago in reply to Pandiyarajan Pitchaimuthu

TI__Guru* 84110 points

Pandiyarajan,

Can you provide some more details, please?
Which DSP device is this for?
Which EVM or simulator are you using?
Which compiler version and CCS version?

Can you be more specific about "some single statements"? What memory or peripheral is being accessed? Is cache enabled?

Regards,
RandyP

0 Pandiyarajan Pitchaimuthu over 14 years ago in reply to RandyP

Prodigy 100 points

Hi RandyP

I am working with TMS320C64x+ processor on Code Composer Studio of Version 4.2.0.10018. The following are the information of the environment i am working on. Simulator - Texas Instruments Simulator, C64X+ CPU Cycle Accurate Simulator, Little Endian. Code Generation tools - TI v7.0.3, DSP/BIOS version - 5.41.07.24.

Hope the below example explains what i meant by single statements.

int s32_Val1 = 0; // an integer variable.

double d_Val2 = 0; // a double variable

s32_Val1 = (int) (d_Val2* 256); // this is the single statement that takes about 350 cycles.

Many such single statements involving double variables takes more cycles.

Regards,

Pandiyarajan.P

0 RandyP over 14 years ago in reply to Pandiyarajan Pitchaimuthu

TI__Guru* 84110 points

Pandiyarajan,

The C64x+ processor is a fixed-point processor, so double precision floating point operations will be slow on this processor.

I ran this on the simulator (Debug build configuration) and found the d_Val2*256 line ran in 77 cycles. What method did you use to measure 350 cycles?

Do you have cache enabled? Are you running from internal memory? Both of those would be good.

Regards,
RandyP

0 Pandiyarajan Pitchaimuthu over 14 years ago in reply to RandyP

Prodigy 100 points

Hi RandyP.

I used the TSCL register to get the clock values. I have not enabled the cache. Thanks for the suggestion about cache and the internal memory.

As I have been working on this processor for just a month or below, i am new to many concepts of it and so i am in learning phase. Will look how to enable the cache and running it in internal memory. Meanwhile if you wish me to do something else, i am eager to hear.

Regards,

Pandiyarajan.P

0 RandyP over 14 years ago in reply to Pandiyarajan Pitchaimuthu

TI__Guru* 84110 points

With just the Cycle Accurate Simulator, it does not seem to matter if everything is placed in internal or external memory. Here is the test program that I used:

main.c said:

#include <c6x.h>

unsigned long long ullStart, ullEnd, ullCal, ullTime1, ullTime2;

void main()
{
    int s32_Val1 = 0;    // an integer variable.
    double d_Val2 = 0; // a double variable

    TSCL = 0;
    ullStart = _itoll( TSCH, TSCL );
    ullEnd = _itoll( TSCH, TSCL );
    ullCal = ullEnd - ullStart;

    ullStart = _itoll( TSCH, TSCL );
    s32_Val1 = (int) (d_Val2* 256); // this is the single statement that takes about 350 cycles.
    ullEnd = _itoll( TSCH, TSCL );
    ullTime1 = ullEnd - ullStart - ullCal;

}

I get 77 cycles, not 350. And I get this whether I use a BIOS5 tcf file that has cache defined (but not MAR-enabled for DDR) with the program code in IRAM or DDR. And I get 77 if I use a non-BIOS project with everything in L2 or DDR.

Please try the test program above and see what results you get.

Regards,
RandyP

0 Pandiyarajan Pitchaimuthu over 14 years ago in reply to RandyP

Prodigy 100 points

Hi RandyP,

I tried the above code in a standalone project and got the same result of 77 cycles for d_Val2 = 0 and 100 for d_Val2 = other than 0.

I verified my SRC project code and found that enabling the optimization level option -O to 3 or other values caused the cycles to shoot up to 350 instead of 77/100. Disabling the option reduced the cycle to 77/100 for that statement, but overall cycles of the project increased. Disabling and enabling the optimization level option -O in standalone project showed a different behavior than with my original SRC code.

Could you comment on this behavior?

Regards,

Pandiyarajan.P

0 RandyP over 14 years ago in reply to Pandiyarajan Pitchaimuthu

TI__Guru* 84110 points

Pandiyarajan,

It is not always easy to benchmark code in a meaningful manner. When I turn on -o3 the cycle count goes down to 1, probably because the optimizer sees that my simple test case is a trivial 0 * X which is always 0.

The same floating point Run-Time Support library functions get called with optimization on or off, so the actual time to do those functions will not change. This means that what is getting measured has changed.

The 77 vs. 100 is because the float operations are not deterministic. You may get other measurements for other values, although I got 99 for a non-zero test that I tried with optimization on, and that indicates that 77 for 0.0 may be a singularity and all other numbers give 100 or so.

There has been a lot of academic work over the past 30-40 years.concerned with how to duplicate the results of floating point operations within a fixed point processing environment. Generally, you have fixed point input data from ADCs and generate fixed point output data to DACs or display screens. That implies that the advantage of floating point in the processing of that data is the dynamic range and/or the resolution. In a large number of applications, this can be handled with scaling of data and coefficients plus the use of 32-bit accumulators and 32-bit intermediate data storage. It takes more work during the design phase but can result in a higher-performing product.

Benchmarking your project's algorithm will give you more meaningful results than benchmarking a single operation or a single line. For example, in this case you are measuring the time it takes to call two different floating point functions from the RTS library: one to do the multiplication and one to convert from that float result into a integer. Your algorithm would be quite inefficient if that combination makes up a majority of the computational time.

Those are my comments. I have no guess other guess as to why your SRC numbers do not match your test numbers.

Regards,
RandyP

0 Pandiyarajan Pitchaimuthu over 14 years ago in reply to RandyP

Prodigy 100 points

Hi RandyP,

Thanks for the valuable comments. I will look in to the duplication of floating point with fixed point environment.

Regards,

Pandiyarajan.P

Processors

Processors forum

Linear Assembly / Assembly optimization