RTOS/TMS320C6678: C language -O3 optimization is better than the assembly optimization?

Frank Zach

Prodigy 200 points

Part Number: TMS320C6678

Tool/software: TI-RTOS

hello,
everyone.

I don't understand.

about ：A*B+C*D （A、B、C、D is floating point）

In C language to open the -O3 compiler optimization of execution time is shorter than assembly optimization.

assembly optimization of execution time is shorter than C language in principle.

I don't know is my assembly language not write . Or now for the C language -O3 optimization has been more than the assembly optimization.

Please help me to answer the confusion.

attach my code:

C language :

assembly language:

Best Regards.

than you.

over 8 years ago

0 Sasha Slijepcevic over 8 years ago

TI__Mastermind 37985 points

Frank,
this question is better suited for the compiler forum (e2e.ti.com/.../). I think you should be able to move it there, but if you can't please let me know and I'll get someone to do it.

0 Robert Tivy over 8 years ago

TI__Mastermind 18260 points

This loop is about as quick as you can do without using the SPLOOP mechanism (and maybe as quick):

.global sum

.asmfunc

sum:
        LDDW    *A4++[1], A9:A8         ; preloop load since LDDW in epilogue
||      LDDW    *B4++[1], B9:B8         ; preloop load since LDDW in epilogue
        LDDW    *A6++[1], A1:A0         ; preloop load since LDDW in epilogue
        MVK     2, A3
        MV      A8, B1
        NOP     2

loop:
        DMPYSP A9:A8, B9:B8, B9:B8
||      DMPYSP A9:A8, A1:A0, A1:A0
        SUB     B1, A3, B1              ; do useful stuff in delay slots
[ B1] B       loop
||[!B1] B       B3
||[ B1] LDDW    *A4++[1], A9:A8         ; do useful stuff in B delay slots
||[ B1] LDDW    *B4++[1], B9:B8         ; start LDDWs early
[ B1] LDDW    *A6++[1], A1:A0         ; A1:A0 lands *after* last B delay slot
        DADDSP B9:B8, A1:A0, A1:A0
        NOP     2
        STDW    A1:A0, *B6++[1]         ; LDDW A1:A0 still in delay slots

.endasmfunc

This code uses "multiple assignment" to A1:A0 at the end of the loop, which is safe to do here since the "B loop" and its 5 delay slots can't be interrupted.

The C compiler might be using SPLOOP to make this quicker. You might want to ask the compiler team as well (as suggested by Sasha).

FYI, in your asm code you forgot the NOP 5 after the B B3. Also, your code (and this code) relies on the count in A8 to be an even number, and won't ever terminate if it's odd.

Regards,

- Rob

0 Frank Zach over 8 years ago in reply to Sasha Slijepcevic

Prodigy 200 points

hi,
Sasha Slijepcevic.
thank you for your answer.

Do you have know related article about forum introduce? so that i can find it in next time correctly.

thanks again.
best regards.
Frank Zach.

0 Frank Zach over 8 years ago in reply to Robert Tivy

Prodigy 200 points

HI,
Robert Tivy.
thank you very much for your answer. I'm sorry to be back so late.

I just learned assembly language, so many mistakes, your answer is very helpful to me.

I run your code, your code is 77502 clock cycle, my code is 133287 clock cycle. your code is really fast.

but next code is 28582 clock cycle.

#pragma DATA_ALIGN(F_E, 8); //align to 8 bytes, so to meet the precondition of sum()
float F_E[LENGTH];

#pragma DATA_ALIGN(F_F, 8); //align to 8 bytes, so to meet the precondition of sum()
float F_F[LENGTH];

#pragma DATA_ALIGN(F_A1, 8);
float F_A1[LENGTH];

#pragma DATA_ALIGN(F_B1, 8);
float F_B1[LENGTH];

#pragma DATA_ALIGN(F_C1, 8);
float F_C1[LENGTH];

#pragma FUNC_CANNOT_INLINE(sum);
void sum1(float* E, const float* A1, const float* B1, const float* C1)
{
int i;

float* const restrict ARRAY_E=E;
const float* const restrict ARRAY_A1=A1;
const float* const restrict ARRAY_B1=B1;
const float* const restrict ARRAY_C1=C1;

_nassert((int)ARRAY_E % 8 ==0); //precondition, not enforced by the compiler
_nassert((int)ARRAY_A1 % 8 ==0); //...
_nassert((int)ARRAY_B1 % 8 ==0);
_nassert((int)ARRAY_C1 % 8 ==0);

#pragma MUST_ITERATE(10000, 10000); //not required since LENGTH is constant and the compiler can figure out it by himself
for(i=0;i<LENGTH;i++) //LENGTH IS 10000
{
ARRAY_E[i]=ARRAY_A1[i]*ARRAY_B1[i]+ARRAY_A1[i]*ARRAY_C1[i];
}
}

assembler is not as good as C? or Assembly language has been eliminated? or Assembly language has advantages in some respects,
i just don't know. if you know please tell me.

thanks again.
best regards.
Frank Zach.

0 Frank Zach over 8 years ago in reply to Robert Tivy

Prodigy 200 points

or Ti official documents have explain about C language and assembly language, or related article. if you know plaese tell me.

thanks.

Frank Zach.

0 Robert Tivy over 8 years ago in reply to Frank Zach

TI__Mastermind 18260 points

Frank Zach said:
I run your code, your code is 77502 clock cycle, my code is 133287 clock cycle. your code is really fast.

I'm a little surprised by these numbers. Your code has a 22-cycle loop, so I would expect 10000 loops to take at least 220000 cycles. My code has an 8-cycle loop, which I would expect to take at least 80000 cycles for 10000 loops.

Frank Zach said:
but next code is 28582 clock cycle.

That's very fast for 10000 loops. Can you recheck your measurements? Given that your numbers appear too small above, I also wonder about this one.

I compiled your C code with -O3 -k (to keep .asm after compile) and looked at the resulting .asm file, and by my visual inspection I would expect that .asm code to take much longer than your original assembly loop. It contained calls to a _c6xabi_mpyf function. You must be using other compiler options to achieve such a fast loop. My compile didn't have any HW SPLOOP in the resulting assembly since there was the function call in the loop, so any compile option that would eliminate that function call would allow it to be SPLOOPed and therefore run much faster.

What exactly are your compile options?

What version of the compiler are you using?

Frank Zach said:

assembler is not as good as C? or Assembly language has been eliminated? or Assembly language has advantages in some respects,
i just don't know. if you know please tell me.

It's not a matter of assembler vs C speed, since the C compiler just produces assembly language that could otherwise have been written by hand. It's more that the C compiler is producing code that is near-optimal. The only reason to go to assembly language is to write faster code, and if the C compiler does as good a job as an expert assembly language author (and a better job than a non-expert), it's a good idea to stick with C.

My improvements to your asm code just made your loop faster. If the loop were reconstructed to do more multiplies inline and reduce the number of loops then I could probably achieve 2x-4x speed improvement, but it would take some time to write, be more complicated and less maintainable. The C66 assembly language has amazing potential for multiple operations per cycle (up to 8 parallel instructions in best case).

Regards,

- Rob

0 Frank Zach over 8 years ago in reply to Robert Tivy

Prodigy 200 points

hi,

-Rob.

thany you for your answer.

my compiler version:TI v8.1.2.

I think my compile is exactly. if you find my compile error please tell me.

attach my project:

compile with "-O3, --opt_for_speed=5, --optimize_with_debug=on, --speculative_loads=auto, --debug_software_pipeline"

1616.ti_asm_test.zip

thanks again.

best regards.

Frank Zach

0 Robert Tivy over 8 years ago in reply to Frank Zach

TI__Mastermind 18260 points

OK, thanks for the project. I was able to see the efficient C-generated assembly for your sum() function.

If you add -k to your compile then a .asm file will be generated that shows the assembly code. I encourage you to do that and take a look at the assembly that is generated.

The resulting assembly code is quite impressive. It uses the SPLOOP mechanism and does many loads, multiplies, adds and stores in each loop. It takes advantage of the large GP register set by keeping lots of in-flight values during the loop, and as a result the asm code is rather huge. It would be quite difficult to write this by hand.

The C-generated assembly sacrifices code size in order to generate the high speed. If you have a need to keep code size small then you would benefit from smaller, less efficient hand-generated assembly function. The one I wrote could be expanded to be faster, so that multiplies are happening in parallel with loads, but it would be hard to match the compiler generated code for speed.

Given all the tradeoffs, I would stick with C code for this.

Regards,

- Rob

0 Frank Zach over 8 years ago in reply to Robert Tivy

Prodigy 200 points

hi,

-Rob.

I don't understand "code size", now nor flash and nand flash is very big, need to consider code size, can you explain?

now i am learning C6000 DSP. In the field of image, GPU is developing fast, as a student, i feel confused, Where is the future of C6000 DSP.

You can talk about your experience?

Regars,

Frank

0 Robert Tivy over 8 years ago in reply to Frank Zach

TI__Mastermind 18260 points

Frank Zach said:
I don't understand "code size", now nor flash and nand flash is very big, need to consider code size, can you explain?

I was only mentioning "if code size is a concern". Some platforms have lots of space for code, others don't. I don't know your system and was simply pointing out that the faster code comes at a cost of larger code size.

Frank Zach said:
now i am learning C6000 DSP. In the field of image, GPU is developing fast, as a student, i feel confused, Where is the future of C6000 DSP.

That question is perhaps better answered by some other e2e forum, or perhaps at another web site altogether. I don't know GPUs so I can't really comment on how they compare to DSP. If you ask a DSP manufacturer which is better then the answer will likely be biased toward the DSP.

DSPs are more generalized than a GPU. One aspect of a DSP is a fast math engine, but it can also support lots of I/O and control code for OS-type processing. Cost of the chip will also be another major factor, in addition to the cost of the rest of the system that is needed around the chip.

Regards,

- Rob

0 Frank Zach over 8 years ago in reply to Robert Tivy

Prodigy 200 points

hi,
-Rob.
thank you for your answer.
Regards.
Frank Zach.

Processors

Processors forum

RTOS/TMS320C6678: C language -O3 optimization is better than the assembly optimization?