DSPLIB performance on complex multiply

Ivan Krechetov

Prodigy 230 points

Good day!

I'am using EVMC6678L

I have a trouble with performance of complex multiply( working with float)

for example:

#define N_ 2048

#define M_ 192

#pragma DATA_ALIGN(bufA, 8);
#pragma DATA_SECTION(bufA, "SHARED_MEM");
float bufA[M_][N_*complex_size];

#pragma DATA_ALIGN(bufC, 8);
#pragma DATA_SECTION(bufC,"SHARED_MEM");
float bufC[M_][M_*complex_size];

......

DSPF_sp_dotp_cplx(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]); tooks about 1.04 cycles per iteration (one complex multiply, when i'am using precompiled library, expected performance about 0.75-0.8 cycles (from TI test benchmark))

but for solving my task i have a bit change source code of this function (changes _complex_mpysp to _complex_conjugate_mpysp)

running from source(not from lib) its took 2.56 cycles per iteration (optimization -O2 and -O3), such result was taken for original source code of DSPF_sp_dotp_cplx

So, main question, could you explain how to correct build optimized code? Another way - rebuld overall dsplib with my changes, but i don't know how to do this.

Thanks!

over 13 years ago

0 Alberto Chessa over 13 years ago

Mastermind 6670 points

I dont test that particular function, but I suppose that the TI benchmark are executed with data mapped directly in L1 or L2 (so used as very fast memory, not as cache).

About the compilation options, have You verified that the debug model is "--symdebug:none" or ,in alternative, that the compiler option "--optimize_with_debug" (Runtime Model Options) is on?

With my installation, the default option are symdebug ON even in release configuration. I verified that in this case the optimizer cannot do its best

Also, I suppose the Optimizations option --opt_for_speed=5 could help.

0 Ivan Krechetov over 13 years ago in reply to Alberto Chessa

Prodigy 230 points

Thanks, Alberto.

I have checked it all, --symdebug:none -O3 -optimize_with_debug --big_endian --abi=eabi --opt_for_speed=5

As a result I got more stable timing of function running (2.5400 for my custom code and original source, and 1.0400 for library). Before it was +-0.0100 per iteration.

Library function also executing in L2SRAM

I think that is only one solution to build my custom function as static lib remaining the same properties as DSPLIB has during build. (already do it, but dont these prop.)

So, don't understand how to achieve about 1.04 cycles for source codes

0 Alberto Chessa over 13 years ago in reply to Ivan Krechetov

Mastermind 6670 points

Sorry cannot help. I also want to extract some function from that library and rebuild, so probabily I'm going to have the same problem.

Have You already check your option and the generated code against the DSPF_sp_dotp_cplx.asm in the Release directoy of the DSPLIB package?

0 Alberto Chessa over 13 years ago in reply to Alberto Chessa

Mastermind 6670 points

I have rebuild DSPF_sp_dotp_cplx(). On my EVM (and on the Simulator also), it performs exactly as the library one. Note that the first call peroforms more or less as your rebuilded version, due to the cache fill time.

My test code simple do (DSPF_sp_dotp_cplx_2 is the rebuild one):

    DSPF_sp_dotp_cplx_2(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]);               //first call, data and code not in cache
DSPF_sp_dotp_cplx_2(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]);               //data and code in cache
DSPF_sp_dotp_cplx(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]);                   //code not in cache, data already in cache
DSPF_sp_dotp_cplx(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]);                   //data and code in cache

and the clock show:

- firs rebuilded: 4576 ticks/2048 = 2.2 t/i

- second rebuilded: 2117ticks = 1.03 t/i

- first from lib: 2130 = 1.04 t/i

- second from lib: 2129 = 1.03 t/i

With CCS5 and:

Compiler: 7.4.4

dsplib: 3.0.0.8

0 Ivan Krechetov over 13 years ago in reply to Alberto Chessa

Prodigy 230 points

Great result!

But I still can't achieve this, could you send me your test project for this?

krechetov_ivan@mail.ru

Many thanks!!!!!

0 Ivan Krechetov over 13 years ago in reply to Ivan Krechetov

Prodigy 230 points

I got it!

for L2SRAM:

[C66xx_0] Cycles RUR_dotp_conj_cplx iteration #1 = 1.053711 Cycles per DSPF_dotp_cplx iteration #2 = 1.046387
[C66xx_0] Cycles RUR_dotp_conj_cplx iteration #1 = 1.053711 Cycles per DSPF_dotp_cplx iteration #2 = 1.046387
[C66xx_0] Cycles RUR_dotp_conj_cplx iteration #1 = 1.053711 Cycles per DSPF_dotp_cplx iteration #2 = 1.046387

45 cycles takes my timing function

=> (1.053711*2048-45)/2048=1,0317 cycles per iteration

for MSMCSRAM:

[C66xx_0] Cycles RUR_dotp_conj_cplx iteration #1 = 1.060547 Cycles per DSPF_dotp_cplx iteration #2 = 1.050293

L2 vs Shared => 0.65% difference

It was my foul, during a lot of test i have a bit damage original source, since that where is a few extra code lines(only for Big Endian part) ))))

=> Thanks, Alberto <=

Resume:

to get a full performance of custom code based on DSPLIB (anf others, based on opt. coding) you need:

optimization -O2 (-O3)

--opt_for_speed=5

-optimize_with_debug

placement for CODE_SECTION => L2SRAM or MSMCSRAM

--symdebug:none doest'n influence on, so you can use full debug

0 Mostafa El-Hashash over 10 years ago in reply to Ivan Krechetov

Intellectual 395 points

Hello,
can you tell me where to add these optimization flags as i am not using code composer.
And what is placement for CODE_SECTION => L2SRAM or MSMCSRAM?
Regards

Processors

Processors forum

DSPLIB performance on complex multiply