This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DSPLIB performance on complex multiply

Good day!

I'am using EVMC6678L

I have a trouble with performance of complex multiply( working with float)

for example:

#define N_ 2048

#define M_ 192

#pragma DATA_ALIGN(bufA, 8);
#pragma DATA_SECTION(bufA, "SHARED_MEM");
float bufA[M_][N_*complex_size];

#pragma DATA_ALIGN(bufC, 8);
#pragma DATA_SECTION(bufC,"SHARED_MEM");
float bufC[M_][M_*complex_size];

......

DSPF_sp_dotp_cplx(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]);  tooks about 1.04 cycles per iteration (one complex multiply, when i'am using precompiled library, expected performance about 0.75-0.8 cycles (from TI test benchmark))

but for solving my task i have a bit change source code of this function (changes _complex_mpysp to _complex_conjugate_mpysp)

running from source(not from lib) its took 2.56 cycles per iteration (optimization -O2 and -O3), such result was taken for original source code of DSPF_sp_dotp_cplx

So, main question, could you explain how to correct build optimized code? Another way - rebuld overall dsplib with my changes, but i don't know how to do this.

 

Thanks!

  •  

    I dont test that particular function, but I suppose that the TI benchmark are executed with data mapped directly in L1 or L2 (so used as very fast memory, not as cache).

    About the compilation options, have You verified that the debug model is "--symdebug:none" or ,in alternative, that the compiler option "--optimize_with_debug" (Runtime Model Options) is on?

    With my installation, the default option are symdebug ON even in release configuration. I verified that in this case the optimizer cannot do its best

    Also, I suppose the Optimizations option --opt_for_speed=5 could help.

     

  • Thanks, Alberto.

    I have checked it all,  --symdebug:none -O3 -optimize_with_debug --big_endian --abi=eabi --opt_for_speed=5

    As a result I got more stable timing of function running (2.5400 for my custom code and original source, and 1.0400 for library). Before it was +-0.0100 per iteration.

    Library function also executing in L2SRAM

    I think that is only one solution to build my custom function as static lib remaining the same properties as DSPLIB has during build. (already do it, but dont these prop.)

    So, don't understand how to achieve about 1.04 cycles for source codes

  •  

    Sorry cannot help. I also want to extract some function from that library and rebuild, so probabily I'm going to have the same problem.

    Have You already check your option and the generated code against the DSPF_sp_dotp_cplx.asm in the Release directoy of the DSPLIB package?

     

  •  

    I have rebuild DSPF_sp_dotp_cplx(). On my EVM (and on the Simulator also), it performs exactly as the library one. Note that the first call peroforms more or less as your rebuilded version, due to the cache fill time.

    My test code simple do (DSPF_sp_dotp_cplx_2 is the rebuild one):

        DSPF_sp_dotp_cplx_2(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]);               //first call, data and code not in cache
        DSPF_sp_dotp_cplx_2(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]);               //data and code in cache
        DSPF_sp_dotp_cplx(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]);                   //code not in cache, data already in cache
        DSPF_sp_dotp_cplx(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]);                   //data and code in cache

    and the clock show:

    - firs rebuilded: 4576 ticks/2048 = 2.2 t/i

    - second rebuilded: 2117ticks = 1.03 t/i

    - first from lib: 2130  = 1.04 t/i

    - second from lib: 2129 = 1.03 t/i

     

    With CCS5 and:

      Compiler: 7.4.4

      dsplib: 3.0.0.8

     

  • Great result!

    But I still can't achieve this, could you send me your test project for this?

    krechetov_ivan@mail.ru

     

    Many thanks!!!!!

  • I got it!

    for L2SRAM:

    [C66xx_0] Cycles RUR_dotp_conj_cplx iteration #1 = 1.053711 Cycles per DSPF_dotp_cplx iteration #2 = 1.046387
    [C66xx_0] Cycles RUR_dotp_conj_cplx iteration #1 = 1.053711 Cycles per DSPF_dotp_cplx iteration #2 = 1.046387
    [C66xx_0] Cycles RUR_dotp_conj_cplx iteration #1 = 1.053711 Cycles per DSPF_dotp_cplx iteration #2 = 1.046387

    45 cycles takes my timing function

    => (1.053711*2048-45)/2048=1,0317 cycles per iteration

    for MSMCSRAM:

    [C66xx_0] Cycles RUR_dotp_conj_cplx iteration #1 = 1.060547 Cycles per DSPF_dotp_cplx iteration #2 = 1.050293

    L2 vs Shared => 0.65% difference

    It was my foul, during a lot of test i have a bit damage original source, since that where is a few extra code lines(only for Big Endian part) ))))

     

    => Thanks, Alberto <=

     

    Resume:

    to get a full performance of custom code based on DSPLIB (anf others, based on opt. coding) you need:

    optimization -O2 (-O3)

    --opt_for_speed=5

    -optimize_with_debug

    placement for CODE_SECTION => L2SRAM or MSMCSRAM

    --symdebug:none doest'n influence on, so you can use full debug 

  • Hello,
    can you tell me where to add these optimization flags as i am not using code composer.
    And what is placement for CODE_SECTION => L2SRAM or MSMCSRAM?
    Regards