This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

executiion time of complex matrix multiplication

hi,

i measured execution time of complex matrix multiplication using TSC.

C6678 run slower than i expect.

mesurement is right?

===== environment: EVM6678LE, CCS 5.2.1, C6000 complier 7.4, DSPLIB 3.1.0

===== codes:

#define LOOP_COUNT       10
#define CPU_FREQ             1e9               /* CPU freq = 1 GHz */
#define MSEC                        1e3               /* 1 sec = 1000 msec */

#define antenna_length       (12)
#define sample_length        (1024)

/* uca_output dimension = antenna_length x sample_length */
float uca_output[2 * antenna_length * sample_length];

/* uca_output dimension = sample_length x antenna_length */
float uca_output_trans[2 * antenna_length * sample_length];

/* uca_hermitian dimension = antenna_length x antenna_length */
float uca_hermitian[2 * antenna_length * antenna_length];

    TSCL = 0;
    t_start = TSCL;
    for (loop = 0; loop < LOOP_COUNT; loop++)
    {
        DSPF_sp_mat_mul_cplx(uca_output, antenna_length, sample_length, uca_output_trans, antenna_length, uca_hermitian);
    }
    t_stop = TSCL;

    printf("[complex multiply] loop = %d, row = %d, column = %d, execution time = %f msec\n", \
                LOOP_COUNT, antenna_length, sample_length, \
                1 / CPU_FREQ * (t_stop - t_start) / LOOP_COUNT * MSEC);

===== results:

[C66xx_0] [complex multiply] loop = 10, row = 12, column = 1024, execution time = 0.368105 msec

===================================================================================

 

  • Hi,

    What are you compilation options ?

    in particular optimization level and debug type.

    CM

  • The performance also depends on where the data arrays locates.  If they are in DDR3 for example, there may be significant amount of cache penalties that can slow down the performance.  What kind of performance are you expecting?

    Xiaohui

  • hi,

    i pasted message on build console.

    thanks for your comments.

    'Building file: ../make_hermitian.c'
    'Invoking: C6000 Compiler'
    "C:/ti/ccsv5/tools/compiler/c6000_7.4.0/bin/cl6x" -mv6600 --abi=coffabi -O3 --symdebug:none --include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.0/include" --include_path="C:/ti/dsplib_c66x_3_1_0_0/inc" --include_path="C:/ti/dsplib_c66x_3_1_0_0/packages" --display_error_number --diag_warning=225 --preproc_with_compile --preproc_dependency="make_hermitian.pp"  "../make_hermitian.c"
    'Finished building: ../make_hermitian.c'
    ' '
    'Building target: make_hermitian.out'
    'Invoking: C6000 Linker'
    "C:/ti/ccsv5/tools/compiler/c6000_7.4.0/bin/cl6x" -mv6600 --abi=coffabi -O3 --symdebug:none --display_error_number --diag_warning=225 -z -m"make_hermitian.map" -i"C:/ti/ccsv5/tools/compiler/c6000_7.4.0/lib" -i"C:/ti/ccsv5/tools/compiler/c6000_7.4.0/include" --reread_libs --warn_sections --display_error_number --rom_model -o "make_hermitian.out"  "./make_hermitian.obj" "./DSPF_sp_mat_mul_cplx_cn.obj" -l"libc.a" -l"C:\ti\dsplib_c66x_3_1_0_0\lib\dsplib.a66" "../lnk.cmd"
    <Linking>
    'Finished building target: make_hermitian.out'

     

  • hi,

    data arrays are not in DDR3.

    execution time was 0.168 msec in following environment.

    cpu = intel i7-920 2.6GHz, chipset = x58, os = w7 32-bit, c compiler 32-bit build.

     i pasted link.cmd. 

    thanks for comment.

    ===============================

    -heap  0x8000
    -stack 0xC000

    MEMORY
    {
        L2SRAM (RWX) : org = 0x800000, len = 0x100000
        MSMCSRAM (RWX) : org = 0xc000000, len = 0x200000
    }

    SECTIONS
    {

        .text: load >> L2SRAM
        .text:touch: load >> L2SRAM
       
        GROUP (NEAR_DP)
        {
        .neardata
        .rodata
        .bss
        } load > L2SRAM
      
        .far: load >> L2SRAM
        .fardata: load >> L2SRAM
        .data: load >> L2SRAM
        .switch: load >> L2SRAM
        .stack: load > L2SRAM
        .args: load > L2SRAM align = 0x4, fill = 0 {_argsize = 0x200; }
        .sysmem: load > L2SRAM
        .cinit: load > L2SRAM
        .const: load > L2SRAM START(const_start) SIZE(const_size)
        .pinit: load > L2SRAM
        .cio: load >> L2SRAM
        xdc.meta: load >> L2SRAM, type = COPY
    }

  • Well where is your data then ? MSMCRAM or L2 ?

    Do you have L2 as full cache ? full ram ?
    Same question for L1 ?

    Did you use memory alignement ?

    CM

  • text is in L1P(full cache), data is in L2(full ram).

    L1D is too small to have matrix.

    for memory alignment,

    =================================

    #pragma DATA_ALIGN(uca_output, 8);
    #pragma DATA_ALIGN(uca_output_trans, 8);
    #pragma DATA_ALIGN(uca_hermitian, 8);

    =================================

    i modified lnk.cmd, but show no improvement.

    thanks,

    ========================

    -heap  0x8000
    -stack 0xC000

    MEMORY
    {
     /* ddr3 */
     //DDR3 (RWX) : org = 0x80000000, len = 0x1000000

     /* fake shared ram */
        //L2SRAM (RWX) : org = 0xc000000, len = 0x200000

        L2SRAM (RWX) : org = 0x800000, len = 0x70000

     /* shared ram */
        //MSMCSRAM (RWX) : org = 0xc000000, len = 0x200000

        L1PRAM (RWX) : org = 0xe00000, len = 0x8000
    }

    SECTIONS
    {
        .text: load > L1PRAM
       
        .bss > L2SRAM
        .far: load > L2SRAM
        .data: load > L2SRAM
        .stack: load > L2SRAM
        .sysmem: load > L2SRAM
        .cinit: load > L2SRAM
        .const: load > L2SRAM START(const_start) SIZE(const_size)
        .cio: load > L2SRAM
        //xdc.meta: load >> L2SRAM, type = COPY
    }

    ===================================