executiion time of complex matrix multiplication

heung yong kang

Prodigy 40 points

hi,

i measured execution time of complex matrix multiplication using TSC.

C6678 run slower than i expect.

mesurement is right?

===== environment: EVM6678LE, CCS 5.2.1, C6000 complier 7.4, DSPLIB 3.1.0

===== codes:

#define LOOP_COUNT       10
#define CPU_FREQ             1e9               /* CPU freq = 1 GHz */
#define MSEC                        1e3               /* 1 sec = 1000 msec */

#define antenna_length (12)
#define sample_length (1024)

/* uca_output dimension = antenna_length x sample_length */
float uca_output[2 * antenna_length * sample_length];

/* uca_output dimension = sample_length x antenna_length */
float uca_output_trans[2 * antenna_length * sample_length];

/* uca_hermitian dimension = antenna_length x antenna_length */
float uca_hermitian[2 * antenna_length * antenna_length];

    TSCL = 0;
    t_start = TSCL;
    for (loop = 0; loop < LOOP_COUNT; loop++)
    {
        DSPF_sp_mat_mul_cplx(uca_output, antenna_length, sample_length, uca_output_trans, antenna_length, uca_hermitian);
    }
    t_stop = TSCL;

    printf("[complex multiply] loop = %d, row = %d, column = %d, execution time = %f msec\n", \
                LOOP_COUNT, antenna_length, sample_length, \
                1 / CPU_FREQ * (t_stop - t_start) / LOOP_COUNT * MSEC);

===== results:

[C66xx_0] [complex multiply] loop = 10, row = 12, column = 1024, execution time = 0.368105 msec

===================================================================================

over 12 years ago

0 Clement FR over 12 years ago

Genius 4750 points

Hi,

What are you compilation options ?

in particular optimization level and debug type.

0 Xiaohui Li over 12 years ago

TI__Intellectual 1870 points

The performance also depends on where the data arrays locates. If they are in DDR3 for example, there may be significant amount of cache penalties that can slow down the performance. What kind of performance are you expecting?

Xiaohui

0 heung yong kang over 12 years ago in reply to Clement FR

Prodigy 40 points

hi,

i pasted message on build console.

thanks for your comments.

'Building file: ../make_hermitian.c'
'Invoking: C6000 Compiler'
"C:/ti/ccsv5/tools/compiler/c6000_7.4.0/bin/cl6x" -mv6600 --abi=coffabi -O3 --symdebug:none --include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.0/include" --include_path="C:/ti/dsplib_c66x_3_1_0_0/inc" --include_path="C:/ti/dsplib_c66x_3_1_0_0/packages" --display_error_number --diag_warning=225 --preproc_with_compile --preproc_dependency="make_hermitian.pp" "../make_hermitian.c"
'Finished building: ../make_hermitian.c'
' '
'Building target: make_hermitian.out'
'Invoking: C6000 Linker'
"C:/ti/ccsv5/tools/compiler/c6000_7.4.0/bin/cl6x" -mv6600 --abi=coffabi -O3 --symdebug:none --display_error_number --diag_warning=225 -z -m"make_hermitian.map" -i"C:/ti/ccsv5/tools/compiler/c6000_7.4.0/lib" -i"C:/ti/ccsv5/tools/compiler/c6000_7.4.0/include" --reread_libs --warn_sections --display_error_number --rom_model -o "make_hermitian.out" "./make_hermitian.obj" "./DSPF_sp_mat_mul_cplx_cn.obj" -l"libc.a" -l"C:\ti\dsplib_c66x_3_1_0_0\lib\dsplib.a66" "../lnk.cmd"
<Linking>
'Finished building target: make_hermitian.out'

0 heung yong kang over 12 years ago in reply to Xiaohui Li

Prodigy 40 points

hi,

data arrays are not in DDR3.

execution time was 0.168 msec in following environment.

cpu = intel i7-920 2.6GHz, chipset = x58, os = w7 32-bit, c compiler 32-bit build.

i pasted link.cmd.

thanks for comment.

===============================

-heap 0x8000
-stack 0xC000

MEMORY
{
L2SRAM (RWX) : org = 0x800000, len = 0x100000
MSMCSRAM (RWX) : org = 0xc000000, len = 0x200000
}

SECTIONS
{

    .text: load >> L2SRAM
    .text:touch: load >> L2SRAM

    GROUP (NEAR_DP)
    {
    .neardata
    .rodata
    .bss
    } load > L2SRAM

    .far: load >> L2SRAM
    .fardata: load >> L2SRAM
    .data: load >> L2SRAM
    .switch: load >> L2SRAM
    .stack: load > L2SRAM
    .args: load > L2SRAM align = 0x4, fill = 0 {_argsize = 0x200; }
    .sysmem: load > L2SRAM
    .cinit: load > L2SRAM
    .const: load > L2SRAM START(const_start) SIZE(const_size)
    .pinit: load > L2SRAM
    .cio: load >> L2SRAM
    xdc.meta: load >> L2SRAM, type = COPY
}

0 Clement FR over 12 years ago in reply to heung yong kang

Genius 4750 points

Well where is your data then ? MSMCRAM or L2 ?

Do you have L2 as full cache ? full ram ?
Same question for L1 ?

Did you use memory alignement ?

0 heung yong kang over 12 years ago in reply to Clement FR

Prodigy 40 points

text is in L1P(full cache), data is in L2(full ram).

L1D is too small to have matrix.

for memory alignment,

=================================

#pragma DATA_ALIGN(uca_output, 8);
#pragma DATA_ALIGN(uca_output_trans, 8);
#pragma DATA_ALIGN(uca_hermitian, 8);

=================================

i modified lnk.cmd, but show no improvement.

thanks,

========================

-heap 0x8000
-stack 0xC000

MEMORY
{
/* ddr3 */
//DDR3 (RWX) : org = 0x80000000, len = 0x1000000

/* fake shared ram */
//L2SRAM (RWX) : org = 0xc000000, len = 0x200000

L2SRAM (RWX) : org = 0x800000, len = 0x70000

/* shared ram */
//MSMCSRAM (RWX) : org = 0xc000000, len = 0x200000

L1PRAM (RWX) : org = 0xe00000, len = 0x8000
}

SECTIONS
{
    .text: load > L1PRAM

    .bss > L2SRAM
    .far: load > L2SRAM
    .data: load > L2SRAM
    .stack: load > L2SRAM
    .sysmem: load > L2SRAM
    .cinit: load > L2SRAM
    .const: load > L2SRAM START(const_start) SIZE(const_size)
    .cio: load > L2SRAM
    //xdc.meta: load >> L2SRAM, type = COPY
}

===================================

Processors

Processors forum

executiion time of complex matrix multiplication