This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

The different cycles when using SHRAM and DDR3on data/code sections.(C6678)

I am working with TMDSEVM6678L. I counted the cycle number of 512 samples convolution. And I compared with using DDR3 and SHRAM on data/code memory section.

I expected that the cycle number of using SHRAM is lower than using DDR3.

But the number of cycles is as follows.

[Memory Sections]
data memory:SHRAM
code memory:SHRAM
stack memory:L2SRAM
⇒5282154 cycles

[Memory Sections]
data memory:DDR3
code memory:DDR3
stack memory:L2SRAM
⇒3324412 cycles

Why is this result occured? And please teach me how to reduction of cycles when using SHRAM.


The code is as follows.(The base project is omp_1_02_00_05\packages\ti\omp\examples(omp_matvec.c))

-------------------------------------------------------------------------------------------------------
#include <ti/omp/omp.h>
#include <stdio.h>
#include <math.h>
#include <ti/csl/soc.h>
#include <ti/csl/csl.h>
#include <ti/csl/csl_cache.h>
#include <ti/csl/csl_cacheAux.h>

#define NTHREADS 8

#define N (512)
#define M (16)
#define L (4)
#define SAMPLE (64)

double x[L][M][N], y[L][M][N];

void main()
{
CACHE_setL1DSize(CACHE_L1_32KCACHE);
CACHE_setL1PSize(CACHE_L1_32KCACHE);
CACHE_setL2Size(CACHE_256KCACHE);
int L1Dsize = CACHE_getL1DSize();
int L1Psize = CACHE_getL1PSize();
int L2size = CACHE_getL2Size();
printf("L1Dsize = %d \n", L1Dsize);
printf("L1Psize = %d \n", L1Psize);
printf("L2size = %d \n", L2size);
CACHE_enableCaching(16);
CACHE_enableCaching(17);
CACHE_enableCaching(18);
CACHE_enableCaching(19);
CACHE_enableCaching(20);
CACHE_enableCaching(21);
CACHE_enableCaching(22);
CACHE_enableCaching(23);

int i, j, k, n, nthread;
CSL_Uint64 start, end;

start = getcycles( );
for(n=0; n<SAMPLE; n++)
{
for(j=0; j<M; j++)
{
for(k=0; k<L; k++)
{
double temp = 0.0;
for(i=0; i<N; i++ )
{
temp += x[k][j][i]*y[k][j][i];
}
}
}
}
end = getcycles( );
printf("Elapsed time(1 core ) =\t%lld\tcycles\n", (end-start) );
}

void wcycles(unsigned long long *c)
{
static int first = 1;
if (first)
{
CSL_tscEnable();
first = 0;
}
*c = CSL_tscRead();
}

unsigned long long getcycles()
{
unsigned long long cycles;
wcycles(&cycles);
return cycles;
}

-------------------------------------------------------------------------------------------------------

My development environment is follows.
・CCS v5.2.1
・MCSDK(2.1.2.6)
・OpenMP(1.2.0.5)

Many thanks.

  • Welcome to the TI E2E forum. I hope you will find many good answers here and in the TI.com documents and in the TI Wiki Pages (for processor issues). Be sure to search those for helpful information and to browse for the questions others may have asked on similar topics (e2e.ti.com). Please read all the links below my signature.

    We will get back to you on the above query shortly. Thank you for your patience.

    Note: We strongly recommend you to create new e2e thread for your queries instead of following up on an old/closed e2e thread, new threads gets more attention than old threads and can provide link of old threads or information on the new post for clarity and faster response.

  • Interesting posting. Of course MSMC memory should be faster, but your numbers are different

    So lets start with the obvious things
    The property of the two projects are the same, namely, optimization level, speed vs. space level and debug level ate all the same. Verify it.

    The alignment of the data in the memory is the same
    The alignment of the code is the same

    Just verify the above issues and repeat the measurements. Report the results to the forum

    Ran
  • Thank you for your reply Ran.

    I measured cycles of two projecs.Two Projects are same property and alignment of the data and code. I copyed a project and I changed only Platform configuration(Properties -> General -> RTSC -> Platform).The settings of property are default. I did not change optimization level, debug level and so on .The result is as follows.

    [Memory Sections]
    data memory:SHRAM
    code memory:SHRAM
    stack memory:L2SRAM
    ⇒5454720 cycles

    [Memory Sections]
    data memory:DDR3
    code memory:DDR3
    stack memory:L2SRAM
    ⇒3355529 cycles

    The results are same as the previous result.

    Memory map is as follows.

    -------------------------------------[using SHRAM]---------------------------------------------

    MEMORY CONFIGURATION

    name origin length used unused attr fill
    ---------------------- -------- --------- -------- -------- ---- --------
    L2SRAM 00800000 00080000 0002bd4c 000542b4 RW X
    MSMCSRAM 0c000000 00400000 000fe426 00301bda RW X
    DDR3 80000000 20000000 00000000 20000000 RW X
    MSMCSRAM_NOCACHE a03c0000 00040000 0003996e 00006692 RW X


    SEGMENT ALLOCATION MAP

    run origin load origin length init length attrs members
    ---------- ----------- ---------- ----------- ----- -------
    00800000 00800000 0002bd4c 0002ac28 rw-
    00800000 00800000 0002ac28 0002ac28 rw- .localfar
    0082ac28 0082ac28 00001000 00000000 rw- .stack
    0082bc28 0082bc28 00000120 00000000 rw- .cio
    0082bd48 0082bd48 00000004 00000000 rw- .tls_tp
    0c080000 0c080000 00048210 00000000 rw-
    0c080000 0c080000 00048210 00000000 rw- .far
    0c0c8210 0c0c8210 00000010 00000010 r--
    0c0c8210 0c0c8210 00000010 00000010 r-- .const.1
    0c0c8220 0c0c8220 000351d2 000351d2 r-x
    0c0c8220 0c0c8220 00030640 00030640 r-x .text
    0c0f8860 0c0f8860 00004b92 00004b92 r-- .const.2
    0c0fd3f4 0c0fd3f4 00000004 00000000 rw-
    0c0fd3f4 0c0fd3f4 00000004 00000000 rw- .bss
    0c0fd400 0c0fd400 00000200 00000200 r-x
    0c0fd400 0c0fd400 00000200 00000200 r-x .vecs
    0c0fd600 0c0fd7b8 00000015 00000015 rw-
    0c0fd600 0c0fd7b8 00000015 00000015 rw- .neardata
    0c0fd618 0c0fd7d0 0000019c 0000019c rw-
    0c0fd618 0c0fd7d0 0000019c 0000019c rw- .fardata
    0c0fd96c 0c0fd96c 000000cc 000000cc r--
    0c0fd96c 0c0fd96c 000000b0 000000b0 r-- .switch
    0c0fda1c 0c0fda1c 0000001c 0000001c r-- .binit
    0c0fdc00 0c0fdc00 000009fc 000009fc r-x
    0c0fdc00 0c0fdc00 000000a0 000000a0 r-x .text:_c_int00
    0c0fdca0 0c0fdca0 0000095c 0000095c r-- .cinit
    a03e5000 a03e5000 000120ac 000120ac rw-
    a03e5000 a03e5000 000120ac 000120ac rw- .qmss.1
    a03f70b0 a03f70b0 00000050 00000050 rw-
    a03f70b0 a03f70b0 00000050 00000050 rw- gomp_data.1
    a03f7100 a03f7100 00002672 00002672 rw-
    a03f7100 a03f7100 00002400 00002400 rw- .qmss.2
    a03f9500 a03f9500 00000272 00000272 rw- gomp_data.2
    a03f9780 a03f9780 00000200 00000000 rw-
    a03f9780 a03f9780 00000200 00000000 rw- .cppi
    ----------------------------------------------------------------------------------------------------------


    -------------------------------------[using DDR3]--------------------------------------------------

    MEMORY CONFIGURATION

    name origin length used unused attr fill
    ---------------------- -------- --------- -------- -------- ---- --------
    L2SRAM 00800000 00080000 0002bd4c 000542b4 RW X
    MSMCSRAM 0c000000 00400000 00000000 00400000 RW X
    DDR3 80000000 20000000 000fe426 1ff01bda RW X
    MSMCSRAM_NOCACHE a03c0000 00040000 0003996e 00006692 RW X


    SEGMENT ALLOCATION MAP

    run origin load origin length init length attrs members
    ---------- ----------- ---------- ----------- ----- -------
    00800000 00800000 0002bd4c 0002ac28 rw-
    00800000 00800000 0002ac28 0002ac28 rw- .localfar
    0082ac28 0082ac28 00001000 00000000 rw- .stack
    0082bc28 0082bc28 00000120 00000000 rw- .cio
    0082bd48 0082bd48 00000004 00000000 rw- .tls_tp
    80080000 80080000 00048210 00000000 rw-
    80080000 80080000 00048210 00000000 rw- .far
    800c8210 800c8210 00000010 00000010 r--
    800c8210 800c8210 00000010 00000010 r-- .const.1
    800c8220 800c8220 000351d2 000351d2 r-x
    800c8220 800c8220 00030640 00030640 r-x .text
    800f8860 800f8860 00004b92 00004b92 r-- .const.2
    800fd3f4 800fd3f4 00000004 00000000 rw-
    800fd3f4 800fd3f4 00000004 00000000 rw- .bss
    800fd400 800fd400 00000200 00000200 r-x
    800fd400 800fd400 00000200 00000200 r-x .vecs
    800fd600 800fd7b8 00000015 00000015 rw-
    800fd600 800fd7b8 00000015 00000015 rw- .neardata
    800fd618 800fd7d0 0000019c 0000019c rw-
    800fd618 800fd7d0 0000019c 0000019c rw- .fardata
    800fd96c 800fd96c 000000cc 000000cc r--
    800fd96c 800fd96c 000000b0 000000b0 r-- .switch
    800fda1c 800fda1c 0000001c 0000001c r-- .binit
    800fdc00 800fdc00 000009fc 000009fc r-x
    800fdc00 800fdc00 000000a0 000000a0 r-x .text:_c_int00
    800fdca0 800fdca0 0000095c 0000095c r-- .cinit
    a03e5000 a03e5000 000120ac 000120ac rw-
    a03e5000 a03e5000 000120ac 000120ac rw- .qmss.1
    a03f70b0 a03f70b0 00000050 00000050 rw-
    a03f70b0 a03f70b0 00000050 00000050 rw- gomp_data.1
    a03f7100 a03f7100 00002672 00002672 rw-
    a03f7100 a03f7100 00002400 00002400 rw- .qmss.2
    a03f9500 a03f9500 00000272 00000272 rw- gomp_data.2
    a03f9780 a03f9780 00000200 00000000 rw-
    a03f9780 a03f9780 00000200 00000000 rw- .cppi

    ----------------------------------------------------------------------------------------------------------

    Regards;
    user1432743

  • Before I dive into your memory map, I ask you to do one more thing (actually two)
    1. Let's see if this is cache issue. Disable all caches and disable pre-fetch (you can do it using the MAR registers, search for them in the user guide if you need help and look for API in the release). Report the results.
    2. Disable the caches but enable the pre-fetch registers and record the performances

    Best regards

    Ran