This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

c6678 EVM MSM RAM test question

hello,

In my project, if I put the process data in DDR with L2 cache 128KB, the process speed of corepac is the same with the data in MSM RAM. I wonder whether anywhere is wrong. So I have a test,below is my test case:

void MSM_DDR_test()
{
int i;
CACHE_setL2Size(CACHE_128KCACHE);
for (i = 128; i < 129; i++) {
/* enables caching for a specific memory region */
CACHE_enableCaching(i);
}
printf("***********TEST START***********\r\n");
memset(g_p_DDR,0,0x400000);
memset(g_p_MSM,0,0x400000);
g_ll_startcycle = CSL_tscRead();
for(i = 0; i < 0x100000; i++)
{
g_p_DDR[i] = i;
}
g_ll_endcycle = CSL_tscRead();
g_ll_overhead = g_ll_endcycle - g_ll_startcycle;
printf("DDR write g_ll_overhead = %lld\r\n",g_ll_overhead);


g_ll_startcycle = CSL_tscRead();
for(i = 0; i < 0x100000; i++)
{
mytemp = g_p_DDR[i];
}
g_ll_endcycle = CSL_tscRead();
g_ll_overhead = g_ll_endcycle - g_ll_startcycle;
printf("DDR read g_ll_overhead = %lld\r\n",g_ll_overhead);

g_ll_startcycle = CSL_tscRead();
for(i = 0; i < 0x100000; i++)
{
g_p_MSM[i] = i;
}
g_ll_endcycle = CSL_tscRead();
g_ll_overhead = g_ll_endcycle - g_ll_startcycle;
printf("MSM write g_ll_overhead = %lld\r\n",g_ll_overhead);

g_ll_startcycle = CSL_tscRead();
for(i = 0; i < 0x100000; i++)
{
mytemp = g_p_MSM[i];
}
g_ll_endcycle = CSL_tscRead();
g_ll_overhead = g_ll_endcycle - g_ll_startcycle;
printf("MSM read g_ll_overhead = %lld\r\n",g_ll_overhead);

printf("***********TEST OVER***********\r\n");
}

the result is(L2  cache 128KB) :

***********TEST START***********
DDR write g_ll_overhead = 23068699
DDR read g_ll_overhead = 28939522
MSM write g_ll_overhead = 23068699
MSM read g_ll_overhead = 28803127
***********TEST OVER***********

the result is(L2  cache 0KB) :

***********TEST START***********
DDR write g_ll_overhead = 23068699
DDR read g_ll_overhead = 59362879
MSM write g_ll_overhead = 23068699
MSM read g_ll_overhead = 28803128
***********TEST OVER***********

1 from the test results, I can make a conclusion:  whether L2 CACHE is enable, the speed to write DDR and MSM RAM is almost the same, and if L2 CACHE is enable for 128KB, the speed to write and read DDR and MSM RAM is almost the same. From the test data, it can explain  in my project  why whether the data in DDR or MSM RAM is the same.  I am not sure whether it is right, if not ,please tell me why?

2 during my test, I find that when I write the data to MSM SRAM, it write the MSM SRAM directly,not through L1D cache, if L2 cache disable, it is the same with DDR write, I have a question here, in this situation, how the CPU write the data to MSM OR DDR, is it through L1D cache or bypass the L1D cahce directly write to the MSM or DDR?

 

Best Regards,

Si

  • my MSM RAM is default setting, as L2.

  • Si,

    On writes, unless you're following up w/ a read of the same locations it's fire and forget (i.e. no stalling waiting for it to reach the final location.)  So the cycle count would be the same no matter which memory you send it to (assuming no stalling, which a loop w/ only writes and no other device activity would produce this.)  

    On the read from DDR3 w/ L2 enabled, you're going to have a lot better prefetching going on, and it shows as the w/ L2 Cache enabled it's less than 1/2 the cycle time.  It's prefetching it into L2 cache and then access time is from the L2 cache.  The L2 Cache has much bigger cache lines than L1D so it will do a much better job prefetching.

    On the read from MSM w/ default MSMC settigns the values would expect to be the same whether L2 cache was enabled or not, as MSMC RAM by default is not cached by L2 (MSMC RAM by default is an L2 memory, it can be set to be cached by L2 if configured as an L3 but that's not the default setting.)

    Note that the DDR3 w/ L2 enabled was slightly more cycles than the read from MSMC RAM, the prefetching was keeping up.

    You should see more variation on the reads if you 'unroll' the loops such that it's doing multiple reads per loop, instead of one.

    Best Regards,

    Chad

  • Chad,

    as you said" the DDR3 w/ L2 enabled was slightly more cycles than the read from MSMC RAM, the prefetching was keeping up." I have question about the MSM RAM. as we know, MSM RAM is a SRAM on chip and DDR is outside of chip, if MSM RAM is just slightl fast than reading from DDR, why TI put the MSM RAM on chip,not put it outside of chip when designed C6678. what''s advantage to put MSM SRAM on chip not outside. Because in my project data in DDR and MSM RAM ,the process speed is almost the same.

    Best Regards,

    Si

  • Si,

    Multiple reasons for this.

    1.) You may not be using L2 cache in the first place - MSMC was much faster in the results w/ L2 disabled.

    2.) You may not be accessing items in a serial manner, and thus prefetching isn't going to help out much.

    3.) You most likely are accessing more than one object every 6 cpu cycles (I'm assuming the assembly code generated is going to be a 6 cycle loop w/ one read access.  Nice tight code is going to have two reads per cycle - and the prefetch from DDR3 while giving better results than w/o it, isn't going to be as close to MSMC.)  This is why I suggested trying it and seeing the difference in performance.

    There are plenty of other area's where you going to be seeing a difference, you just happened to create a test that isn't going to show much of a difference.

    Best Regards,

    Chad

  • Chad,

    I have another several tests the results are:

    test1,core0 test

    optimization(-O3):
    L2 cache disable
    ***********TEST START***********
    DDR write g_ll_overhead = 4021887
    DDR read g_ll_overhead = 25057412
    MSM write g_ll_overhead = 524311
    MSM read g_ll_overhead = 986437
    ***********TEST OVER***********

    L2 cache enable, 128KB
    ***********TEST START***********
    DDR write g_ll_overhead = 5039691
    DDR read g_ll_overhead = 2177260
    MSM write g_ll_overhead = 524322
    MSM read g_ll_overhead = 986421
    ***********TEST OVER***********

    from this test result, I have a question:

    Q1:why if I build with -O3 , both the read and write speed of MSM RAM is fast than DDR3  whether the L2 CACHE is disable or enable.(JUST CORE0 RUN THE PROEJCT)

    I use the below C code to test random address access with 8 cores run at the same time

    g_ll_startcycle = CSL_tscRead();
    for(i = 0; i < 0x100000; i++)
    {
    mytemp = *(int*)(g_p_DDR + rand()%0x100000);
    }
    g_ll_endcycle = CSL_tscRead();
    g_ll_overhead = g_ll_endcycle - g_ll_startcycle;
    printf("DDR read g_ll_overhead = %lld\r\n",g_ll_overhead);

    g_ll_startcycle = CSL_tscRead();
    for(i = 0; i < 0x100000; i++)
    {
    mytemp = *(int*)(g_p_MSM + rand()%0x100000);
    }
    g_ll_endcycle = CSL_tscRead();
    g_ll_overhead = g_ll_endcycle - g_ll_startcycle;
    printf("MSM read g_ll_overhead = %lld\r\n",g_ll_overhead);

    test results:

    *********************************************************
    core0-7:
    no optimization:
    L2 cache disable:

    [C66xx_0] DDR read g_ll_overhead = 182241275
    [C66xx_1] DDR read g_ll_overhead = 182235471
    [C66xx_2] DDR read g_ll_overhead = 182991167
    [C66xx_3] DDR read g_ll_overhead = 182235611
    [C66xx_4] DDR read g_ll_overhead = 185485905
    [C66xx_5] DDR read g_ll_overhead = 182237667
    [C66xx_6] DDR read g_ll_overhead = 182235691
    [C66xx_7] DDR read g_ll_overhead = 185485429

    [C66xx_0] MSM read g_ll_overhead = 79199887
    [C66xx_1] MSM read g_ll_overhead = 79456549
    [C66xx_2] MSM read g_ll_overhead = 79590121
    [C66xx_3] MSM read g_ll_overhead = 79593668
    [C66xx_4] MSM read g_ll_overhead = 79503946
    [C66xx_5] MSM read g_ll_overhead = 79359653
    [C66xx_6] MSM read g_ll_overhead = 78745559
    [C66xx_7] MSM read g_ll_overhead = 78750037


    L2 cache enable, 128KB

    [C66xx_0] DDR read g_ll_overhead = 69102760
    [C66xx_1] DDR read g_ll_overhead = 69102244
    [C66xx_2] DDR read g_ll_overhead = 69102456
    [C66xx_3] DDR read g_ll_overhead = 69102692
    [C66xx_4] DDR read g_ll_overhead = 69103128
    [C66xx_5] DDR read g_ll_overhead = 69101970
    [C66xx_6] DDR read g_ll_overhead = 69102486
    [C66xx_7] DDR read g_ll_overhead = 69103742

    [C66xx_0] MSM read g_ll_overhead = 79224192
    [C66xx_1] MSM read g_ll_overhead = 79488368
    [C66xx_2] MSM read g_ll_overhead = 79599546
    [C66xx_3] MSM read g_ll_overhead = 79594841
    [C66xx_4] MSM read g_ll_overhead = 79517403
    [C66xx_5] MSM read g_ll_overhead = 79374221
    [C66xx_6] MSM read g_ll_overhead = 78780542
    [C66xx_7] MSM read g_ll_overhead = 78778309

    **********************************************************
    core0-7:
    optimization(-O3):

    L2 cache disable:

    [C66xx_0] DDR read g_ll_overhead = 180781462
    [C66xx_1] DDR read g_ll_overhead = 181028146
    [C66xx_2] DDR read g_ll_overhead = 180776248
    [C66xx_3] DDR read g_ll_overhead = 180776968
    [C66xx_4] DDR read g_ll_overhead = 184167664
    [C66xx_5] DDR read g_ll_overhead = 180779084
    [C66xx_6] DDR read g_ll_overhead = 180777110
    [C66xx_7] DDR read g_ll_overhead = 184167236

    [C66xx_0] MSM read g_ll_overhead = 77093239
    [C66xx_1] MSM read g_ll_overhead = 77361328
    [C66xx_2] MSM read g_ll_overhead = 77498327
    [C66xx_3] MSM read g_ll_overhead = 77502841
    [C66xx_4] MSM read g_ll_overhead = 77421591
    [C66xx_5] MSM read g_ll_overhead = 77271424
    [C66xx_6] MSM read g_ll_overhead = 76642935
    [C66xx_7] MSM read g_ll_overhead = 76652356


    L2 cache enable, 128KB

    [C66xx_0] DDR read g_ll_overhead = 66999428
    [C66xx_1] DDR read g_ll_overhead = 67001706
    [C66xx_2] DDR read g_ll_overhead = 67000144
    [C66xx_3] DDR read g_ll_overhead = 67000286
    [C66xx_4] DDR read g_ll_overhead = 66999732
    [C66xx_6] DDR read g_ll_overhead = 66999872
    [C66xx_5] DDR read g_ll_overhead = 67000218
    [C66xx_7] DDR read g_ll_overhead = 66999546

    [C66xx_0] MSM read g_ll_overhead = 77131099
    [C66xx_1] MSM read g_ll_overhead = 77399198
    [C66xx_2] MSM read g_ll_overhead = 77500053
    [C66xx_3] MSM read g_ll_overhead = 77506983
    [C66xx_4] MSM read g_ll_overhead = 77484238
    [C66xx_5] MSM read g_ll_overhead = 77401469
    [C66xx_6] MSM read g_ll_overhead = 76669513
    [C66xx_7] MSM read g_ll_overhead = 77007213

    from the  test  results, we can find that MSM random address access speed is slower than DDR random address access speed if 8 cores run at the same time when L2 CACHE enable no matter build with -O3 or not . 

    Q2: I am not clear why does this happen? what reason ? Can you exlpain it?

    Q3: if data in my project  will  be random accessed, not  continuous accessed, whether I need put  the proccessed data in DDR, not in MSM RAM.

    Best Regards,

    Si

  • Comments inserted in blue.

    -Chad

    Chad,

    I have another several tests the results are:

    test1,core0 test

    optimization(-O3):
    L2 cache disable
    ***********TEST START***********
    DDR write g_ll_overhead = 4021887
    DDR read g_ll_overhead = 25057412
    MSM write g_ll_overhead = 524311
    MSM read g_ll_overhead = 986437
    ***********TEST OVER***********

    L2 cache enable, 128KB
    ***********TEST START***********
    DDR write g_ll_overhead = 5039691
    DDR read g_ll_overhead = 2177260
    MSM write g_ll_overhead = 524322
    MSM read g_ll_overhead = 986421
    ***********TEST OVER***********

    from this test result, I have a question:

    Q1:why if I build with -O3 , both the read and write speed of MSM RAM is fast than DDR3  whether the L2 CACHE is disable or enable.(JUST CORE0 RUN THE PROEJCT)

    A1: This goes along with what I was saying before.  With -O3 it's going to unroll the loop for you, hence more access per cycle.   The prefetching while it will do a fair amount to help improve performance from DDR3 accesses, it will not be superior to having it in MSMC RAM.

    I use the below C code to test random address access with 8 cores run at the same time

    g_ll_startcycle = CSL_tscRead();
    for(i = 0; i < 0x100000; i++)
    {
    mytemp = *(int*)(g_p_DDR + rand()%0x100000);
    }
    g_ll_endcycle = CSL_tscRead();
    g_ll_overhead = g_ll_endcycle - g_ll_startcycle;
    printf("DDR read g_ll_overhead = %lld\r\n",g_ll_overhead);

    g_ll_startcycle = CSL_tscRead();
    for(i = 0; i < 0x100000; i++)
    {
    mytemp = *(int*)(g_p_MSM + rand()%0x100000);
    }
    g_ll_endcycle = CSL_tscRead();
    g_ll_overhead = g_ll_endcycle - g_ll_startcycle;
    printf("MSM read g_ll_overhead = %lld\r\n",g_ll_overhead);

    test results:

    *********************************************************
    core0-7:
    no optimization:
    L2 cache disable:

    [C66xx_0] DDR read g_ll_overhead = 182241275
    [C66xx_1] DDR read g_ll_overhead = 182235471
    [C66xx_2] DDR read g_ll_overhead = 182991167
    [C66xx_3] DDR read g_ll_overhead = 182235611
    [C66xx_4] DDR read g_ll_overhead = 185485905
    [C66xx_5] DDR read g_ll_overhead = 182237667
    [C66xx_6] DDR read g_ll_overhead = 182235691
    [C66xx_7] DDR read g_ll_overhead = 185485429

    [C66xx_0] MSM read g_ll_overhead = 79199887
    [C66xx_1] MSM read g_ll_overhead = 79456549
    [C66xx_2] MSM read g_ll_overhead = 79590121
    [C66xx_3] MSM read g_ll_overhead = 79593668
    [C66xx_4] MSM read g_ll_overhead = 79503946
    [C66xx_5] MSM read g_ll_overhead = 79359653
    [C66xx_6] MSM read g_ll_overhead = 78745559
    [C66xx_7] MSM read g_ll_overhead = 78750037


    L2 cache enable, 128KB

    [C66xx_0] DDR read g_ll_overhead = 69102760
    [C66xx_1] DDR read g_ll_overhead = 69102244
    [C66xx_2] DDR read g_ll_overhead = 69102456
    [C66xx_3] DDR read g_ll_overhead = 69102692
    [C66xx_4] DDR read g_ll_overhead = 69103128
    [C66xx_5] DDR read g_ll_overhead = 69101970
    [C66xx_6] DDR read g_ll_overhead = 69102486
    [C66xx_7] DDR read g_ll_overhead = 69103742

    [C66xx_0] MSM read g_ll_overhead = 79224192
    [C66xx_1] MSM read g_ll_overhead = 79488368
    [C66xx_2] MSM read g_ll_overhead = 79599546
    [C66xx_3] MSM read g_ll_overhead = 79594841
    [C66xx_4] MSM read g_ll_overhead = 79517403
    [C66xx_5] MSM read g_ll_overhead = 79374221
    [C66xx_6] MSM read g_ll_overhead = 78780542
    [C66xx_7] MSM read g_ll_overhead = 78778309

    **********************************************************
    core0-7:
    optimization(-O3):

    L2 cache disable:

    [C66xx_0] DDR read g_ll_overhead = 180781462
    [C66xx_1] DDR read g_ll_overhead = 181028146
    [C66xx_2] DDR read g_ll_overhead = 180776248
    [C66xx_3] DDR read g_ll_overhead = 180776968
    [C66xx_4] DDR read g_ll_overhead = 184167664
    [C66xx_5] DDR read g_ll_overhead = 180779084
    [C66xx_6] DDR read g_ll_overhead = 180777110
    [C66xx_7] DDR read g_ll_overhead = 184167236

    [C66xx_0] MSM read g_ll_overhead = 77093239
    [C66xx_1] MSM read g_ll_overhead = 77361328
    [C66xx_2] MSM read g_ll_overhead = 77498327
    [C66xx_3] MSM read g_ll_overhead = 77502841
    [C66xx_4] MSM read g_ll_overhead = 77421591
    [C66xx_5] MSM read g_ll_overhead = 77271424
    [C66xx_6] MSM read g_ll_overhead = 76642935
    [C66xx_7] MSM read g_ll_overhead = 76652356


    L2 cache enable, 128KB

    [C66xx_0] DDR read g_ll_overhead = 66999428
    [C66xx_1] DDR read g_ll_overhead = 67001706
    [C66xx_2] DDR read g_ll_overhead = 67000144
    [C66xx_3] DDR read g_ll_overhead = 67000286
    [C66xx_4] DDR read g_ll_overhead = 66999732
    [C66xx_6] DDR read g_ll_overhead = 66999872
    [C66xx_5] DDR read g_ll_overhead = 67000218
    [C66xx_7] DDR read g_ll_overhead = 66999546

    [C66xx_0] MSM read g_ll_overhead = 77131099
    [C66xx_1] MSM read g_ll_overhead = 77399198
    [C66xx_2] MSM read g_ll_overhead = 77500053
    [C66xx_3] MSM read g_ll_overhead = 77506983
    [C66xx_4] MSM read g_ll_overhead = 77484238
    [C66xx_5] MSM read g_ll_overhead = 77401469
    [C66xx_6] MSM read g_ll_overhead = 76669513
    [C66xx_7] MSM read g_ll_overhead = 77007213

    from the  test  results, we can find that MSM random address access speed is slower than DDR random address access speed if 8 cores run at the same time when L2 CACHE enable no matter build with -O3 or not . 

    Q2: I am not clear why does this happen? what reason ? Can you exlpain it?

    Q3: if data in my project  will  be random accessed, not  continuous accessed, whether I need put  the proccessed data in DDR, not in MSM RAM.

    A2 & 3: There are multiple variables coming into play here.  That said, I think it's primarily the multiple accesses by all cores per cycle to the MSMC RAM is causing stalling of the MSMC RAM, while the prefetching to L2 is at least giving a partial fill of L2 w/ no contention for access of the L2 by other cores.

    The reality is, this is not too realistic,  you're not going to have contention every cycle randomly on MSMC.

    Random accessed data in a typical system is going to be better in MSMC than DDR3 under normal operating conditions.

    Best Regards,

    Si