This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

c6747 External Memory Touch Program



I have tested memory touch program for external memory with cache enabled. I have following results.

Questions:

1) When do we use touch function? How can we optimize on the access times?

2) Is direct access of external memory (with Cache enabled) is better than touch before access?

3) The results here show that sum of Touch+Access is greater than without touch? Any reason behind it?

4) Also Read and Write access times are different? Why is this so?

5) Can I take these no.of cycles (Optimization in the build option is 0)? If I relate them to access then 26667/8K bytes = approximately 3 cycles per byte? And similarly read less than 3 cycles per byte? How do we interpret these results?

 

 

With Touch Without Touch
SDRAM:Read SDRAM:Read
Iteration Cycles Iteration Cycles
touch 9,030
1 26,667 1 33,223
2 26,667 2 26,667
3 26,667 3 26,667
touch 140
SDRAM:Write SDRAM:Write
Iteration Cycles Iteration Cycles
touch 9,073
1 22,577 1 27,282
2 22,575 2 22,575
3 22,575 3 22,575
touch 140
Cache Settings
L1D = 16K
L1P = 16K
L2D=128K

 

Reference L1D access for the same function:

L1D:Read
Iteration
1 26,667
2 26,667
3 26,667

#define SIZE_OF_ARR (1024*8)

#pragma DATA_ALIGN(Externbuf,256)

#pragma DATA_SECTION(Externbuf, ".DDRData:Externbuf")

char Externbuf[SIZE_OF_ARR];

 

#pragma CODE_SECTION(testWrite, ".L1Code:testWrite")

void testWrite(char *pBuf, int len)

{

register int i, len2 = len/4;

register int *ptr = (int *) pBuf;

register int val = 0x12345678;

for (i=0;i<len2;i++)

{

ptr[i]=val;

}

}

 

#pragma CODE_SECTION(testRead, ".L1Code:testRead")

int testRead(char *pBuf, int len)

{

register int i, sum, len2 = len/4;

register int *ptr = (int *) pBuf;

for (i=0;i<len2;i++)

{

sum = ptr[i];

}

return(sum);

}

test()

{

BCACHE_wbInvAll(); // 8011 cycles

BCACHE_inv(Externbuf, SIZE_OF_ARR, TRUE); // 2811 cycles for 8K

if(touchenable) touch(Externbuf,SIZE_OF_ARR);

testRead(Externbuf,SIZE_OF_ARR);

testRead(Externbuf,SIZE_OF_ARR);

testRead(Externbuf,SIZE_OF_ARR);

testRead(Externbuf,SIZE_OF_ARR);

if(touchenable) touch(Externbuf,SIZE_OF_ARR);

BCACHE_wbInvAll(); // 8011 cycles

BCACHE_inv(Externbuf, SIZE_OF_ARR, TRUE); // 2811 cycles for 8K

if(touchenable) touch(Externbuf,SIZE_OF_ARR);

testWrite(Externbuf,SIZE_OF_ARR);

testWrite(Externbuf,SIZE_OF_ARR);

testWrite(Externbuf,SIZE_OF_ARR);

testWrite(Externbuf,SIZE_OF_ARR);

if(touchenable) touch(Externbuf,SIZE_OF_ARR);

}

  • Can you give more details on what a "touch memory program" is and does? What is the functionality of the touch() function?

    Jeff

  • It is taken from one of the TI documents on cache.

     

    .global _touch
    .sect ".text"

    _touch:
    B .S2 loop ; Pipe up the loop
    || MVK .S1 128, A2 ; Step by two cache lines
    || ADDAW .D2 B4, 31, B4 ; Round up # of iters

    B .S2 loop ; Pipe up the loop
    || CLR .S1 A4, 0, 6, A4 ; Align to cache line
    || MV .L2X A4, B0 ; Twin the pointer

    B .S1 loop ; Pipe up the loop
    || CLR .S2 B0, 0, 6, B0 ; Align to cache line
    || MV .L2X A2, B2 ; Twin the stepping constant

    B .S2 loop ; Pipe up the loop
    || SHR .S1X B4, 7, A1 ; Divide by 128 bytes
    || ADDAW .D2 B0, 17, B0 ; Offset by one line + one word

    [A1] BDEC .S1 loop, A1 ; Step by 128s through array
    || [A1] LDBU .D1T1 *A4++[A2], A3 ; Load from [128*i + 0]
    || [A1] LDBU .D2T2 *B0++[B2], B4 ; Load from [128*i + 68]
    || SUB .L1 A1, 7, A0

    loop:
    [A0] BDEC .S1 loop, A0 ; Step by 128s through array
    || [A1] LDBU .D1T1 *A4++[A2], A3 ; Load from [128*i + 0]
    || [A1] LDBU .D2T2 *B0++[B2], B4 ; Load from [128*i + 68]
    || [A1] SUB .L1 A1, 1, A1
    BNOP .S2 B3, 5 ; Return
    .end

  • Harikrishna,

    Referencing section 3.1.2 of http://www.ti.com/lit/ug/sprug82a/sprug82a.pdf:

    In the 2 level DSP cache system, the L1D->L2 interface supports "pipelining of read misses".  However, the L2->External interface does not.  This is the reason you're not seeing a marked advantage between the two software implementations (with and without touch loop).

    In order to speed up the benchmark, you should enable compiler optimization (via the -o3 flag).  The touch loop is attempting to bring the relevant data into the L1D level.  At that point, the DSPshould be able to perform 2 loads per cycle as long as the accesses hit in L1D.

    In addition, to see a real benefit from the touch loop, you may try to manually copy the data into L2 SRAM via the EDMA.  This will allow the touch loop to pipeline accesses between the L1D and L2 SRAM.

    Regards
    Kyle