CCS/AM5728: DSP cache problem

p aither

Intellectual 500 points

Part Number: AM5728
Other Parts Discussed in Thread: TEST2

Tool/software: Code Composer Studio

Hi, kind TI and everyone,

I followed "TMS320C66x DSP Cache User Guide" 2.2, and tested VLIB_xyGradients() using AM5728 IDK and CCS8.2 as bellow:

main.c:

...

// for allocate into DDR

#pragma DATA_SECTION(ppin, ".ddr")
#pragma DATA_SECTION(ppgradx_opt, ".ddr")
#pragma DATA_SECTION(ppgrady_opt, ".ddr")
#pragma DATA_SECTION(ppgradx_cn, ".ddr")
#pragma DATA_SECTION(ppgrady_cn, ".ddr")

// for align with L2 cache

#pragma DATA_ALIGN(ppin, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(ppgradx_opt, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(ppgrady_opt, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(ppgradx_cn, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(ppgrady_cn, CACHE_L2_LINESIZE)

uint8_t ppin[HH * WW];
int16_t ppgradx_opt[HH * WW], ppgrady_opt[HH * WW];
int16_t ppgradx_cn[HH * WW], ppgrady_cn[HH * WW];

...

void main()

{

VLIB_cache_init();

CACHE_setL1PSize(CACHE_L1_32KCACHE);

CACHE_setL1DSize(CACHE_L1_32KCACHE);

CACHE_setL2Size(CACHE_128KCACHE); //USer defined.

// enable DDR caching

CACHE_enableCaching(128);

CACHE_enableCaching(129);

printf("\n\n");

printf(" +---------------------------------------+\n");

printf(" | TESTING: VLIB_xyGradients |\n");

printf(" +---------------------------------------+\n\n");

VLIB_xyGradients_d(LEVEL_OF_FEEDBACK);

}

link.cmd:

.cinit : load > L2_SRAM
.cio : load >> L2_SRAM
.text : load >> L2_SRAM
.stack : load > L2_SRAM
.const : load > L2_SRAM START(const_start) SIZE(const_size)
.far : load >> L2_SRAM
.switch : load >> L2_SRAM
.fardata : load >> L2_SRAM
.data : load >> L2_SRAM
.neardata : load >> L2_SRAM
.rodata : load >> L2_SRAM
.sysmem : load > L2_SRAM
.ddr : load >> EXT_MEM

My problem is: the result with DDR cache enable has no difference with the result with no using cache.

I think that DDR cache does not run right for my test, and perhaps DMA between L2SRAM and DDR would not run.

Please tell me how to use DDR cache for AM5728 C66x bare-metal firmware.

Thaks.

Best Regards.

Aither.

over 6 years ago

0 lding over 6 years ago

TI__Guru* 95265 points

Hi,

I saw you have L1D/L1P 32KB cache and L2 partially cached. And the first two DDR blocks cached. This is the correct cache setting in the DSP core. Also the linker looked right.

"the result with DDR cache enable has no difference with the result with no using cache." ======> Can you elaborate what does this mean? You meant VLIB_xyGradients() produced something like a performance number? With DDR cached and DDR non-cached, the number are exactly the same?

"and perhaps DMA between L2SRAM and DDR would not" =======> DMA and DSP CPU is not cache coherent. You need cache operation. Is that possible the benefit you got using cache is overshadowed by the cache operation, so you didn't see any improvement?

Regards, Eric

0 p aither over 6 years ago in reply to lding

Intellectual 500 points

Hi, Eric
First thanks for your reply.
Your question 1: With DDR cached and DDR non-cached, the number are exactly the same?

Yes, the cycles consumption report is same.

Your question 2: Is that possible the benefit you got using cache is overshadowed by the cache operation, so you didn't see any improvement?

Yes.

I think that the cache operation would not run for my code.

Please tell me, how to speed up the accessing DDR using cache.

Thanks again.

Regards.

Aither.

0 lding over 6 years ago in reply to p aither

TI__Guru* 95265 points

Hi,

Your DDR cache usage is right. Is that EDMA usage part of the VLIB_xyGradients() or it is outside the function? If not, the data access from DDR should be faster because it is cached.

Regards, Eric

0 p aither over 6 years ago in reply to lding

Intellectual 500 points

Hi, Eric.
Thanks for your reply.

I think that all cache run AUTOMATICALLY.
Should I configure and start the EDMA between L2SRAM and DDR? Then how to use the EDMA?

Thanks again.
Regards.
Aither.

0 lding over 6 years ago in reply to p aither

TI__Guru* 95265 points

Hi,

"I think that all cache run AUTOMATICALLY." =====> cache is ON when you enabled it.

Should I configure and start the EDMA between L2SRAM and DDR? Then how to use the EDMA?=======> I don't know you data flow and if using EDMA is part of the function or it is out of this function but your application desired to do so.

Using EDMA is beneficial if you move a big data block. Typical way you do this is:
1. CPU writes data into source buffer
2. CPU does a cache invalidate and write back for the source buffer
3. EDMA moves the data from source buffer into destination buffer, you can use CSL or EDMA LLD for this
4. CPU does a cache invalidate of the source buffer
5. CPU reads the data for consumption

Regards, Eric

0 p aither over 6 years ago in reply to lding

Intellectual 500 points

Hi, Eric.

First, thanks for your reply.

Please tell me that when the cache is ON, does the EDMA hardware automatically start between L2CACHE and DDR?

And, when CPU access the data on DDR, does the EDMA automatically transfer the data between L2CACHE and DDR without CPU operation or instruction?

Thanks again.

Regards.

Aither.

0 lding over 6 years ago in reply to p aither

TI__Guru* 95265 points

Hi,

First EDMA is not started automatically, it can be triggered by one of the following:
• Event-triggered transfer request (this is the more typical usage of EDMA3): A
peripheral, system, or externally-generated event triggers a transfer request.
• Manually-triggered transfer request: The DSP to manually triggers a transfer by
writing a 1 to the corresponding bit in the event set register (ESR/ESRH).
• Chain-triggered transfer request: A transfer is triggered on the completion of
another transfer or sub-transfer.

Once triggered, it moves data between L2 (NOT L2 cache) and DDR.

when CPU access the data on DDR, it reads the data from cache. You need invalidate the cache from CPU before reading it.

Regards, Eric

0 p aither over 6 years ago in reply to lding

Intellectual 500 points

Hi, Eric
First, thanks for your effort and reply.
I tested DDR cache as bellow at AM5728 IDK:

main.c:

#define TEST_BUFF_SZ 8*1024*1024
#pragma DATA_ALIGN(a, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(b, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(c, CACHE_L2_LINESIZE)

#pragma DATA_SECTION(ppin, ".ddr")
#pragma DATA_SECTION(ppgradx_opt, ".ddr")
#pragma DATA_SECTION(ppgrady_opt, ".ddr")
#pragma DATA_SECTION(ppgradx_cn, ".ddr")
#pragma DATA_SECTION(ppgrady_cn, ".ddr")
static short a[TEST_BUFF_SZ], b[TEST_BUFF_SZ], c[TEST_BUFF_SZ];

int main ()
{
	int i, j ;
	clock_t t_start, t_stop, t_overhead, t_opt, t_i, t_cn;

#ifndef IO_CONSOLE
	Board_initCfg boardCfg;
#if defined(SOC_K2E) || defined(SOC_C6678)|| defined(SOC_K2H)|| defined(SOC_C6657)
	boardCfg = BOARD_INIT_MODULE_CLOCK |
	BOARD_INIT_UART_STDIO;
#else
	boardCfg = BOARD_INIT_PINMUX_CONFIG |
	BOARD_INIT_MODULE_CLOCK |
	BOARD_INIT_UART_STDIO;
#endif
	Board_init(boardCfg);
#endif

	for (i = 0; i < TEST_BUFF_SZ; i++)
	{
		a[i] = b[i] = i << 2;
	}
	
	TSCL= 0,TSCH=0;
	/* Compute the overhead of calling _itoll(TSCH, TSCL) twice to get timing info */
	/* ---------------------------------------------------------------- */
	t_start = _itoll(TSCH, TSCL);
	t_stop = _itoll(TSCH, TSCL);
	t_overhead = t_stop - t_start;

	t_start = _itoll(TSCH, TSCL);
	for (i = 0; i < TEST_BUFF_SZ; i++)
	{
		c[i] = a[i] + b[i];
	}

	t_stop = _itoll(TSCH, TSCL);
	t_cn = (t_stop - t_start) - t_overhead;
	printf("test1:%d,%d\n", t_cn, c[1]);

	CACHE_setL1PSize(CACHE_L1_32KCACHE);
	CACHE_setL1DSize(CACHE_L1_32KCACHE);
	CACHE_setL2Size(CACHE_128KCACHE); //USer defined.

	CACHE_enableCaching(64);
	CACHE_enableCaching(128);
	CACHE_enableCaching(129);

	CACHE_invL2((void *)0, 128*1024, CACHE_WAIT);
	CACHE_invL2Wait();

	for (i = 0; i < TEST_BUFF_SZ; i++)
	{
		a[i] = b[i] = i << 2;
	}
	TSCL= 0,TSCH=0;
	/* Compute the overhead of calling _itoll(TSCH, TSCL) twice to get timing info */
	/* ---------------------------------------------------------------- */
	t_start = _itoll(TSCH, TSCL);
	t_stop = _itoll(TSCH, TSCL);
	t_overhead = t_stop - t_start;

	t_start = _itoll(TSCH, TSCL);
	for (i = 0; i < TEST_BUFF_SZ; i++)
	{
		c[i] = a[i] + b[i];
	}

	t_stop = _itoll(TSCH, TSCL);
	t_cn = (t_stop - t_start) - t_overhead;
	printf("test2:%d,%d\n", t_cn, c[1]);
}

link.cmd:

MEMORY
{
  L2SRAM (RWX)  : org = 0x0800000, len = 0x040000 /* 128KB */
  DDR0:      o = 0x80000000 l = 0x40000000   /* 1GB external DDR Bank 0 */
}

SECTIONS
{
  .kernel: {
    dsplib*<*.o*> (.text:optimized) { SIZE(_kernel_size) }
  } 
  
  .text:            load >> DDR0
  .text:touch:      load >> DDR0

  GROUP (NEAR_DP)
  {
    .neardata
    .rodata 
    .bss
  } load > DDR0
   
  .init_array: load >> DDR0 
  .far:        load >> DDR0
  .fardata:    load >> DDR0
  .neardata    load >> DDR0
  .rodata      load >> DDR0
  .data:       load >> DDR0 
  .switch:     load >> DDR0
  .stack:      load >  DDR0
  .args:       load >  DDR0 align = 0x4, fill = 0 {_argsize = 0x200; }
  .sysmem:     load >  DDR0
  .cinit:      load >  DDR0
  .const:      load >  DDR0 START(const_start) SIZE(const_size)
  .pinit:      load >  DDR0
  .cio:        load >> DDR0
  .csl_vect:   load >  DDR0
}

result:

test1:1731405405,8
test2:1731397407,8

As you see, using cache is a little faster than no cache, but I think that using cache naturally must be more faster.

Please tell me what is wrong?

Thanks again.

Regards, Aither.

0 lding over 6 years ago in reply to p aither

TI__Guru* 95265 points

Hi,

I don't have your CCS project so I created one and tested. You need to make sure you initial test condition what it is before you run for comparison, on the DSP C66x core, look at:
L1DCFG 0x01840040 change to 0
L1PCFG 0x01840020 change to 0
L2CFG 0x01840000 change to 0
MAR128 0x01848200 change to 0xc
MAR129 0x01848204 change to 0xc

Then you run the
CACHE_setL1PSize(CACHE_L1_32KCACHE);
CACHE_setL1DSize(CACHE_L1_32KCACHE);
CACHE_setL2Size(CACHE_128KCACHE); //USer defined.

CACHE_enableCaching(64);
CACHE_enableCaching(128);
CACHE_enableCaching(129);

Make sure above are: 4/4/3/0xD/0XD. Then you benchmarking again. Let me know if you get different results. Note for some reason the _itoll() didn't work properly, I didn't debug. But from TSCL and TSCH, I knew it is faster.

Regards, Eric

0 p aither over 6 years ago in reply to lding

Intellectual 500 points

Hi, Eric.

First, thanks for your reply, and sorry for my late reply.

I tested as you instructed, but the result is same:

t_start_hi = TSCH;
t_start_lo = TSCL;
t_start = _itoll(t_start_hi, t_start_lo);

for (i = 0; i < TEST_BUFF_SZ; i++)
{
c[i] = a[i] + b[i];
}

t_stop_hi = TSCH;
t_stop_lo = TSCL;
t_stop = _itoll(t_stop_hi, t_stop_lo);
t_cn = (t_stop - t_start) - t_overhead;
AUDIO_log("test1:%d,%X,%X~%X,%X\n", t_cn, t_start_hi, t_start_lo, t_stop_hi, t_stop_lo);

test1:1732320338,5,C4CC9980~5,2C0DB5D6

test2:1732253510,6,3297C265~6,99D7D9AF

I could not verify _itoll(), however the difference is very small.

Please tell me to solve this problem.

Thanks again.

Regards.

Aither.

0 lding over 6 years ago in reply to p aither

TI__Guru* 95265 points

Hi,

I tested on AM5728 IDK EVM, using the standard GEL to initialize the SOC, including the DSP PLL. I can see that cache on is about x13 faster. One difference is I used -O3 optimization for your code, the other changes are minor.

Attached is the CCS project and printout below.

Cache off

start 488821, stop 580438111, diff 579949281, TSCH 0, c[1] 8

Cache on

start 635032235, stop 678885963, diff 43853725, TSCH 0, c[1] 8

Regards, Eric

DSP_cache.zip

Processors

Processors forum

CCS/AM5728: DSP cache problem