This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/AM5728: DSP cache problem

Part Number: AM5728
Other Parts Discussed in Thread: TEST2

Tool/software: Code Composer Studio

Hi, kind TI and everyone,

I followed "TMS320C66x DSP Cache User Guide" 2.2, and tested VLIB_xyGradients() using AM5728 IDK and CCS8.2 as bellow:

main.c:

...

// for allocate into DDR

#pragma DATA_SECTION(ppin, ".ddr")
#pragma DATA_SECTION(ppgradx_opt, ".ddr")
#pragma DATA_SECTION(ppgrady_opt, ".ddr")
#pragma DATA_SECTION(ppgradx_cn, ".ddr")
#pragma DATA_SECTION(ppgrady_cn, ".ddr")

// for align with L2 cache

#pragma DATA_ALIGN(ppin, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(ppgradx_opt, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(ppgrady_opt, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(ppgradx_cn, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(ppgrady_cn, CACHE_L2_LINESIZE)

uint8_t ppin[HH * WW];
int16_t ppgradx_opt[HH * WW], ppgrady_opt[HH * WW];
int16_t ppgradx_cn[HH * WW], ppgrady_cn[HH * WW];

... 

void main()
{
VLIB_cache_init();
CACHE_setL1PSize(CACHE_L1_32KCACHE);
CACHE_setL1DSize(CACHE_L1_32KCACHE);
CACHE_setL2Size(CACHE_128KCACHE); //USer defined.
// enable DDR caching
CACHE_enableCaching(128);
CACHE_enableCaching(129);
 
printf("\n\n");
printf(" +---------------------------------------+\n");
printf(" | TESTING: VLIB_xyGradients     |\n");
printf(" +---------------------------------------+\n\n");
VLIB_xyGradients_d(LEVEL_OF_FEEDBACK);
}
link.cmd:

.cinit : load > L2_SRAM
.cio : load >> L2_SRAM
.text : load >> L2_SRAM
.stack : load > L2_SRAM
.const : load > L2_SRAM START(const_start) SIZE(const_size)
.far : load >> L2_SRAM
.switch : load >> L2_SRAM
.fardata : load >> L2_SRAM
.data : load >> L2_SRAM
.neardata : load >> L2_SRAM
.rodata : load >> L2_SRAM
.sysmem : load > L2_SRAM
.ddr : load >> EXT_MEM

My problem is: the result with DDR cache enable has no difference with the result with no using cache.
I think that DDR cache does not run right for my test, and perhaps DMA between L2SRAM and DDR would not run.
Please tell me how to use DDR cache for AM5728 C66x bare-metal firmware.
Thaks.
Best Regards.
Aither.

  • Hi,

    I saw you have L1D/L1P 32KB cache and L2 partially cached. And the first two DDR blocks cached. This is the correct cache setting in the DSP core. Also the linker looked right.

    "the result with DDR cache enable has no difference with the result with no using cache." ======> Can you elaborate what does this mean? You meant VLIB_xyGradients() produced something like a performance number? With DDR cached and DDR non-cached, the number are exactly the same?

    "and perhaps DMA between L2SRAM and DDR would not" =======> DMA and DSP CPU is not cache coherent. You need cache operation. Is that possible the benefit you got using cache is overshadowed by the cache operation, so you didn't see any improvement?

    Regards, Eric
  • Hi, Eric
    First thanks for your reply.
    Your question 1: With DDR cached and DDR non-cached, the number are exactly the same?

    Yes, the cycles consumption report is same.

    Your question 2: Is that possible the benefit you got using cache is overshadowed by the cache operation, so you didn't see any improvement?

    Yes.

    I think that the cache operation would not run for my code.

    Please tell me, how to speed up the accessing DDR using cache.

    Thanks again.

    Regards.

    Aither.

  • Hi,

    Your DDR cache usage is right. Is that EDMA usage part of the VLIB_xyGradients() or it is outside the function? If not, the data access from DDR should be faster because it is cached.

    Regards, Eric
  • Hi, Eric.
    Thanks for your reply.

    I think that all cache run AUTOMATICALLY.
    Should I configure and start the EDMA between L2SRAM and DDR? Then how to use the EDMA?

    Thanks again.
    Regards.
    Aither.
  • Hi,

    "I think that all cache run AUTOMATICALLY." =====> cache is ON when you enabled it.

    Should I configure and start the EDMA between L2SRAM and DDR? Then how to use the EDMA?=======> I don't know you data flow and if using EDMA is part of the function or it is out of this function but your application desired to do so.

    Using EDMA is beneficial if you move a big data block. Typical way you do this is:
    1. CPU writes data into source buffer
    2. CPU does a cache invalidate and write back for the source buffer
    3. EDMA moves the data from source buffer into destination buffer, you can use CSL or EDMA LLD for this
    4. CPU does a cache invalidate of the source buffer
    5. CPU reads the data for consumption

    Regards, Eric
  • Hi, Eric.

    First, thanks for your reply.

    Please tell me that when the cache is ON, does the EDMA hardware automatically start between L2CACHE and DDR?

    And, when CPU access the data on DDR, does the EDMA automatically transfer the data between L2CACHE and DDR without CPU operation or instruction?

    Thanks again.

    Regards.

    Aither.

  • Hi,

    First EDMA is not started automatically, it can be triggered by one of the following:
    • Event-triggered transfer request (this is the more typical usage of EDMA3): A
    peripheral, system, or externally-generated event triggers a transfer request.
    • Manually-triggered transfer request: The DSP to manually triggers a transfer by
    writing a 1 to the corresponding bit in the event set register (ESR/ESRH).
    • Chain-triggered transfer request: A transfer is triggered on the completion of
    another transfer or sub-transfer.

    Once triggered, it moves data between L2 (NOT L2 cache) and DDR.

    when CPU access the data on DDR, it reads the data from cache. You need invalidate the cache from CPU before reading it.

    Regards, Eric
  • Hi, Eric
    First, thanks for your effort and reply.
    I tested DDR cache as bellow at AM5728 IDK:

    main.c:

    #define TEST_BUFF_SZ 8*1024*1024
    #pragma DATA_ALIGN(a, CACHE_L2_LINESIZE)
    #pragma DATA_ALIGN(b, CACHE_L2_LINESIZE)
    #pragma DATA_ALIGN(c, CACHE_L2_LINESIZE)
    
    #pragma DATA_SECTION(ppin, ".ddr")
    #pragma DATA_SECTION(ppgradx_opt, ".ddr")
    #pragma DATA_SECTION(ppgrady_opt, ".ddr")
    #pragma DATA_SECTION(ppgradx_cn, ".ddr")
    #pragma DATA_SECTION(ppgrady_cn, ".ddr")
    static short a[TEST_BUFF_SZ], b[TEST_BUFF_SZ], c[TEST_BUFF_SZ];
    
    int main ()
    {
    	int i, j ;
    	clock_t t_start, t_stop, t_overhead, t_opt, t_i, t_cn;
    
    #ifndef IO_CONSOLE
    	Board_initCfg boardCfg;
    #if defined(SOC_K2E) || defined(SOC_C6678)|| defined(SOC_K2H)|| defined(SOC_C6657)
    	boardCfg = BOARD_INIT_MODULE_CLOCK |
    	BOARD_INIT_UART_STDIO;
    #else
    	boardCfg = BOARD_INIT_PINMUX_CONFIG |
    	BOARD_INIT_MODULE_CLOCK |
    	BOARD_INIT_UART_STDIO;
    #endif
    	Board_init(boardCfg);
    #endif
    
    	for (i = 0; i < TEST_BUFF_SZ; i++)
    	{
    		a[i] = b[i] = i << 2;
    	}
    	
    	TSCL= 0,TSCH=0;
    	/* Compute the overhead of calling _itoll(TSCH, TSCL) twice to get timing info */
    	/* ---------------------------------------------------------------- */
    	t_start = _itoll(TSCH, TSCL);
    	t_stop = _itoll(TSCH, TSCL);
    	t_overhead = t_stop - t_start;
    
    	t_start = _itoll(TSCH, TSCL);
    	for (i = 0; i < TEST_BUFF_SZ; i++)
    	{
    		c[i] = a[i] + b[i];
    	}
    
    	t_stop = _itoll(TSCH, TSCL);
    	t_cn = (t_stop - t_start) - t_overhead;
    	printf("test1:%d,%d\n", t_cn, c[1]);
    
    	CACHE_setL1PSize(CACHE_L1_32KCACHE);
    	CACHE_setL1DSize(CACHE_L1_32KCACHE);
    	CACHE_setL2Size(CACHE_128KCACHE); //USer defined.
    
    	CACHE_enableCaching(64);
    	CACHE_enableCaching(128);
    	CACHE_enableCaching(129);
    
    	CACHE_invL2((void *)0, 128*1024, CACHE_WAIT);
    	CACHE_invL2Wait();
    
    	for (i = 0; i < TEST_BUFF_SZ; i++)
    	{
    		a[i] = b[i] = i << 2;
    	}
    	TSCL= 0,TSCH=0;
    	/* Compute the overhead of calling _itoll(TSCH, TSCL) twice to get timing info */
    	/* ---------------------------------------------------------------- */
    	t_start = _itoll(TSCH, TSCL);
    	t_stop = _itoll(TSCH, TSCL);
    	t_overhead = t_stop - t_start;
    
    	t_start = _itoll(TSCH, TSCL);
    	for (i = 0; i < TEST_BUFF_SZ; i++)
    	{
    		c[i] = a[i] + b[i];
    	}
    
    	t_stop = _itoll(TSCH, TSCL);
    	t_cn = (t_stop - t_start) - t_overhead;
    	printf("test2:%d,%d\n", t_cn, c[1]);
    }
    

    link.cmd:

    MEMORY
    {
      L2SRAM (RWX)  : org = 0x0800000, len = 0x040000 /* 128KB */
      DDR0:      o = 0x80000000 l = 0x40000000   /* 1GB external DDR Bank 0 */
    }
    
    SECTIONS
    {
      .kernel: {
        dsplib*<*.o*> (.text:optimized) { SIZE(_kernel_size) }
      } 
      
      .text:            load >> DDR0
      .text:touch:      load >> DDR0
    
      GROUP (NEAR_DP)
      {
        .neardata
        .rodata 
        .bss
      } load > DDR0
       
      .init_array: load >> DDR0 
      .far:        load >> DDR0
      .fardata:    load >> DDR0
      .neardata    load >> DDR0
      .rodata      load >> DDR0
      .data:       load >> DDR0 
      .switch:     load >> DDR0
      .stack:      load >  DDR0
      .args:       load >  DDR0 align = 0x4, fill = 0 {_argsize = 0x200; }
      .sysmem:     load >  DDR0
      .cinit:      load >  DDR0
      .const:      load >  DDR0 START(const_start) SIZE(const_size)
      .pinit:      load >  DDR0
      .cio:        load >> DDR0
      .csl_vect:   load >  DDR0
    }
    

    result:

    test1:1731405405,8
    test2:1731397407,8

    As you see, using cache is a little faster than no cache, but I think that using cache naturally must be more faster.

    Please tell me what is wrong?

    Thanks again.

    Regards, Aither.

  • Hi,

    I don't have your CCS project so I created one and tested. You need to make sure you initial test condition what it is before you run for comparison, on the DSP C66x core, look at:
    L1DCFG 0x01840040 change to 0
    L1PCFG 0x01840020 change to 0
    L2CFG 0x01840000 change to 0
    MAR128 0x01848200 change to 0xc
    MAR129 0x01848204 change to 0xc

    Then you run the
    CACHE_setL1PSize(CACHE_L1_32KCACHE);
    CACHE_setL1DSize(CACHE_L1_32KCACHE);
    CACHE_setL2Size(CACHE_128KCACHE); //USer defined.

    CACHE_enableCaching(64);
    CACHE_enableCaching(128);
    CACHE_enableCaching(129);

    Make sure above are: 4/4/3/0xD/0XD. Then you benchmarking again. Let me know if you get different results. Note for some reason the _itoll() didn't work properly, I didn't debug. But from TSCL and TSCH, I knew it is faster.

    Regards, Eric
  • Hi, Eric.

    First, thanks for your reply, and sorry for my late reply.

    I tested as you instructed, but the result is same:

    t_start_hi = TSCH;
    t_start_lo = TSCL;
    t_start = _itoll(t_start_hi, t_start_lo);

    for (i = 0; i < TEST_BUFF_SZ; i++)
    {
    c[i] = a[i] + b[i];
    }

    t_stop_hi = TSCH;
    t_stop_lo = TSCL;
    t_stop = _itoll(t_stop_hi, t_stop_lo);
    t_cn = (t_stop - t_start) - t_overhead;
    AUDIO_log("test1:%d,%X,%X~%X,%X\n", t_cn, t_start_hi, t_start_lo, t_stop_hi, t_stop_lo);

    test1:1732320338,5,C4CC9980~5,2C0DB5D6

    test2:1732253510,6,3297C265~6,99D7D9AF

    I could not verify _itoll(), however the difference is very small.

    Please tell me to solve this problem.

    Thanks again.

    Regards.

    Aither.

  • Hi,

    I tested on AM5728 IDK EVM, using the standard GEL to initialize the SOC, including the DSP PLL. I can see that cache on is about x13 faster. One difference is I used -O3 optimization for your code, the other changes are minor.

    Attached is the CCS project and printout below.

    Cache off

    start 488821, stop 580438111, diff 579949281, TSCH 0, c[1] 8

    Cache on

    start 635032235, stop 678885963, diff 43853725, TSCH 0, c[1] 8

    Regards, Eric

    DSP_cache.zip