This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/AM5728: DSP cache problem

Part Number: AM5728

Tool/software: Code Composer Studio

Hi, kind TI and everyone!

I tested the ti\processor_sdk_rtos_am57xx_5_00_00_15\demos\audio-benchmark-starterkit using CCS8.2, AM5728 IDK.

I mapped all memory sections into DDR and tested L2SRAM cache as bellow:

CACHE_setL1PSize(CACHE_L1_32KCACHE);
CACHE_setL1DSize(CACHE_L1_32KCACHE);
CACHE_setL2Size(CACHE_128KCACHE);

CACHE_enableCaching(64);
CACHE_enableCaching(128);
CACHE_enableCaching(129);

And result is good.

DSPF_sp_fftSPxSP Iter#: 1 Intrinsic Successful SA Successful N = 8 radix = 2 natC: 1387 optC: 2953 SA: 355
DSPF_sp_fftSPxSP Iter#: 2 Intrinsic Successful SA Successful N = 16 radix = 4 natC: 606 optC: 2491 SA: 206
DSPF_sp_fftSPxSP Iter#: 3 Intrinsic Successful SA Successful N = 32 radix = 2 natC: 1273 optC: 6336 SA: 284
DSPF_sp_fftSPxSP Iter#: 4 Intrinsic Successful SA Successful N = 64 radix = 4 natC: 2326 optC: 13252 SA: 424
DSPF_sp_fftSPxSP Iter#: 5 Intrinsic Successful SA Successful N = 128 radix = 2 natC: 5378 optC: 34218 SA: 891
DSPF_sp_fftSPxSP Iter#: 6 Intrinsic Successful SA Successful N = 256 radix = 4 natC: 10577 optC: 71518 SA: 1600

But after I inserted the DDR test, cycles increased to previous result when no cache using.

#define TEST_BUFF_SZ 8*1024*1024

#pragma DATA_ALIGN(a, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(b, CACHE_L2_LINESIZE)
#pragma DATA_ALIGN(c, CACHE_L2_LINESIZE)

static short a[TEST_BUFF_SZ], b[TEST_BUFF_SZ], c[TEST_BUFF_SZ];

main()

{

Board_init(boardCfg);

for (i = 0; i < TEST_BUFF_SZ; i++)
{
    a[i] = b[i] = i << 2;
}
TSCL= 0,TSCH=0;
/* Compute the overhead of calling _itoll(TSCH, TSCL) twice to get timing info */
/* ---------------------------------------------------------------- */
t_start = _itoll(TSCH, TSCL);
t_stop = _itoll(TSCH, TSCL);
t_overhead = t_stop - t_start;

t_start = _itoll(TSCH, TSCL);
for (i = 0; i < TEST_BUFF_SZ; i++)
{
    c[i] = a[i] + b[i];
}

t_stop = _itoll(TSCH, TSCL);
t_cn = (t_stop - t_start) - t_overhead;
AUDIO_log("DDR test:%d,%d\n", t_cn, c[1]);

CACHE_setL1PSize(CACHE_L1_32KCACHE);
CACHE_setL1DSize(CACHE_L1_32KCACHE);
CACHE_setL2Size(CACHE_128KCACHE); //USer defined.

CACHE_enableCaching(64);
CACHE_enableCaching(128);
CACHE_enableCaching(129);

CACHE_invL2((void *)0, 128*1024, CACHE_WAIT); // invalidate entire L2SRAM
CACHE_invL2Wait();

for (N = 8, k = 1; N <= MAXN; N = N * 2, k++)
{

// run benchmark

...

}

DSPF_sp_fftSPxSP Iter#: 1 Intrinsic Successful SA Successful N = 8 radix = 2 natC: 5120 optC: 29480 SA: 4999
DSPF_sp_fftSPxSP Iter#: 2 Intrinsic Successful SA Successful N = 16 radix = 4 natC: 8438 optC: 58262 SA: 7156
DSPF_sp_fftSPxSP Iter#: 3 Intrinsic Successful SA Successful N = 32 radix = 2 natC: 25211 optC: 168056 SA: 16102
DSPF_sp_fftSPxSP Iter#: 4 Intrinsic Successful SA Successful N = 64 radix = 4 natC: 49016 optC: 348002 SA: 29614
DSPF_sp_fftSPxSP Iter#: 5 Intrinsic Successful SA Successful N = 128 radix = 2 natC: 133709 optC: 913610 SA: 69847
DSPF_sp_fftSPxSP Iter#: 6 Intrinsic Successful SA Successful N = 256 radix = 4 natC: 265388 optC: 1882187 SA: 140677

Please tell me why benchmark performance dropped.

Thanks.

Regards.

Aither.

  • Part Number: AM5728

    Tool/software: Code Composer Studio

    Hi, kind TI and every one,

    I'd like to know the possible reasons of why there is differences in performance when use or not use Cache in ti\vlib_c66x_3_3_2_0\examples\Regression.

    Thanks.

    Regards.

    Aither.

  • AIther,

    Can you provide the map file from your build for the DDR test case with the cache enabled. The code you are using for Cache is enable MAR bit 64, 128 and 129 which corresponds to following regions:

    4000 0000h - 40FF FFFF
    8000 0000h - 80FF FFFFh
    8100 0000h - 81FF FFFFh

    SO you need to ensure that the code/data sections are placed in that region. ALso, big is your code, is it larger than 128 KB ?

    There are several great training materials to understand the impact of cache on performance and strategies to use cache optimally. you can checkout the C6000 embedded workshop section on cache :
    processors.wiki.ti.com/.../C6000_Embedded_Design_Workshop

    The C66x Cache user guide is also great resource for this topic.

    Regards
    Rahul
  • Hi, Rahul

    First, thanks for your reply.

    C:\ti\processor_sdk_rtos_am57xx_5_00_00_15\demos\audio-benchmark-starterkit\BenchmarkProjects\Benchmark_FFT_idkAM572x_c66ExampleProject\Release\Benchmark_FFT_idkAM572x_c66ExampleProject.map:

    MEMORY CONFIGURATION

    name origin length used unused attr fill
    ---------------------- -------- --------- -------- -------- ---- --------
    L2SRAM 00800000 00040000 00000480 0003fb80 RW X
    OCMC_RAM1 40300000 00080000 00000000 00080000 RWIX
    OCMC_RAM2 40400000 00100000 00000000 00100000 RWIX
    OCMC_RAM3 40500000 00100000 00000000 00100000 RWIX
    DDR0 80000000 40000000 00022560 3ffddaa0 RWIX


    SEGMENT ALLOCATION MAP

    run origin load origin length init length attrs members
    ---------- ----------- ---------- ----------- ----- -------
    00800000 00800000 00000480 00000480 r-x
    00800000 00800000 00000480 00000480 r-x .kernel
    80000000 80000000 0000c000 00000000 rw-
    80000000 80000000 0000c000 00000000 rw- .stack
    8000c000 8000c000 00009c80 00009c80 r-x
    8000c000 8000c000 00009c80 00009c80 r-x .text
    80015c80 80015c80 0000c0ac 00000000 rw-
    80015c80 80015c80 00008000 00000000 rw- .sysmem
    8001dc80 8001dc80 00003cd0 00000000 rw- .far
    80021950 80021950 000003dc 00000000 rw- .fardata
    80021d30 80021d30 000002f8 000002f8 r--
    80021d30 80021d30 000002f8 000002f8 r-- .const
    80022028 80022028 00000174 00000000 rw-
    80022028 80022028 00000120 00000000 rw- .cio
    80022148 80022148 00000054 00000000 rw- .bss
    8002219c 8002219c 00000048 00000048 r--
    8002219c 8002219c 00000048 00000048 r-- .switch
    800221e8 800221e8 00000160 00000160 r--
    800221e8 800221e8 00000160 00000160 r-- .cinit
    80022400 80022400 00000220 00000220 r-x
    80022400 80022400 00000220 00000220 r-x .csl_vect

    ...

    My map file is:

    /cfs-file/__key/communityserver-discussions-components-files/791/Benchmark_5F00_FFT_5F00_idkAM572x_5F00_c66ExampleProject.map.txt

    As above, all code at .text, and .text size is 0x9c80(39.125KB).

    And I checked C6000_Embedded_Design_Workshop for several times, but I could not find any problem.

     

    Please help me to solve this problem.

     

    Thanks again.

    Regards.

    Aither.

  • p aither said:

    #define TEST_BUFF_SZ 8*1024*1024

    #pragma DATA_ALIGN(a, CACHE_L2_LINESIZE)
    #pragma DATA_ALIGN(b, CACHE_L2_LINESIZE)
    #pragma DATA_ALIGN(c, CACHE_L2_LINESIZE)

    static short a[TEST_BUFF_SZ], b[TEST_BUFF_SZ], c[TEST_BUFF_SZ];

    These 3 buffers will consume 48MB of memory.  Why don't I see them in your map file?  The MAR sections correspond to 16MB regions of DDR, so you would need to have 5 minimum to cover all of your code and data with those giant arrays defined.