This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

66AK2H14: DDR3 cacheability (L1D cache issue)

Part Number: 66AK2H14

Hi all,


I don't understand something in a CCS project I wrote. In my test I only write data in DDR3. I did two tests: one using DDR3 cacheable and an other one using DDR3 non-cacheable (through MAR).

For this test, I don't use any OS, L2 is used as full SRAM (no cache) and L1D as full cache.


I expecting no data in cache for those two tests.


In this test, I wrote data in different memory block each time and each write will be 'miss' (no rewritting data), if I have a good understanding of the DSP cache and DDR3 memory controller documentations from 66AK2H14 related documentations:


L1D is read-allocate/writeback, in my case I don't do any reads (only write), I expecting no data in cache because in case of write 'miss' documentation says that data pass through Write buffer and go directly to external memory or L2 (in my case DDR3).

Yet, when I look into cache view under CCS, from exactly 262 142 Bytes written, the cache is used.

Why ?


Regards,

François

  • Hi,

    I assume you have:

    L1DCFG = 4 (32KB cache)

    L2CFG = 0 (all SRAM)

    DDR3A: starting from MAR128-, what is the value in both cases? 

    "In this test, I wrote data in different memory block each time and each write will be 'miss' (no rewritting data), " =====> are you able to provide a code snippet for this or the CCS project?

    There is CCS view with L1D, L2 checked and unchecked. Can you attach screenshots to explain what you saw and what you expected?

    Regards, Eric

  • Hi,

    Yes you're right L1DCFG=4 and L2CFG=0.

    DDR3A is cached from MAR128 to MAR255, my start address for the test is 0xB0000000.

    Here are the code I use for the test:

    /*
     * Author:  F. Poulain
     * Date:    05/06/2019
     *
     * The goal of this test is to see DDR3 behaviour in DSP one core writing.
     *
     */
    
    #include <memory.h>
    
    /* 1 MiB = 1024 KiB = 1048576  */
    #define DDR_SIZE_TO_TEST_MAX_BYTES    (0x100000)
    
    #define FLOAT_MB (1048576.0)
    
    /* */
    #define BUS_PRIORITY_MAX (0x0)
    
    /* */
    #define BUS_PRIORITY_MIN (0x7)
    
    /*
     *
     */
    void ddr_write_access(Uint32* p_tab, Uint32 size_bytes);
    Uint32 verif(Uint32* p_tab, Uint32 size_bytes);
    
    /* Table test pointer in DDR3A */
    Uint32* ptab0 = NULL;
    
    float calculate_throughput(Uint32 size_bytes, UInt64 nb_cycles_elapsed)
    {
        float f32_speed_mb_per_sec = 0.0;
        double f64_speed_b_per_sec = 0.0;
        double f64_speed_mb_per_sec = 0.0;
    
        float execution_time_ns = 0.0;
    
        execution_time_ns = nb_cycles_elapsed * (1/1.2) ;
    
        f64_speed_b_per_sec = ( (double) 1000000000.0 * (double) size_bytes) / (double) execution_time_ns;
    
        f64_speed_mb_per_sec = f64_speed_b_per_sec / (double) FLOAT_MB;
    
        f32_speed_mb_per_sec = (float) (f64_speed_mb_per_sec);
    
        return f32_speed_mb_per_sec;
    }
    
    
    int main(void)
    {
        // memory initialization
        memory_init();
    
        volatile int *next=(volatile int*)0xFFFF0000;
        *next=0;
        //printf("@ of loop continue variable: %x\n\n", &next);
    
        // set priority
        Uint32 priority = BUS_PRIORITY_MIN;
        CSL_XMC_setMDMAPriority(priority);
    
        CSL_Uint64 timestamp_start      = 0;
        CSL_Uint64 timestamp_end        = 0;
        CSL_Uint64 timestamp_overhead   = 0;
        CSL_Uint64 nb_cycles_elapsed    = 0;
        Uint32 core_id = 0;
        Uint32 err=0;
    
    
        float f32_speed_mb_per_sec = 0.0;
    
        Uint32 size_to_test_bytes = 131072; //Uint32 size_to_test_bytes = 4;
    
        core_id = DNUM;
    
        /* Enable time stamp clock */
        CSL_tscEnable();
    
        /* counting delay */
        timestamp_start     = CSL_tscRead();
        timestamp_end       = CSL_tscRead();
        timestamp_overhead    = timestamp_end - timestamp_start;
    
        // pointer in DDR3A
        ptab0 = (Uint32*) DDR_TEST_START_ADDR;
    
        if (core_id == 0)
        {
            /* Only Core 0 do the test */
    
            do
            {
    
                // DDR accesses tick measurement
                timestamp_start = CSL_tscRead();
                ddr_write_access(&ptab0[0], size_to_test_bytes);
    
                //CACHE_wbL1d(&ptab0[0], size_to_test_bytes, CACHE_WAIT); // commented because L1D is not write-allocate & cache is free at beginning.
                timestamp_end = CSL_tscRead();
    
                // Invalidate cache line (security to ensure empty cache)
                CACHE_invL1d(&ptab0[0], size_to_test_bytes, CACHE_WAIT);
    
                // Verification
                err = verif(&ptab0[0], size_to_test_bytes);
    
                // Throughput Calculation
                nb_cycles_elapsed = (timestamp_end - timestamp_start) - timestamp_overhead;
                f32_speed_mb_per_sec = calculate_throughput(size_to_test_bytes, nb_cycles_elapsed);
    
                if (err == 0)
                {
                    printf("Throughput for size bytes %d = %f MB/s \n", size_to_test_bytes, f32_speed_mb_per_sec);
                }
                else
                {
                    printf("Error: %d\n", err);
                }
    
                // reset all updated memory blocks in DDR3 memory
                memset(&ptab0[0], 0, size_to_test_bytes);
    
                // Writeback Invalidate cache (empty cache) due to verification
                CACHE_wbInvL1d(&ptab0[0], size_to_test_bytes, CACHE_WAIT);
    
                *next=0;
                do
                {
    
                } while(*next==0);
    
                // Increment test size
                size_to_test_bytes *= 2;
    
            } while(size_to_test_bytes <= DDR_SIZE_TO_TEST_MAX_BYTES);
    
        }
    
        return 0;
    }
    
    void ddr_write_access(Uint32* p_tab, Uint32 size_bytes)
    {
        Uint32 i_word = 0;
        Uint32 nb_words = size_bytes / 4;
    
        for (i_word = 0; i_word < nb_words; i_word++)
        {
            p_tab[i_word] = i_word;
        }
    }
    
    Uint32 verif(Uint32* p_tab, Uint32 size_bytes)
    {
        Uint32 i_word = 0;
        Uint32 err = 0;
        Uint32 nb_words = size_bytes / 4;
    
        for (i_word = 0; i_word < nb_words; i_word++)
        {
            if (p_tab[i_word] != i_word)
                err++;
        }
    
        return err;
    }

  • This is the cache view in the infinite loop after the writeback invalidate.

    If we look into memory browser view, we can see all '0'  from memset function but there is no coherency between L1D cache and DDR3 memory.

    I am currently looking for a solution, but if you can help me that could be very helpful.

    Best Regards,

    François

  • Hi,

    Sorry for the latest response, I created a CCS project for your code. I used CCS 8.3 (CGT 8.3.2) this shouldn't matter. I used TI K2H EVM with no-boot mode, I connected the DSP core 0, this has GEL initial setup with L1D 32K cache, L1P 32K cache, L2 no cache and DDR no cache.

    I added some code for your case:

    L1DCFG = 0x4;
    L2CFG = 0x0;

    for(i=0; i < 128; i++) {
    *(unsigned int*)(0x1848200+4*i) = 0xD;
    }

    I also added some include path for the CSL header's and linked with CSL library from pdk_k2hk_4_0_13. The CSL release/version should not matter as this is something basic for the cache operation. I attached my CCS project so you can compare my difference.

    In the main code. 

    After ddr_write_access() is called, I looked at the cache view, it is clear that DDR is NOT cached in L1D. 

    Then, after    CACHE_invL1d(&ptab0[0], size_to_test_bytes, CACHE_WAIT); still NO DDR cached in L1D

    Then, after   err = verif(&ptab0[0], size_to_test_bytes);=====> this is DDR read, it is cached in L1D

    Then after             memset(&ptab0[0], 0, size_to_test_bytes);

    Finally, after CACHE_wbInvL1d(&ptab0[0], size_to_test_bytes, CACHE_WAIT);, I didn't see DDR cached in L1D


    I attached my CCS project.

    L1D_cache.zip

    Regards, Eric

  • Hi Eric,

    Thanks for your answer, I had time to look into my problem during the delay of your answer. And I don't how to resolve the problem but I know where is coming from.

    First of all in DDR3 non-cacheable, there is nothing in L1D cache, the problem occured with DDR3 cacheable (have you tired to cached DDR3 on your test ?)

    Secondly, the issue is due to the verification function at the end at 262144 bytes, memset occured modified value in cache but after the writeback invalidate, there is no coherency between cache and DDR3.

    Thirdly, if the verification function is not used there is no problem of cocherency between cache and DDR3 after memset and writeback invalidate.

    Also, I don't understand why you are doing that:

    for(i=0; i < 128; i++) {
    *(unsigned int*)(0x1848200+4*i) = 0xD;
    }

    I can't find 0x1848200 in 66AK2H14 documentation.


    I add my configuration in 4 new posts, in order to give you all information to help me.

    Regards,

    François

  • #ifndef INC_PARAMETERS_H_
    #define INC_PARAMETERS_H_
    
    /* Kepler Architecture */
    #define CPU_CLOCK_KHZ           1200000     // CPU frequency (kHz)
    #define NB_DSP_CORE_MAX         8           // Number of DSP cores in Kepler
    #define DDR_TEST_START_ADDR     0xB0000000  /* */
    
    /* Define in DDR3 Initialization */
    #define PAGE_SIZE           4096        // PAGE_SIZE
    #define BANKS               8           // Number of Banks
    #define ROWS                65536       // Number of rows
    #define DSP_DDR_SIZE        402653184   // Size of DSP DDR3 Region: 0x18000000
    
    /* TEST OPTIONS */
    #define DDR_CACHEABILITY    0           // DDR3 full cacheability : 1, DDR3 non-cacheable : 0
    
    #endif /* INC_PARAMETERS_H_ */
    

  • MEMORY
    {
        LOCAL_L2_SRAM:  o = 0x00800000 l = 0x00100000   /* 1MB LOCAL L2/SRAM */
    
    	DDR3:      		o = 0xB0000000 l = 0x18000000   /* DDR3 SDRAM */
    }
    
    SECTIONS
    {
        .text          >  LOCAL_L2_SRAM
        .stack         >  LOCAL_L2_SRAM
        .bss           >  LOCAL_L2_SRAM
        .cio           >  LOCAL_L2_SRAM
        .const         >  LOCAL_L2_SRAM
        .data          >  LOCAL_L2_SRAM
        .switch        >  LOCAL_L2_SRAM
        .sysmem        >  LOCAL_L2_SRAM
        .far           >  LOCAL_L2_SRAM
        .args          >  LOCAL_L2_SRAM
        .ppinfo        >  LOCAL_L2_SRAM
        .ppdata        >  LOCAL_L2_SRAM
    
        /* COFF sections */
    
        .pinit         >  LOCAL_L2_SRAM
        .cinit         >  LOCAL_L2_SRAM
    
        /* EABI sections */
        .binit         >  LOCAL_L2_SRAM
        .init_array    >  LOCAL_L2_SRAM
        .neardata      >  LOCAL_L2_SRAM
        .fardata       >  LOCAL_L2_SRAM
        .rodata        >  LOCAL_L2_SRAM
        .c6xabi.exidx  >  LOCAL_L2_SRAM
        .c6xabi.extab  >  LOCAL_L2_SRAM
    
        /* project specific sections */
    	.l2sram			>	LOCAL_L2_SRAM
    	.ddr			>	DDR3
    }
    

  • #ifndef __MEMORY_H
    #define __MEMORY_H
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <math.h>
    #include <c6x.h>
    
    #include <dsplib.h>
    #include <csl_tsc.h>
    #include <csl_cache.h>
    #include <csl_cacheAux.h>
    
    #include <parameters.h>
    
    
    /* extern declaration */
    extern Uint32 *ptab0;
    
    /* API reference */
    
    /**
     * @brief memory mapping initialization
     * @param void
     */
    void memory_init (void);
    
    #endif /* __MEMORY_H */
    

  • #include <memory.h>
    
    
    
    
    void memory_init (void)
    {
        int k=0;
        Uint8 pcx; // Reserved bit, do not touch
        Uint8 pfx; // prefetchability
    
    
    
        if (DDR_ACCESS == 1)
        {
            /*
             * MAR 128: DDR3 start @    : 0x8000 0000
             * MAR 255: DDR3 end @      : 0xFFFF FFFF
             *
             * DDR3 cacheabilty is enable on the entire DDR3 memory space.
             */
            for (k=128; k<=255; k++)
            {
                // Set PC at '1'
                CACHE_enableCaching(k);
                // Get the memory region information for MAR k
                CACHE_getMemRegionInfo (k, &pcx, &pfx);
                // prefetch, 0 disable, 1 enable
                pfx = 1;
                CACHE_setMemRegionInfo(k, pcx, pfx);
            }
        }
    
        else if (DDR_ACCESS == 0)
        {
            /*
             * MAR 128: DDR3 start @    : 0x8000 0000
             * MAR 255: DDR3 end @      : 0xFFFF FFFF
             *
             * DDR3 cacheability is disabled on the entire DDR3 memory space.
             */
            for (k=128; k<=255; k++)
            {
                // Set PC at '0'
                CACHE_disableCaching(k);
                // Get the memory region information for MAR k
                CACHE_getMemRegionInfo (k, &pcx, &pfx);
                // prefetch, 0 disable, 1 enable
                pfx = 1;
                CACHE_setMemRegionInfo(k, pcx, pfx);
            }
        }
    
        else
        {
            printf("MAR INITIALIZATION FAILED\n");
        }
    
        /* cache initialization */
        CACHE_setL2Size(CACHE_0KCACHE);
        CACHE_setL1DSize(CACHE_L1_32KCACHE);
        CACHE_setL1PSize(CACHE_L1_32KCACHE);
    
    }
    

  • I give you all my test code and this is my compiler options:

    -mv6600 -O3 --include_path="${TI_PDK_INCLUDE_PATH}" --include_path="${COM_TI_MAS_DSPLIB_C66X_INCLUDE_PATH}" --include_path="D:/APP/TI_CCSV8/dsplib_c66x_3_4_0_2/packages/ti/dsplib" --include_path="D:/APP/TI_CCSV8/dsplib_c66x_3_4_0_2/packages" --include_path="D:/APP/TI_CCSV8/pdk_k2hk_4_0_12/packages/ti/csl" --include_path="D:/APP/TI_CCSV8/pdk_k2hk_4_0_12/packages" --include_path="${PROJECT_ROOT}" --include_path="${workspace_loc:/${ProjName}/inc}" --include_path="${CG_TOOL_ROOT}/include" --advice:performance=all --define=${TI_PDK_SYMBOLS} --define=${COM_TI_MAS_DSPLIB_C66X_SYMBOLS} --define=DEVICE_K2K -g --diag_warning=225 --diag_wrap=off --display_error_number --issue_remarks --debug_software_pipeline -k --gen_opt_info=2 --optimizer_interlist

    If you need other information, don't hesitate to ask me.

    Best Regards,

    François

  • Hi,

    I looked your code and I thought what I did is the same as yours, L1D=32KB cache, L1P=32KB cache (default by GEL), L2 all RAM, I did DDR3A as cacheable by MAR128-MAR255, that is my code:

    for(i=0; i < 128; i++) {
    *(unsigned int*)(0x1848200+4*i) = 0xD; =========>DDR cache-able, prefetchable
    }

    Those are documented on TMS320C66x DSP CorePac USER GUIDE: 

    0184 8200h MAR128 Memory Attribute Register 128 8000 0000h - 80FF FFFFh
    0184 8204h MAR129 Memory Attribute Register 129 8100 0000h - 81FF FFFFh

    ....

    This should be equivalent to your (DDR_ACCESS == 1) by CSL call. And I also place code into L2. 

    Given I couldn't see the issue, I don't know the difference. If you replace your memory_init() with my code:

    L1DCFG = 0x4;
    L2CFG = 0x0;

    for(i=0; i < 128; i++) {
    *(unsigned int*)(0x1848200+4*i) = 0xD;
    }

    Do you see any difference?

    Also, you mentioned " there is no coherency between cache and DDR3." or "no problem of cocherency between cache and DDR3", what is your criteria for cache coherency between L1D and DDR3A? E.g.: 

    • In the CCS memory window when you look at DDR3A, with L1D checked and unchecked, are the DDR3A memory the same? If the same, coherent?
    • Or, in the CCS ---->View------>Other------>Debug----->cache to check if DDR3A address is cached in the L1D?  If not cached, coherent? For me, after CACHE_wbInvL1d(&ptab0[0], size_to_test_bytes, CACHE_WAIT);, I didn't see DDR cached in L1D

    Regards, Eric


     

  • François, any update?

    Regards, Eric

  • Hi!

    Sorry for the delay, I test your code but it it the same, what have you done for testing ?

    The protocol to see the issue is the following:

    1°) Click on green arrow to resume

    2°) Click on pause (in the infinite loop after writeback-invalidate)

    3°) Under cache view, look into cache L1D, to see what's in (there are nothing for 131072 bytes)

    4°) under memory browser, enter 0xFFFF0000 address and write 1

    5°) Resume

    6°) Then Pause

    7°) See cache view again (there are data in cache depending on DDR3 addresses)

    If you had further questions on the procedure, do not hesitate to ask me.

    Regards,

    François

  • François,

    I tested the cache issue with your code and provided my observation. I am closing this for now and I will be out of office for a few weeks. If you have further question, please re-open it and my colleague will be happy to help you.

    Regards, Eric 

  • Eric,

    Thanks, for your answer but have you look for 262144 bytes, I do not understand, with your code and my code, I get different observation than yours.

    Anyway, have a good holiday, I'll re-open the post later.

    Regards,

    François

  • Hi!

    I re-open the post, on L1D issue I met with DDR3 cacheability.

    I print the result of your code by follow the procedure I described above:

    I got same as result as mine.

    Can you retry to use the procedure I proposed ?

  • Francois

    Eric is out for next 2 weeks. We will likely not be able to provide further guidance till he is back. 

    Sorry for the delay.

    Regards

    Mukul 

  • Mukul,

    Ok, but does any of you could try the code posted by Eric and try the procedure I indicate in previous post  and take screenshot of the L1D cache memory ?

    If not, I will be waiting for Eric.

    Best Regards,

    François

  • Hi,

    I am back to office and thanks for the patience! Here is what I observed. I attached a word document showing the L1D cache view for each steps:

    Do you mean that in the second loop, the DDR3 is still cached in L1D and marked as dirty, this is a problem? Load program.docx

    Regards, Eric

  • Hi!

    I have the same view of L1D cache as in your document. Yes the issue appears in the second loop, DDR3 is still cached in L1D as dirty.

    The issue I had depending on this is coherency between DDR and L1D cache after the memset and cache writeback-invalidate.

    I show you the issue with DDR and L1D cache view:

    DDR view

    L1D cache view:

    No coherency between DDR and L1D cache.

    Regards,

    François

  • In my understanding, the issue is due to dirty lines.

    Here is the definition given by TI on the sprugw0c documentation:

    "In a writeback cache, writes that reach a given level in the memory hierarchy may update
    that level, but not the levels below it. Thus, when a cache line is valid and contains
    updates that have not been sent to the next lower level, that line is said to be dirty. The
    opposite state for a valid cache line is clean."

    Am I right ? Why lines are dirty in my case ?

    Regards,

    François

  • Hi,

    Do you have any news ?

    Regards,

    François

  • Part Number: 66AK2H14

    Hi,

    This question follows the answer given on the related post talking about DDR cacheability. Now I would like to show you another, strange difference between DDR cacheable and non-cacheable.

    In fact, I keep the code of the related post but without using the verification function causing some cache issue (a priori), in my test I realized only write accesses from DSP to DDR cacheable and non cacheable, there is nothing in L1cache. All write command pass through write buffer.

    Even code are expecting to have same results,  results are different, it is very stranged:

    This figure has been obtained using -o3 option. In -off option, cacheable and non cacheable give same results:

    The only difference, I saw between DDR cacheable and non cacheable is the number of access realized in the DDR memory controller:

    opt : -o3

    dbg : -oOff

    C: DDR cacheable

    NC : DDR non cacheable

    SIZE opt, C dbg, C opt, NC dbg, NC
    4 1 1 1 1
    8 2 2 2 2
    16 4 4 4 4
    32 8 8 8 8
    64 16 16 16 16
    128 25 32 32 32
    256 35 64 64 64
    512 58 128 128 128
    1024 76 256 256 256
    2048 111 512 512 512
    4096 182 1024 1024 1024
    8192 318 2046 2046 2046
    16384 609 4096 4096 4096
    32768 1178 8191 8191 8191
    65536 2316 16382 16382 16382
    131072 4580 32761 32761 32761
    262144 9120 65536 65536 65536
    524288 18197 131061 131061 131061
    1048576 36377 262126 262126 262126

    If anyone could give me an explanation, it would really help me !!



    Regards,

    François

    PS :  if you  need other information, tell me

  • Francois,

    I have merged your two threads as the question in your threads needs the context from previous thread. Eric seems to have reached out to the C6x design team and also has your code setup on K2H EVM so may be best person to help with this issue.

    Regards,

    Rahul

  • Hi,

    What is the code to reproduce and analysis the behavior? E.g., how do you change the input data size? How do you get "number of access realized in the DDR memory controller"?

    Regards, Eric

  • #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <math.h>
    #include <c6x.h>
    
    #include <dsplib.h>
    #include <csl_tsc.h>
    #include <csl_cache.h>
    #include <csl_cacheAux.h>
    
    
    #define DDR_TEST_START_ADDR     0xB0000000
    
    /* 1 MiB = 1024 KiB = 1048576  */
    #define DDR_SIZE_TO_TEST_MAX_BYTES    (0x100000)
    
    /* */
    #define BUS_PRIORITY_MAX (0x0)
    #define BUS_PRIORITY_MIN (0x7)
    
    #define MEM_CTRL_BASE_ADDR  (0x21010000)
    #define REG_SEL_CORE0   (0x00000000)
    #define REG_SEL_CORE1   (0x01000000)
    #define REG_SEL_CORE2   (0x02000000)
    #define REG_SEL_CORE3   (0x03000000)
    #define REG_SEL_CORE4   (0x04000000)
    #define REG_SEL_CORE5   (0x05000000)
    #define REG_SEL_CORE6   (0x06000000)
    #define REG_SEL_CORE7   (0x07000000)
    #define REG_SEL_LINUX   (0x08000000)
    
    /*
     *
     */
    Uint32 verif(Uint32* p_tab, Uint32 size_bytes);
    void init_cnt(void);
    
    /* */
    Uint32* ptab0 = NULL;
    
    #pragma DATA_ALIGN(ptab0, 8);
    
    int* cnt_1_addr;
    int* cnt_2_addr;
    
    
    int main(void)
    {
        // memory initialization
        for (k=128; k<=255; k++)
        {
           // Set PC at '0'
           CACHE_disableCaching(k);
           // Get the memory region information for MAR k
           CACHE_getMemRegionInfo (k, &pcx, &pfx);
           // prefetch, 0 disable, 1 enable
           pfx = 1;
           CACHE_setMemRegionInfo(k, pcx, pfx);
        }
    
        /* cache initialization */
        CACHE_setL2Size(CACHE_0KCACHE);
        CACHE_setL1DSize(CACHE_L1_32KCACHE);
        CACHE_setL1PSize(CACHE_L1_32KCACHE);
    
        // Variable used for the test
        Uint32 start_1=0;
        Uint32 start_2=0;
        Uint32 end_1=0;
        Uint32 end_2=0;
        Uint32 duration_1=0;
        Uint32 duration_2=0;
        Uint32 err=0;
    
        // Performance counter initialization
        init_cnt();
    
        // set priority
        Uint32 priority = BUS_PRIORITY_MIN;
        CSL_XMC_setMDMAPriority(priority);
    
        Uint32 core_id = 0;
    
        Uint32 size_to_test_bytes = 4;
    
        core_id = DNUM;
    
        // pointer in DDR3A
        ptab0 = (Uint32*) DDR_TEST_START_ADDR;
    
        if (core_id == 0)
        {
            /* Only Core 0 do the test */
    
            do
            {
                start_1 = *cnt_1_addr; start_2 = *cnt_2_addr;
                ddr_write_access(&ptab0[0], size_to_test_bytes);
                // Writeback last data in cache
                //CACHE_wbL1d(&ptab0[0], size_to_test_bytes, CACHE_WAIT); // commented because L1D is not write-allocate & cache is free at beginning.
                end_1 = *cnt_1_addr; end_2 = *cnt_2_addr;
    
                // Invalidate cache line (security to ensure empty cache)
                CACHE_invL1d(&ptab0[0], size_to_test_bytes, CACHE_WAIT);
    
                // Verification
                err = verif(&ptab0[0], size_to_test_bytes);
    
                if (err != 0)
                {
                    printf("Error: %d\n", err);
                }
    
                // reset all updated memory blocks in DDR3 memory
                memset(&ptab0[0], 0, size_to_test_bytes);
    
                // Writeback Invalidate cache (empty cache) due to verification
                CACHE_wbInvL1d(&ptab0[0], size_to_test_bytes, CACHE_WAIT);
    
                // Total read and write accesses by the DSP core #0
                duration_1 = end_1 - start_1; duration_2 = end_2 - start_2;
    
                printf("Data input: %dB, Read Acesses: %d, Write Acesses: %d\n", size_to_test_bytes, duration_1, duration_2);
    
                size_to_test_bytes *= 2;
            } while(size_to_test_bytes <= DDR_SIZE_TO_TEST_MAX_BYTES);
    
        }
    
        return 0;
    }
    
    void ddr_write_access(Uint32* p_tab, Uint32 size_bytes)
    {
        Uint32 i_word = 0;
        Uint32 nb_words = size_bytes / 4;
    
        for (i_word = 0; i_word < nb_words; i_word++)
        {
           p_tab[i_word] = i_word;
        }
    }
    
    void init_cnt(void)
    {
        // PERF_CNT_CFG
        int offset_cfg = 0x88;
        int *reg_cfg = (int*)(MEM_CTRL_BASE_ADDR + offset_cfg);
        *reg_cfg = 0x80038002; // CNT1: R, CNT2: W, DSP CORE0 MASTER ID
    
        // PERF_CNT_SEL
        int offset_sel = 0x8C;
        int *reg_sel = (int*)(MEM_CTRL_BASE_ADDR + offset_sel);
        *reg_sel = REG_SEL_CORE0; // DSP CORE0 MASTER ID
    
        // PERF_CNT_1
        int offset_cnt_1 = 0x80;
        cnt_1_addr = (int*)(MEM_CTRL_BASE_ADDR + offset_cnt_1);
    
        // PERF_CNT_1
        int offset_cnt_2 = 0x84;
        cnt_2_addr = (int*)(MEM_CTRL_BASE_ADDR + offset_cnt_2);
    }
    
    Uint32 verif(Uint32* p_tab, Uint32 size_bytes)
    {
        Uint32 i_word = 0;
        Uint32 err = 0;
        Uint32 nb_words = size_bytes / 4;
    
        for (i_word = 0; i_word < nb_words; i_word++)
        {
            if (p_tab[i_word] != i_word)
                err++;
        }
    
        return err;
    }

    You can test this one.


    Regards,

    François

  • Please refer to spruhn7c DDR memory controller documentation for performance counter at page 35 for PERF_CNT accesses and from page 62 to 65 for registers configuration.

  • Hi,

    Thanks for providing the code. I can reproduce the issue. With DDR set as cache off, DDR write accesses the controller every 4 bytes, no matter it is -O3 or -O0. 

    When DDR is cached, the DDR write accesses controller much less using -O3 than -O0. 

    No matter cache ON or OFF, the -O3 produced the same assembly code:

    I will check and update why.

    Regards, Eric 

  • Hi Eric,

    Thanks for your answer, in fact I already asked a related question on this topic : https://e2e.ti.com/support/processors/f/791/t/810423

    I have same result as yours concerning asm code, and that's why, I was wondering what's the difference between DDR accesses in both cases.

    Thanks for your time,

    Best Regards,

    François

  • Hi,

    I looked the referred thread. All those people (Rahul, George and Tom) are my colleagues across different function units. This is not caused by the optimization. because with -O3 the disassembly code are the same regardless of the DDR is cached or non-cached. 

    Tom is our expert on DDR controller and I thought he already provided the explanation. 

    Regards, Eric

  • Hi Eric,

    Thanks for your answer, I will no longer search information about this subject. I'll just keep in mind all Tom said in his last answer.

    Regards,

    François