This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

66AK2H12: DDR3 to L2 memcpy throughput

Part Number: 66AK2H12

Hello all,

I have a question about the throughput of memcpy.

I measured the throughput of memcpy with variable sizes (from 16Bytes to 1MBytes) and from different memory locations (DDR to L2, L2 to DDR, MSMC to L2 , L2 to MSMC).

I noticed that DDR to L2 throughput is very bad (saturating around 85-90MB/s) whereas the other memory cases are far higher than this result (for exp. L2 to DDR can go up to 2.4GB/s). Also when I use DMA there is no problem the speed is up to 6.6GB/s.

What could be the reason for that? Are those results wrong?

In my setup CPU has 1.22Ghz and DDR GEL file setting is ddr3A_64bit_DDR1600_setup().

Best,

samseytani

  • Sam,

    Please look at the expected memory throughput numbers for the device that we have published in the Keystone II through put app note:

    http://www.ti.com/lit/an/sprabk5b/sprabk5b.pdf

    The 85-90 Mb/s seems fairly low but I can check internally if there is something that can cause the number to be slower. It might be good if you can share how you are measuring this data and sharing the code so we can review for any issues.

    Regards,

    Rahul

  • Hi,

    I attach the project and there are some sample results:

    For DDR to L2 memcpy:

    59.041393 MB/s with size 16 bytes
    81.242905 MB/s with size 32 bytes
    87.869431 MB/s with size 64 bytes
    92.466942 MB/s with size 128 bytes
    70.690460 MB/s with size 256 bytes
    94.707977 MB/s with size 512 bytes
    95.188004 MB/s with size 1024 bytes
    91.792397 MB/s with size 2048 bytes
    89.014740 MB/s with size 4096 bytes
    91.251534 MB/s with size 8192 bytes
    90.612068 MB/s with size 16384 bytes
    90.640015 MB/s with size 32768 bytes
    90.045395 MB/s with size 65536 bytes
    89.969650 MB/s with size 131072 bytes
    89.973618 MB/s with size 262144 bytes
    90.096855 MB/s with size 524288 bytes
    85.062866 MB/s with size 1048566 bytes

    For L2 to DDR memcpy:

    333.233612 MB/s with size 16 bytes
    624.151855 MB/s with size 32 bytes
    1139.755615 MB/s with size 64 bytes
    1655.644897 MB/s with size 128 bytes
    2296.149902 MB/s with size 256 bytes
    2808.683594 MB/s with size 512 bytes
    757.549744 MB/s with size 1024 bytes
    1792.436279 MB/s with size 2048 bytes
    1691.818726 MB/s with size 4096 bytes
    1609.066772 MB/s with size 8192 bytes
    1473.623535 MB/s with size 16384 bytes
    1602.662231 MB/s with size 32768 bytes
    1565.432983 MB/s with size 65536 bytes
    1579.464478 MB/s with size 131072 bytes
    1533.411865 MB/s with size 262144 bytes
    1497.515381 MB/s with size 524288 bytes
    752.195862 MB/s with size 1048566 bytes

    For MSMC to DDR memcpy:


    132.843140 MB/s with size 16 bytes
    538.651611 MB/s with size 32 bytes
    995.482788 MB/s with size 64 bytes
    1299.886475 MB/s with size 128 bytes
    2016.490723 MB/s with size 256 bytes
    2557.500488 MB/s with size 512 bytes
    2905.981934 MB/s with size 1024 bytes
    3141.798340 MB/s with size 2048 bytes
    3278.932129 MB/s with size 4096 bytes
    3345.404297 MB/s with size 8192 bytes
    3381.932129 MB/s with size 16384 bytes
    3401.358887 MB/s with size 32768 bytes
    3351.390869 MB/s with size 65536 bytes
    2753.982178 MB/s with size 131072 bytes
    2756.668945 MB/s with size 262144 bytes
    2758.061768 MB/s with size 524288 bytes
    2096.768555 MB/s with size 1048566 bytes

    For L2 to MSMC memcpy:


    139.438187 MB/s with size 16 bytes
    504.122681 MB/s with size 32 bytes
    936.227905 MB/s with size 64 bytes
    1442.993408 MB/s with size 128 bytes
    2097.150146 MB/s with size 256 bytes
    2688.654297 MB/s with size 512 bytes
    3069.000488 MB/s with size 1024 bytes
    3391.617676 MB/s with size 2048 bytes
    3574.688232 MB/s with size 4096 bytes
    3649.863037 MB/s with size 8192 bytes
    3687.297363 MB/s with size 16384 bytes
    3709.718750 MB/s with size 32768 bytes
    3725.507813 MB/s with size 65536 bytes
    3139.532471 MB/s with size 131072 bytes
    3142.625977 MB/s with size 262144 bytes
    3144.205811 MB/s with size 524288 bytes
    3019.879395 MB/s with size 1048566 bytes

    For DDR to L2 DMA
    33.379936 MB/s with size 16 bytes
    86.611382 MB/s with size 32 bytes
    173.222763 MB/s with size 64 bytes
    311.457977 MB/s with size 128 bytes
    622.915955 MB/s with size 256 bytes
    1131.556030 MB/s with size 512 bytes
    1912.295166 MB/s with size 1024 bytes
    2919.466797 MB/s with size 2048 bytes
    4128.926270 MB/s with size 4096 bytes
    5073.750977 MB/s with size 8192 bytes
    5729.266602 MB/s with size 16384 bytes
    6077.778809 MB/s with size 32768 bytes
    5828.790527 MB/s with size 65536 bytes
    6169.270508 MB/s with size 131072 bytes
    6038.924805 MB/s with size 262144 bytes
    6216.956543 MB/s with size 524288 bytes
    6336.812988 MB/s with size 1048566 bytes

    For L2 to DDR DMA
    48.786064 MB/s with size 16 bytes
    97.572128 MB/s with size 32 bytes
    195.144257 MB/s with size 64 bytes
    390.288513 MB/s with size 128 bytes
    780.577026 MB/s with size 256 bytes
    1385.782104 MB/s with size 512 bytes
    2263.112061 MB/s with size 1024 bytes
    3311.289795 MB/s with size 2048 bytes
    4309.212891 MB/s with size 4096 bytes
    5207.616211 MB/s with size 8192 bytes
    5813.642090 MB/s with size 16384 bytes
    6172.816895 MB/s with size 32768 bytes
    5960.812500 MB/s with size 65536 bytes
    6242.437988 MB/s with size 131072 bytes
    6505.417969 MB/s with size 262144 bytes
    6475.080566 MB/s with size 524288 bytes
    6423.819336 MB/s with size 1048566 bytes

    For MSMC to L2 DMA
    55.854504 MB/s with size 16 bytes
    111.709007 MB/s with size 32 bytes
    223.418015 MB/s with size 64 bytes
    390.288513 MB/s with size 128 bytes
    780.577026 MB/s with size 256 bytes
    1385.782104 MB/s with size 512 bytes
    2263.112061 MB/s with size 1024 bytes
    3311.289795 MB/s with size 2048 bytes
    4505.962891 MB/s with size 4096 bytes
    5348.735840 MB/s with size 8192 bytes
    5900.540039 MB/s with size 16384 bytes
    6172.816895 MB/s with size 32768 bytes
    6369.577637 MB/s with size 65536 bytes
    6459.499023 MB/s with size 131072 bytes
    6505.417969 MB/s with size 262144 bytes
    6528.623535 MB/s with size 524288 bytes
    6541.919434 MB/s with size 1048566 bytes

    For L2 to MSMC DMA
    55.854504 MB/s with size 16 bytes
    111.709007 MB/s with size 32 bytes
    223.418015 MB/s with size 64 bytes
    390.288513 MB/s with size 128 bytes
    780.577026 MB/s with size 256 bytes
    1385.782104 MB/s with size 512 bytes
    2263.112061 MB/s with size 1024 bytes
    3311.289795 MB/s with size 2048 bytes
    4505.962891 MB/s with size 4096 bytes
    5348.735840 MB/s with size 8192 bytes
    5813.642090 MB/s with size 16384 bytes
    6172.816895 MB/s with size 32768 bytes
    6369.577637 MB/s with size 65536 bytes
    6459.499023 MB/s with size 131072 bytes
    6505.417969 MB/s with size 262144 bytes
    6528.623535 MB/s with size 524288 bytes
    6538.534180 MB/s with size 1048566 bytes

    CPU has the core speed of 1228799Khz. DDR is at its ma speed which I set in GEL file as ddr3A_64bit_DDR1600_setup().

    DDR caching is disabled also. L2 is used as RAM. EDMA 0 with channel 0 is used and manually triggered. For transfer completion IPR bit is polled.

    Thank you.

    Best,

    #include <memory.h>
    
    /*PROTOTYPES*/
    CSL_Uint64 test(int size, int cases);
    void edma_init();
    
    /*Starting addresses of mems*/
    volatile Uint8 *ptab0 = (volatile Uint8*)ddr_copy;
    volatile Uint8 *ptab1 = (volatile Uint8*)msmc_copy;
    volatile int *edma_cc0_ipr = (volatile int*)0x02701068;
    
    CSL_Uint64 start, end, duration=0;
    
    int main()
    {
    
        Uint32 priority = 7;
        Uint32 size_byte,cases;
        float cpu_ms = 1/(float)CPU_CLOCK_KHZ, time_ms=0, throughput=0;
    
        memory_init(); // memory initialization
        edma_init();
    
        // Csl Timer Enable
        CSL_tscEnable();
        CSL_XMC_setMDMAPriority(priority);
    
        printf("DSP CorePac priority: %d\n", (int)CSL_XMC_getMDMAPriority());
    
        _nassert( ((int)ptab0 & 0x7) == 0 );
        _nassert( ((int)ptab1 & 0x7) == 0 );
        _nassert( ((int)l2_copy & 0x7) == 0 );
    
    #pragma MUST_ITERATE (MAX_CASE,MAX_CASE,MAX_CASE);
    
        for(cases=0; cases<MAX_CASE; cases++) //0 8 ++
        {
            memory_set(cases);
    #pragma MUST_ITERATE (MAX_SIZE+1,MAX_SIZE+1,17);
    
            for(size_byte=16; size_byte<MAX_SIZE+1; size_byte=size_byte*2)
            {
                // 10 is  a arbitrary choice, MAX_SIZE is 1MB but L2 cannot handle it properly so a little decrease needed
                if(size_byte== MAX_SIZE){size_byte=MAX_SIZE-10;}
    
                duration= test(size_byte, cases);
    
                time_ms = (float)(duration*cpu_ms);
                throughput = ((size_byte)/(float)time_ms)*1e-3;
                printf("%f MB/s with size %d bytes\n", throughput,size_byte,cases);
            }
        }
        return 0;
    }
    
    CSL_Uint64 test(int size ,int cases) // returns the elapsed cycles
    {
        if(cases == 0 | cases == 1 |cases == 2 |cases == 3 )
        {
            start = CSL_tscRead();
        }
    
        if(cases == 0) //ddr to l2
        {
            memcpy(l2_copy, ptab0, size);
        }
        else if(cases == 1) //l2 to ddr
        {
            memcpy(ptab0, l2_copy,  size);
        }
        else if(cases == 2) //msmc to l2
        {
            memcpy(l2_copy, ptab1, size);
        }
        else if(cases == 3) //l2 to msmc
        {
            memcpy(ptab1, l2_copy, size);
        }
        else if(cases == 4) //ddr to l2 dma
        {
    
            CSL_Edma3ParamSetup myParamSetup =
            {
             CSL_EDMA3_OPT_MAKE(CSL_EDMA3_ITCCH_DIS,
                                CSL_EDMA3_TCCH_DIS,
                                CSL_EDMA3_ITCINT_DIS,
                                CSL_EDMA3_TCINT_EN,
                                CSL_EDMA3_CHA_4,
                                CSL_EDMA3_TCC_NORMAL,
                                CSL_EDMA3_FIFOWIDTH_NONE,
                                CSL_EDMA3_STATIC_DIS,
                                CSL_EDMA3_SYNC_AB,
                                CSL_EDMA3_ADDRMODE_INCR,
                                CSL_EDMA3_ADDRMODE_INCR
             ),
             (Uint32)ptab0, //src
             CSL_EDMA3_CNT_MAKE(16,size/16), //acnt bcnt  size/8 8
             (Uint32)l2_global_address ((Uint32)l2_copy), //dst
             CSL_EDMA3_BIDX_MAKE(16,16), //srcbidx dstbidx size/8 size/8
             CSL_EDMA3_LINKBCNTRLD_MAKE(0xFFFF,3),
             CSL_EDMA3_CIDX_MAKE(0,0),
             1
            };
            status = CSL_edma3ParamSetup(hParam,&myParamSetup);
            start = CSL_tscRead();
            status = CSL_edma3HwChannelControl(hChannel,CSL_EDMA3_CMD_CHANNEL_SET,NULL);
            while(*edma_cc0_ipr != 1<< CSL_EDMA3_CHA_4){};
            end = CSL_tscRead();
            duration = end - start;
            status = CSL_edma3HwControl(hModule, CSL_EDMA3_CMD_INTRPEND_CLEAR, &edmaIntr);
        }
        else if(cases == 5) //l2 to ddr dma
        {
            CSL_Edma3ParamSetup myParamSetup =
            {
             CSL_EDMA3_OPT_MAKE(CSL_EDMA3_ITCCH_DIS,
                                CSL_EDMA3_TCCH_DIS,
                                CSL_EDMA3_ITCINT_DIS,
                                CSL_EDMA3_TCINT_EN,
                                CSL_EDMA3_CHA_4,
                                CSL_EDMA3_TCC_NORMAL,
                                CSL_EDMA3_FIFOWIDTH_NONE,
                                CSL_EDMA3_STATIC_DIS,
                                CSL_EDMA3_SYNC_AB,
                                CSL_EDMA3_ADDRMODE_INCR,
                                CSL_EDMA3_ADDRMODE_INCR
             ),
             (Uint32)l2_global_address ((Uint32)l2_copy), //src
             CSL_EDMA3_CNT_MAKE(16,size/16), //acnt bcnt
             (Uint32)ptab0, //dst
             CSL_EDMA3_BIDX_MAKE(16,16), //srcbidx dstbidx
             CSL_EDMA3_LINKBCNTRLD_MAKE(0xFFFF,3),
             CSL_EDMA3_CIDX_MAKE(0,0),
             1
            };
            status = CSL_edma3ParamSetup(hParam,&myParamSetup);
            start = CSL_tscRead();
            status = CSL_edma3HwChannelControl(hChannel,CSL_EDMA3_CMD_CHANNEL_SET,NULL);
            while(*edma_cc0_ipr != 1<<CSL_EDMA3_CHA_4){};
            end = CSL_tscRead();
            duration = end - start;
            status = CSL_edma3HwControl(hModule, CSL_EDMA3_CMD_INTRPEND_CLEAR, &edmaIntr);
    
        }
        else if(cases == 6) //msmc to l2 dma
        {
            CSL_Edma3ParamSetup myParamSetup =
            {
             CSL_EDMA3_OPT_MAKE(CSL_EDMA3_ITCCH_DIS,
                                CSL_EDMA3_TCCH_DIS,
                                CSL_EDMA3_ITCINT_DIS,
                                CSL_EDMA3_TCINT_EN,
                                CSL_EDMA3_CHA_4,
                                CSL_EDMA3_TCC_NORMAL,
                                CSL_EDMA3_FIFOWIDTH_NONE,
                                CSL_EDMA3_STATIC_DIS,
                                CSL_EDMA3_SYNC_AB,
                                CSL_EDMA3_ADDRMODE_INCR,
                                CSL_EDMA3_ADDRMODE_INCR
             ),
             (Uint32)ptab1, //src
             CSL_EDMA3_CNT_MAKE(16,size/16), //acnt bcnt
             (Uint32)l2_global_address ((Uint32)l2_copy), //dst
             CSL_EDMA3_BIDX_MAKE(16,16), //srcbidx dstbidx
             CSL_EDMA3_LINKBCNTRLD_MAKE(0xFFFF,3),
             CSL_EDMA3_CIDX_MAKE(0,0),
             1
            };
            status = CSL_edma3ParamSetup(hParam,&myParamSetup);
            start = CSL_tscRead();
            status = CSL_edma3HwChannelControl(hChannel,CSL_EDMA3_CMD_CHANNEL_SET,NULL);
            while(*edma_cc0_ipr != 1<<CSL_EDMA3_CHA_4){};
            end = CSL_tscRead();
            duration = end - start;
            status = CSL_edma3HwControl(hModule, CSL_EDMA3_CMD_INTRPEND_CLEAR, &edmaIntr);
    
        }
        else if(cases == 7) //l2 to msmc dma
        {
            CSL_Edma3ParamSetup myParamSetup =
            {
             CSL_EDMA3_OPT_MAKE(CSL_EDMA3_ITCCH_DIS,
                                CSL_EDMA3_TCCH_DIS,
                                CSL_EDMA3_ITCINT_DIS,
                                CSL_EDMA3_TCINT_EN,
                                CSL_EDMA3_CHA_4,
                                CSL_EDMA3_TCC_NORMAL,
                                CSL_EDMA3_FIFOWIDTH_NONE,
                                CSL_EDMA3_STATIC_DIS,
                                CSL_EDMA3_SYNC_AB,
                                CSL_EDMA3_ADDRMODE_INCR,
                                CSL_EDMA3_ADDRMODE_INCR
             ),
             (Uint32)l2_global_address ((Uint32)l2_copy), //src
             CSL_EDMA3_CNT_MAKE(16,size/16), //acnt bcnt
             (Uint32)ptab1, //dst
             CSL_EDMA3_BIDX_MAKE(16,16), //srcbidx dstbidx
             CSL_EDMA3_LINKBCNTRLD_MAKE(0xFFFF,3),
             CSL_EDMA3_CIDX_MAKE(0,0),
             1
            };
            status = CSL_edma3ParamSetup(hParam,&myParamSetup);
            start = CSL_tscRead();
            status = CSL_edma3HwChannelControl(hChannel,CSL_EDMA3_CMD_CHANNEL_SET,NULL);
            while(*edma_cc0_ipr != 1<<CSL_EDMA3_CHA_4){};
            end = CSL_tscRead();
            duration = end - start;
            status = CSL_edma3HwControl(hModule, CSL_EDMA3_CMD_INTRPEND_CLEAR, &edmaIntr);
    
        }
    
        if(cases == 0 | cases == 1 |cases == 2 |cases == 3 )
        {
            end = CSL_tscRead();
            duration = end - start;
        }
        return duration;
    }
    
    

    test_priority_metu_e2e.tar.gz

  • Can you please indicate if this issue was resolved or are you still looking for root causing the memcpy performance issue befween L2 and DDR?

  • I dont know why but it is now normal as I expect.

    Best,