This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RTOS/TMS320C6678: EDMA3 does not copy and runs slowly

Part Number: TMS320C6678

Tool/software: TI-RTOS

I made a small project using EDMA3_LLD. The project copies data from DDR3 memory to L2RAM memory. However, copying is not performed - the array in the memory L2SRAM is empty. Also, the test shows that the time to copy an array of 128 elements is about 2000 cycles! I use CCS 7.3 and EDMA LLD v.2.12.5.

What could be the problem?

Why such a low copy speed?

  • EDMA3.c
    #include <xdc/std.h>
    #include <stdio.h>
    #include <time.h>
    #include <stdlib.h>
    #include <limits.h>
    #include <math.h>
    #include <c6x.h>
    
    #include <ti/csl/csl_cacheAux.h>
    #include <ti/sdo/edma3/drv/sample/bios6_edma3_drv_sample.h>
    
    
    extern cregister volatile unsigned int DNUM;
    
    
    /* OPT Field specific defines */
    #define OPT_SYNCDIM_SHIFT                   (0x00000002u)
    #define OPT_TCC_MASK                        (0x0003F000u)
    #define OPT_TCC_SHIFT                       (0x0000000Cu)
    #define OPT_ITCINTEN_SHIFT                  (0x00000015u)
    #define OPT_TCINTEN_SHIFT                   (0x00000014u)
    #define OPT_ITCCHEN_SHIFT                   (0x00000017u)
    #define OPT_TCCHEN_SHIFT                    (0x00000016u)
    #define OPT_STATIC_SHIFT                    (0x00000003u)
    
    #define N 16384
    #define M 128
    
    /* ======================================================================== */
    /*  Kernel-specific alignments                                              */
    /* ======================================================================== */
    #pragma DATA_SECTION(x,  ".my_sect_ddr");
    #pragma DATA_ALIGN(x,  8);
    float x [2*N];
    float *const ptr_x  = x;
    
    #pragma DATA_SECTION(y, ".my_sect_l2sram");
    #pragma DATA_ALIGN(y,  64);
    float y [2*M];
    float *const ptr_y  = y;
    
    
    int main () {
        /* -------------------------------------------------------------------------------- */
        /*                                        Variables                                 */
        /* -------------------------------------------------------------------------------- */
        int i;
        clock_t t_start, t_stop, t_overhead, t_opt;
        /* -------------------------------------------------------------------------------- */
    
    
        /* -------------------------------------------------------------------------------- */
        /*                                      Initialization                              */
        /* -------------------------------------------------------------------------------- */
    
        // Intialize hardware timers
        TSCL = 0; TSCH = 0;
    
        // Compute the overhead of calling clock twice to get timing info
    
        t_start = _itoll(TSCH, TSCL);
        t_stop  = _itoll(TSCH, TSCL);
        t_overhead = t_stop - t_start;
    
        // Initialize input vector
        for (i = 0; i < 2*N; i++) {
            x[i] = (float)(i + 1);
        }
        CACHE_wbInvAllL2(CACHE_WAIT);
        /* -------------------------------------------------------------------------------- */
    
    
        /* -------------------------------------------------------------------------------- */
        /*                                          EDMA                                    */
        /* -------------------------------------------------------------------------------- */
        // EDMA variables
        EDMA3_DRV_Handle hEDMA;
        EDMA3_DRV_Result EDMA_Result = EDMA3_DRV_SOK;
        uint32_t EDMA_ID = 0;
        EDMA3_DRV_PaRAMRegs EDMA_PaRAM = {0,0,0,0,0,0,0,0,0,0,0,0};
    
        // Channel options
        uint32_t EDMA_TCC  = EDMA3_DRV_TCC_ANY; // Transfer complete code (TCC)
        uint32_t EDMA_chID = EDMA3_DRV_DMA_CHANNEL_ANY; // Channel ID
    
        // Initialisation EDMA
        hEDMA = edma3init(EDMA_ID, &EDMA_Result);
    
        // Flush the Source Buffer
        if (EDMA_Result == EDMA3_DRV_SOK) {
            EDMA_Result = Edma3_CacheFlush((unsigned int)ptr_x, 2*N*sizeof(float));
        }
    
        /* Invalidate the Destination Buffer */
        if (EDMA_Result == EDMA3_DRV_SOK) {
            EDMA_Result = Edma3_CacheInvalidate((unsigned int)ptr_y, 2*M*sizeof(float));
        }
    
        // Request a Channel from Resource Manager
        EDMA_Result = EDMA3_DRV_requestChannel (hEDMA, &EDMA_chID, &EDMA_TCC, (EDMA3_RM_EventQueue)0, NULL, NULL);
        if (EDMA_Result == EDMA3_DRV_SOK) {
            printf("DMA channel %d request successful.\n", EDMA_chID);
        } else {
            printf("DMA channel %d request failed!\n", EDMA_chID);
        }
    
        if (EDMA_Result == EDMA3_DRV_SOK) {
            // OPT field of PaRAM Set
            EDMA_PaRAM.opt &= 0xFFFFFFFCu; // Src & Dest are in INCR modes
            EDMA_PaRAM.opt |= ((EDMA_TCC << OPT_TCC_SHIFT) & OPT_TCC_MASK);   // Program the TCC
            EDMA_PaRAM.opt |= (1 << OPT_TCINTEN_SHIFT); // Enable Final transfer completion interrupt
            EDMA_PaRAM.opt |= (1 << OPT_SYNCDIM_SHIFT); // AB-Sync Transfer Mode
            EDMA_PaRAM.opt |= (1 << OPT_STATIC_SHIFT); // Set the static bit
    
            EDMA_PaRAM.srcAddr    = (unsigned int)(ptr_x);
            EDMA_PaRAM.destAddr   = (unsigned int)(ptr_y);
            EDMA_PaRAM.srcBIdx    = (short)(2*M*sizeof(float));
            EDMA_PaRAM.destBIdx   = (short)(2*sizeof(float));
            EDMA_PaRAM.srcCIdx    = 0;
            EDMA_PaRAM.destCIdx   = 0;
            EDMA_PaRAM.aCnt       = (unsigned short)(2*sizeof(float));
            EDMA_PaRAM.bCnt       = (unsigned short)(M);
            EDMA_PaRAM.cCnt       = 1;
            EDMA_PaRAM.bCntReload = 0;
            EDMA_PaRAM.linkAddr   = 0xFFFFu;
    
            // Write the PaRAM Set.
            EDMA_Result = EDMA3_DRV_setPaRAM(hEDMA, EDMA_chID, &EDMA_PaRAM);
            if (EDMA_Result != EDMA3_DRV_SOK) {
                printf("EDMA3 set PaRAM Failed with error code: %d!\n", EDMA_Result);
            }
        }
    
        t_start = _itoll(TSCH, TSCL);
        EDMA_Result = EDMA3_DRV_enableTransfer (hEDMA, EDMA_chID, EDMA3_DRV_TRIG_MODE_MANUAL);
        EDMA_Result = EDMA3_DRV_waitAndClearTcc(hEDMA, EDMA_TCC);
        t_stop = _itoll(TSCH, TSCL);
        t_opt  = (t_stop - t_start) - t_overhead;
        if (EDMA_Result != EDMA3_DRV_SOK) {
            printf("EDMA3 Transfer Failed with error code: %d!\n", EDMA_Result);
        } else {
            printf("\tEDMA3 Transfer cycles: %d\n", t_opt);
        }
        EDMA_Result = EDMA3_DRV_freeChannel (hEDMA, EDMA_chID);
        if (EDMA_Result != EDMA3_DRV_SOK) {
                printf("EDMA3 Free Channel Failed with error code: %d!\n", EDMA_Result);
        }
        /* -------------------------------------------------------------------------------- */
    
        printf("STOP\n");
    }
    

    EDMA_WS.rar

  • Hi,

    Please post the Processor SDK RTOS version that you are using.

    Best Regards,
    Yordan
  • Hi,

    There are existing examples in EDMA LLD, check processors.wiki.ti.com/.../II_devices

    Q. What are the software building blocks: EDMA LLD, EDMA CSL, and StarterWare?
    Try the example to make sure it works by moving data from A to B, then adding your TSCL/TSCH profiling code.

    Besides, several RTOS driver examples use EDMA, such as PCIE and Hyperlink, they all use PCIE LLD. You can check how they move data.

    Regards, Eric
  • Hi, I use CCS v7.3 with Processor SDK RTOS v04.02.00.

  • Have you tried any EDMA example mentioned in 01/02 post?

    Regards, Eric
  • Hi, Eric.
    Thank you for reply!
    Yes, I tried EDMA examples. But these examples use A-synchronization, but I use AB-synchronization. My project is based on these examples, but for some reason it does not work. My question is precisely why copying is not happening? My second question is, why is copying so slow (about 15 clock cycles per count)? And yes, I conducted measurements by adding TSCL/TSCH profiling in my code.

  • Hi,

    If the example is A-sync, you can change the OPT field to AB-sync. Firstly if the existing example worked as expected by moving the data? Then, how do you moving data? inside the chip or between two devices via a interface link PCIE or Hyperlink? If the address is L2, do you make it global address? For the TSCL/TSCH slowness, do you program the DSP main PLL to get the right time per CPU cycle?

    Regards, Eric
  • Hello!

    1. Yes, the existing example worked as expected by moving the data.
    2. I move the data inside the chip.
    3. Yes, I forgot to make the address in L2 global. Now the data is copied correctly. Thank you, lding!
    4. I do the measurement of time in CPU cycles in the following way:
        // Intialize hardware timers
        TSCL = 0; TSCH = 0;
    
        // Compute the overhead of calling clock twice to get timing info
        t_start = _itoll(TSCH, TSCL);
        t_stop  = _itoll(TSCH, TSCL);
        t_overhead = t_stop - t_start;
    
        t_start = _itoll(TSCH, TSCL);
        EDMA_Result = EDMA3_DRV_enableTransfer (hEDMA, EDMA_chID, EDMA3_DRV_TRIG_MODE_MANUAL);
        EDMA_Result = EDMA3_DRV_waitAndClearTcc(hEDMA, EDMA_TCC);
        t_stop = _itoll(TSCH, TSCL);
        t_opt  = (t_stop - t_start) - t_overhead;

    At this time, 256 samples of floating-point data are copied over 1300 cycles, which is approximately 5 cycles per sample.

    Total copy speed is 4 (Byte) / 5 (cycles) * 1 GHz = 800 MB/s. In the document "Throughput Performance Guide for C66x KeyStone Devices" in Table 15 the speed 10664 MB/s is declared.

    Why do I get such a low EDMA speed?

  • I made a project that copies 2x4096 samples in a floating-point format from one memory area to another and measures the copy speed. I run the code with the Blackhawk XDS560v2 Emulator on Evaluation Board TMDSEVM6678LE.
    With different combinations I got the following results:

    Source

    Destination

    Measured Speed, MB/s

    Speed in Document, MB/s

    DDR3

    MSMSRAM

    8000

    10664

    MSMSRAM

    DDR3

    9000

    10664

    DDR3

    L2SRAM

    4985

    10664

    L2SRAM

    DDR3

    5033

    10664

    I estimated the copy speed as follows: speed[MB/s] = 2*4096*sizeof(float)*1000[MHz]/t_opt.

    My project is EDMA3.zip

    Why are the results very different from the results in the document "Throughput Performance Guide for C66x KeyStone Devices"?

    Why does the speed depend on the direction of the copy?

  • Hi,

    How many parallel EDMA transfer in place in your test? Using one EDMA channel is not enough, you need at least 3 EDMA transfer in parallel to achieve the speed in the document.

    What is your goal here? Try to duplicate the benchmark in the document or just try to get the best throughput for your 1-EDMA channel application? You can also do "-O3" optimization for the code.

    Regards, Eric
  • Hi,

    The results are obtained using "-O3" optimization.

    Yes, my goal is to get the maximum bandwidth when using one EDMA channel.

    • The results I received are the best?
    • Why the different speed of copying from DDR3 for memory L2SRAM and for MSMSRAM?
    • How is parallel transmission carried out over multiple channels?
  • Hi,

    The numbers you quote I believe came from www.ti.com/.../sprabk5b.pdf. Those numbers are obtrained by using several channels in parallel. If just using one channel, the typical throughput is about 5000-6000 Mbps. Your results looked good.

    To have multiple channels in parallel, you can have several channels setup first, then enable the transfer sequentially just using one core. As long as the previous transfer didn't finish, you have several EDMA transfers overlapping. Or, you can use multiple cores, each core starts a transfer and do that continuously.

    Regards, Eric
  • Do I understand correctly, in order to use the two channels in parallel, I should: distribute the data to 2 blocks; first channel copies data from the first block, and the second from the second?
  • Yes, that is right.

    Regards, Eric
  • Thank you, Eric!