This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

c6747 EDMA from External memory to L2 Memory processing times.

Hi,

  I am using the lib (edma3_lld_01_10_00_01). I have a task to optimize copy from External to Internal memory. Say 16K Bytes array. What is your suggestion?

IF this array is contiguous array (1-Dimensional) what should be ACNT, BCNT, CCNT? What is the basis to define these values for a single 1-D array?

And also my results are a bit confusing:

1K~2K Bytes DMA takes more processing cycles than 4K?

And 128 bytes DMA take 6K Cycles?

Why is that L2 to External DMA takes less cycles compared to External memory to L2?

And also Which is the best way to DMA? Is it QDMA? Can QDMA be chained?

 

-------------------- External Memory to Internal L2 -------------------------

Bytes From Setup to DMA completion From Enable to DMA completion
16K 51,042 24,514
8K 45,918 10,274
4K 43,365 7,966
1K 34,662 8,210
128 30,551 5,699
DMA from External L2D to SDRAM From Enable to DMA completion
Bytes Clock Cycles from DMA setup to DMA Complete/teardown
16K 47,819 16,684
8K 45,302 19,309
4K 41,040 8,049
2K 40,152 8,746
1K 30,299 8,220
128 25,757 5,744

 

-------------------- L2 Internal to External Memory -------------------------

 

 

    unsigned int acnt = MAX_ACOUNT;  // Byte Count
    unsigned int bcnt = MAX_BCOUNT;  // 1
    unsigned int ccnt = MAX_CCOUNT; // 1
    EDMA3_DRV_SyncType syncType = EDMA3_DRV_SYNC_A;
     EDMA3_DRV_Result result = EDMA3_DRV_SOK;
    EDMA3_DRV_PaRAMRegs paramSet = {0,0,0,0,0,0,0,0,0,0,0,0};
 unsigned int Istestpassed;
    unsigned int chId = 0;
    unsigned int tcc = 0;
    int i;
    unsigned int numenabled = 0;
    unsigned int BRCnt = 0;
    int srcbidx = 0, desbidx = 0;
    int srccidx = 0, descidx = 0;
 unsigned int *srcuintptr, *dstuintptr;

    EDMA3_DRV_Result edmaResult = EDMA3_DRV_SOK;

 srcBuff1 = (signed char*)_srcBuff1;
    dstBuff1 = (signed char*)_dstBuff1;

    /* Initialize EDMA3 first */
    edmaResult = edma3init();
    for (i = 0; i < (acnt*bcnt*ccnt); i++)
    srcBuff1[i] = (char) (i%(0xFF));

 BCACHE_wbInvAll(); // 8011 cycles for 8K
 BCACHE_inv(srcBuff1, (acnt*bcnt*ccnt), TRUE); // 2811 cycles for 8K
 BCACHE_inv(dstBuff1, (acnt*bcnt*ccnt), TRUE); // 2811 cycles for 8K

===================================== MCPS estimate STARTs "From Setup to DMA completion"===================

    /* Flush the Source Buffer */
 if (result == EDMA3_DRV_SOK)
    {
        result = Edma3_CacheFlush((unsigned int)srcBuff1, (acnt*bcnt*ccnt));
    }

    /* Invalidate the Destination Buffer */
    if (result == EDMA3_DRV_SOK)
    {
        result = Edma3_CacheInvalidate((unsigned int)dstBuff1, (acnt*bcnt*ccnt));
    }

    /* Set B count reload as B count. */
    BRCnt = bcnt;

    /* Setting up the SRC/DES Index */
    srcbidx = (int)acnt;
    desbidx = (int)acnt;

    if (syncType == EDMA3_DRV_SYNC_A)
    {
        /* A Sync Transfer Mode */
        srccidx = (int)acnt;
        descidx = (int)acnt;
    }
    else
    {
        /* AB Sync Transfer Mode */
        srccidx = ((int)acnt * (int)bcnt);
        descidx = ((int)acnt * (int)bcnt);
    }


    /* Setup for Channel 1*/
    tcc = EDMA3_DRV_TCC_ANY;
    chId = EDMA3_DRV_DMA_CHANNEL_ANY;

    /* Request any DMA channel and any TCC */
    if (result == EDMA3_DRV_SOK)
    {
        result = EDMA3_DRV_requestChannel (hEdma, &chId, &tcc,
                                        (EDMA3_RM_EventQueue)0,
                                            &callback1, NULL);
    }
    if (result == EDMA3_DRV_SOK)
    {
        /* Fill the PaRAM Set with transfer specific information */
        paramSet.srcAddr    = (unsigned int)(srcBuff1);
        paramSet.destAddr   = (unsigned int)(dstBuff1);

        /**
         * Be Careful !!!
         * Valid values for SRCBIDX/DSTBIDX are between –32768 and 32767
         * Valid values for SRCCIDX/DSTCIDX are between –32768 and 32767
         */
        paramSet.srcBIdx    = srcbidx;
        paramSet.destBIdx   = desbidx;
        paramSet.srcCIdx    = srccidx;
        paramSet.destCIdx   = descidx;

        /**
        // * Be Careful !!!
         * Valid values for ACNT/BCNT/CCNT are between 0 and 65535.
         * ACNT/BCNT/CCNT must be greater than or equal to 1.
         * Maximum number of bytes in an array (ACNT) is 65535 bytes
         * Maximum number of arrays in a frame (BCNT) is 65535
         * Maximum number of frames in a block (CCNT) is 65535
         */
        paramSet.aCnt       = acnt;
        paramSet.bCnt       = bcnt;
        paramSet.cCnt       = ccnt;

        /* For AB-synchronized transfers, BCNTRLD is not used. */
        paramSet.bCntReload = BRCnt;

        paramSet.linkAddr   = 0xFFFFu;

        /* Src & Dest are in INCR modes */
        paramSet.opt &= 0xFFFFFFFCu;
        /* Program the TCC */
        paramSet.opt |= ((tcc << OPT_TCC_SHIFT) & OPT_TCC_MASK);

        /* Enable Intermediate & Final transfer completion interrupt */
        paramSet.opt |= (1 << OPT_ITCINTEN_SHIFT);
        paramSet.opt |= (1 << OPT_TCINTEN_SHIFT);

        if (syncType == EDMA3_DRV_SYNC_A)
        {
            paramSet.opt &= 0xFFFFFFFBu;
        }
        else
        {
            /* AB Sync Transfer Mode */
            paramSet.opt |= (1 << OPT_SYNCDIM_SHIFT);
        }

        /* Now, write the PaRAM Set. */
        result = EDMA3_DRV_setPaRAM(hEdma, chId, &paramSet);
    }

 

    /*
     * Since the transfer is going to happen in Manual mode of EDMA3
     * operation, we have to 'Enable the Transfer' multiple times.
     * Number of times depends upon the Mode (A/AB Sync)
     * and the different counts.
     */
    if (result == EDMA3_DRV_SOK)
    {
        /*Need to activate next param*/
        if (syncType == EDMA3_DRV_SYNC_A)
        {
            numenabled = bcnt * ccnt;
        }
     else
        {
            /* AB Sync Transfer Mode */
            numenabled = ccnt;
        }

        for (i = 0; i < numenabled; i++)
        {
            irqRaised1 = 0;

===================================== MCPS estimate STARTs "From Enable to DMA completion"===================

            /*
             * Now enable the transfer as many times as calculated above.
             */
            result = EDMA3_DRV_enableTransfer (hEdma, chId, //EDMA3_DRV_TRIG_MODE_NONE);
                                                EDMA3_DRV_TRIG_MODE_MANUAL);
            if (result != EDMA3_DRV_SOK)
            {
                printf ("edma3_test: EDMA3_DRV_enableTransfer " \
                                    "Failed, error code: %d\r\n", result);
                break;
            }

            /* Wait for the Completion ISR. */
            while (irqRaised1 == 0u)
            {
                /** Wait for the Completion ISR on Master Channel.
                 * You can insert your code here to do something
                 * meaningful.
     */
            }

            /* Check the status of the completed transfer */
            if (irqRaised1 < 0)
            {
                /* Some error occured, break from the FOR loop. */
                printf ("\r\nedma3_test: Event Miss Occured!!!\r\n");

                /* Clear the error bits first */
                result = EDMA3_DRV_clearErrorBits (hEdma, chId);
                break;
            }

===================================== MCPS estimate END ===================

        }
    }
    /* Match the Source and Destination Buffers. */
    if (EDMA3_DRV_SOK == result)
    {
        /* Free the previously allocated channel. */
        result = EDMA3_DRV_freeChannel (hEdma, chId);
        if (result != EDMA3_DRV_SOK)
        {
           printf("edma3_test: EDMA3_DRV_freeChannel() FAILED, " \
                                "error code: %d\r\n", result);
        }
    }
}

  • Hi,

    If your memory is contiguous you can put all as acnt and bcnt , ccnt can be 1, Max value of acnt is 65535

    About qdma and dma channels the programming the channels triggering priority and the values in PaRAM associated will differ. But from both the channels the requests are given to the transfer controller which does the actual transfer and there is no difference with respect to that.

    In QDMA channels the linking is supported (if STATIC = 0 in OPT). in this case when edma3CC copies the linked PaRAM(including the trigger word) the current PaRAM set is recognized as valid and initiates another transfer.

    Regards,
    Prasad

    If this answers the question, please click  Verify Answer  , below.

  • Hi,

     

    1. IF this array is contiguous array (1-Dimensional) what should be ACNT, BCNT, CCNT? What is the basis to define these values for a single 1-D array?

    [h]  for 8K byte array, acnt = 8000, bcnt = 1, ccnt = 1; I understand based on your suggestion. Is this correct? Does it matter if I change acnt = 1, bcnt = 8000, ccnt = 1?

     

    Can you pl. comment on the observations on cycle count too?

    2. 1K~2K Bytes DMA takes more processing cycles than 4K?

    3. And 128 bytes DMA take 6K Cycles?

    4. Why is that L2 to External DMA takes less cycles compared to External memory to L2?

     

    Regards,

    Hari

     

  • Hi Hari

    What device are you using? Your tags say C6747, can you confirm?

     

    Harikrishna Vuppaladhadiam said:

    [h]  for 8K byte array, acnt = 8000, bcnt = 1, ccnt = 1; I understand based on your suggestion. Is this correct? Does it matter if I change acnt = 1, bcnt = 8000, ccnt = 1?

    An A sync transfer with ACNT=8k BCNT/CCNT will give you the same performance as ACNT=1, BCNT=8000, CCNT=1 , AB sync transfer for a linear array.

    Harikrishna Vuppaladhadiam said:

    2. 1K~2K Bytes DMA takes more processing cycles than 4K?

    3. And 128 bytes DMA take 6K Cycles?

    Yes, you will see a small transfer size show less overall bus utilzation, compared to larger transfers, as for smaller sized transfers 1K/2K or lower, the initial latency of doing a transfer will dominate your cycle count (this latency is dependent on several factors, primarily chip topology) and is fixed latency incurred for given transfer, so if your transfer size/TRP is small then you will see it being dominant in your bytes/cycle calculation.

    We don't have any documented/externally available data on C674x based devices for EDMA throughput/cycles etc, however your trends pretty much match what is documented for another device which has the same EDMA module and similar bus architecture.

     http://www.ti.com/lit/an/spraaw4b/spraaw4b.pdf

    You will find the section 4.1 in the above appnote helpful on your set of queries on EDMA throughput, and the general data trend should match what you are seeing currently.

    Harikrishna Vuppaladhadiam said:

    4. Why is that L2 to External DMA takes less cycles compared to External memory to L2?

    I don't have any datapoint/reference points for greater then 4K, transfers for C674x , I see higher variability at 8K/16K datapoints, but this is plausible, as in general the overall performance for EDMA is also more heavily influenced by the latency of the source (aka in this case L2).  If the speed of the external  memory i/f is lower then the speed of SDMA (DSP/2), which is the port via which EDMA accesses the L2 memory , then you could see differences. Additionally in general L2 is a "faster" source compared to exteranal memory, because internal to the C674x megamodule(where L2 resides), there is slightly better buffering, bigger internal bus (256 bit bus) and better response time because you are not susceptible to things like refresh cycles etc (as is the case with external memory). So we have seen chip topologies where you will see the L2 to ext memory data transfers do better then external memory to L2/L1 transfers.

    Hope this helps.

    Regards

    Mukul