This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Why EDMA3 is slower than for-loop when performing a matrix transposition

Other Parts Discussed in Thread: TMS320C6455

Hi,

I'm using TMS320C6455 and trying to perform a matrix transposition. The original matrix has 512 lines and 16 elements per line. All elements are short type.

I tried both EDMA3 method and for-loop method, and both ways showed right transposition results. But EDMA3 is much slower than FOR-LOOP. (80000 cycles to 60000 cycles)

I don't know why EDMA3 is slower because it was much faster than for-loop in a block-move task.

Here are some code snippet I'm using. How can I improve the performance of EDMA3?

(1)   EDMA3 configuration code:

int temp = 0;
DMA_EMCR = 0xFFFFFFFF;
DCHMAP0 = 0x0;
DMAQNUM0 = 0x0;

PaRAM0_OPT = 0x00900004;
PaRAM0_SRC = srcAddr;
PaRAM0_BCNT_ACNT = 0x00100002;
PaRAM0_DST = dstAddr;
PaRAM0_DSTBIDX_SRCBIDX = 0x04000002;
PaRAM0_BCNTRLD_LINK = 0xFFFF;
PaRAM0_DSTCIDX_SRCCIDX = 0x00020020;
PaRAM0_RSV_CCNT = 0x200;

DMA_EESR = 0x01;
DMA_ESR = 0x01;

while (temp == 0)
{
    temp = (*(volatile int *) DMA_IPR) & 0x01;
}
DMA_ICR = 0x01;

 

(2) FOR-LOOP code:

for (i=0; i<512; i++)
{
    for (j=0; j<16; j++)
    {
        newData[j*512+i] = oldData[i*16+j];
    }
}

Thanks!

Ricky

  • Hi Ricky,

    How can I improve the performance of EDMA3?


    I am not expert on EDMA3 but I will give you some hints on this.

    Please refer the following TI wiki to increase the throughput of EDMA3

    This is not exactly for C6455 but It talks about typical EDMA3 throughputs OMAPL13x & C674X and their registers (RDRATE, EDMA config reg), So you can get some hints from this link about increasing EDMA3 throughputs,

    http://processors.wiki.ti.com/index.php/EDMA_Background_Activity_for_OMAP-L1x/C674x/AM1x_Throughput_Measurements

    What is your source & destination address (internal or ext memory or peripheral)?

    Which type of code ( TI provided code? or own? ) are you referring for your developments?

    Please refer the below link to download CSL code examples for EDMA on C6455 devices.

    Use this EDMA example code for reference and modify it as per your requirement (dst & src addr, type of sync, indexes etc.,)

    http://www.ti.com/tool/sprc234

    meanwhile, I will check your configuration and come back to you soon,

  • Ricky,

    Where are the source and destination memories located?

    From where to where in your code are you taking the time measurements?

    Is this a time-critical operation in your system or is this just something you noticed being less efficient in this case using EDMA3?

    If this is something that is done repeatedly, then the setup can be minimized by reusing previous settings.

    If this is something that can be started and the DSP can go do other things, then you can avoid most or all of the time wasted in the loop waiting for the IPR bit to be set. This is the most common way that EDMA3 is used to improve overall efficiency. Otherwise, QDMA could be used, and that might save 1 register write.

    I do not think the write to EESR is needed. Am I wrong about that?

    If you are counting the register setup time, this will be a big impact in your EDMA3 timing. Of course, if the size of the transfer were much larger, that effect would be greatly reduced as a percentage of the transfer time.

    The source is being read very efficiently, but the destination is being written in 2-byte sizes. If the destination is external SDRAM, then this will be a very inefficient write for the EDMA.

    We can discuss other options when you reply back with answers, if you choose to.

    Regards,
    RandyP