edma3 lld performance insufficient

Christoph Sulzbachner

Other Parts Discussed in Thread: TMS320DM6467

I am using the EDMA3 low level driver (edma3_lld_01_06_00_01) and DSP/BIOS 5.33.06 with CCS3.3 on a DSK6455. I extracted the DAT functions from the CSL2_DAT_DEMO example included in the LLD and build a library (DAT_ll_*). The functionality of the DAT functions in my application have been verified manually with the debugger and during runtime. The problem is the performance of the DAT functions! However, the DAT overhead is minimal and consists almost of EDMA3_DRV calls.

Measuring the performance of copying 1GiB of data from the external to the internal memory in chunks results in a total copying performance of about 230MiB/s. All sections are mapped to the internal memory, except the data section of "dext". Thus, the memory interface should not be stressed. Here you can see some code snippets of my test project:

#pragma DATA_ALIGN(dstBuff, 8)
#pragma DATA_SECTION(dstBuff, "dint")
unsigned int dstBuff[SIZE];

#pragma DATA_ALIGN(srcBuff1, 8)
#pragma DATA_SECTION(srcBuff1, "dext")
unsigned int srcBuff1[SIZE];

...

{
float time_cycles, time_ms;
unsigned long long int u64T1, u64T2;

TSCH = 0; TSCL = 0;
u64T1 = _itoll(TSCH, TSCL);

for (i = 0; i < (1024 * 1024 * (1024)) / SIZE; i++)
{
    waitId[0] = DAT_ll_copy2d(DAT_2D2D, srcBuff1, dstBuff, 4, SIZE / 8, 8);
    DAT_ll_wait(waitId[0]);
}

u64T2 = _itoll(TSCH, TSCL);

time_cycles = (float)(u64T2 - u64T1);
time_ms = (float)time_cycles / (float)GBL_getFrequency(); // in msec

printf("performance DAT_ll_copy2d: %fMiB/s\n", 1024. * (1024.) / time_ms);
}

I expected much higher performance rates than 230MiB/s. Please let me know how I can increase the performance?

Regards, Christoph

over 14 years ago

0 RandyP over 14 years ago

TI__Guru* 84110 points

Questions for you:

Why did you extract the commands from the LLD library and make your own?

Have you benchmarked the overhead of your DAT_ll_copy2d() function call? How do you support your statement that the overhead is low?

The code snippet is missing key information such as SIZE. What value do you use for SIZE?

I do not find a reference to the function GBL_getFrequency. Where did this come from? What value does it return in your system?

What is the performance of the memory port where you placed the dext section?

Why are you doing a 2D2D transfer?

If I understand your arg list, you are doing a bunch of 1 word copies skipping every other word. Why do you do that, or am I misunderstanding the arg list to your DAT function?

Other than improving the overhead in your DAT function, you will not be able to improve the system performance when doing the copies that you are doing. But perhaps you do not want to do the copies that you are doing?

Also, it defeats the primary value of EDMA3 when you have to wait for the transfer to complete. Ideally, you will start a transfer then go do important DSP work and the copy will be finished by the time you need the data.

The secondary value of EDMA3 is that it is more efficient at doing copies than having the DSP do the copies in code. Proper use of EDMA3 is definitely the fastest way to get blocks of data copied from one place to another, unless it is just a few words.

The code you are using for benchmarking is obviously trying to measure the capabilities of the chip. But what do you need the chip to actually do? A real-life benchmark is much more useful for your evaluation and will get you finished quicker.

For a good reference on throughput and optimization, you may want to take a look at the TMS320DM6467 SoC Architecture and Throughput Overview document, SPRAAW4.

0 Christoph Sulzbachner over 14 years ago in reply to RandyP

Prodigy 120 points

Maybe I can skip some of your questions regarding the application. The application is only used for benchmarking the DAT functions. The target application is
described later. To avoid modifying the original sources whenever required, I renamed the DAT function to DAT_ll. However, the DAT_ll function are equal to the DAT function of the TI demo.

>> How do you support your statement that the overhead is low?
The copy, copy2d and fill functions are requesting a channel and setting up the transfer directly using the EDMA3_DRV functions. The function overhead of DAT_copy, DAT_copy of DAT_copy2d is about 35000ticks and DAT_wait is lower that 50 ticks.

>> The code snippet is missing key information such as SIZE. What value do you use for SIZE?
The value is 0x4000. In my target application, where the DAT functions should be integrated the chunk size will be maximally 0x8000. Using 0x8000, the copy performance is
474MB/s and the copy2d performance using the parameters from the example is 313MB/s. Increasing SIZE to e.g. 0x40000 results in 7339MB/s for the copy function and 480MB/s for the copy2d function call for transferring 1GiB. Is 7339MB/s a realistic value?

>> I do not find a reference to the function GBL_getFrequency. Where did this come from? What value does it return in your system?
GBL_getFrequency is a DSP/BIOS function (spru403o) that returns the current frequency of the CPU in kHZ. The return value in my system is 1000000.

>> What is the performance of the memory port where you placed the dext section?
The dext section is mapped to DDR2 and dint is mapped to IRAM. The DSP/BIOS configuration uses utils.loadPlatform("ti.platforms.dsk6455"). Appending you will find the tcf file and cmd file of my project.

>> But what do you need the chip to actually do? A real-life benchmark is much more useful for your evaluation and will get you finished quicker.
The target application uses a C6455 for data acquisition and a C6474 for data processing. The C6455 sends data to the C6474 using SRIO. Therefore the C6474 has a reserved memory segment that implements an input ringbuffer for multiple SRIO 4ki bursts. Afterwards a doorbell indicates that the data is completelly transmitted and the C6474 starts processing. To avoid data to be overwritten, the C6474 copies the received data using double buffering to a separate memory where it will be further processed. Therefore, I intended using DAT_copy.

8203.demo.zip

0 RandyP over 14 years ago in reply to Christoph Sulzbachner

TI__Guru* 84110 points

The more real-life information you can provide (in small doses, not 35000 lines of code), the better my answers will be. Although you may see me like a car salesman asking for your total income and your home mortgage payments - you just want answers, not questions. C'est la vie?

Christoph Sulzbachner said:
The function overhead of DAT_copy, DAT_copy of DAT_copy2d is about 35000ticks and DAT_wait is lower that 50 ticks.

Some words and letters may be misplaced here, but if the numbers are right, then I would suggest that 35000 ticks is very high overhead, and it sounds inordinately high even for an LLD call. This tells you 1) I question the validity of the 35000 value and would like more information, 2) the LLD-based DAT_copy2d is not the most efficient way to implement a DAT_copyXX function. The EDMA3 LLD (in my opinion) is very easy to use, uses intuitive function names, handles resources well, and offers working examples. It is not the perfect solution for every situation, but can be molded by the wise user (you) into a practical system.

Christoph Sulzbachner said:
Is 7339MB/s a realistic value?

Take a look at the datasheet's section on DDR2 Memory Controller, SPRS276 section 7.0 page 157. It discusses maximum data rate, which is less than 7 GB/s. So, no, this is not realistic.

[GBL_getFrequency - my desktop search failed me, must have needed a reboot. I find it now, sorry for the oversight.]

Christoph Sulzbachner said:
The target application ... C6455 sends data to the C6474 using SRIO. ... a doorbell indicates that the data is completelly transmitted .... the C6474 copies the received data using double buffering to a separate memory .... Therefore, I intended using DAT_copy.

Comments with the new understanding of the application:

DAT_copy can work here, if you want an easy-to-use implementation. The fact that you are taking the time to benchmark the function calls tells me time is more important than ease-of-use, but perhaps you can clarify that.
You can kick off the DAT_copy in your SRIO-response ISR, then let the DAT_copy termination generate another interrupt, then you can continue doing processing during the transfer time. A bit more context-switching overhead, but that is one more piece to add to the puzzle.
I think the SRIO doorbell can generate an EDMA3 event (through CIC3 in the C6474), so you could have a series of transfers programmed into PARAM Link Sets that could be triggered by the SRIO doorbell, then the completion of the transfer would generate your ISR. This would indicate to the DSP that the data is available and has already been copied to a better place.
Since you are using DirectIO to send the data (I hope), you can just write the data directly to the place you would have transferred it, saving a copy on the C6474. It might be a bit more overhead on the C6455, but it would be equivalent to a 0-cycle copy on the C6474 target.

Some of these may not be practical. For example, you might need to gather some data after the doorbell to determine how much data came in and where it should go. If you determine that you do need to do a QDMA, you may want to use the LLD to allocate a QDMA channel then work with that channel outside the DAT_* code. You can set it up once with most of the parameters, then just write the variable parameters to start the transfer when you need to do it in the ISR, and let another ISR respond to the transfer completion interrupt rather than polling in the SRIO's ISR.

Regards,
RandyP

0 Christoph Sulzbachner over 14 years ago in reply to RandyP

Prodigy 120 points

The DAT interface is used as a memory abstraction interface in the application. Thus, I need some compatible API and I thought of using the Edma3 lld DAT implementation. In my application I need to use DAT_fill, DAT_copy and DAT_copy2d. The inefficient DAT_copy2d paramaters in the example are only used for testing purpose.

>> [...] 35000 ticks is very high overhead [...]
>> Is 7339MB/s a realistic value?

35000 ticks +/- 100 ticks were the overhead measured for the DAT_copy, DAT_copy2d and DAT_fill function calls using the benchmarking method shown in the code snippet. This was the value measure with the benchmarking application. Appending you will find the benchmark code.

>> Comments with the new understanding of the application[...]

The C6474 has a reserved memory segment for storing N 4kiB bursts that is used as a ring buffer. The c6455 transmits a burst of 4kiB using DirectIO following by a doorbell that triggers a HWI on the C6474. In the HWI, the pointer and length of the data reception are pushed into a queue, that stores the n lastest data receptions using a couting semaphore for counting the elements. Thus, not all received data must be processed and the latest n receptions are handled. The parameters N and n can be modified. As long as entries are in the queue a task calling SEM_pend processes the data. The DAT functions are used for copying a chunk the data from the external to the internal memory to avoid overwriting the data in the external memory due to the fact that the C6455 is streaming data. Further the DAT functions are used in the processing function. The following code snippet shows the HWI and the task. Due to the fact that some received data can be skipped another EDMA transfer for copying data from external to internal memory cannot be chained. In future, the DAT_copy shown in the example should be used in a more efficient way.

HWI()
{
HWI_disable()
if(SEM_count(sem) < (n - 1))
{
    push(ptr, len)
    SEM_ipost(sem)
}
HWI_enable()
}

TASK()
{
for(;;)
{
    if(SEM_pend(sem))
    {
      HWI_disable()
      ptr, len = pop()
      HWI_enable()

      id = DAT_copy()
      DAT_wait()

      processing()
    }
}
}

3568.main.zip

0 RandyP over 14 years ago in reply to Christoph Sulzbachner

TI__Guru* 84110 points

Christoph Sulzbachner said:
>> [...] 35000 ticks is very high overhead [...]

This is what you have to fix to get better performance, especially since you are not trying to transfer 1GB at a time. Some of the steps I outlined above were to fix this systemically rather than rewriting the DAT_copy functions. But if that is not the direction you want to go, then you will have to re-write the DAT_* functions to avoid all of the read-modify-writes to the EDMA3 registers.

The easiest way to do the rewrite is to manually inline all of the LLD functions that are called in your DAT_copy routine. Then start combining things to get to where you just do a single EDMA3_DRV_setPaRAM with all 8 words in one struct being written at one time to the PARAM. For example, this is how I rewrote it once, but this does not have all the instrumentation and parameter checking:

/*
* ======== DAT_copy =========
* One dimensional copy from source to destination of byteCnt bytes
*/
Uint32 DAT_copy(void *src, void *dst, Uint16 byteCnt ) {
    Uint32 chNum = 0;
    Uint32 tccNum = 0;
    /*
     * An alternate way to setup the params
     * EDMA3_DRV_PaRAMRegs param;
     * const EDMA3_DRV_PaRAMRegs *newPaRAM = &param;
     */
    EDMA3_DRV_PaRAMRegs param;
    const EDMA3_DRV_PaRAMRegs *newPaRAM = &param;

    /*
     * Obtain a free channel
     * This call spins till a free channel is obtained
     */
    chNum = _getFreeChannel(&tccNum);

    /*
     * Set up Transfer Paramters for this channel
    EDMA3_DRV_setTransferParams(DAT_EDMA3LLD_hEdma, chNum, byteCnt, 1, 1, 0,
            EDMA3_DRV_SYNC_AB);
    EDMA3_DRV_setDestParams(DAT_EDMA3LLD_hEdma, chNum, (unsigned int)dst,
            EDMA3_DRV_ADDR_MODE_INCR,(EDMA3_DRV_FifoWidth)0);
    EDMA3_DRV_setSrcParams(DAT_EDMA3LLD_hEdma, chNum, (unsigned int)src,
            EDMA3_DRV_ADDR_MODE_INCR,(EDMA3_DRV_FifoWidth)0);
     */

    /*
     * To set up all the parameters in a single call, can use
     * EDMA3_DRV_PARAMRegs structure, and populate it as indicated
     */

     param.srcAddr = (Uint32)src;
     param.aCnt = (unsigned short) byteCnt;
     param.bCnt = 1;
     param.destAddr = (Uint32)dst;
     param.srcBIdx = (short)0;
     param.destBIdx = (short)0;
     param.linkAddr = DAT_NULL_LINK;
     param.bCntReload = 0x0 ;
     param.srcCIdx = 0x0 ;
     param.destCIdx = 0x0;
     param.cCnt = 0x1;

     param.opt = DAT_OPT_TCC(DAT_OPT_DEFAULT, tccNum);

     if (EDMA3_DRV_SOK != EDMA3_DRV_setPaRAM(DAT_EDMA3LLD_hEdma, chNum,
         newPaRAM)) {
         LOG_printf(&LOG0,"Error setting up transfer \n");
//         HWI_enable();
         return DAT_INVALID_ID;
     }

//     return _setupTransferOptions(chNum, tccNum);
    // write to ESR for this channel
//    drvInst = (EDMA3_DRV_Instance *)hEdma;
//    drvObject = drvInst->pDrvObjectHandle;
    ((EDMA3_DRV_Instance *)DAT_EDMA3LLD_hEdma)->shadowRegs->ESR = 1 << chNum;

    return tccNum;
}

It is still not the most efficient because of redundant steps that get repeated every time. And it looks like I did not reverse the _getFreeChannel [oops!], so if you make a good version of it that you can use, please post it back to this thread for everyone's benefit.

Christoph Sulzbachner said:
>> Is 7339MB/s a realistic value?

Did my answer make sense? There was an error in the calculation, since the number is not realistic.

Let us know if you find better performance this way. Then we can start looking at improving the efficiency of the transfer itself, once the huge overhead is fixed.

Regards,
RandyP

0 Christoph Sulzbachner over 14 years ago in reply to RandyP

Prodigy 120 points

I have modified the fill, copy and copy2d functions in order to avoid the read-modify-writes to the EDMA3 registers following your example. I also replaced the edma3MemCpy call in EDMA3_DRV_setPaRAM by a statement that is represented by LDNDW and STNDW statements assigning the parameters.

*(EDMA3_CCRL_ParamentryRegs *)(&(globalRegs->PARAMENTRY[paRAMId].OPT)) = *(EDMA3_CCRL_ParamentryRegs *)newPaRAM;

I also removed the EDMA3_DRV_setOptField calls in the _setupTransferOptions, because the registers have already been assigned. Further, I replaced the TSK_disable and TSK_enable statements in the dat_critical_section functions by _disable_interrupts and _restore_interrupts due to the lower function overhead.

The DAT function overhead for fill, copy and copy2d could be reduced to 1800 ticks. However, simplifying the DAT exception handling only results in minor performance improvements. Using 0xffff chunks for transferring 1GiB results in a total copying performance of 920MiB/s. How can I further improve the performance?

Regards,
Christoph

0 RandyP over 14 years ago in reply to Christoph Sulzbachner

TI__Guru* 84110 points

Christoph,

Christoph Sulzbachner said:
I also replaced the edma3MemCpy call in EDMA3_DRV_setPaRAM by a statement that is represented by LDNDW and STNDW statements assigning the parameters.

*(EDMA3_CCRL_ParamentryRegs *)(&(globalRegs->PARAMENTRY[paRAMId].OPT)) = *(EDMA3_CCRL_ParamentryRegs *)newPaRAM;

IDMA0 would be a good way to do this, too. LDNDW is not as fast as LDDW, so some compiler optimization _nasserts may help you improve on that.

Christoph Sulzbachner said:
Further, I replaced the TSK_disable and TSK_enable statements in the dat_critical_section functions by _disable_interrupts and _restore_interrupts due to the lower function overhead.

The DAT function overhead for fill, copy and copy2d could be reduced to 1800 ticks.

If you own the resources so that you do not have to do any resource allocation in your personalized DAT functions, then you may be able to get rid of the disable/enable functions altogether. Improving the DAT overhead from 35000 to 1800 looks pretty great!. Is that applied for every 64KB block?

Christoph Sulzbachner said:
Using 0xffff chunks for transferring 1GiB results in a total copying performance of 920MiB/s.

0xffff does not look like a very good number for optimization, but then, your block factors in your original post are not very efficient anyway.

If you are doing everything that you need to do in your application, then quadrupling your performance from 230GB/s to 920GB/s looks pretty good to me.

The biggest question here is, what do you need to do for your application? One benchmark never tells the whole story, and one call to a function is rarely the sum of the system requirements.

If you are just looking for a maximum throughput for one thing, try searching the E2E forum and Wiki pages for keywords like "EDMA3 optimization" or throughput or bandwidth or maximum, or combinations of those. Others have asked about these things on many of the processors for one reason or another, so you may find someone else who has come up with what you want to do.

If you have an application that needs some certain performance, there will be many factors in addition to EDMA3 performance, and all of these factors will play together. If you get one EDMA3 transfer running at 80-90% of the 2.1GBps DDR2 bandwidth, once you add other components of your application that number may go way down once they all start sharing the DDR2 bus. And conversely, if you are only using 45% of the available DDR2 bandwidth for this EDMA3 transfer, then you might not slow it down at all when you add the other system components. It depends on all the components and not just one item.

If you want to take this further, there are still several unanswered questions that I have asked earlier in this thread. You can decide where you want to go next with this.

Regards,
RandyP

Processors

Processors forum

edma3 lld performance insufficient