TMS320F28377D: Reducing DMA overhead / latency

Part Number: TMS320F28377D

Hello,

I have a customer using DMA to reduce memory transfer overhead on the CPU. Right now, 128 bytes (32x32 bits) of data are being transferred from GS memory to EMIF1.

Using the CPU, this takes approx 8us to complete. In 16 bit DMA mode, the same transfer takes 32.8 us, and in 32 bit DMA mode, the transfer lasts 27.4 us. Since using 32 bit mode cuts down the transfers by half, this suggests that the actual DMA transaction takes ~5 us. 

Timing is done using the IPC counter, starting from right before the DMA is triggered, and ending once the DMA complete interrupt is triggered. 

The following are the values used to configure DMA ch1:

DmaRegs.CH1.BURST_SIZE.all = 31U;   // 32 16 bit words(X-1) x-ferred in a burst.
DmaRegs.CH1.SRC_BURST_STEP = 1U;    // Increment source addr between each word x-ferred.
DmaRegs.CH1.DST_BURST_STEP = 1U;    // Increment dest addr between each word x-ferred.

// Set up TRANSFER registers:
DmaRegs.CH1.TRANSFER_SIZE = 7U;         // 4 bursts (X-1) per transfer, DMA interrupt will occur after completed transfer.
DmaRegs.CH1.SRC_TRANSFER_STEP = 1U;     // TRANSFER_STEP is ignored when WRAP occurs.
DmaRegs.CH1.DST_TRANSFER_STEP = 1U;     // TRANSFER_STEP is ignored when WRAP occurs.

DmaClaSrcSelRegs.DMACHSRCSEL1.bit.CH1 = 0U; // Source select
DmaRegs.CH1.MODE.bit.PERINTSEL = 1U;        // Should be hard coded to channel, above now selects source
DmaRegs.CH1.MODE.bit.PERINTE = 1U;          // Peripheral interrupt enable
DmaRegs.CH1.MODE.bit.ONESHOT = 1U;          // Oneshot enable
DmaRegs.CH1.MODE.bit.CONTINUOUS = 1U;       // Continuous enable
DmaRegs.CH1.MODE.bit.OVRINTE = 0U;          // Enable/disable the overflow interrupt
DmaRegs.CH1.MODE.bit.DATASIZE = 1U;         // 16-bit/32-bit data size transfers
DmaRegs.CH1.MODE.bit.CHINTMODE = 1U;        // Generate interrupt to CPU at beginning/end of transfer
DmaRegs.CH1.MODE.bit.CHINTE = 1U;           // Channel Interrupt to  CPU enable

// Clear any spurious flags: Interrupt flags and sync error flags
DmaRegs.CH1.CONTROL.bit.PERINTCLR = 1U;
DmaRegs.CH1.CONTROL.bit.ERRCLR = 1U;

Any input on reducing this transfer time would be appreciated, thanks!

Munan

12 Replies

  • Hi,

    Since using 32 bit mode cuts down the transfers by half, this suggests that the actual DMA transaction takes ~5 us. 

    This is not true. In this case most of the time is taken by EMIF interface which almost remains same even in 32bit mode because external interface is still 16bit so even DMA does 32bit transfer, EMIF will split 32bit access into two 16bit access (which was the case in 16bit mode).

    Hope it is clear.

    Regards,

    Vivek Singh

     

    If my reply answers your question please click on the green button "Verify Answer".

  • In reply to Vivek Singh:

    In either case, why would using the DMA system to do the transfer take more than 3x longer than using the CPU to copy the same data? The latency from EMIF should be the same in either case... Is there a way to reduce this time?
    Also, then what explains the difference in the 32 bit vs 16 bit DMA?
  • In reply to Munan Xu:

    Hi,

    In either case, why would using the DMA system to do the transfer take more than 3x longer than using the CPU to copy the same data?

    Data transfer via CPU is much efficient compare to DMA. Please note that DMA transfers are not pipelined. But it should definitely not take 3x time. I would check your DMA code but you may want to check these number again.

    Also, then what explains the difference in the 32 bit vs 16 bit DMA?

    There are two component here. Time taken by DMA + Time taken by EMIF. When you switch from 16bit to 32bit, time taken by DMA would reduce by 1/2 hence even though EMIF accesses takes almost same time (some cycle saving in 32bit mode), there is improvement in overall time.

    Regards,

    Vivek Singh

     

    If my reply answers your question please click on the green button "Verify Answer".

  • In reply to Vivek Singh:

    Hello Vivek,

    The DMA has a four cycle pipeline with the last cycle being the EMIF write. The number of cycles saved by switching from 16 bit to 32 bit should only be around 130 (approx.650ns), not the 5us. Yes, the CPU might be a little more efficient, but the transfer time would still be dominated by the EMIF write cycle. The EMIF transfers would take approximately 6us with our setup and this makes sense when we see transfer times of just over 7us when using the CPU, but the DMA should be in that ballpark. No other DMA channels are enabled, the CPU is running out of LS memory, and the DMA transfers are from GS memory to EMIF. The register setup is shown at the top of this thread. The IPC counter is read, then a software trigger is issued to the DMA. The interrupt is generated at the end of the transfer where the IPC counter is read again to determine the execution time.

    I'll do some further invesigation to see if I can determine where the majority of the time is being spent, and if I find anything I will post the results. In the meantime, is there any sample code with timings accessing EMIF asynchronously?

    Regards,

    Orval

  • In reply to Orval Neil:

    Hi Neil,

    I agree that DMA should not take way too many cycles compare to CPU when overall time is dominated by EMIF access time. There is some overhead of DMA startup and ISR at the end in calculating the total time but even that should not amount to such big number. GSx RAM from where DMA is reading is not accessed by any other master. Right?

    We have EMIF ASYNC example in controlSUITE but not with timing. CCS has profiling feature which customer can use to to check the timing.

    Regards,

    Vivek Singh

     

    If my reply answers your question please click on the green button "Verify Answer".

  • In reply to Vivek Singh:

    Vivek,

    No other master is accessing the GS ram, the CPU is running out of LS ram and is spinning waiting for the DMA to complete. I've run further tests doing the same transfers from GS to GS ram in both 16 and 32 bit modes. The 32 bit mode took 2.570us and the 16 bit mode took 4.475us. 

    The datasheets show that the 16 bit mode should be 258 sysclk cycles and the 32 bit should be 130 cycles. The would mean that a 5ns cycle time, we should have 1.29us in 16bit and 650ns in 32bit mode. They are almost off by a factor of 4. Any idea on why this is so?

    Regards,

    Orval

  • In reply to Orval Neil:

    Hi Orval,

    Please share your EMIF setting code.

    Vivek Singh

     

    If my reply answers your question please click on the green button "Verify Answer".

  • In reply to Vivek Singh:

    Vivek,

    I've gone through the calculations of what I'd expect the execution time to be.

    2 bursts * [ ( ( 3 cycles/word + (8 EMIF cycles * 2 ))  * 32 words/burst )+ 1 ] = 1218 cycles * 5ns = 6.09us

    Our EMIF bus uses 8 cycles / write and runs at sysClk/2. The calculation above is for 16 bit mode. The results I see are 4 times this, the same as when I do a memory to memory transfer.

    I've included the EMIF configuration code below.

        EALLOW;
    
        // Perform a Module soft reset on EMIF1
        DevCfgRegs.SOFTPRES1.all = 0x1U;
        __asm (" nop");
        DevCfgRegs.SOFTPRES1.all = 0x0U;
    
        //This bit selects whether the EMIF1 module run with a /1 or /2 clock.
        ClkCfgRegs.PERCLKDIVSEL.bit.EMIF1CLKDIV = EMIF_CLOCK_SEL_DIV_BY_2;
    
        //Grab EMIF1 For Core1
        // lvd_lint -(C1) Defines from TI header file.
        Emif1ConfigRegs.EMIF1MSEL.all = EMIF1_KEY | EMIF1_CORE_GRABBED;
    
        //Check if the write to Emif1ConfigRegs is successful
        if (Emif1ConfigRegs.EMIF1MSEL.all != 0x1u)
        {
              errCount++;
        }
    
        //Disable access protection
        Emif1ConfigRegs.EMIF1ACCPROT0.all = 0x0u;
    
        //Check if the write to Emif1ConfigRegs is successful
        if(Emif1ConfigRegs.EMIF1ACCPROT0.all != 0x0u)
        {
            errCount++;
        }
    
        //Permanently Locks the write to access protection and master select
        //fields for EMIF1
        Emif1ConfigRegs.EMIF1COMMIT.all = 0x0u;
    
        //Check if the write to Emif1ConfigRegs is successful
        if(Emif1ConfigRegs.EMIF1COMMIT.all != 0x0u)
        {
            errCount++;
        }
    
        //Locks the write to access protection and master select fields for
        //EMIF1
        Emif1ConfigRegs.EMIF1LOCK.all = 0x0u;
    
        //Check if the write to Emif1ConfigRegs is successful
        if(Emif1ConfigRegs.EMIF1LOCK.all != 0x0u)
        {
            errCount++;
        }
    
        //Configure ASYNC_CS2_CR register
        //EMIF_ASYNC_SS_ENABLE  - Enable strobe selection mode
        //EMIF_ASYNC_ASIZE_16   - 16-bit databus interface
        //EMIF_ASYNC_TA_2       - Turn around time 2 Emif clock
        //EMIF_ASYNC_RHOLD_1    - Hold time of 1 Emif clock
        //EMIF_ASYNC_RSTROBE_6  - Read strobe time of 6
        //EMIF_ASYNC_RSETUP_1   - Read Setup time of 1 Emif Clock
        //EMIF_ASYNC_WHOLD_1    - Write Hold time of 3
        //EMIF_ASYNC_WSTROBE_3  - Write Strobe time of 1 Emif Clock
        //EMIF_ASYNC_WSETUP_1   - Write Setup time of 1 Emif Clock
        //EMIF_ASYNC_EW_DISABLE - Extended Wait Disable.
    
        //         |      Write    |      Read     |
        //  SS |EW | SU | STRB |HLD| SU | STRB |HLD|TA|SZ|
        // | - | - |----|------|---|----|------|---|--|--|
        //   1   0  0000 000011 000 0000 000110 000 01 01
        //Emif1Regs.ASYNC_CS2_CR.all = (0x80300305u);
    
        // This fixes the 32 bit read issues - Chris is looking into a permanent fix.
        //         |      Write    |      Read     |
        //  SS |EW | SU | STRB |HLD| SU | STRB |HLD|TA|SZ|
        // | - | - |----|------|---|----|------|---|--|--|
        //   1   0  0000 000100 001 0000 000111 001 01 01
        Emif1Regs.ASYNC_CS2_CR.all = (0x80420395u);
    
        EDIS;
    

    Regards,

    Orval

  • In reply to Orval Neil:

    Hi Neil,

    Just wanted to check if you still have this issue so that I look into this further? Sorry for late response.

    Vivek Singh

     

    If my reply answers your question please click on the green button "Verify Answer".

  • In reply to Vivek Singh:

    Vivek,

    There is still an issue. I've also performed testing of the same size transfers from GS to GS memory, execution time is 2.570us in 32 bit mode and 4.475us in 16 ibt mode, we run at 200Mhz. These numbers still seem very slow. The datasheets show that the 16 bit mode should be 258 sysclk cycles and the 32 bit should be 130 cycles. The would mean that a 5ns cycle time, we should have 1.29us in 16bit and 650ns in 32bit mode. They are almost off by a factor of 4. This is the same factor slower that I see when transferring from GS to the EMIF1 bus. Any idea on why this is so?

     

    Thanks in Advance,

    Orval