DMA on DM6437

John3588

Hi all,

I am working on DM6437. I intend to use DMA for transferring data
from external Memory (DDR2) to internal memory (L1DSRAM).
DDR2 runs at 162Mhz and configured to have bus-width of 32bits.

DMAN3 and ACPY3 interface is used with QDMA Channels to accomplish
the above mentioned task. I am exploring ways to configure the
transfers so as to achieve maximum possible throughput.

The minimum transfer size required by my application is 672bytes and
i need four such transfers (as the Source Addresses are different).
Therefore my total transfer size is 2688 bytes.

I am using Time stamp counter to profile the DMA transfer time.

Scenario #1

Single logical channel with four linked transfers are used.
So for every transfer, DMA overhead is one ACPY3_start() and one ACPY3_wait()
functions. I could see that the about 3800 cycles were consumed to
transfer 2688 bytes.

Scenario #2

In this case, Four logical channels each with single transfers were used.So for
every transfer, DMA overhead is four ACPY3_start() and four ACPY3_wait()
functions. I found that only about 2800 cycles were consumed in transferring
2688 bytes. Cycles mentioned here was profiled from the end of first ACPY3_start()

function .Please note that all the logical channels have same priority
level.

Could somebody explain me as to why the transfer time in scenario #2 is less
than that in scenario #1 for the same transfer size?

DM643x devices have three TCs. Is it because multiple channels would get
distributed among TCs and transfers possibly happening in parallel? However
this seems unlikely as all the TCs have to access the same DDR2 resulting
in port conflict. Please do correct me if my understanding is wrong.

Another question related to TC. Each of the TCs do have different configuration
w.r.t FIFO size, bus-width, Burst size...How does the user control routing of
transfer request to the desired TC? Is it by assigning priority levels to
logical channels.

Regards,
John

over 17 years ago

0 Guy over 17 years ago

Intellectual 290 points

IMHO,

Senario 1, Fetch of the configuration (by dma engine) is interleaved in your transfert by the driver.

Senario 2, Fetch of the configuration is already done for all transferts at the start.

The start of the 3 last transferts is done during the transfert of the first so no overhead cost for it.

Any other explanations or corrections ?

I'm interested on it. Thank's

Guy

0 John3588 over 17 years ago in reply to Guy

Prodigy 40 points

Hi Guy,

Thanks for the reply.

I do agree with you that transfer #1 does happen in parallel with the
start of the 3 last transfers. However, the total transfer time is
sum of the time taken by individual transfers to complete. This i say
because the transfers do happen sequentially. Don't they?

Moreover, i further digged into the configuration done by DMAN3 (DMAN3_init()).
All the QDMA events are mapped to event queue 1. Therefore my assumption is
that all the transfer request are getting serviced by TC1 alone.

The overhead w.r.t ACPY3_start() function is about 100 cycles. In both the cases
there is atleast one ACPY3_start() function before the actual DMA transfer is
initiated. So overhead should remain same.

So both the cases should consume similar number of cycles if not identical.
Please let me know if i have missed something.

Regards,
John

0 Brad Griffis over 17 years ago in reply to John3588

TI__Guru*** 125430 points

There is a FIFO that stores the transfer requests for the queue. In your second scenario you are putting 4 events in that FIFO all at the start. Hence as soon as one completes the next is ready to go immediately. Your first scenario on the other hand will have some time in between transfers. When one transfer completes the DMA controller will then copy/link a new set of parameters. That copy in turn causes an event to be queued in the FIFO. Finally the transfer begins.

The EDMA is built for massive throughput, not necessarily low latency. Therefore your second case ends up being more optimal because the EDMA can "pipeline" the requests. Your first scenario exposes the latency between transfers.

From the hardware perspective the DMAQNUM and QDMAQNUM registers are what map a channel to a specific queue/TC.

Brad

Processors

Processors forum

DMA on DM6437