Parallel DMA transfers triggered by one request

Felipe Rubo

Hi,

It's possible to trigger two parallel DMA transfers with one DMA request (from DMTimer, for instance) in AM335x microprocessors? I see that there are more PaRAMs than disponible channels.

Thanks!

over 11 years ago

0 Biser Gatchev-XID over 11 years ago

TI__Guru**** 393215 points

Hi Felipe,

I have asked the factory team about this.

0 Wolfgang Muees1 over 11 years ago

Genius 3685 points

Triggering 2 parallel DMA transfers is useless, because there is only ONE DMA engine, and all transfers are executed in sequence.

You can use channel chaining to do several transfers in sequence, starting from one event. Read the manual!

BEWARE:

a) I was not able to trigger a DMA transfer from several timers.

b) A timer with a "working" DMA event is timer 4.

c) In order to reset the DMA trigger from the timer, you need to write a 0x07 into the timer IRQSTATUS register

AND to write a 0x00 in the timer IRQ_EOI register. Otherwise the DMA event will not be reset after the first trigger.

I have used DMA chaining to write to these 2 registers.

regards

Wolfgang

0 Matthijs van Duin over 11 years ago in reply to Wolfgang Muees1

Mastermind 8040 points

Wolfgang Muees1 said:
Triggering 2 parallel DMA transfers is useless, because there is only ONE DMA engine, and all transfers are executed in sequence

That is incorrect, EDMA on the am335x has 3 TCs (transfer controllers) capable of performing transfers in parallel.

To trigger parallel transfers, you use 2 channels as follows:

channel A is triggered by the DMA event, it has chaining enabled to channel B and "early completion" to trigger this chaining as soon as the transfer is submitted to the TC rather than when it completes.
channel B is configured to a different event queue and associated TC

You can get a completion irq for the channel B transfer, but transfer A may still be running in parallel. If you need a completion irq for the combined transfer, you can have the channel B transfer issue on completion a chaining event to a third channel C configured to the same event queue as channel A, where it should trigger some trivial transfer which will get queued behind the original channel A transfer hence act like a memory barrier. The completion of this barrier transfer therefore indicates both the A and B transfers have completed. Note that "trivial transfer" means e.g. a 4-byte transfer from anywhere to some place harmless (e.g. the revision register of the EMIF). Zero-byte transfers are unfortunately not supported by the EDMA TCs. Depending on the details of the desired DMA setup, you may be able to use linking to have channel A alternate between the real transfer and the barrier transfer and avoid the need for a third channel.

0 Matthijs van Duin over 11 years ago in reply to Matthijs van Duin

Mastermind 8040 points

A picture may help here, especially for the completion irq trick (blue dotted lines are events, solid black lines transfer requests):

0 Matthijs van Duin over 11 years ago in reply to Matthijs van Duin

Mastermind 8040 points

BTW, even if the two transfers are submitted to the same TC (thus avoiding the complicated barrier trick to get a completion irq), if "early" chaining is used then they will still be partly parallelized since the TC is pipelined and can already begin reading data for the second transfer while still busy writing data for the first transfer.

0 Felipe Rubo over 11 years ago in reply to Wolfgang Muees1

Prodigy 40 points

Wolfgang,

Is the DMTimer 4 the unique with DMA event? I read the TRM and there are DMA requests for all DMTimers. There are problems with others?

regards,

Rubo.

0 Felipe Rubo over 11 years ago in reply to Matthijs van Duin

Prodigy 40 points

Thanks, Matthijs van Duin! I understood your suggestion.

A question: I have a buffer to be transfered with 64 bytes (source) and a memory área to be updated with 4 bytes (destination). On each DMA request, I have to transfer the next 4-bytes from the buffer to the memory área. How to do this?

0 Matthijs van Duin over 11 years ago in reply to Felipe Rubo

Mastermind 8040 points

Parameter	Value	Comment
SYNCDIM	0	A-synchronized, i.e. a 1-dimensional transfer per DMA event
ACNT	4	bytes per transfer
SRCBIDX	4	src pointer adjustment after each transfer
DSTBIDX	0	dst pointer adjustment after each transfer
BCNT	16	number of transfers = 64 / 4
BCNTRLD	-	irrelevant when CCNT = 1
SRCCIDX	-	(ditto)
DSTCIDX	-	(ditto)
CCNT	1

Or an alternative I think I might prefer:

Parameter	Value	Comment
SYNCDIM	1	AB-synchronized, i.e. a 2-dimensional transfer per DMA event
ACNT	4	bytes per transfer
SRCBIDX	-	src pointer adjustment after each transfer
DSTBIDX	-	dst pointer adjustment after each transfer
BCNT	1	number of transfers per DMA event
BCNTRLD	-	not used when AB-synchronized
SRCCIDX	4	src pointer adjustment after each DMA event
DSTCIDX	0	dst pointer adjustment after each DMA event
CCNT	16	number of events = 64 / 4

[Update] To elaborate on second alternative: My preference for AB-synchronized is because it is conceptually simpler: the parameters marked green (together with SRC, DST, and some options) directly specify the transfer request that the channel controller (CC) submits to the transfer controller (TC) in response to a DMA event. The TC then performs a simple copy-loop:

while( BCNT ) {
    memcpy( DST, SRC, ACNT );
    SRC += SRCBIDX;
    DST += DSTBIDX;
    --BCNT;
}

The variables here are the TC's private copy thereof, the original is only modified by the CC, and in a way which is also easy to describe: after submitting the transfer request to the TC it does

SRC += SRCCIDX;
DST += DSTCIDX;
--CCNT;

and if CCNT decremented to 0 declares the job done and loads the next (if any, or clears the parameters otherwise).

The A-synchronized case makes the state-updating by the CC a lot messier. It's very useful if you need it of course, but in simple cases such as this one where you're free to choose between the two options, I'd default to AB.

0 Felipe Rubo over 11 years ago in reply to Matthijs van Duin

Prodigy 40 points

Matthijs van Duin,

Do you know how to write a module in Linux kernel space to do this parallel transfer with EDMA?

0 Matthijs van Duin over 11 years ago in reply to Felipe Rubo

Mastermind 8040 points

Sorry, I'm not familiar with the Linux EDMA driver.

0 Suchit Bhatt over 9 years ago in reply to Matthijs van Duin

Intellectual 315 points

Hi Matthijs

Can we use ACNT=3 because if I take ACNT=3 but DMA always takes 4 bytes.Please confirm me. After debugging I observed ACNT value should be a power of 2.

0 Matthijs van Duin over 9 years ago in reply to Suchit Bhatt

Mastermind 8040 points

Seems to work fine here. I started with the buffer:

40318000 11 22 33 44 55 66 77 88 99 00 00 00 00 00 00 00
40318010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
40318020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
40318030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Performed a transfer with src=0x40318000, dst=0x40318010, acnt=3, srcbidx=3, dstbidx=0x10, bcnt=3. Result:

40318000 11 22 33 44 55 66 77 88 99 00 00 00 00 00 00 00
40318010 11 22 33 00 00 00 00 00 00 00 00 00 00 00 00 00
40318020 44 55 66 00 00 00 00 00 00 00 00 00 00 00 00 00
40318030 77 88 99 00 00 00 00 00 00 00 00 00 00 00 00 00

As expected.

0 Suchit Bhatt over 9 years ago in reply to Matthijs van Duin

Intellectual 315 points

Let me explain my problem in detail. I have 240*320(W*H) LCD interfaced with AM335X LIDD 8-Bits. Data format on LCD is set to 24 bits RGB 888. My android doesn't support 24 Bits RGB 888, so I have selected 32 Bits ARGB 8888 mode. My frame buffer has 32 bits for one pixel and on LCD, it is 24 Bits for one pixel. In this case, I have to transfer three bytes and skip every fourth bytes. I am using DMA to transfer frame buffer data on LCD. My DMA configuration is shown as below:
ABSYNC Transfer mode
acnt = 3;
bcnt = 240;
ccnt = 320;
srcbidx = acnt+1;
dstbidx = 0;
srccidx = (acnt+1)*bcnt;
srccidx = 0;

If I dump frame buffer data, I can see proper image in Image viewer. While on LCD, image is repeated and displayed nine times on single screen. If I use 16 bit data format on LCD and on Android then using DMA, everything is working fine.

Please help me to solve this problem. Thanks for your help.

Regards,
Suchit

0 Matthijs van Duin over 9 years ago in reply to Suchit Bhatt

Mastermind 8040 points

LIDD is not memory, and writing to a data port always requires care. Every write to the LIDD register will result in a bus transaction.

I don't really understand the result you're reporting, but then again I'd consider attempting a "3-byte write" to a data port to have undefined behaviour anyhow.

If the bus width is 8-bit then presumably you need three 8-bit transfers per pixel, which needs something awkward like:

acnt = 1
srcbidx = 1
dstbidx = 0
bcnt = 3
srccidx = 4
dstcidx = 0
ccnt = 240 * 320

and then using intermediate transfer completion chaining to transfer the whole frame at once. This makes incredibly inefficient use of EDMA but I'm not sure what alternative there is. If I understand correctly the integrated DMA controller of LCDC is 16-bit oriented so when used with an 8-bit bus it would require supplying the data as an array of u16 with each upper byte unused. Using the CPU to repack the data from android's framebuffer to a more suitable format might actually be a good option.

0 Suchit Bhatt over 9 years ago in reply to Matthijs van Duin

Intellectual 315 points

Hi matthijs

We tried your solution but it did not work.We are getting DMA error .Is any other solution.

0 Matthijs van Duin over 9 years ago in reply to Suchit Bhatt

Mastermind 8040 points

Please be a lot more specific what sort of error was reported by what. An error reported by the EDMA channel controller simply means you misconfigured it. An error reported by the EDMA transfer controller would indicate a more serious problem.

How are you configuring EDMA specifically? I'm quite certain the configuration I outlined in my previous post is valid.

Can you confirm that writing the framebuffer manually works correctly? (be sure to use volatile u8 for the LIDD target port, or use the ACCESS_ONCE macro in the linux kernel)

0 Matthijs van Duin over 9 years ago in reply to Matthijs van Duin

Mastermind 8040 points

BTW it just occurred to me that when configured like this, EDMA would actually perform the transfer slower than the cpu would, possibly even a lot slower. It would perform 2 separate reads (and 3 writes) for each pixel of the framebuffer. It would also generate 8 times as much DDR3 traffic compared to an efficiently written transfer routine on the cpu. The only benefit would be that EDMA works in the background while the CPU can do other stuff.

Since you're using android I'm assuming you're using the SGX. I asked someone who works at imgtec about it and he confirmed the SGX530 does not natively support RGB888. He mentioned it could probably however be done with some creativity:

"You might be able to do some terrible shader shenigans, outputing RGBA RGBA RGBA as R1G1B1R2 G2B2R3G3 B3R4G4B4.

Modifying the SurfaceFlinger gles2 shaders using the 'normal' gles2 api should be do-able. You will have to modify the gralloc allocation when importing the EGLImage, so it thinks it's 4-channel of the correspondingly smaller width. I haven't dug through that code recently, but I know such things are do-able. We did something similar to fake YUV pixel format output through SurfaceFlinger."

Since I know pretty much nothing about shaders or Android this is all greek to me, but perhaps it is meaningful to you or other people.

0 Matthijs van Duin over 9 years ago in reply to Suchit Bhatt

Mastermind 8040 points

Suchit Bhatt said:
Can we use ACNT=3 because if I take ACNT=3 but DMA always takes 4 bytes.Please confirm me. After debugging I observed ACNT value should be a power of 2.

Sorry for kicking the thread one last time, but it seems it may be useful for future searchers to give a final addendum on bus transactions by EDMA on the L3 interconnect: testing shows that it natively supports

naturally aligned power-of-two size transactions up to 16 bytes (the bus width of the EDMA TCs)
16-byte aligned bursts of one or more 16-byte transfers up to the configured maximum burst size, not crossing a 4 KB boundary.

Transfer requests which do not meet these restrictions are supported by enlarging them to fit, using byte-enables to correctly mask writes. Transfers are never split except to prevent crossing a page boundary.

Some examples, where the addresses are offsets from any 4 KB aligned base address:

2 bytes at 0x000: supported natively
2 bytes at 0x001: becomes 4 bytes at 0x000
2 bytes at 0x002: supported natively
2 bytes at 0x003: becomes 8 bytes at 0x000
2 bytes at 0x0FF: becomes 2*16 bytes at 0x0F0
2 bytes at 0x7FF: becomes 2*16 bytes at 0x7F0
2 bytes at 0xFFF: split into 1 byte at 0xFFF and 1 byte at 0x1000
3 bytes at 0x000 or 0x001: becomes 4 bytes at 0x000
3 bytes at 0x002 or 0x003: becomes 8 bytes at 0x000
3 bytes at 0x004 or 0x005: becomes 4 bytes at 0x004
3 bytes at 0x006 or 0x007: becomes 16 bytes at 0x000

For memory all this is really not relevant, but when aimed at the L4 interconnect things are different. As far as I can tell a single access to one of LIDD's ADDR or DATA registers causes a single transaction on the external bus, hence if you use ACNT = bytes per pixel then I would expect every pixel to result in exactly one bus transaction. Since LIDD's external bus width is limited to 16 bits, using ACNT=3 or ACNT=4 would use only the first two bytes of each pixel.

(Note that using LIDD's integrated DMA controller instead of EDMA effectively behaves like ACNT=2)

P.S. Suchit, next time you have a question, start a new topic instead of replying to an unrelated old thread. Also, when people ask for clarification it would be nice to actually pay attention and give a response. By digging a bit I noticed that a similar suggestion (but for 16-bit framebuffer) was given to you in an earlier thread and you never gave any further response to that either.

Processors

Processors forum

Parallel DMA transfers triggered by one request