Hi,
It's possible to trigger two parallel DMA transfers with one DMA request (from DMTimer, for instance) in AM335x microprocessors? I see that there are more PaRAMs than disponible channels.
Thanks!
This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hi,
It's possible to trigger two parallel DMA transfers with one DMA request (from DMTimer, for instance) in AM335x microprocessors? I see that there are more PaRAMs than disponible channels.
Thanks!
Hi Felipe,
I have asked the factory team about this.
Triggering 2 parallel DMA transfers is useless, because there is only ONE DMA engine, and all transfers are executed in sequence.
You can use channel chaining to do several transfers in sequence, starting from one event. Read the manual!
BEWARE:
a) I was not able to trigger a DMA transfer from several timers.
b) A timer with a "working" DMA event is timer 4.
c) In order to reset the DMA trigger from the timer, you need to write a 0x07 into the timer IRQSTATUS register
AND to write a 0x00 in the timer IRQ_EOI register. Otherwise the DMA event will not be reset after the first trigger.
I have used DMA chaining to write to these 2 registers.
regards
Wolfgang
Wolfgang Muees1 said:Triggering 2 parallel DMA transfers is useless, because there is only ONE DMA engine, and all transfers are executed in sequence
That is incorrect, EDMA on the am335x has 3 TCs (transfer controllers) capable of performing transfers in parallel.
To trigger parallel transfers, you use 2 channels as follows:
You can get a completion irq for the channel B transfer, but transfer A may still be running in parallel. If you need a completion irq for the combined transfer, you can have the channel B transfer issue on completion a chaining event to a third channel C configured to the same event queue as channel A, where it should trigger some trivial transfer which will get queued behind the original channel A transfer hence act like a memory barrier. The completion of this barrier transfer therefore indicates both the A and B transfers have completed. Note that "trivial transfer" means e.g. a 4-byte transfer from anywhere to some place harmless (e.g. the revision register of the EMIF). Zero-byte transfers are unfortunately not supported by the EDMA TCs. Depending on the details of the desired DMA setup, you may be able to use linking to have channel A alternate between the real transfer and the barrier transfer and avoid the need for a third channel.
BTW, even if the two transfers are submitted to the same TC (thus avoiding the complicated barrier trick to get a completion irq), if "early" chaining is used then they will still be partly parallelized since the TC is pipelined and can already begin reading data for the second transfer while still busy writing data for the first transfer.
Wolfgang,
Is the DMTimer 4 the unique with DMA event? I read the TRM and there are DMA requests for all DMTimers. There are problems with others?
regards,
Rubo.
Thanks, Matthijs van Duin! I understood your suggestion.
A question: I have a buffer to be transfered with 64 bytes (source) and a memory área to be updated with 4 bytes (destination). On each DMA request, I have to transfer the next 4-bytes from the buffer to the memory área. How to do this?
| Parameter | Value | Comment |
| SYNCDIM | 0 | A-synchronized, i.e. a 1-dimensional transfer per DMA event |
| ACNT | 4 | bytes per transfer |
| SRCBIDX | 4 | src pointer adjustment after each transfer |
| DSTBIDX | 0 | dst pointer adjustment after each transfer |
| BCNT | 16 | number of transfers = 64 / 4 |
| BCNTRLD | - | irrelevant when CCNT = 1 |
| SRCCIDX | - | (ditto) |
| DSTCIDX | - | (ditto) |
| CCNT | 1 |
Or an alternative I think I might prefer:
| Parameter | Value | Comment |
| SYNCDIM | 1 | AB-synchronized, i.e. a 2-dimensional transfer per DMA event |
| ACNT | 4 | bytes per transfer |
| SRCBIDX | - | src pointer adjustment after each transfer |
| DSTBIDX | - | dst pointer adjustment after each transfer |
| BCNT | 1 | number of transfers per DMA event |
| BCNTRLD | - | not used when AB-synchronized |
| SRCCIDX | 4 | src pointer adjustment after each DMA event |
| DSTCIDX | 0 | dst pointer adjustment after each DMA event |
| CCNT | 16 | number of events = 64 / 4 |
[Update] To elaborate on second alternative: My preference for AB-synchronized is because it is conceptually simpler: the parameters marked green (together with SRC, DST, and some options) directly specify the transfer request that the channel controller (CC) submits to the transfer controller (TC) in response to a DMA event. The TC then performs a simple copy-loop:
while( BCNT ) {
memcpy( DST, SRC, ACNT );
SRC += SRCBIDX;
DST += DSTBIDX;
--BCNT;
}
The variables here are the TC's private copy thereof, the original is only modified by the CC, and in a way which is also easy to describe: after submitting the transfer request to the TC it does
SRC += SRCCIDX; DST += DSTCIDX; --CCNT;
and if CCNT decremented to 0 declares the job done and loads the next (if any, or clears the parameters otherwise).
The A-synchronized case makes the state-updating by the CC a lot messier. It's very useful if you need it of course, but in simple cases such as this one where you're free to choose between the two options, I'd default to AB.
Matthijs van Duin,
Do you know how to write a module in Linux kernel space to do this parallel transfer with EDMA?
Hi Matthijs
Can we use ACNT=3 because if I take ACNT=3 but DMA always takes 4 bytes.Please confirm me. After debugging I observed ACNT value should be a power of 2.
Seems to work fine here. I started with the buffer:
40318000 11 22 33 44 55 66 77 88 99 00 00 00 00 00 00 00
40318010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
40318020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
40318030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Performed a transfer with src=0x40318000, dst=0x40318010, acnt=3, srcbidx=3, dstbidx=0x10, bcnt=3. Result:
40318000 11 22 33 44 55 66 77 88 99 00 00 00 00 00 00 00
40318010 11 22 33 00 00 00 00 00 00 00 00 00 00 00 00 00
40318020 44 55 66 00 00 00 00 00 00 00 00 00 00 00 00 00
40318030 77 88 99 00 00 00 00 00 00 00 00 00 00 00 00 00
As expected.
LIDD is not memory, and writing to a data port always requires care. Every write to the LIDD register will result in a bus transaction.
I don't really understand the result you're reporting, but then again I'd consider attempting a "3-byte write" to a data port to have undefined behaviour anyhow.
If the bus width is 8-bit then presumably you need three 8-bit transfers per pixel, which needs something awkward like:
acnt = 1
srcbidx = 1
dstbidx = 0
bcnt = 3
srccidx = 4
dstcidx = 0
ccnt = 240 * 320
and then using intermediate transfer completion chaining to transfer the whole frame at once. This makes incredibly inefficient use of EDMA but I'm not sure what alternative there is. If I understand correctly the integrated DMA controller of LCDC is 16-bit oriented so when used with an 8-bit bus it would require supplying the data as an array of u16 with each upper byte unused. Using the CPU to repack the data from android's framebuffer to a more suitable format might actually be a good option.
Please be a lot more specific what sort of error was reported by what. An error reported by the EDMA channel controller simply means you misconfigured it. An error reported by the EDMA transfer controller would indicate a more serious problem.
How are you configuring EDMA specifically? I'm quite certain the configuration I outlined in my previous post is valid.
Can you confirm that writing the framebuffer manually works correctly? (be sure to use volatile u8 for the LIDD target port, or use the ACCESS_ONCE macro in the linux kernel)
BTW it just occurred to me that when configured like this, EDMA would actually perform the transfer slower than the cpu would, possibly even a lot slower. It would perform 2 separate reads (and 3 writes) for each pixel of the framebuffer. It would also generate 8 times as much DDR3 traffic compared to an efficiently written transfer routine on the cpu. The only benefit would be that EDMA works in the background while the CPU can do other stuff.
Since you're using android I'm assuming you're using the SGX. I asked someone who works at imgtec about it and he confirmed the SGX530 does not natively support RGB888. He mentioned it could probably however be done with some creativity:
"You might be able to do some terrible shader shenigans, outputing RGBA RGBA RGBA as R1G1B1R2 G2B2R3G3 B3R4G4B4.
Modifying the SurfaceFlinger gles2 shaders using the 'normal' gles2 api should be do-able. You will have to modify the gralloc allocation when importing the EGLImage, so it thinks it's 4-channel of the correspondingly smaller width. I haven't dug through that code recently, but I know such things are do-able. We did something similar to fake YUV pixel format output through SurfaceFlinger."
Since I know pretty much nothing about shaders or Android this is all greek to me, but perhaps it is meaningful to you or other people.
Suchit Bhatt said:Can we use ACNT=3 because if I take ACNT=3 but DMA always takes 4 bytes.Please confirm me. After debugging I observed ACNT value should be a power of 2.
Sorry for kicking the thread one last time, but it seems it may be useful for future searchers to give a final addendum on bus transactions by EDMA on the L3 interconnect: testing shows that it natively supports
Transfer requests which do not meet these restrictions are supported by enlarging them to fit, using byte-enables to correctly mask writes. Transfers are never split except to prevent crossing a page boundary.
Some examples, where the addresses are offsets from any 4 KB aligned base address:
For memory all this is really not relevant, but when aimed at the L4 interconnect things are different. As far as I can tell a single access to one of LIDD's ADDR or DATA registers causes a single transaction on the external bus, hence if you use ACNT = bytes per pixel then I would expect every pixel to result in exactly one bus transaction. Since LIDD's external bus width is limited to 16 bits, using ACNT=3 or ACNT=4 would use only the first two bytes of each pixel.
(Note that using LIDD's integrated DMA controller instead of EDMA effectively behaves like ACNT=2)
P.S. Suchit, next time you have a question, start a new topic instead of replying to an unrelated old thread. Also, when people ask for clarification it would be nice to actually pay attention and give a response. By digging a bit I noticed that a similar suggestion (but for 16-bit framebuffer) was given to you in an earlier thread and you never gave any further response to that either.