For my Davinci project I've run into a serious problem with performance. For my initial testing I had a planer video buffer YUV (888) y0y1... u0u1...v0v1 ... coming in. For each row in the input buffer(-7) I DMA'd 7 rows of the y data into internal SRAM ran a conv7x7 algorithm on it and then DMA'd the ouptut row back. I got the performance up to a reasonable point. Then I switched over to the real data YUV(422) vy0uy1 buffer so I had to switch my DMA around and transfer every other byte from the input buffer to the internal SRAM buffer, run the conv7x7 function on it and DMA it back. To get this to work I had to switch from a 1D1D transfer to a 2D2D transfer. The performance went out the door. The following are for 10 video frames
1D1D DMA's 0.074097s 7x7 i8_c8
DMA .058274s 7x7 i8_c8
2D2D DMA 0.926687s
As you can see doing the 1D1D DMA the performance as pretty good, changing to 2D2D really brought it to a screeching halt. Here are the DMA parameters for both transfers hopefully I'm doing something wrong. If it's not possible to do DMA's like this quickly is it possible to tell the Davinci Front End VPFE to deliver the data planer rather than interleaved. I couldn't see anything in the manual.
Parameters for 1D DMA
params.elementSize = width;
params.numElements = 11;
params.srcElementIndex = width;
params.dstElementIndex = width;
params.srcFrameIndex = 0;
params.dstFrameIndex = 0;
params.waitId = 0;
params.srcAddr = (void*)in1Ptr;
params.dstAddr = (void*)dma1Ptr;
ACPY3_configure(dma1, ¶ms, 0);
Parameters for 2D DMA
params.transferType = ACPY3_2D2D;
params.elementSize = 1;
params.numElements = 11*width;
params.srcElementIndex = 2;
params.dstElementIndex = 1;
params.numFrames = 1;
params.srcFrameIndex = 0;
params.dstFrameIndex = 0;
BTW - Is there a limit to the number of DMA channels an algorithm can have? I've been using 2 but when I switched to 3 I couldn't get it to not crash, ended up going back to 2.
Thanks,