Codec Engine DMA performance

Gaston Schelotto

Hi,

After evaluating some CE examples without the use of DMA I've started to test them using it. For this purpose I'm evaluating video1_copy example but I cannot see any performance improvement by using the DMA resource. Here's a comparison table of the first two frames after executing

CE_DEBUG=1 CE_DSP0TRACE="ti.sdo.ce.VISA=5" ./app_remote.xv5T -s xe674

IOFRAMESIZE [bytes]	1024
DSP [MHz]	800

		ticks [hex]	hex to dec	ticks	time [us]
		ticks [hex]	hex to dec	ticks	time [us]
DMA	enter VISA	26c6fcd	40660941			encoder	frame0
	exit VISA	26e87fc	40798204	137263	171.58	encoder
	enter VISA	278afb3	41463731	665527	831.91	decoder
	exit VISA	27a7d3e	41581886	118155	147.69	decoder
	enter VISA	285c67d	42321533	739647	924.56	encoder	frame1
	exit VISA	287cefe	42454782	133249	166.56	encoder
	enter VISA	290d9f7	43047415	592633	740.79	decoder
	exit VISA	292a636	43165238	117823	147.28	decoder


		ticks [hex]	hex to dec	ticks	time [us]
		ticks [hex]	hex to dec	ticks	time [us]
No DMA	enter VISA	2e4b84f	48543823			encoder	frame0
	exit VISA	2e69ed9	48668377	124554	155.69	encoder
	enter VISA	2f1edd7	49409495	741118	926.40	decoder
	exit VISA	2f3cca7	49532071	122576	153.22	decoder
	enter VISA	3025a33	50485811	953740	1192.18	encoder	frame1
	exit VISA	3045643	50615875	130064	162.58	encoder
	enter VISA	30f8d1f	51350815	734940	918.68	decoder
	exit VISA	31163f9	51471353	120538	150.67	decoder

In each case the results are almost the same for the enter/exit VISA calls (about 160us). I've double checked the output using CE_DEBUG=3 option in the DMA example and everything seems to work fine, I'm able to see DMAN3 and ACPY3 calls as expected. What could be missing?

CE 3.22.01.06
EZSDK 5.04.00.11
dm816x evm

Regards,
gaston

over 13 years ago

0 Ramsey over 13 years ago

TI__Genius 12025 points

Gaston,

I asked around regarding your question and the short answer is that DMA doesn't help much unless you perform the memory movement in parallel with CPU processing. Here is the longer answer.

The bottleneck is typically the EMIF - doesn't matter if you access external memory using DMA or CPU, the external memory accesses can only go so fast.
Typically codecs get perf lift using the DMA when they do this loop:
    1. DMA a subset of the input data buffer into internal memory
    2. Process it from that [fast] internal memory
    3. DMA the results from internal mem to the output buffer in external memory

Here is another response I received.

The benefit of using DMA vs CPU copy for synchronized raw transfers of large blocks of 1-dimensional data to/from external memory is typically bounded by the DDR/EMIF bandwidth as constrained by the target device. When transferring 2D type transfers, DMA should perform better vs CPU programming by eliminating the CPU overhead of offset/skip calculation etc.
True performance using DMA would result from careful scheduling logic to have CPU to do work while background DMA transfers are taking place instead of blocking on completion of each data transfer.

The codec author should be the one to comment on the expected benefit w/DMA enabling.

I hope this answers your question.

~Ramsey

Processors

Processors forum

Codec Engine DMA performance