VICP: moving input(s)/output from image buffer to coefficient memory

Orjan Friberg

Hi,

Page 67 in SPRUGN1C.pdf (VICP Computation Unit Library) specifies the memory conflict factor for imxenc_array_op depending on the location of the input buffers and the output buffer. In the VICP signal processing library, the arrayOp implementation puts all three in the image buffer, resulting in a memory conflict factor of 3. I'd like to get rid of the memory conflict factor by moving the input vectors to the coefficient memory instead.

I haven't tried modifying _CPIS_setDmaInTransfers to put the data in the coefficient memory instead, but I thought I would at least be able to change (in _arrayOp.c):

info.cmdlen += imxenc_array_op(
(Int16*)info.imgbufptr,
(Int16*)(info.imgbufptr + block_size),
(Int16*)(info.imgbufptr) [...]

memcpy((Int16*)info.coefptr, (Int16*)(info.imgbufptr + block_size), block_size);

info.coeflen += block_size;

info.cmdlen += imxenc_array_op(
(Int16*)info.imgbufptr,
(Int16*)info.coefptr,
(Int16*)(info.imgbufptr) [...]

but that results in verification errors.

over 15 years ago

0 Paul.Yin over 15 years ago

TI__Genius 14405 points

Hi, what is your question? comment on your modification?

0 Orjan Friberg over 15 years ago in reply to Paul.Yin

Expert 1385 points

Hi,

My question is: What do I need to do to use the VICP's coefficient memory to hold the input parameters?

0 Orjan Friberg over 15 years ago in reply to Orjan Friberg

Expert 1385 points

Some partial progress; the DMA copy to imgbufptr probably happens later, so the memcpy should be from srcBuf directly:

memcpy((Int16*)info.coefptr, (Int16*)((unsigned)base->srcBuf[0].ptr + block_size), block_size);

or, equivalently,

memcpy((Int16*)info.coefptr, (Int16*)(base->srcBuf[1].ptr), block_size);

It only works with 1x1 block (at 64*32) though, instead of the default 10x15 blocks, so it's probably still not correct.

Is it possible to do memory-to-memory-DMA for the data that is put in the coefficient memory?

(The VICP library filter function uses manual memory writes, which makes me think it may not be possible.)

0 Gagan Maur over 15 years ago in reply to Orjan Friberg

TI__Expert 8150 points

Hello Orjan,

First let me please give some background on VICP operation. The file for ex, _arrayOp.c includes C code that executes on the DSP. The function _CPIS_setArrayOpProcessing is used by the DSP to 'create SW' for the VICP accelerator. That is, the function sets up VICP for execution by loading the right binary code for the accelerator in its instruction memory. The function doesn't by itself implements the execution. Also, note that the VICP processing happens in blocks. For each block the data is read in using EDMA. Thus, after the _CPIS_setArrayOpProcessing function returns, the VICP and the EDMA work in tandem to process the whole image, block by block.

If you add a memcpy in the _CPIS_setArrayOpProcessing code, it will be executed by the DSP. As you understand from the above description, this memcpy doesn't mean anything.

Now coming to your Q of improving the processing time for this function, please note the function is right now memory I/O limited. That is, the function time is determined by how fast the EDMA can get data in and out of the system. Thus, the observed performance of the function will not improve by what you are attempting. Now, you can combine more than one iMX API so that you can break the memory I/O bottleneck and thus get better overall performance.

Regards,
Gagan

0 Orjan Friberg over 15 years ago in reply to Gagan Maur

Expert 1385 points

Hi Gagan,

Right; I do need to chain commands together to amortize the cost of the initial EDMA transfers; chaining 10 commands gets me very close to the cycle time for imxenc_array_op:

The estimated number of VICP cycles to perform the operation (except overhead time)
is:
amount_of_work × memory_conflict_factor / speedup_factor

After doing this, another 3 x performance increase looks possible by putting the input vectors into coefficient memory instead, reducing the memory conflict factor.

Thanks for the VICP setup explanation. You lost me at memcpy meaning nothing though; _filter.c does a similar setup for its coefficient data:

/* Below code takes care of 8bit elements too. We just copy more data */
for (i = 0; i <coeff_size; i++) {
info.coefptr[i] = ((Int16 *)(params->coeffPtr)) [i];
info.coeflen += 1;
}

Does this work differently depending on the underlying imxenc_ function?

Thanks,

Orjan

0 Orjan Friberg over 15 years ago in reply to Orjan Friberg

Expert 1385 points

Gagan,

If you meant that the memcpy from info.imgbufptr meant nothing, I agree.

I looked at the memcpy from srcPtr to coefptr again and noticed I only copied one single block; copying all of srcBuf[1].ptr makes no difference though.

Since I don't intend to use the coefficient memory for coefficients, but want to keep the block-based processing, maybe I need to write a custom DmaIn function?

0 Gagan Maur over 15 years ago in reply to Orjan Friberg

TI__Expert 8150 points

Orjan, Each VICP kernel has the capability to read/write into coef memory. However the EDMAs in the vicp lib are programmed to DMA-in into the image buffer and then DMA-out from image buffer to take advantage of the ping-pong buffering capability of the image buffer. The vicp coef mem doesn't have such ping-pong buffering capability. In case of a function that needs to process entire frame, the frame needs to be processed block by block. Thus, both input and output are located in the image buffer to enable such ping-pong processing.

> Thanks for the VICP setup explanation. You lost me at memcpy meaning nothing though; _filter.c does a similar setup for its coefficient data

Yes. Please note the fxn for ex _CPIS_setArrayOpProcessing is only doing setup. Part of the setup can be to initialize the coefficients. The _filter.c example you give is setting up the coefficient memory with the filter coefficients that the kernel will use. This is OK as it is a one time setup and doesn't need to be repeated for every block that is processed.

What you were doing was to move processing data into coefficient memory using memcpy. Now this is not valid because this step needs to be done again and again for every block that is processed. After the setup function returns and the VICP processing starts, who will move the processing data into coefficient memory for every block?

> Since I don't intend to use the coefficient memory for coefficients, but want to keep the block-based processing, maybe I need to write a custom DmaIn function

Couple issues. First, it is a lot of change. Second, as the coeff memory doesn't support ping pong, you can't be fetching the next chunk of data using EDMA in parallel with VICP processing the previous chunk. Thus, you will have VICP and EDMA conflict for the memory resource. I recommend against doing this.

Gagan

Processors

Processors forum

VICP: moving input(s)/output from image buffer to coefficient memory