This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

VICP: moving input(s)/output from image buffer to coefficient memory



Hi,

Page 67 in SPRUGN1C.pdf (VICP Computation Unit Library) specifies the memory conflict factor for imxenc_array_op depending on the location of the input buffers and the output buffer.  In the VICP signal processing library, the arrayOp implementation puts all three in the image buffer, resulting in a memory conflict factor of 3.  I'd like to get rid of the memory conflict factor by moving the input vectors to the coefficient memory instead.

I haven't tried modifying _CPIS_setDmaInTransfers to put the data in the coefficient memory instead, but I thought I would at least be able to change (in _arrayOp.c):

info.cmdlen += imxenc_array_op(
  (Int16*)info.imgbufptr,
  (Int16*)(info.imgbufptr + block_size),
  (Int16*)(info.imgbufptr) [...]

to

memcpy((Int16*)info.coefptr, (Int16*)(info.imgbufptr + block_size), block_size);

info.coeflen += block_size;

info.cmdlen += imxenc_array_op(
  (Int16*)info.imgbufptr,
  (Int16*)info.coefptr,
  (Int16*)(info.imgbufptr) [...]

but that results in verification errors.

  • Hi, what is your question? comment on your modification?

  • Hi,

    My question is: What do I need to do to use the VICP's coefficient memory to hold the input parameters?

  • Some partial progress; the DMA copy to imgbufptr probably happens later, so the memcpy should be from srcBuf directly:

      memcpy((Int16*)info.coefptr, (Int16*)((unsigned)base->srcBuf[0].ptr + block_size), block_size);

    or, equivalently,

      memcpy((Int16*)info.coefptr, (Int16*)(base->srcBuf[1].ptr), block_size); 

     

    It only works with 1x1 block (at 64*32) though, instead of the default 10x15 blocks, so it's probably still not correct.

     

    Is it possible to do memory-to-memory-DMA for the data that is put in the coefficient memory?

    (The VICP library filter function uses manual memory writes, which makes me think it may not be possible.)

  • Hello Orjan,

    First let me please give some background on VICP operation. The file for ex, _arrayOp.c includes C code that executes on the DSP. The function _CPIS_setArrayOpProcessing is used by the DSP to 'create SW' for the VICP accelerator. That is, the function sets up VICP for execution by loading the right binary code for the accelerator in its instruction memory. The function doesn't by itself implements the execution. Also, note that the VICP processing happens in blocks. For each block the data is read in using EDMA. Thus, after the _CPIS_setArrayOpProcessing function returns, the VICP and the EDMA work in tandem to process the whole image, block by block.

    If you add a memcpy in the _CPIS_setArrayOpProcessing code, it will be executed by the DSP. As you understand from the above description, this memcpy doesn't mean anything.

    Now coming to your Q of improving the processing time for this function, please note the function is right now memory I/O limited. That is, the function time is determined by how fast the EDMA can get data in and out of the system. Thus, the observed performance of the function will not improve by what you are attempting. Now, you can combine more than one iMX API so that you can break the memory I/O bottleneck and thus get better overall performance.

    Regards,
    Gagan

  • Hi Gagan,

    Right; I do need to chain commands together to amortize the cost of the initial EDMA transfers; chaining 10 commands gets me very close to the cycle time for imxenc_array_op:

      The estimated number of VICP cycles to perform the operation (except overhead time)
      is:
      amount_of_work × memory_conflict_factor / speedup_factor

    After doing this, another 3 x performance increase looks possible by putting the input vectors into coefficient memory instead, reducing the memory conflict factor.

     

    Thanks for the VICP setup explanation.  You lost me at memcpy meaning nothing though; _filter.c does a similar setup for its coefficient data:


      /* Below code takes care of 8bit elements too. We just copy more data */
      for (i = 0; i <coeff_size; i++) {
        info.coefptr[i] = ((Int16 *)(params->coeffPtr)) [i];
        info.coeflen += 1;
      }

    Does this work differently depending on the underlying imxenc_ function?

     

    Thanks,

    Orjan

     

     

  • Gagan,

    If you meant that the memcpy from info.imgbufptr meant nothing, I agree.

    I looked at the memcpy from srcPtr to coefptr again and noticed I only copied one single block; copying all of srcBuf[1].ptr makes no difference though.

    Since I don't intend to use the coefficient memory for coefficients, but want to keep the block-based processing, maybe I need to write a custom DmaIn function?

     

     

     

  • Orjan, Each VICP kernel has the capability to read/write into coef memory. However the EDMAs in the vicp lib are programmed to DMA-in into the image buffer and then DMA-out from image buffer to take advantage of the ping-pong buffering capability of the image buffer. The vicp coef mem doesn't have such ping-pong buffering capability. In case of a function that needs to process entire frame, the frame needs to be processed block by block. Thus, both input and output are located in the image buffer to enable such ping-pong processing. 

    > Thanks for the VICP setup explanation.  You lost me at memcpy meaning nothing though; _filter.c does a similar setup for its coefficient data

    Yes. Please note the fxn for ex _CPIS_setArrayOpProcessing is only doing setup. Part of the setup can be to initialize the coefficients. The _filter.c example you give is setting up the coefficient memory with the filter coefficients that the kernel will use. This is OK as it is a one time setup and doesn't need to be repeated for every block that is processed.

    What you were doing was to move processing data into coefficient memory using memcpy. Now this is not valid because this step needs to be done again and again for every block that is processed. After the setup function returns and the VICP processing starts, who will move the processing data into coefficient memory for every block?

    > Since I don't intend to use the coefficient memory for coefficients, but want to keep the block-based processing, maybe I need to write a custom DmaIn function

    Couple issues. First, it is a lot of change. Second, as the coeff memory doesn't support ping pong, you can't be fetching the next chunk of data using EDMA in parallel with VICP processing the previous chunk. Thus, you will have VICP and EDMA conflict for the memory resource. I recommend against doing this.

    Gagan