This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM648VICP VS assembly

DM648VICP VS assembly, which is more faster?

I tested the DCT transform by VICP and  assembly,  BUt I finded that the assembly is faster than VICP.

  • Assembly may still be faster than VICP. But note that VICP can be ran in parallel with c64x so its purpose is to offload c64x. Also VICP is more efficient when it is applied to a large frame of data. The larger the frame is, the better it is, otherwise setup and startup overhead may become predominant. 

  • I tested the assembly (IMG_fdct_8x8) and the VICP(imxenc_dct8x8row) ; the result of VICP(imxenc_dct8x8row) is 'Execution Time = 90421 or 70.64 CPU cycles/point'. but the assembly (IMG_fdct_8x8) is 'Execution Time = 6339 or 4.95 CPU cycles/point'. the VICP is fast slower than the assembly. Is my test correct?Can The performance of imxenc_dct8x8row run faster?

    My test Array is 800*16*sizeof(short);this is no example about the imxenc_dct8x8row,I made the code according to the example of TI.

    my code about the VICP is as follows:

    if (CPIS_DCT8x8Row
             (
                &handle,
                &base,
                &params,
                CPIS_ASYNC
                )== -1)
            {
                printf("\nCPIS_DCT8x8Row() error %d\n", CPIS_errno);
                exit(-1);
            };

             // CPIS_reset(handle);
            //CPIS_start(handle);

            timerStart= timerReadStart ();
            //CPIS_wait(handle);
      IMG_fdct_8x8((short *)0xe0000000, 10*2);
            timerEnd= timerReadEnd ();
            execTimerDiff= timerEnd-timerStart;
      printf("Verification successful. Execution Time = %ld or %.2f CPU cycles/point !\n", execTimerDiff, (float)execTimerDiff/(WIDTH*HEIGHT));

    the fun of CPIS_DCT8x8Row is as follows:

    Int32 _CPIS_setDCT8x8RowProcessing(
      CPIS_IpRun *ipRun,
      CPIS_BaseParms *base,
      void *p)
    {

     CPIS_Info info;
     CPIS_dct8x8Parms *params;

     Int32 i;
     Int16 typeSrc, typeMat, typeDst;

     params= (CPIS_dct8x8Parms *)p;

     info.imgbufptr= IMGBUF_A_BASE + ipRun->imgbufInOfst;
     info.imgbuflen= ipRun->imgbufLen;
     info.cmdptr= (Int16*) (CMDBUF_BASE + ipRun->cmdOfst);
     info.cmdlen= 0;
     info.coefptr= (Int16*) (COEFFBUF_BASE + ipRun->coefOfst);
     info.coeflen= 0;
     info.procBlockSize= base->procBlockSize;

     IMGBUF_switch(SELALLBUF, ALLBUFDSP);

     /* Below code takes care of 8bit elements too. We just copy more data */
     for (i = 0; i < params->matWidth*params->matHeight; i++)
     {
      info.coefptr[i] = ((Int16 *)(params->matPtr)) [i];
      info.coeflen += 1;
     }

     typeSrc= _CPIS_translateInputFormat(base->srcFormat[0]);
     typeMat= _CPIS_translateInputFormat(params->matFormat);
     typeDst= _CPIS_translateOutputFormat(base->dstFormat[0]);
     
     //info.cmdlen += imxenc_set_saturation(32767, 32767, -32768, -32768, info.cmdptr + info.cmdlen);
                  
     info.cmdlen += imxenc_dct8x8row(
      (Int16*)info.imgbufptr,
      (Int16*)info.coefptr,
      (Int16*)(info.imgbufptr + info.imgbuflen),
      base->roiSize.width,
      base->roiSize.height,
      base->roiSize.width,
      base->roiSize.height,
      base->roiSize.width>>3,
      base->roiSize.height>>3,
      typeSrc,
      typeMat,
      typeDst,
      params->qShift,
      info.cmdptr + info.cmdlen
     );

     info.cmdlen+= imxenc_sleep(info.cmdptr + info.cmdlen);

     ipRun->imgbufLen= info.imgbuflen; /* info.imgbuflen is in number of bytes */
     ipRun->cmdLen= info.cmdlen<<1;  /* info.cmdlen is in number of words */
     ipRun->coefLen= info.coeflen<<1; /* info.coeflen is in number of words */


     ipRun->imgbufOutOfst[0]= ipRun->imgbufInOfst + info.imgbuflen;

     return 0;
    }

  • Actually, in theory you should get close to 16 to 32 CPU cycles/point on VICP.  I believe your test array is bigger, you will get closer to these numbers.

    regards,

    Victor

  • the power domain is set the pll(CLKDIV4 and CLKDIV2) is set ,is there any other configures?

    my test arry 64*16*sizeof(short).

    the result is 'Execution Time = 68363 or 66.76 CPU cycles/point !'.

     

    In the file(SPRUGN1C.pdf

    imxenc_dct8x8row

    The estimated number of cycles to perform the operation (except overhead time) is:
    amount_of_work × memory_conflict_factor
    • amount_of_work =
    64 × calc_Hblks × calc_Vblks
    • memory_conflict_factor:
    Location of data Location of coeff Location of output memory_conflict_factor
    IMGBUF IMGBUF IMGBUF 2 + 1/8
    IMGBUF IMGBUF COEFF 2
    IMGBUF COEFF IMGBUF 1 + 1/8
    IMGBUF COEFF COEFF 1
    COEFF IMGBUF IMGBUF 1 + 1/8
    COEFF IMGBUF COEFF 1
    COEFF COEFF IMGBUF 1
    COEFF COEFF COEFF 1 + 1/8

     

    the theory  CPU cycle/point  is 64 x 1 x 1 x (1+1/8) / 64 = 1+1/8;(if wo choose calc_Hblks  = 1 and  calc_Vblks = 1).