DM648VICP VS assembly

tristonewang

Prodigy 170 points

DM648VICP VS assembly, which is more faster?

I tested the DCT transform by VICP and assembly, BUt I finded that the assembly is faster than VICP.

over 13 years ago

0 Victor Cheng over 13 years ago

TI__Expert 6335 points

Assembly may still be faster than VICP. But note that VICP can be ran in parallel with c64x so its purpose is to offload c64x. Also VICP is more efficient when it is applied to a large frame of data. The larger the frame is, the better it is, otherwise setup and startup overhead may become predominant.

0 tristonewang over 13 years ago in reply to Victor Cheng

Prodigy 170 points

I tested the assembly (IMG_fdct_8x8) and the VICP(imxenc_dct8x8row) ; the result of VICP(imxenc_dct8x8row) is 'Execution Time = 90421 or 70.64 CPU cycles/point'. but the assembly (IMG_fdct_8x8) is 'Execution Time = 6339 or 4.95 CPU cycles/point'. the VICP is fast slower than the assembly. Is my test correct?Can The performance of imxenc_dct8x8row run faster?

My test Array is 800*16*sizeof(short);this is no example about the imxenc_dct8x8row,I made the code according to the example of TI.

my code about the VICP is as follows:

if (CPIS_DCT8x8Row
        (
            &handle,
            &base,
            &params,
            CPIS_ASYNC
            )== -1)
        {
            printf("\nCPIS_DCT8x8Row() error %d\n", CPIS_errno);
            exit(-1);
        };

// CPIS_reset(handle);
//CPIS_start(handle);

        timerStart= timerReadStart ();
        //CPIS_wait(handle);
  IMG_fdct_8x8((short *)0xe0000000, 10*2);
        timerEnd= timerReadEnd ();
        execTimerDiff= timerEnd-timerStart;
  printf("Verification successful. Execution Time = %ld or %.2f CPU cycles/point !\n", execTimerDiff, (float)execTimerDiff/(WIDTH*HEIGHT));

the fun of CPIS_DCT8x8Row is as follows:

Int32 _CPIS_setDCT8x8RowProcessing(
CPIS_IpRun *ipRun,
CPIS_BaseParms *base,
void *p)
{

CPIS_Info info;
CPIS_dct8x8Parms *params;

Int32 i;
Int16 typeSrc, typeMat, typeDst;

params= (CPIS_dct8x8Parms *)p;

info.imgbufptr= IMGBUF_A_BASE + ipRun->imgbufInOfst;
info.imgbuflen= ipRun->imgbufLen;
info.cmdptr= (Int16*) (CMDBUF_BASE + ipRun->cmdOfst);
info.cmdlen= 0;
info.coefptr= (Int16*) (COEFFBUF_BASE + ipRun->coefOfst);
info.coeflen= 0;
info.procBlockSize= base->procBlockSize;

IMGBUF_switch(SELALLBUF, ALLBUFDSP);

/* Below code takes care of 8bit elements too. We just copy more data */
for (i = 0; i < params->matWidth*params->matHeight; i++)
{
info.coefptr[i] = ((Int16 *)(params->matPtr)) [i];
info.coeflen += 1;
}

typeSrc= _CPIS_translateInputFormat(base->srcFormat[0]);
typeMat= _CPIS_translateInputFormat(params->matFormat);
typeDst= _CPIS_translateOutputFormat(base->dstFormat[0]);

//info.cmdlen += imxenc_set_saturation(32767, 32767, -32768, -32768, info.cmdptr + info.cmdlen);

info.cmdlen += imxenc_dct8x8row(
  (Int16*)info.imgbufptr,
  (Int16*)info.coefptr,
  (Int16*)(info.imgbufptr + info.imgbuflen),
  base->roiSize.width,
  base->roiSize.height,
  base->roiSize.width,
  base->roiSize.height,
  base->roiSize.width>>3,
  base->roiSize.height>>3,
  typeSrc,
  typeMat,
  typeDst,
  params->qShift,
  info.cmdptr + info.cmdlen
);

info.cmdlen+= imxenc_sleep(info.cmdptr + info.cmdlen);

ipRun->imgbufLen= info.imgbuflen; /* info.imgbuflen is in number of bytes */
ipRun->cmdLen= info.cmdlen<<1; /* info.cmdlen is in number of words */
ipRun->coefLen= info.coeflen<<1; /* info.coeflen is in number of words */

ipRun->imgbufOutOfst[0]= ipRun->imgbufInOfst + info.imgbuflen;

return 0;
}

0 Victor Cheng over 13 years ago in reply to tristonewang

TI__Expert 6335 points

Actually, in theory you should get close to 16 to 32 CPU cycles/point on VICP. I believe your test array is bigger, you will get closer to these numbers.

regards,

Victor

0 tristonewang over 13 years ago in reply to Victor Cheng

Prodigy 170 points

the power domain is set the pll(CLKDIV4 and CLKDIV2) is set ,is there any other configures?

my test arry 64*16*sizeof(short).

the result is 'Execution Time = 68363 or 66.76 CPU cycles/point !'.

In the file(SPRUGN1C.pdf

imxenc_dct8x8row

The estimated number of cycles to perform the operation (except overhead time) is:
amount_of_work × memory_conflict_factor
• amount_of_work =
64 × calc_Hblks × calc_Vblks
• memory_conflict_factor:
Location of data Location of coeff Location of output memory_conflict_factor
IMGBUF IMGBUF IMGBUF 2 + 1/8
IMGBUF IMGBUF COEFF 2
IMGBUF COEFF IMGBUF 1 + 1/8
IMGBUF COEFF COEFF 1
COEFF IMGBUF IMGBUF 1 + 1/8
COEFF IMGBUF COEFF 1
COEFF COEFF IMGBUF 1
COEFF COEFF COEFF 1 + 1/8

the theory CPU cycle/point is 64 x 1 x 1 x (1+1/8) / 64 = 1+1/8;(if wo choose calc_Hblks = 1 and calc_Vblks = 1).

Processors

Processors forum

DM648VICP VS assembly