DM648 VICP Affine Transform Performance

James Gort

Other Parts Discussed in Thread: CCSTUDIO

(I originally posted this on DSP Developer's Network, apparently wrong place)

Hi-

I am trying to utilize the DM648 VICP Affine Transform function (v3.3 VICP library) for a product. It seems that a single value of QSHIFT is used for both the matrix coefficient generation and the resulting transformed coordinates. Hence, as SPRUGJ3E suggest, there are conflicting desires for QSHIFT--the integer portion must be enough bits to represent all of the transformed coordinates (e.g., 10 bits min for a 640x480 image with Scale=1), but the fractional portion must be large enough to not overly quantize the matrix coefficients. In SPRUGJ3E, a value of 3 for QSHIFT is recommended.

However, a value of 3 for QSHIFT results in quantization of pi/16 for the angle of rotation! And quantization of 1/8 for scales! (that is, the matrix values themselves can only take on values of 0, +/-1, +/-2.... +/-8 as the angle goes from 0 to 360, or the scale goes from 0 to 1). This is unacceptable in our application (we want to use the Affine Transformation for subtle alignment and gain correction between multiple cameras), and I'm guessing in most every application for Affine Transformation. A much better implementation would have been to provide 2 "QSHIFTS", one for the matrix coeffs, and another for the transformed X and Y prior to interpolation.

So, my question is: Am I missing something here? Some hidden parameter(s)? If not, has this been addressed in some newer version of VICP library? If not, can I get advice on how to implement this change and rebuild the VICP library?

Thank you in advance, and best regards,

Jim

over 15 years ago

0 Gagan Maur over 15 years ago

TI__Expert 8150 points

Hello Jim, your understanding of the function operation is correct. For now, we don't have a plan to update the function and there is no later release with update like what you request. But we will definitely be willing to understand your request better and try and updated the lib.

Can you please look at the C equivalent provided for the function here: C:\CCStudio_v3.3\c64plus\vicplib_v330\src\src_natc\_GPP_affineTransform.c
And suggest what change you would ideally like to see. This is to ensure we have correct understanding of the enhancement. Based on the needed change, we can suggest if the change can be done by just modifying the provided SW and rebuilding or if a change to core kernel is required that may not be released in source.

Regards,
Gagan

0 James Gort over 15 years ago in reply to Gagan Maur

Intellectual 590 points

Hi Gagan-

Thank you for your response. After spending some time going for broke (trying to modify the VICP library wrapper function), I took the step of modifying the Native C (_GPP_affineTransform.c) to meet my needs. I suceeded, the modified code is pasted below (I couldn't figure out how to attach file to this post--if you email me I will send to you).

Basically, the param past to the function (qshift) is now ONLY used for matrix value shift, whereas the coordinate shift is hard-coded (COORDQSHIFT, which I have set to 4, which is more than enough). Ideally both would be parameterized for general purpose use. I use a matrix shift of 10, so I can rotate to within a fraction of a degree.

The changes to the Native C code were very simple, as you can see--all of my changes are flagged with "JRG" in the comment. However, I am not going back to trying to modify the VICP library per the modified native C, which will be easier now that I have the native C as reference, but will still not be a trivial task, at least for me (this is my first use of the VICP at all).

I would appreciate any help/advice you could give me in modifying the library function to match the modified native C.

Thank you and Best Regards,

Jim Gort

/* ======================================================================== */
/* TEXAS INSTRUMENTS, INC.                                                 */
/*                                                                          */
/* VICP Signal Processing Library                                         */
/*                                                                          */
/* This library contains proprietary intellectual property of Texas        */
/* Instruments, Inc. The library and its source code are protected by     */
/* various copyrights, and portions may also be protected by patents or    */
/* other legal protections.                                                */
/*                                                                          */
/* This software is licensed for use with Texas Instruments TMS320         */
/* family DSPs. This license was provided to you prior to installing      */
/* the software. You may review this license by consulting the file       */
/* TI_license.PDF which accompanies the files in this library.             */
/*                                                                          */
/* ------------------------------------------------------------------------ */
/*                                                                          */
/*    NAME                                                                  */
/*     GPP_affineTransform - affineTransform C module                       */
/*                                                                          */
/*     DESCRIPTION                                                          */
/*     This file demonstrates the usage of affineTransform routine          */
/*                                                                          */
/*     REV                                                                  */
/*        version 0.0.1: May 13th 2009                                     */
/*        Initial version                                                   */
/*                                                                          */
/* ------------------------------------------------------------------------ */
/*            Copyright (c) 2008 Texas Instruments, Incorporated.           */
/*                           All Rights Reserved.                           */
/* ======================================================================== */

/* Include the lib interface header files */
#include "gpp_vicplib.h"
#include "_gpp_vicplib.h"
#include <math.h>
#include <stdio.h>

#define ROUND 0
#define FLOOR 1
#define CEIL 2
#define TRUNC 3
#define NOSHIFT 4

#define COORDQSHIFT 4 //JRG decimal point shift for Coordinates

typedef struct {
Int16 validFlag;
Int16 inBlock_x;
Int16 inBlock_y;
Int16 outBlock_x;
Int16 outBlock_y;
} BlockCoord;

static CPIS_Info info;
static _CPIS_AffineTransformPrivate* privateVars;
static CPIS_AffineTransformParms *params;
static Int16 incX, incY;
static Uint16 blockIndex;
static Uint16 scaleFactor;

Int32 _GPP_CPIS_setAffineTransformFunc();

static Int16 truncShift(Int32 x, Uint16 shift){

return (x >> shift);
}

static Int16 roundShift(Int32 x, Uint16 shift){

return ( ((x >> shift-1)+1) >> 1);
}

static Int16 floorShift(Int32 x, Uint16 shift){

   if (x >= 0)
     return (x >> shift);
   else
     return ( ((x >> shift-1)-1) >> 1);

}

static Int16 ceilShift(Int32 x, Uint16 shift){

/* ceil(x)= -floor(-x) */
return (-floorShift(-x, shift));
}

static void mapCoord(Int16 *X, Int16 *Y, Int16 x, Int16 y, CPIS_AffineTransformParms *params, Int16 roundingX, Int16 roundingY){

Int32 tempX;
Int32 tempY;

tempX= params->m0 * x + params->m1 * y + params->tx;
tempY= params->m2 * x + params->m3 * y + params->ty;

if (params->qShift) {
   if (roundingX== ROUND) {
     *X= roundShift(tempX, params->qShift);
     }
   else if (roundingX== FLOOR) {
     *X= floorShift(tempX, params->qShift);
     }
   else if (roundingX== CEIL) {
     *X= ceilShift(tempX, params->qShift);
     }

}
}

/* Map to the inverse coordinates */
static void mapInvCoord(Int16 *X, Int16 *Y, Int16 x, Int16 y, Int16 negY, CPIS_AffineTransformParms *params, Int16 roundingX, Int16 roundingY, Int16 extra_tx, Int16 extra_ty){
//JRG--the only time this is used with "extra_tx" and "extra_ty" non-zero is when it is used for the bilinear transformation.
//In the previous implementation, these extra params are assumed to have qshift applied to them. However, they now have COORDSHIFT applied to
//them. Since they are being used to calculate integer portion, the shift is not needed--it is simply undone by the rounding/truncation.
//So, modify for correct shift.
Int32 tempX;
Int32 tempY;

#if 0
tempX= params->m0inv * (x + params->txinv) + params->m1inv * (y + params->tyinv) + extra_tx;
if (negY)
   tempY= - (params->m2inv * (x + params->txinv) + params->m3inv * (y + params->tyinv) + extra_ty);
else
   tempY= (params->m2inv * (x + params->txinv) + params->m3inv * (y + params->tyinv) + extra_ty);
#else
tempX= params->m0inv * (x + params->txinv) + params->m1inv * (y + params->tyinv);
if (negY)
   tempY= - (params->m2inv * (x + params->txinv) + params->m3inv * (y + params->tyinv));
else
   tempY= (params->m2inv * (x + params->txinv) + params->m3inv * (y + params->tyinv));
#endif

   if (roundingY== ROUND) {
     *Y= roundShift(tempY, params->qShift);
   }
   else if (roundingY== FLOOR) {
     *Y= floorShift(tempY, params->qShift);
   }
   else if (roundingY== CEIL) {
     *Y= ceilShift(tempY, params->qShift);
   }
   else if (roundingY== TRUNC)
     *Y= truncShift(tempY, params->qShift);
   else
     *Y= tempY;
}
else {
   *X= tempX;
   *Y= tempY;
}
//JRG note--the exlusion of NOSHIFT case is because NOSHIFT is only used for fraction computation, and integer
//portion is masked off anyway an that case (masked using QSHIFT, so if we included the "extra" it would have to
//be shifted up to qshift prior to masking it off, but this would overflow.
if(roundingX!=NOSHIFT)
{
    // now add in the integer shifts
    *X=*X+(extra_tx>>COORDQSHIFT);
}
if(roundingY!=NOSHIFT)
{
    if(negY)
        *Y=*Y-(extra_ty>>COORDQSHIFT);
    else
        *Y=*Y+(extra_ty>>COORDQSHIFT);
}

}

static Int16 max4(Int16 a0, Int16 a1, Int16 a2, Int16 a3){

Int16 max= a0;

if (a1 > max)
   max= a1;

if (a2 > max)
   max= a2;

if (a3 > max)
   max= a3;

return max;
}

static Int16 min4(Int16 a0, Int16 a1, Int16 a2, Int16 a3){

Int16 min= a0;

if (a1 < min)
   min= a1;

if (a2 < min)
   min= a2;

if (a3 < min)
   min= a3;

return min;
}

Int32 _GPP_CPIS_setAffineTransformProcessingFunc();

/*
* This function assists the application in determining the dimensions of the output ROI
* after affine transform is applied.
* It also finds the optimum block width and block height for the algorithm to use.
*/
Int32 _GPP_CPIS_affineTransformGetSize(CPIS_BaseParms *base,
CPIS_AffineTransformParms *params, CPIS_AffineTransformOutputROI *outputROI ){

   Uint16 scaleFactorSrc, inRoiWidth, inRoiHeight, outerRoiWidth, outerRoiHeight;
   Uint16 blockWidth, blockHeight, outBlockWidth, outBlockHeight;
   Int16 inRoi_xul, inRoi_xlr, inRoi_yul, inRoi_ylr;
   Int16 outRoi_xul, outRoi_xur, outRoi_xlr, outRoi_xll, outRoi_yul, outRoi_yur, outRoi_ylr,outRoi_yll;
   Int16 xRoiCenter, yRoiCenter;
   _CPIS_AffineTransformPrivate* privateVars;
   Uint16 numHorzBlocks, numVertBlocks;
   Uint16 max_affine_blocksize;

   privateVars= (_CPIS_AffineTransformPrivate*)params->privateVars;
   scaleFactorSrc = _CPIS_sizeof(base->srcFormat[0]);

   inRoiWidth= base->roiSize.width;
   inRoiHeight= base->roiSize.height;
   /*
   Adjust size of ROI so center is really at the middle of the
   block. To that end, we force the dimensions to be odd.
   */
   if ((base->roiSize.width & 1)==0)
      inRoiWidth= inRoiWidth + 1;

if ((base->roiSize.height & 1)==0)
inRoiHeight= inRoiHeight + 1;

xRoiCenter= inRoiWidth/2;
yRoiCenter= inRoiHeight/2;

   /* Coordinates of all corners in input block coordinate space where center is (0,0) */
   inRoi_xul= -xRoiCenter;
   inRoi_xlr= xRoiCenter;
   inRoi_yul= yRoiCenter;
   inRoi_ylr= -yRoiCenter;

   /* Find Coordinates of all corners of the output block, which is no longer rectangular */
   mapCoord(&outRoi_xul, &outRoi_yul, inRoi_xul, inRoi_yul, params, ROUND, ROUND);
   mapCoord(&outRoi_xur, &outRoi_yur, inRoi_xlr, inRoi_yul, params, ROUND, ROUND);
   mapCoord(&outRoi_xll, &outRoi_yll, inRoi_xul, inRoi_ylr, params, ROUND, ROUND);
   mapCoord(&outRoi_xlr, &outRoi_ylr, inRoi_xlr, inRoi_ylr, params, ROUND, ROUND);

   /* Find corners of the outer box as the shape is no longer rectangular */
   privateVars->outerRoi_xul= min4(outRoi_xul, outRoi_xur, outRoi_xll, outRoi_xlr);
   privateVars->outerRoi_xlr= max4(outRoi_xul, outRoi_xur, outRoi_xll, outRoi_xlr);
   privateVars->outerRoi_yul= max4(outRoi_yul, outRoi_yur, outRoi_yll, outRoi_ylr);
   privateVars->outerRoi_ylr= min4(outRoi_yul, outRoi_yur, outRoi_yll, outRoi_ylr);

   outerRoiWidth= privateVars->outerRoi_xlr - privateVars->outerRoi_xul + 1;
   outerRoiHeight= privateVars->outerRoi_yul - privateVars->outerRoi_ylr + 1;

   outputROI->width= outerRoiWidth;
   outputROI->height= outerRoiHeight;

   /*
   The way the optimum blockWidth and blockHeight is found depends on whether the outerRoiArea
   is greater or smaller than inputRoiArea.

   If outerRoiArea smaller than inputRoi Area, back map an output square into the input.
   The resulting shape will be a parallelogram. Next find the area of the bounding box around the parallelogram
   and calculate the area scale factor.
   Note that the input square's coordinates are scaled by 1<<params->qShift, in order to prevent
   computation results < 1.
   */
   {
     Int16 square_xul, square_yul, square_xur, square_yur, square_xlr, square_ylr, square_xll, square_yll;
     Int16 invSquare_xul, invSquare_yul, invSquare_xur, invSquare_yur, invSquare_xlr, invSquare_ylr, invSquare_xll, invSquare_yll;

     Int32 edge0_x, edge0_y, edge1_x, edge1_y, edge2_x, edge2_y, edge3_x, edge3_y, len_x, len_y;
     Uint16 areaScale;

     square_xul= 0;
     square_yul= (1<<params->qShift);
     square_xur= (1<<params->qShift);
     square_yur= (1<<params->qShift);
     square_xlr= (1<<params->qShift);
     square_ylr= 0;
     square_xll= 0;
     square_yll= 0;
      /* Find inv coordinates of all corners of the output square, the output corresponding shape is not rectangular
      but a parallelogram */
     mapInvCoord(&invSquare_xul, &invSquare_yul, square_xul, square_yul, 0, params, ROUND, ROUND, 0, 0);
     mapInvCoord(&invSquare_xur, &invSquare_yur, square_xur, square_yur, 0, params, ROUND, ROUND, 0, 0);
     mapInvCoord(&invSquare_xll, &invSquare_yll, square_xll, square_yll, 0, params, ROUND, ROUND, 0 ,0);
     mapInvCoord(&invSquare_xlr, &invSquare_ylr, square_xlr, square_ylr, 0, params, ROUND, ROUND, 0 ,0);

     /* Calculate vector coordinates of the parallelogram's edges and take their absolute value*/
     edge0_x= abs(invSquare_xur - invSquare_xul);
     edge0_y= abs(invSquare_yur - invSquare_yul);
     edge1_x= abs(invSquare_xlr - invSquare_xur);
     edge1_y= abs(invSquare_ylr - invSquare_yur);
     edge2_x= abs(invSquare_xll - invSquare_xlr);
     edge2_y= abs(invSquare_yll - invSquare_ylr);
     edge3_x= abs(invSquare_xul - invSquare_xll);
     edge3_y= abs(invSquare_yul - invSquare_yll);

     /* Calculate the length of each vector projected to the X axis or Y axis. Each projected vector
     represents the edge of the bounding box around the parallelogram */
     len_x= edge0_x + edge1_x + edge2_x + edge3_x;
     len_y= edge0_y + edge1_y + edge2_y + edge3_y;

     /* Calculate the scaling factor of the area
     We can replace the following 2 lines
     area= ceilShift(len_x * len_y, 2);
     areaScale= ceilShift(area, params->qShift);

     with the single line:
     areaScale= ceilShift(len_x * len_y, 2 + params->qShift)
     */
     areaScale= roundShift(len_x * len_y, 2 + 2*params->qShift);

     if (!areaScale)
       areaScale= 1;

/* Readjust max size of the block */
max_affine_blocksize= MAX_AFFINE_BLOCKSIZE/areaScale;

   }

   /*
     Calculate optimium blockWidth and blockHeight dimensions
     The calculation depends on whether the output area is smaller or greater than input area
   */

   /* Initialize blockWidth by taking square root of max blocksize */
   blockWidth= (Uint16)sqrt((double)max_affine_blocksize/scaleFactorSrc);
   blockWidth= (blockWidth>>3)<<3; /* round down to multiple of 8 */
   while (((outerRoiWidth+blockWidth-1)/blockWidth)*blockWidth - outerRoiWidth > blockWidth & (blockWidth-8) >= 8 )
       blockWidth -= 8;

   blockHeight= (max_affine_blocksize/scaleFactorSrc)/blockWidth;
   while (((outerRoiHeight+blockHeight-1)/blockHeight)*blockHeight - outerRoiHeight > 2 & blockHeight >= 2 )
       blockHeight -= 1;

   numHorzBlocks= (outerRoiWidth+blockWidth-1)/blockWidth;
   numVertBlocks= (outerRoiHeight+blockHeight-1)/blockHeight;

outBlockWidth= blockWidth;
outBlockHeight= blockHeight;

/* stride is not necessary equal to outerRoiWidth, which might result in some garbage data to ignore */
outputROI->stride= outBlockWidth*numHorzBlocks;

outputROI->blockWidth= outBlockWidth;
outputROI->blockHeight= outBlockHeight;

   /* Number of bytes to allocate in scratch */
   params->scratchSize= (numHorzBlocks*numVertBlocks + 2) * sizeof(TferParamEntry);
   params->scratchSize+= (numHorzBlocks*numVertBlocks + 2) * sizeof(BlockCoord);

   return 0;
}

Int32 _GPP_CPIS_checkAffineTransformParams(
CPIS_BaseParms *base,
void *p){

    Uint16 srcFormat1 = base->srcFormat[0];

    if ((srcFormat1 != CPIS_U16BIT) && (srcFormat1 != CPIS_U8BIT) && (srcFormat1 != CPIS_16BIT) && (srcFormat1 != CPIS_8BIT)) {
        CPIS_errno= CPIS_NOSUPPORTFORMAT_ERROR;
        return -1;
    }

return 0;

}

Int32 _GPP_CPIS_setAffineTransformDmaIn(
CPIS_IpRun *ipRun,
CPIS_BaseParms *base,
void *p) {

CPIS_AffineTransformParms *params;
Uint16 roiWidth, roiHeight, xRoiCenter, yRoiCenter;
Uint16 outerRoi_width, outerRoi_height, size;
Int16 x, y, incX, incY;
Int16 inBlock_xul, inBlock_yul, inBlock_xur, inBlock_yur;
Int16 inBlock_xll, inBlock_yll, inBlock_xlr, inBlock_ylr;
Int16 outerInBlock_xul, outerInBlock_yul;
Int16 outerInBlock_xlr, outerInBlock_ylr;
Int16 outerInBlock_xulAbs, outerInBlock_yulAbs;
Uint16 outerInBlock_width,outerInBlock_height;
TferParamEntry* inTferParamTable;
Uint32 index;
Uint16 scaleFactorSrc;
BlockCoord *blockCoord;
_CPIS_AffineTransformPrivate* privateVars;

IP_run *ipRunObj= &ipRun->ipRunObj;

params= (CPIS_AffineTransformParms *)p;
privateVars= (_CPIS_AffineTransformPrivate*)params->privateVars;

scaleFactorSrc = _CPIS_sizeof(base->srcFormat[0]);

roiWidth= base->roiSize.width;
roiHeight= base->roiSize.height;
/*
Adjust size of ROI is so center is really at the middle of the
block. To that end, we force the dimensions to be odd.
*/
if ((base->roiSize.width & 1)==0)
roiWidth= roiWidth + 1;

if ((base->roiSize.height & 1)==0)
roiHeight= roiHeight + 1;

xRoiCenter= roiWidth/2;
yRoiCenter= roiHeight/2;

outerRoi_width= privateVars->outerRoi_xlr - privateVars->outerRoi_xul + 1;
outerRoi_height= privateVars->outerRoi_yul - privateVars->outerRoi_ylr + 1;

ipRunObj->numHorzBlocks= (outerRoi_width + base->procBlockSize.width -1)/base->procBlockSize.width;
ipRunObj->numVertBlocks= (outerRoi_height+ base->procBlockSize.height-1)/base->procBlockSize.height;

incX= base->procBlockSize.width;
incY= base->procBlockSize.height;

ipRunObj->numDmaIn= 2;
ipRunObj->dmaIn= &GPP_CPIS_obj.dmaIn[0];
/* First DMA will read the 16 lsb output from stage 0 */
ipRunObj->dmaIn[0].dmaChNo = 0;
ipRunObj->dmaIn[0].useTferParamTable= 1;
inTferParamTable= (TferParamEntry*)params->scratch;
ipRunObj->dmaIn[0].tferParamTable= inTferParamTable;
blockCoord= (BlockCoord*)((Int8*)inTferParamTable + (ipRunObj->numHorzBlocks*ipRunObj->numVertBlocks + 2) * sizeof(TferParamEntry));

index= 0;

privateVars->maxInBlockWidth= 0;
privateVars->maxInBlockHeight= 0;

/* This loop will construct the EDMA transfer parameters for each input block
In overall privateVars->numHorzBlocks*privateVars->numVertBlocks parameter sets are written
Note that 2 extra parameter sets were allocated in memory as they will be used
by the VICP scheduling unit library to flush the pipeline.
*/
for (y=privateVars->outerRoi_yul; y >= privateVars->outerRoi_ylr; y-= incY)
   for (x=privateVars->outerRoi_xul; x <= privateVars->outerRoi_xlr; x+= incX) {

     /* Map input rectangle's corners into output corners.
     The segments connecting the output corners will form a shape not
     necessarily rectangular. */
     mapInvCoord(&inBlock_xul, &inBlock_yul, x, y, 0, params, ROUND, ROUND, 0, 0);
     mapInvCoord(&inBlock_xur, &inBlock_yur, x + incX - 1, y, 0, params, ROUND, ROUND, 0, 0);
     mapInvCoord(&inBlock_xll, &inBlock_yll, x , y - incY + 1, 0, params, ROUND, ROUND, 0, 0);
     mapInvCoord(&inBlock_xlr, &inBlock_ylr, x + incX - 1, y - incY + 1, 0, params, ROUND, ROUND, 0 ,0);

    /* The blockCoord[index].validFlag tells whether the mapped shape falls wholly inside a valid area of the input ROI
     It will be used to speed up EDMA and bilinear processing by skiping task.
    */
     if (params->skipOutside) {
     blockCoord[index].validFlag= ((inBlock_xul < xRoiCenter) && (inBlock_xul > -xRoiCenter) && (inBlock_yul >= -yRoiCenter) && (inBlock_yul <= yRoiCenter) ) \
    || ((inBlock_xur < xRoiCenter) && (inBlock_xur > -xRoiCenter) && (inBlock_yur >= -yRoiCenter) && (inBlock_yur <= yRoiCenter)) \
    || ((inBlock_xlr < xRoiCenter) && (inBlock_xlr > -xRoiCenter) && (inBlock_ylr >= -yRoiCenter) && (inBlock_ylr <= yRoiCenter)) \
    || ((inBlock_xll < xRoiCenter) && (inBlock_xll > -xRoiCenter) && (inBlock_yll >= -yRoiCenter) && (inBlock_yll <= yRoiCenter));
     }
     else
       blockCoord[index].validFlag=1;

     /* Find corners of the smallest rectangle that contains the mapped shape */
     outerInBlock_xul= min4(inBlock_xul, inBlock_xur, inBlock_xll, inBlock_xlr) - 1;
     outerInBlock_xlr= max4(inBlock_xul, inBlock_xur, inBlock_xll, inBlock_xlr) + 1;
     outerInBlock_yul= max4(inBlock_yul, inBlock_yur, inBlock_yll, inBlock_ylr) + 1;
     outerInBlock_ylr= min4(inBlock_yul, inBlock_yur, inBlock_yll, inBlock_ylr) - 1;

     /* calculate width and height */
     outerInBlock_width= outerInBlock_xlr - outerInBlock_xul + 1;
     outerInBlock_height= outerInBlock_yul - outerInBlock_ylr + 1;

     blockCoord[index].inBlock_x= -(outerInBlock_xul<<COORDQSHIFT); // JRG -(outerInBlock_xul<<params->qShift);
     blockCoord[index].inBlock_y= -(outerInBlock_yul<<COORDQSHIFT); //JRG -(outerInBlock_yul<<params->qShift);
     blockCoord[index].outBlock_x= x;
     blockCoord[index].outBlock_y= y;

if (outerInBlock_width > privateVars->maxInBlockWidth)
privateVars->maxInBlockWidth= outerInBlock_width;

     if (outerInBlock_height > privateVars->maxInBlockHeight)
       privateVars->maxInBlockHeight= outerInBlock_height;

     /* calculate upper left corner address */

     /* First convert to coordinate space where (0,0) is upper left corner of input ROI*/
     outerInBlock_xulAbs= outerInBlock_xul + xRoiCenter;
     outerInBlock_yulAbs= -outerInBlock_yul + yRoiCenter;

     /* Add to DDR base address */
     /* If block is inside ROI, normal transfer */
     if (blockCoord[index].validFlag) {
       inTferParamTable[index].in.ddrAddr= (Int32)base->srcBuf[0].ptr + (outerInBlock_xulAbs + outerInBlock_yulAbs*base->srcBuf[0].stride)*scaleFactorSrc;
       inTferParamTable[index].in.imgBufAddr= (Int32)GPP_CPIS_imgBuf + sizeof(BlockCoord);
       }
     else { /* Otherwise transfer nothing */
       inTferParamTable[index].in.ddrAddr= (Int32)base->srcBuf[0].ptr;
       inTferParamTable[index].in.imgBufAddr= (Int32)GPP_CPIS_imgBuf;
       inTferParamTable[index].in.blockWidth= 0; /* if block falls outside ROI, don't transfer */
       inTferParamTable[index].in.blockHeight= 1;
     }
     inTferParamTable[index].in.ddrWidth= base->srcBuf[0].stride*scaleFactorSrc;

     /* We initialize blockWidth and blockHeight later, once we have the final values of
     private->maxInBlockWidth and private->maxInBlockHeight
     */
     index++;
   }

/* Fill two last entries for epilog (not necessary for GPP side though) */
blockCoord[index].validFlag= 0;
blockCoord[index+1].validFlag= 0;

size= privateVars->maxInBlockWidth*scaleFactorSrc*privateVars->maxInBlockHeight;
if (size > MAX_AFFINE_BLOCKSIZE) {
       //printf(" Error in CPIS_affineTransform: block size too big\n");
       //return -1;
     }

index= 0;
for (y=privateVars->outerRoi_yul; y >= privateVars->outerRoi_ylr; y-= incY)
   for (x=privateVars->outerRoi_xul; x <= privateVars->outerRoi_xlr; x+= incX) {

     if (blockCoord[index].validFlag) {
       inTferParamTable[index].in.blockWidth= privateVars->maxInBlockWidth*scaleFactorSrc;
       inTferParamTable[index].in.blockHeight= privateVars->maxInBlockHeight;
       }

     inTferParamTable[index].in.imgBufWidth= inTferParamTable[index].in.blockWidth;

     index++;
   }

CACHE_writeBack(inTferParamTable, params->scratchSize, 1);

ipRun->imgbufLen= size;

ipRunObj->dmaIn[1].dmaChNo = 0;
ipRunObj->dmaIn[1].useTferParamTable= 0;
ipRunObj->dmaIn[1].ddrAddr= (Uint32)blockCoord;
ipRunObj->dmaIn[1].imgBufAddr= (Int32)GPP_CPIS_imgBuf;
ipRunObj->dmaIn[1].blockWidth= sizeof(BlockCoord);
ipRunObj->dmaIn[1].imgBufWidth= sizeof(BlockCoord);
ipRunObj->dmaIn[1].blockHeight= 1;
ipRunObj->dmaIn[1].ddrWidth = 0;
ipRunObj->dmaIn[1].ddrOfstNextBlock= sizeof(BlockCoord);
/* Since the DMA assumes it is a 2-D transfer, need to set ddrOfstNextBlockRow
correctly even though the data is in 1-D form */
ipRunObj->dmaIn[1].ddrOfstNextBlockRow= sizeof(BlockCoord)*ipRunObj->numHorzBlocks;

return 0;
}

Int32 _GPP_CPIS_setAffineTransformProcessing(
CPIS_IpRun *ipRun,
CPIS_BaseParms *base,
void *p,
GPP_CPIS_Func *func){

params= (CPIS_AffineTransformParms *)p;
privateVars= (_CPIS_AffineTransformPrivate*)params->privateVars;

info.imgbufptr= (Int32)(GPP_CPIS_imgBuf + ipRun->imgbufInOfst);
info.imgbuflen= ipRun->imgbufLen;
info.coefptr= (Int16*) (GPP_CPIS_coef + ipRun->coefOfst);
info.coeflen= 0;
info.procBlockSize= base->procBlockSize;

*(Int8*)info.imgbufptr= 0; /* Set the empty block flag to 0 */

scaleFactor = _CPIS_sizeof(base->srcFormat[0]);
*func= &_GPP_CPIS_setAffineTransformProcessingFunc;
blockIndex= 0;
incX= base->procBlockSize.width;
incY= base->procBlockSize.height;

ipRun->imgbufLen= info.imgbuflen;
ipRun->coefLen= info.coeflen << 1;
ipRun->imgbufOutOfst[0]= sizeof(BlockCoord);

return 0;
}

/*
Int32 _GPP_CPIS_setAffineTransformDmaOut(
    CPIS_IpRun *ipRun,
    CPIS_BaseParms *base,
    void *p) {

    Uint16 scaleFactorSrc;

    IP_run *ipRunObj= &ipRun->ipRunObj;

    scaleFactorSrc = _CPIS_sizeof(base->srcFormat[0]);

ipRunObj->dmaOut= &GPP_CPIS_obj.dmaOut[0];
ipRunObj->numDmaOut= 1;

    ipRunObj->dmaOut[0].dmaChNo = 0;
    ipRunObj->dmaOut[0].useTferParamTable= 0;
    ipRunObj->dmaOut[0].ddrAddr= (Uint32)base->dstBuf[0].ptr;
    ipRunObj->dmaOut[0].imgBufAddr= (Int32)(ipRun->imgbufOutOfst[0] + GPP_CPIS_imgBuf);
    ipRunObj->dmaOut[0].ddrWidth = scaleFactorSrc*base->dstBuf[0].stride;
    ipRunObj->dmaOut[0].blockWidth= scaleFactorSrc*base->procBlockSize.width;
    ipRunObj->dmaOut[0].ddrOfstNextBlock= scaleFactorSrc*base->procBlockSize.width;
    ipRunObj->dmaOut[0].imgBufWidth= scaleFactorSrc*base->procBlockSize.width;
    ipRunObj->dmaOut[0].blockHeight= base->procBlockSize.height;
    ipRunObj->dmaOut[0].ddrOfstNextBlockRow= ipRunObj->dmaOut[0].ddrWidth * ipRunObj->dmaOut[0].blockHeight;

return 0;
}
*/

Int32 _GPP_CPIS_setAffineTransformProcessingFunc(){

Int16 x, y, i, j;
Int16 Xint, Yint;
Int16 Xfrac, Yfrac;
Int32 b1, b2, b3, b4;
Int32 XfracYfrac, P;
BlockCoord *blockCoord;

Uint8 *src;
Uint8 *dst;

blockCoord= (BlockCoord *)info.imgbufptr;

/* Copy block into coef memory */
src= (Uint8*)(info.imgbufptr + sizeof(BlockCoord));
dst= (Uint8*)info.coefptr;

for (y=0;y<privateVars->maxInBlockHeight;y++)
   for(x=0;x<scaleFactor*privateVars->maxInBlockWidth;x++){
      *((Uint8*)dst + x + y*scaleFactor*privateVars->maxInBlockWidth)= *((Uint8*)src + x + y*scaleFactor*privateVars->maxInBlockWidth);
   }

src= (Uint8*)info.coefptr;
dst= (Uint8*)(info.imgbufptr + sizeof(BlockCoord));

/* if empty block flag set
   just fill with a value
*/
if (params->skipOutside && blockCoord->validFlag== 0) {

   *((Uint8*)info.imgbufptr)= 0;
   for (y= 0; y < incY; y++)
     for (x= 0; x < incX; x++) {
       if (scaleFactor== 1)
         *((Uint8*)dst + x + y*incX)= 0;
       else
         *((Uint16*)dst + x + y*incX)= 0;
     }
}
else
#if 0 /* Put #if 1 to disable algorithm and to see just the input data */
   for (y= 0; y < incY; y++)
     for (x= 0; x < incX; x++) {
       if (scaleFactor== 1)
         *((Uint8*)dst + x + y*incX)= *((Uint8*)src + x + y*privateVars->maxInBlockWidth);
       else
         *((Uint16*)dst + x + y*incX)= *((Uint16*)src + x + y*privateVars->maxInBlockWidth);
     }
#else
/*
   Do bilinear interpolation
*/
   for (j=0, y= blockCoord->outBlock_y; y > blockCoord->outBlock_y - incY; j++, y--)
     for (i=0, x= blockCoord->outBlock_x; x < blockCoord->outBlock_x + incX; i++, x++) {
     /* Back-map the coordinate
       First get the shifted 'integer' values Xint and Yint
     */
       mapInvCoord(&Xint, &Yint, x, y, 1, params, TRUNC, TRUNC, blockCoord->inBlock_x, blockCoord->inBlock_y); // JRG--int portions have COORDQSHIFT.

     /* Now get the fractional part
     */

//JRG NOTE--integer portions are ingored in mapInvCoord NOSHIFT case. They make no contibution to the fractional result anyway,
//as they are masked off below. By not including them in the math, we avoid dealing with the problem that inBlock_x and inBlock_y
//have COORDSHIFT, yet the fractional result is masked with qShift (if we included them, we would either have to shift them up to qshift,
//which would overflow, or do 2 levels of masking, but this is all waste of CPU, as they do not contribute to fraction result!
       mapInvCoord(&Xfrac, &Yfrac, x, y, 1, params, NOSHIFT, NOSHIFT, blockCoord->inBlock_x, blockCoord->inBlock_y);

       Xfrac= Xfrac & ((1<<params->qShift) - 1);    // JRG--since above matrix multiply done with NOSHIFT, and matrix has qshift, the
       Yfrac= Yfrac & ((1<<params->qShift) - 1);    // result has qshift, so fraction mask is qshift.

       /* Calculate the bilinear coefficients */
       if (scaleFactor== 1) {
         b1= *((Uint8*)src + Xint + Yint*privateVars->maxInBlockWidth);
         b2= *((Uint8*)src + Xint + 1 + Yint*privateVars->maxInBlockWidth) - b1;
         b3= *((Uint8*)src + Xint + (Yint+1)*privateVars->maxInBlockWidth) - b1;
         /*
         b4= b1 \
           - *((Uint8*)src + Xint + (Yint+1)*privateVars->maxInBlockWidth) \
           - *((Uint8*)src + Xint + 1 + Yint*privateVars->maxInBlockWidth) \
           + *((Uint8*)src + Xint + 1 + (Yint+1)*privateVars->maxInBlockWidth) ;
         */
         b4= *((Uint8*)src + Xint + 1 + (Yint+1)*privateVars->maxInBlockWidth) \
             - *((Uint8*)src + Xint + 1 + Yint*privateVars->maxInBlockWidth) - b3 ;

         }
       else {
         b1= *((Uint16*)src + Xint + Yint*privateVars->maxInBlockWidth);
         b2= *((Uint16*)src + Xint + 1 + Yint*privateVars->maxInBlockWidth) - b1;
         b3= *((Uint16*)src + Xint + (Yint+1)*privateVars->maxInBlockWidth) - b1;
         /*
         b4= b1 \
           - *((Uint16*)src + Xint + (Yint+1)*privateVars->maxInBlockWidth) \
           - *((Uint16*)src + Xint + 1 + Yint*privateVars->maxInBlockWidth) \
           + *((Uint16*)src + Xint + 1 + (Yint+1)*privateVars->maxInBlockWidth) ;
           */
         b4= *((Uint16*)src + Xint + 1 + (Yint+1)*privateVars->maxInBlockWidth) \
             - *((Uint16*)src + Xint + 1 + Yint*privateVars->maxInBlockWidth) - b3 ;
         }

       XfracYfrac= Xfrac*Yfrac;
       if (params->qShift)
        XfracYfrac= ( ((XfracYfrac >> params->qShift-1)+1) >> 1);

       /* Bilinear interpolation */
       P= b2*Xfrac + b3*Yfrac + b4*XfracYfrac;
       if (params->qShift)
        P= ( ((P >> params->qShift-1)+1) >> 1);

P+= b1;

       if (P >= params->sat_high)
         P = params->sat_high_set;
       else if (P < params->sat_low)
         P = params->sat_low_set;

       if (scaleFactor== 1)
         *((Uint8*)dst + i + j*incX)= P;
       else
         *((Uint16*)dst + i + j*incX)= P;

} // for
#endif

blockIndex++;

return 0;
}

/* ======================================================================== */
/*                       End of file                                        */
/* ------------------------------------------------------------------------ */
/*            Copyright (c) 2008 Texas Instruments, Incorporated.           */
/*                           All Rights Reserved.                           */
/* ======================================================================== */

0 Victor Cheng over 15 years ago in reply to James Gort

TI__Expert 6335 points

Hello Jim,

Thanks for modifying the C code.

I need one more clarification to help my understanding of your changes. It appears that the shift COORDQSHIFT that you apply to outerInBlock_xul and outerInBlock_yul will eventually get canceled by mapInvCoord(). In other words, we can just set COORDQSHIFT= 0 and the result will be the same ?

I think the key point of your modification is that when we call mapInvCoord() for the fractional part, we need to make sure that extra_tx and extra_ty are not added. Meaning that the original implementation could have worked if instead of calling:

mapInvCoord(&Xfrac, &Yfrac, x, y, 1, params, NOSHIFT, NOSHIFT, blockCoord->inBlock_x, blockCoord->inBlock_y);

we would have called:

mapInvCoord(&Xfrac, &Yfrac, x, y, 1, params, NOSHIFT, NOSHIFT, 0,0);

Is my understanding correct ?

regards,

Victor

0 James Gort over 15 years ago in reply to James Gort

Intellectual 590 points

Hi Victor-

Yes, you are correct that COORDQSHIFT could be zero--I noticed the same thing after doing the modifications to "fix" it.

The upshot is that, in the original implementation, for the computation of C' (Xint, Yint, Xfrac, Yfrac), the block offsets (blockCoord->inblock_x and inblock_y) are ALL INTEGER. Hence, there is no need to shift them up when they are computed (they were previously shifted up by QSHIFT when they were computed, which was overflowing when QSHIFT was large). In computing C' for Xint,Yint, they are now added in AFTER the matrix multiply, as integers, rather than shifting them up and then shifting back down after matrix multiply. In computing C' for Xfrac,Yfrac, they are irrelevant, because they are all integer.

Stated in the context of the matrix multiplies, the current implementation has a 2x3 M matrix, ALL shifted up by QSHIFT, so that C'=M*C>>qshift. This failes because M[2] and M[5] are coordinate offset integers that get shifted up, and then down, but they are integers, so no point other than to overflow them!

The modified implementation uses a 2x2 M matrix, all shifted by QSHIFT, but leaves the integer offsets out of the matrix, and hence they downed get shifted up and back down. I.E., C'=M*C>>qshift+[inblock_x,inblock_y].

Thank you for your attention to this. I am still struggling to modify the actual VICP library wrapper function to accomplish the above, and appreciate any help!

Best Regards,

Jim Gort

0 Victor Cheng over 15 years ago in reply to James Gort

TI__Expert 6335 points

Hi Jim,

Thanks for pointing the overflow issue of M[2] and M[3] to us. To modify the VICP implementation, replace the second call to imxenc_mat_mul():

cmdlen += imxenc_mat_mul (
   blnr->M,
   blnr->C,
   blnr->Cprime_frac,
   3, /* M width */
   2, /* M height */
   numRowCoord*base->procBlockSize.width, /* C width */
   3, /* C height */
   numRowCoord*base->procBlockSize.width, /* Cprime width */
   2, /* Cprime height */
   3, /* M width */
   2, /* M height */
   numRowCoordCompute*base->procBlockSize.width, /* C width */
   3, /* C height */
   IMXTYPE_SHORT,
   IMXTYPE_SHORT,
   IMXOTYPE_SHORT,
   0,
   cmdptr + cmdlen
   );

with:

cmdlen += imxenc_mat_mul (
   blnr->M,
   blnr->C,
   blnr->Cprime_frac,
   3, /* M width */
   2, /* M height */
   numRowCoord*base->procBlockSize.width, /* C width */
   3, /* C height */
   numRowCoord*base->procBlockSize.width, /* Cprime width */
   2, /* Cprime height */
   2, /* M width */
   2, /* M height */
   numRowCoordCompute*base->procBlockSize.width, /* C width */
   2, /* C height */
   IMXTYPE_SHORT,
   IMXTYPE_SHORT,
   IMXOTYPE_SHORT,
   0,
   cmdptr + cmdlen
   );

Let me know how it goes. I actually tried to run this new implementation of the affine transform in a live demo but didn't see any improvement. In this demo, my qshift was limited to 6 but even after this change, setting qshift to 7 showed artifacts so I wonder whether there is any other overflow occurring somewhere else.

regards,

Victor

0 James Gort over 15 years ago in reply to James Gort

Intellectual 590 points

Hi Victor-

I haven't tried it yet, but I think that is only half the fix. The modification you posted takes the (useless) integer stuff out of the computation of Xfrac,Yfrac. However, the other modification that needs to be made is to not shift up and then back down the offsets for the computation of Xint,Yint. The changes should include:

1) Removing <<qshift from blockCoord->inblock_x and inblock_y

2) Changing to 2x2 M matrix for computation of Xint, Yint

3) Adding inblock_x and inblock_y in AFTER the computation in 2).

Best Regards,

Jim

0 Victor Cheng over 15 years ago in reply to James Gort

TI__Expert 6335 points

Hello Jim,

Thanks for these additional suggestions. I could modify the vicp implementation and saw an improvement in quality as I can now set a higher qShift.

Basically, in addition to removing the <<qshift, I replaced the first call to imxenc_mat_mul with;

/* Matrix multiply to find Cprime_x and Cprime_y */
cmdlen += imxenc_mat_mul (
   blnr->M,
   blnr->C,
   blnr->Cprime /*+ 2*rowCoord*base->procBlockSize.width*/,
   3, /* M width */
   2, /* M height */
   numRowCoord*base->procBlockSize.width, /* C width */
   3, /* C height */
   numRowCoord*base->procBlockSize.width, /* Cprime width */
   2, /* Cprime height */
   2, /* M width */
   2, /* M height */
   numRowCoordCompute*base->procBlockSize.width, /* C width */
   2, /* C height */
   IMXTYPE_SHORT,
   IMXTYPE_SHORT,
   IMXOTYPE_SHORT,
   params->qShift,
   cmdptr + cmdlen
   );

cmdlen += imxenc_array_scalar_op (
   blnr->Cprime_X,
   &blnr->M[2],
   blnr->Cprime_X,
   numRowCoord*base->procBlockSize.width,
   1,
   1,
   1,
   numRowCoord*base->procBlockSize.width,
   1,
   numRowCoordCompute*base->procBlockSize.width,
   1,
   IMXTYPE_SHORT,
   IMXTYPE_SHORT,
   IMXOTYPE_SHORT,
   0,
   IMXOP_ADD,
   cmdptr + cmdlen
   );

cmdlen += imxenc_array_scalar_op (
   blnr->Cprime_Y,
   &blnr->M[5],
   blnr->Cprime_Y,
   numRowCoord*base->procBlockSize.width,
   1,
   1,
   1,
   numRowCoord*base->procBlockSize.width,
   1,
   numRowCoordCompute*base->procBlockSize.width,
   1,
   IMXTYPE_SHORT,
   IMXTYPE_SHORT,
   IMXOTYPE_SHORT,
   0,
   IMXOP_ADD,
   cmdptr + cmdlen
   );

Please find attached a zip containing the files that have been modified. In addition to the changes related to the issue you brought up, there is a couple of bug fixes that have not been released yet. Also the upper 8 bit of parameter params->skipOutside is now used as pixel value to fill any part outside of the ROI. Previously, no filling was done if params->skipOutside= 1 and you would see some garbage displayed.

Let me know if this patch fixes your issue.

regards,

Victor

_affineTransform.zip

0 James Gort over 15 years ago in reply to Victor Cheng

Intellectual 590 points

Hi Victor-

THANK YOU! That is exactly what I was looking for, and a great improvement to the function's general purpose use.

As you can see, there was no downside to the changes, and the upside is the only limit on angular and zoom resolution is the amount of zoom you want to do (i.e., set qshift to 15 if your max zoom is 2), and the image size can now be very large without impacting transform resolution.

Also, as a side note, I was trying to do the same as you did, but I got hung up on trying to do both M[3] and M[5] additions in a single call to imxenc_array_scalar_op. I couldn't get it to work! I guess maybe it is impossible....

Again, thank you very much.

Best Regards,

Jim Gort

0 Victor Cheng over 15 years ago in reply to James Gort

TI__Expert 6335 points

Hi Jim,

Thank you for raising the issue and spending time investigating it on your side. With your guidance, we could come up with a solution very quickly.

Indeed, I don't see any useful function the VICP computation library that would allow computing the two additions in one operation. A custom function could be written but anyway the speed up won't be that much.

regards,

Victor

0 MohammadH Ghaemi over 12 years ago in reply to Victor Cheng

Prodigy 120 points

Hi,

I am developing an application on DM6446 EVM board with ARM9, DSP, VICP and some peripherals
I got a problem and don't find any solution for that, would you mind help me to solve that?
Before please note my design details:

    1 - The ARM is master and control other parts of system
    2 - The boot process is executed from NAND (BTSEL == 00), So after power on, the RBL (ROM Boat Loader) copies
        second level boot loader from NAND to ARM IRAM and then my application is loaded to DDR
    3 - Based on "spraai4.pdf" and "sprue14c.pdf" documents, I used the ubl to load my own application to DDR
        (it is noted that my application does not use any linux so I dont use uboot, I developed specific application)
    4 - Binary file of my ARM application from .out was generated with tms470.exe in CCS v3.3
    5 - Settings for NAND is J4=NAND, S3[1..4] = 0000 (NAND boot, 8-bit AEMIF, ARM boots DSP)
    6 - In my application, first some peripherals of systems (PINMUX, UART, I2C, VPSS) are configured, then DSP and
        VICP codes are copied to DDR, then DSPBOOTADDR is programed and finally the ARM releases C64x+ DSP from reset,
    7 - The DSP&VICP codes are included in ARM binary file as a header
    8 - To program NAND flash I use flash_burn_utility based on following web site
        (http://wiki.davincidsp.com/index.php?title=Serial_Boot_and_Flash_Loading_Utility)
    9 - The value of PSC_MDSTAT_IMCOP register (0x01C418A0)is 0x1E03 and the value of PSC_MDCTL_IMCOP register (0x01C41AA0) is 0x0003
        (I tested them with 0x1F03 and 0x0103 respectively but there is no difference)

All parts of my design works correctly except VICP (actually IMX does not generate interrupt for completion of procedure). for example, I checked above procedure with simple DSP example ("blinding a LED") and the DSP boots and works correctly

Also I checked program with JTAG Emulator when DSP_BT == 1 (J4=NAND, S3[1..4] = 0001), and VICP works correctly in this condition,
Under this circumstances, I flashed NAND but ARM does not boot when DSP_BT == 1(J4=NAND, S3[1..4] = 0001),

I dont really know why VICP does not work when DSP_BT == 0, is it need to initialze EDMA, INTC, MCBSP, ...?
Does EDMA3LLD configuration after system power on conflicts with VICP?
(I dont initialze and configure EDMA3 in ARM application because I think the DSP program initializes it automatically before using VICP in CPIS_Init method)

Processors

Processors forum

DM648 VICP Affine Transform Performance