I convert a code segment in C working fine with beegle board to neon intrinsics
The code fraction is given below
1. C Code Segment
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
2. Neon Code Segment
AryPred[0] = *Predicted++; AryPred[1] = *Predicted++; AryPred[2] = *Predicted++; AryPred[3] = *Predicted++;
neonResidue = vld1q_s32(Residue);
/* Code for SHIFTR */
neonResidue = vaddq_s32(neonResidue, addconst);
neonResidue = vshrq_n_s32(neonResidue, 6);
neonPredict = vld1q_s32(AryPred);
addedResult = vaddq_s32(neonResidue, neonPredict);
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 0));
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 1));
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 2));
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 3));
Residue += 4;
The following data types are used to define the above array operations without Neon Operations
kvalue -> int
Residue-> int
Predicted->unsigned char
Original->unsigned char
The following data types are used to define the above array operations with Neon Operations
AryPred -> int
addedResult, neonPredict, neonResidue ->int32x4_t
Make file setting is as follows
RESULT =-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions
Usually the neon intrinsics give back a best result in time but here i get a reverse effect What is the major reason for the reverse effect?