This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Neon Intrinsics overhead in Beegle board

I convert a code segment in C working fine with beegle board to neon intrinsics

The code fraction is given below

1. C Code Segment

kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);

2. Neon Code Segment

AryPred[0] = *Predicted++; AryPred[1] = *Predicted++; AryPred[2] = *Predicted++; AryPred[3] = *Predicted++;
neonResidue = vld1q_s32(Residue);
/* Code for SHIFTR */
neonResidue = vaddq_s32(neonResidue, addconst);
neonResidue = vshrq_n_s32(neonResidue, 6);

neonPredict = vld1q_s32(AryPred);
addedResult = vaddq_s32(neonResidue, neonPredict);
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 0));
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 1));
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 2));
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 3));
Residue += 4;

The following data types are used to define the above array operations without Neon Operations

kvalue -> int
Residue-> int
Predicted->unsigned char
Original->unsigned char

The following data types are used to define the above array operations with Neon Operations

AryPred -> int
addedResult, neonPredict, neonResidue ->int32x4_t

Make file setting is as follows

RESULT =-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions

Usually the neon intrinsics give back a best result in time but here i get a reverse effect  What is the major reason for the reverse effect?