This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Buffers from VFCC are 6 times slower than locally allocated buffers

I have a simple OMX pipeline set up with the VFCC feeding CMUX which then calls my own function to process each video frame. The VFCC is able to keep up with the 45 frames per second my imager is producing, but I'm having difficulty accessing the data in the frame buffer at anywhere near that rate. My video frames are 16-bit YUV, 1280x960 (2467840 bytes), and it is currently taking around 77 ms to perform a fairly simple operation on every pixel in the frame:

src = (uint64_t*)be->pEncodeBuffer;
static uint16_t *shiftBuffer = NULL;
if (shiftBuffer == NULL) {
        shiftBuffer = (uint16_t*)malloc(1280*960*sizeof(uint16_t));
        memset(shiftBuffer, 0, 1280*960*sizeof(uint16_t));
}
dst = shiftBuffer;

for (x = 0; x < pMetaData->framesize; x += sizeof(uint64_t)) {
        temp_word = *src;
        *dst = ((temp_word>>12) & 0x0FFF0FFF);
        src++;
        dst++;
}

If instead of reading from the frame buffer, I read from a buffer locally allocated in the same way as shiftBuffer, it takes around 14 ms for the same block of code to execute:

static uint16_t *fakeBuffer = NULL;
if (fakeBuffer == NULL) {
        fakeBuffer = (uint16_t*)malloc(1280*960*sizeof(uint16_t));
}
src = (uint64_t*)fakeBuffer;

static uint16_t *shiftBuffer = NULL;
if (shiftBuffer == NULL) {
        shiftBuffer = (uint16_t*)malloc(1280*960*sizeof(uint16_t));
        memset(shiftBuffer, 0, 1280*960*sizeof(uint16_t));
}
dst = shiftBuffer;

for (x = 0; x < pMetaData->framesize; x += sizeof(uint64_t)) {
        temp_word = *src;
        *dst = ((temp_word>>12) & 0x0FFF0FFF);
        src++;
        dst++;
}

Why is this happening? Is it because the framebuffers were allocated by one of the M3 cores