VPBE/VPFE PSP driver usage

DG15920

Other Parts Discussed in Thread: TVP5146

Hi,
After much frustration in attempting to "code from scratch" (eg no DSP/BIOS) some simple video/interrupt based applications I decided to give DSP/BIOS a shot on my EVM6437 kit. I've read the user guide/API guide and went through the tutorials. I'm now at the point where I'd like to implement the silliest of video "processing" algorithms...eg just some algorithm that loops through a frame of video and modifies some or all of the pixels. My current setup is a camera input to the kit and component output to a monitor. In order to avoid wasting too much time in getting setup, I've taken the video_preview example and attempted to modify it. The example ships with a while loop that calls the FVID wrapper function FVID_exchange twice - once for the front end and once on the back end. As I understand it the driver simply modifies the register overlay so that the pointer for the buffer for the VPBE/VPFE is changed to the appropriate memory address. The code reads like this:

while(status == 0){
FVID_exchange(hGioVpfeCcdc, &frameBuffPtr);
FVID_exchange(hGioVpbeVid0, &frameBuffPtr);

}

Since I want to modify every pixel in a frame I inserted a for loop between the two API calls that loops through each pixel and multiplies it by .5:

for(j=0; j<num_pixels; j++){
*((Uint8*)frameBuffPtr->frame.frameBufferPtr + j) *= .5;
}

This certainly resulted in a modified image, but not at all what I expected. First the frame rate appears to have dropped to a near stand still. Second the actual image doesn't look lighter. It's distorted with vertical "lines" (spaced what appears to be every other column) and saturation on the white end. As I understand the diver architecture and API calls what should happen after the first exchange is that frameBuffPtr points to the address of the dequeued frame from the VPFE. I modify that frame with my loop and then queue this frame onto the output frame buffer. The dequeued output frame buffer is queued into the input on the next execution of the loop. Since the VPBE just reads data out memory continuously I expect that if the "algorithm" that resides between the two FVID_exchange calls takes a while that the VPFE will just display the same frame over and over until a queue/dequeue operation. Is that the way the driver is in fact implemented, or does is the driver interrupt based such that it just rolls through the VPBE frame buffer continuously regardless of queue/dequeue operations? Clearly what I'm trying to do is working as hopped. Is this the right approach? What am I missing?

The driver documentation doesn't really explain the inner workings of the driver. It states that it's interrupt based though. I assume that every time a new frame comes in that the driver places it in the tail of the queue. Does it overwrite data automatically? In other words if my algorithm is REALLY slow and I'm processing a frame is it possible for the VPFE to overwrite the frame I'm processing? Eg if I don't dequeue a frame is data overwritten? On the other hand if I repeatedly call FVID_exchange on the VPFE and then the VPBE and a new frame is NOT ready does the driver block for a new frame? Any help/input would be much appreciated. Thanks,

over 15 years ago

0 Jinh over 15 years ago

Expert 2315 points

You have to look at the format the data per pixel is stored at in RAM - probably YCbCr as 24 bits per pixel - and if you multiply each byte by 0.5 you are maybe doing quite a lot of operations per frame of 40ms or so and it will definately not make the image "lighter"? The TI datasheets are not to clear about the YUV 4:2:2 or YCbCr format and I only got a clear picture after finding some other sites after a google search. msdn.microsoft.com have some programmers information as well as indopedia and lots of other sites.

0 DG15920 over 15 years ago in reply to Jinh

Prodigy 170 points

Hi Hinj,
Thanks for your reply. I did indeed consider the data format in RAM as well as the time the loop consumed as possible issues. It looks to me like the example actually uses 8 bit per pixel YUV 4:2:2. As for the time required for the loop - I actually started by the silliest of "algorithms" eg

for(i = 0; i<num_pixels; i++)
old_pixel[i] = 0;

This also doesn't result in what I'd expect. Same issue of vertical lines and "ghosting" of video behind/infront of what I'd expect to see as a solid block. Since I started worrying about time, I tried looping on num_pixels/4. Same result. I added some code around the loop to examine how much time it was consuming. In particular I call TSK_time() before and after the loop and using LOG_printf get the delta time. I calculate that the loop itself takes roughly .4 mS for 1/4 of the image. Even at 30 frames a second this is a fraction of the time required to clock in a frame. Does anyone have any ideas? Can someone please help steer me in the right direction? Thanks,

ps Please find below my modified video_preview.:

/*
* ======== video_preview.c ========
*
*/

/* runtime include files */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>

/* BIOS include files */
#include <std.h>
#include <log.h>
#include <gio.h>
#include <tsk.h>
#include <trc.h>
#include <gbl.h>
#include <clk.h>

/* PSP include files */
#include <psp_i2c.h>
#include <psp_vpfe.h>
#include <psp_vpbe.h>
#include <fvid.h>
#include <psp_tvp5146_extVidDecoder.h>

/* CSL include files */
#include <soc.h>
#include <cslr_sysctl.h>

/* BSL include files */
#include <evmdm6437.h>
#include <evmdm6437_dip.h>

/* Video Params Defaults */
#include <vid_params_default.h>

/* ****** */
#include "video_previewcfg.h"

/* This example supports either PAL or NTSC depending on position of JP1 */
#define STANDARD_PAL 0
#define STANDARD_NTSC 1

#define FRAME_BUFF_CNT 6

static int read_JP1(void);

static CSL_SysctlRegsOvly sysModuleRegs = (CSL_SysctlRegsOvly )CSL_SYS_0_REGS;

/*
* ======== main ========
*/
void main() {

    printf("Video Preview Application\n");
    fflush(stdout);

    /* Initialize BSL library to read jumper switches: */
    EVMDM6437_DIP_init();

    /* VPSS PinMuxing */
    /* CI10SEL   - No CI[1:0]                       */
    /* CI32SEL   - No CI[3:2]                       */
    /* CI54SEL   - No CI[5:4]                       */
    /* CI76SEL   - No CI[7:6]                       */
    /* CFLDSEL   - No C_FIELD                       */
    /* CWENSEL   - No C_WEN                         */
    /* HDVSEL    - CCDC HD and VD enabled           */
    /* CCDCSEL   - CCDC PCLK, YI[7:0] enabled       */
    /* AEAW      - EMIFA full address mode          */
    /* VPBECKEN - VPBECLK enabled                  */
    /* RGBSEL    - No digital outputs               */
    /* CS3SEL    - LCD_OE/EM_CS3 disabled           */
    /* CS4SEL    - CS4/VSYNC enabled                */
    /* CS5SEL    - CS5/HSYNC enabled                */
    /* VENCSEL   - VCLK,YOUT[7:0],COUT[7:0] enabled */
    /* AEM       - 8bEMIF + 8bCCDC + 8 to 16bVENC   */
    sysModuleRegs -> PINMUX0    &= (0x005482A3u);
    sysModuleRegs -> PINMUX0    |= (0x005482A3u);

    /* PCIEN    =   0: PINMUX1 - Bit 0 */
    sysModuleRegs -> PINMUX1 &= (0xFFFFFFFEu);
    sysModuleRegs -> VPSSCLKCTL = (0x18u);

    LOG_printf(&trace, "Returning from main.");
    return;
}

/*
* ======== video_preview ========
*/

void video_preview(void) {

FVID_Frame *frameBuffTable[FRAME_BUFF_CNT];
FVID_Frame *frameBuffPtr;
GIO_Handle hGioVpfeCcdc;
GIO_Handle hGioVpbeVid0;
GIO_Handle hGioVpbeVenc;
int status = 0;
int result;
int i;
int standard;
int width;
int height;
Uint32 num_pixels;
Uint32 j;
LgUns freq;
Float milliSecsPerIntr, cycles;
Int curr_time;

/* Set video display/capture driver params to defaults */
PSP_VPFE_TVP5146_ConfigParams tvp5146Params =
       VID_PARAMS_TVP5146_DEFAULT;
PSP_VPFECcdcConfigParams      vpfeCcdcConfigParams =
       VID_PARAMS_CCDC_DEFAULT_D1;
PSP_VPBEOsdConfigParams vpbeOsdConfigParams =
       VID_PARAMS_OSD_DEFAULT_D1;
PSP_VPBEVencConfigParams vpbeVencConfigParams;

/* Clock Info */
freq = GBL_getFrequency();
cycles = CLK_cpuCyclesPerLtime();
milliSecsPerIntr = cycles/(Float)freq;

LOG_printf(&trace, "freq = %d", freq);
LOG_printf(&trace, "cycles = %f", cycles);
LOG_printf(&trace, "milliSecs = %f", milliSecsPerIntr);

standard = read_JP1();

/* Update display/capture params based on video standard (PAL/NTSC) */
if (standard == STANDARD_PAL) {
       width = 720;
       height = 576;
       vpbeVencConfigParams.displayStandard = PSP_VPBE_DISPLAY_PAL_INTERLACED_COMPOSITE;
}
else {
       width = 720;
       height = 480;
       vpbeVencConfigParams.displayStandard = PSP_VPBE_DISPLAY_NTSC_INTERLACED_COMPOSITE;
}
vpfeCcdcConfigParams.height = vpbeOsdConfigParams.height = height;
vpfeCcdcConfigParams.width = vpbeOsdConfigParams.width = width;
vpfeCcdcConfigParams.pitch = vpbeOsdConfigParams.pitch = width * 2;

num_pixels = 2 * width * height;
/* init the frame buffer table */
for (i=0; i<FRAME_BUFF_CNT; i++) {
    frameBuffTable[i] = NULL;
}

/* create video input channel */
if (status == 0) {
    PSP_VPFEChannelParams vpfeChannelParams;
    vpfeChannelParams.id     = PSP_VPFE_CCDC;
    vpfeChannelParams.params = (PSP_VPFECcdcConfigParams*)&vpfeCcdcConfigParams;
    hGioVpfeCcdc = FVID_create("/VPFE0",IOM_INOUT,NULL,&vpfeChannelParams,NULL);
    status = (hGioVpfeCcdc == NULL ? -1 : 0);
}

/* create video output channel, plane 0 */
if (status == 0) {
    PSP_VPBEChannelParams vpbeChannelParams;
    vpbeChannelParams.id     = PSP_VPBE_VIDEO_0;
    vpbeChannelParams.params = (PSP_VPBEOsdConfigParams*)&vpbeOsdConfigParams;
    hGioVpbeVid0 = FVID_create("/VPBE0",IOM_INOUT,NULL,&vpbeChannelParams,NULL);
    status = (hGioVpbeVid0 == NULL ? -1 : 0);
}

/* create video output channel, venc */
if (status == 0) {
    PSP_VPBEChannelParams vpbeChannelParams;
    vpbeChannelParams.id     = PSP_VPBE_VENC;
    vpbeChannelParams.params = (PSP_VPBEVencConfigParams *)&vpbeVencConfigParams;
    hGioVpbeVenc = FVID_create("/VPBE0",IOM_INOUT,NULL,&vpbeChannelParams,NULL);
    status = (hGioVpbeVenc == NULL ? -1 : 0);
}

/* configure the TVP5146 video decoder */
if (status == 0) {
    result = FVID_control(hGioVpfeCcdc, VPFE_ExtVD_BASE+PSP_VPSS_EXT_VIDEO_DECODER_CONFIG, &tvp5146Params);
    status = (result == IOM_COMPLETED ? 0 : -1);
}

/* allocate some frame buffers */
if (status == 0) {
    for (i=0; i<FRAME_BUFF_CNT && status == 0; i++) {
      result = FVID_allocBuffer(hGioVpfeCcdc, &frameBuffTable[i]);
      status = (result == IOM_COMPLETED && frameBuffTable[i] != NULL ? 0 : -1);
    }
}

/* prime up the video capture channel */
if (status == 0) {
    FVID_queue(hGioVpfeCcdc, frameBuffTable[0]);
    FVID_queue(hGioVpfeCcdc, frameBuffTable[1]);
    FVID_queue(hGioVpfeCcdc, frameBuffTable[2]);
}

/* prime up the video display channel */
if (status == 0) {
    FVID_queue(hGioVpbeVid0, frameBuffTable[3]);
    FVID_queue(hGioVpbeVid0, frameBuffTable[4]);
    FVID_queue(hGioVpbeVid0, frameBuffTable[5]);
}

/* grab first buffer from input queue */
if (status == 0) {
    FVID_dequeue(hGioVpfeCcdc, &frameBuffPtr);
}

/* loop forever performing video capture and display */
while ( status == 0 ) {

    /* grab a fresh video input frame */

    status = FVID_exchange(hGioVpfeCcdc, &frameBuffPtr);
    curr_time = (Int)TSK_time();
    if(status == 0){
      for(j = 0; j < num_pixels/4; j++){
    *((Uint8*)frameBuffPtr->frame.frameBufferPtr + j) = 0;
      }

      LOG_printf(&trace, "Delta = %d", (Int)TSK_time() - curr_time);
    }

    else
      LOG_printf(&trace, "Got status not 0");

    /* display the video frame */
    status = FVID_exchange(hGioVpbeVid0, &frameBuffPtr);

    if(status != 0)
      LOG_printf(&trace, "2nd exch");
}
}

/*
* ======== read_JP1 ========
* Read the PAL/NTSC jumper.
*
* Retry, as I2C sometimes fails:
*/
static int read_JP1(void)
{
    int jp1 = -1;

    while (jp1 == -1) {
      jp1 = EVMDM6437_DIP_get(JP1_JUMPER);
      TSK_sleep(1);
    }
    return(jp1);
}

0 Alexander Stohr over 15 years ago in reply to DG15920

Expert 1200 points

you are using DM6437 - this is a DSP only SoC design with a C64x+ core inside.

so you are fine in the first row with using DSP-BIOS as your foundation. It provides a rich set of drivers and OS style features including multi threading.

approaching your problem. FVID is a fine thing for work in all standard cases and even covers a bigger bunch of special cases. so i doubt there is a major problem in it, but probably some you, as a starter in that area, have not yet fully understood its concepts.

the color format on DM6437 is typically YUV meaning you have two pixels encoded with two luminance components but only one color component.

if you will debug your application you will see that the video buffer might contain something like this: "0x80 0x43 , 0x80 0x76 , 0x80 0x43 , 0x80 0x76 ..."

(Thats just a sample from brain - the true ordering on might differ in a real world scenario.)

As you can see the first element repeats every second byte. The other components repeat every 4th byte.

(Check this in your setup with some test image that has some color in the top left corner. Futher try with different colors.)

As you are doing byte by byte access you are horribly damaging the performance of the C64x+ core since it has data lines that can be addressed with read command that transfer up to 2x32 bits in a single opcode.

Further the byte transfer imposes repeated read-modify-write operations since the minimum granularity is a 32 bit transfer.

Byond this you are reading from not-yet-cached external memory that is raising a bigger bunch of wait states.

dig the davinci wiki for EDMA3 LLD and find the document from a full scale DSP-BIOS course (posted there from one of the TI TTO representatives) explaining you what EDMA3 can do for you in transfering data in the background to some very fast memory close to the processor.

what else?

as images are not neccessarily the same width as the offset between the start of lines (pitch stride) you rather might have two nested loop-counters, one for the lines and one for the pixels in the line.

do not use any floating points and floating point divides (like "/ 0.5") since the C64x+ does not have a floating point unit. use byte, 2-byte and 4-byte operations instead. There are nice "packed-mode" operations for that core. see the processor instruction set and compiler manuals (intrinsic commands) for the fastest solution.

turn on compiler optimisation for this single source file. turning optimize for speed to function level should help and tuning optimize for size to some of the lower levels might improve speed as well.

maybe you want to attend one of the advanced training courses from TI. there is one for DSP-BIOS, an other for C6x optimisation techniques, and a few more.

0 DG15920 over 15 years ago in reply to Alexander Stohr

Prodigy 170 points

Hi Alexandar,

Thanks a lot for your reply (and sorry for taking a bit to get back to you). Please see my comments inline... Thanks,

Alexander Stohr said:

you are using DM6437 - this is a DSP only SoC design with a C64x+ core inside.

so you are fine in the first row with using DSP-BIOS as your foundation. It provides a rich set of drivers and OS style features including multi threading.

approaching your problem. FVID is a fine thing for work in all standard cases and even covers a bigger bunch of special cases. so i doubt there is a major problem in it, but probably some you, as a starter in that area, have not yet fully understood its concepts.

Yes the documentation on the PSP drivers isn't amazing though so my questions in the original post still remain open. For example the issue of resource blocking etc. Any thoughts? See original post for specific questions.

Alexander Stohr said:

the color format on DM6437 is typically YUV meaning you have two pixels encoded with two luminance components but only one color component.

if you will debug your application you will see that the video buffer might contain something like this: "0x80 0x43 , 0x80 0x76 , 0x80 0x43 , 0x80 0x76 ..."

(Thats just a sample from brain - the true ordering on might differ in a real world scenario.)

As you can see the first element repeats every second byte. The other components repeat every 4th byte.

(Check this in your setup with some test image that has some color in the top left corner. Futher try with different colors.)

I'm aware of the fact that the example I'm using utilizes YUV format. It in fact appears as though it's using packed 8 bit YUV format. Eg not 16 bits per pixel, but one byte per pixel. That said, I'm attempting to set every byte in the buffer to 0 which in YUV format is black. As mentioned in my original post my current goal is to simply modify every pixel value in a frame of video and see that modification. After I have the "infrastructure" setup I'll work on actually implementing the algorithms that I've already coded in matlab/C. You allude to the fact that I need to optimize memory bandwidth etc by making use of the c64x+ instruction set. You're 100% right and I indeed planned on doing so, but for the immediate goal is just to perform a simple operation on each pixel.

Alexander Stohr said:

As you are doing byte by byte access you are horribly damaging the performance of the C64x+ core since it has data lines that can be addressed with read command that transfer up to 2x32 bits in a single opcode.

Further the byte transfer imposes repeated read-modify-write operations since the minimum granularity is a 32 bit transfer.

Byond this you are reading from not-yet-cached external memory that is raising a bigger bunch of wait states.

dig the davinci wiki for EDMA3 LLD and find the document from a full scale DSP-BIOS course (posted there from one of the TI TTO representatives) explaining you what EDMA3 can do for you in transfering data in the background to some very fast memory close to the processor.

Absolutely correct. I also thought of this and do intend to optimize a bit. I'll check out that wiki/document. Thanks

Alexander Stohr said:

what else?

as images are not neccessarily the same width as the offset between the start of lines (pitch stride) you rather might have two nested loop-counters, one for the lines and one for the pixels in the line.

do not use any floating points and floating point divides (like "/ 0.5") since the C64x+ does not have a floating point unit. use byte, 2-byte and 4-byte operations instead. There are nice "packed-mode" operations for that core. see the processor instruction set and compiler manuals (intrinsic commands) for the fastest solution.

turn on compiler optimisation for this single source file. turning optimize for speed to function level should help and tuning optimize for size to some of the lower levels might improve speed as well.

maybe you want to attend one of the advanced training courses from TI. there is one for DSP-BIOS, an other for C6x optimisation techniques, and a few more.

Indeed I stopped using floating point instructions. Instead of *.5 I now just try to set to zero. I still yield the same results. I think beyond the optimizations you mention there is a fundamental misuse and misunderstanding of the PSP driver on my part. My original questions remain. How does the PSP driver handle overwriting of data? Is the way I modify the frame buffer the "proper" approach? I'll try to take a screen capture of what I see when I try to "black out" the frame and upload it later today. In short I see nothing close to black. I see some "ghosting" for video behind dark vertical lines that "dance" (in other words it looks like only parts of the frame get modified and that it's not the same part each time. This is what leads me to ask the questions about resource blocking and data overwriting in the driver implementation). Thanks a lot for your reply...any additional help would be much appreciated.

0 Alexander Stohr over 15 years ago in reply to DG15920

Expert 1200 points

as you indicate you are seeing some "ghost" images, i think about two possibilites:

a) there is some analogue coupling on the CVBS (in case both data paths are analog video), e.g. by insufficient cable shielding or bad ground designs

b) there is only the Y component set to zero whilst the U and V components keep their values - some color will apply, even some Y to U/V channel crosstalk can appear on the path to the analog display.

BTW U and V neutral values are 0x80 meaning 50% in an unsigned scale, or "0" in a symmetrical signed scale.

hmm, 8 bit YUV formats are rather rare since they would mean a pretty low quality. an effective bit per pixel encoding rate with 16 bit is more common.

having 8 bit Y (and antoher 8 bit for U/V) is more common to planar formats but the DM6437 prefers interleaved formats. some codecs might still source or sink planar data.

the deepest performance impact is still the single byte access as this is a derivate of the rather expensive read-modify-write sequence. try reading and writing in aligned 32 bit chunks for your processing.

0 DG15920 over 15 years ago in reply to Alexander Stohr

Prodigy 170 points

Hi Alexandar,

Thanks for the response. I too thought that 8 bit YUV was a little awkward, but it looks like that's indeed the case. The option for 16 bit does exist, but the video_preview example application doesn't appear to utilize it. Regarding the ghosting, it's definitely not a result of any analog noise since the video_preview application (which just calls FVID_exchange on the input/output repeatedly) works fine. The second option you suggest also doesn't seem to be right since I checked the frame buffer size and found that it's exactly width * height * pitch (which in my case pitch = 2) bytes. Since I just write 0x00 to every byte the U/V should also be overwritten with 0x00.

I read the EDMA3 document that you suggested I look at. I am thinking about using the ACPY3 functions to grab a chunk of video frame, modify it, and write it back. Does that sound reasonable? I realize that there exists a huge bottle neck in having the CPU read/write data, but I'm still unsure whether or not my approach here is in fact valid. I tried doing something even simpler. After the exchange call I do:

*((Uint32*)frameBuffPtr->frame.frameBufferPtr) = 0x00000000;

Sure the CPU has to do a read/write, but it's one per frame which should be easy enough. I would think that this should black out the first 32 bits of the frame (eg first 4 pixels). This seemingly has NO effect on the frame. That said, do you (or somebody else) have experience with using the VPSS PSP drivers in conjunction with software algorithms that modify pixels in a frame? How about the whole issue of resource blocking? Any thoughts? Thanks!

0 Alexander Stohr over 15 years ago in reply to DG15920

Expert 1200 points

if caches are enable then they must be flushed before the frame is given back to the hardware.

else some flicker might happen or only partial results will be visible.

0 DG15920 over 15 years ago in reply to Alexander Stohr

Prodigy 170 points

Ok...I'll give the ACPY3 calls a shot. Any thoughts/ideas as to why when I attempt to modify a single pixel or very small number of pixels I see no change? I would expect that the CPU would be able to fetch a few single bytes from memory/write them back in the time between two frames of video....

I also discovered the the codec engine framework. I'll try to dig through the way they perform memory access and mimic, but in general ACPY3 should be the right way to go? Any personal experience with implementing real time video processing algorithms with DSP/BIOS?

Thanks,

0 mani over 15 years ago in reply to DG15920

Prodigy 15 points

try this....

http://wiki.davincidsp.com/index.php?title=Accessing_pixels_in_a_frame_on_DM643x

regards,

mani

Processors

Processors forum

VPBE/VPFE PSP driver usage