This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How do I capture a raw video frame and process it on ARM, DSP and HDVICP (DM8168)?

Guru 10685 points

I want to be able to capture a frame of video data and do 3 separate types of processing on it. My platform is the DM8168. The 3 things are:

1) Do some processing with it on the ARM.

2) Get the HDVICPs to compress it to H264.

3) Use the DSP core (C674x) of the DM8168 to process the frame.

What is my best option for doing all this? I would like to keep the captured raw video frame in ONE place in memory and not have to copy it for each separate bit of processing.

Should I be using OpenMAX all the way and modify the VLPB component to perform processing on the DSP? Or should I use Codec Engine to create a codec? Or C6accel?

I would greatly value TI's advice on where to go with this. I don't know what is most efficient and most feasible. I've read countless forum posts but can't come to a conclusion.

Thanks,
Ralph

  • Hi,

    I would be very interested also to know how to efficiently  process video frames on the A8 arm, i am working with an OMX application that allocates its buffers in shared region 2, which is not cacheable by the A8 arm thus performances are very poor. Is there any way to boost A8 performances?

    Thanks,
    Gabi 

  • I'll ramble a bit at ya... maybe it'll help.

    I don't know OMX, but assuming your data is managed in physically contiguous buffers that all cores can see, what you're describing seems doable.

    I assume you know how to handle the capture and #1. And I'd assume you have an OMX-based solution for #2 (again, not sure about anything OMX-related).

    As for the DSP, I frankly haven't seen any examples on extending OMX with custom algorithms on the DSP, but there are some for CE.  Also, I know C6accel is no longer actively maintained - so I'd advise against that.

    I may be bias b/c I'm very familiar with it, but I'd recommend using CE for #3.  The CE APIs allow you to pass buffer pointers to 'somewhat' arbitrary memory, so long as it's physically contiguous. CE will internally do the necessary address translation as the buffers move from the A8 to the DSP and back. And if the buffer is cached from the DSP side, it will also manage that cache for you.

    This article may be an interesting read, too:
       
    http://e2e.ti.com/blogs_/b/codec_engine/archive/2010/11/03/finer-details-physically-contiguous-memory.aspx

    Chris

  • @Gabi, which release of SysLink are you using?  Recent releases do support A8-side cacheable SharedRegions.  Another option might be to use CMEM which also supports Linux-side cached buffers.

    Turning on cache is not a magic bullet, and can add significant complexity (and sometimes more overhead!), but in some applications it can help.

    Chris

  • _Ralph_ said:

    I want to be able to capture a frame of video data and do 3 separate types of processing on it. My platform is the DM8168. The 3 things are:

    1) Do some processing with it on the ARM.

    2) Get the HDVICPs to compress it to H264.

    3) Use the DSP core (C674x) of the DM8168 to process the frame.

    I think the key question that needs to be answered is:  Assuming that OMX needs to be used to do anything with HDVICP2, does OMX use the DSP?  If so, you may have a problem as it is more than likely that the DSP code that OMX uses it not thread-safe.

    What I have done is to hand-craft my code around Syslink, an image comes in to the ARM using DC1394, it is copied to shared memory, the DSP does something with it and produces a list of info upon which the ARM then processes.  The DSP activities are well defined in time so that there are no conflicts.

    Lee

     

  • Hi Chris,

    Thank you very much for answer. I am on a vacation now and can't answer the question regarding the SysLink version but i am working with EZSDK ver 5.04.
    The A8 has the Neon inside which is a considerable computational unit and i really need this computational power after i have been using the DSP until the last MIPS.
    Both DSP and A8 need to perform processing on the image frame buffer in shared region 2, i am using a DSP OMX component which can transfer data from shared region 2 to L2 cache on the DSP using EDMA and with double buffers on L2 cache slice processing i am getting good performances. Now i need to do similar things with the Neon on the A8.
    The problem is that the A8 is running with Linux, i don't know how to access A8 L2 cache, i don't know if it is possible, and performances of image processing with the A8 when image buffer is on shared region 2 are very poor. 
    Can you please help with that?

    BR,
    Gabi 

  • Gabi:

        I found some interesting posts related to A8 Neon image processing performance and L2 cache on the ARM forums:

        http://forums.arm.com/index.php?/topic/15464-cortex-a8-preload-engine-ple-error/

        http://forums.arm.com/index.php?/topic/15140-differences-between-neon-in-cortex-a8-and-a9/

        This is independent of SysLink SharedRegions, and there seem to be some good tips on that forum that may help.

    - Gil
     

  • Hi Chris,

    Thanks for replying. I suppose my fundamental question (for now) is: is it feasible to write an algorithm to run on the DSP using the VLPB component as a starting point, or will I find that much more difficult than using Codec Engine?

    Thanks,
    Ralph

  • Ralph, I don't have any experience with VLPB, so I can't speak to it.  A quick search turned up this related thread you may want to read.

    Also, I found this article which includes a brief description of VLPB.  Given these two data points, it looks possible to create a DSP side algorithm that plugs in via VLPB.

    There are a few ways to skin this cat.  If you already have VLPB integrated, that may be the way to go - often getting the framework up and running is unfortunately the hardest part.  If you _don't_ have VLPB integrated, and you have no other need for OpenMAX on the DSP, it may be easier to use Codec Engine (and maybe IUNIVERSAL?) as it's a little less insulation/abstraction around your algorithm's API.

    Chris

  • Hi Chris,

    That related thread helped a bit. I think I'm going to go with Codec Engine and avoid using a modified VLPB component for a couple of reasons; the support at TI seems to be much better regarding using the established Codec Engine, while almost no one at TI _on the forums_ seems to be knowledgeable about running OpenMAX components on the DSP.

    There is also a risk I perceive of the OpenMAX component on the DSP having a high overhead/latency which is not what we want.

    Thanks,

    Ralph

  • I am working with OMX on TI8168 i am using VLPB on DSP OMX component for video processing and i need more processing power, now TI put a remarkable A8 with Neon on the TI8168 but unfortunately, regarding the A8 video processing, i still didn't find a way to perform an efficient processing with the A8, if i use EDMA for copying video buffers to my allocated buffers on shared region 2 and process them with the A8, the performances are very poor, making shared region 2 cache-able for the A8 doesn't seem to help, please see http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/p/206929/735841.aspx#735841

    If i try to use A8 OMX VLPB and add it to OMX chain maybe this way performances would be better, it allocates its buffers not in shared region 2 so i can't process the video buffers please see http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/t/210109.aspx 

    So what is the use case for performing video processing on the A8?, Is there any example for that? did TI ever give a thought about that?
    And yes, OMX support could be much better!!! 

  • Gabi Gvili said:
    So what is the use case for performing video processing on the A8?, Is there any example for that?

    You probably do not want to do any significant image processing on the ARM, the DSP is much more efficient for that sort of thing.  When push comes to shove, you could always employ Syslink to provide some shared buffers between the ARM and DSP and hand-craft your own code to do what you want on the DSP.  There are many things that you can do to optimize DSP code.

    Lee

  • Lee Holeva said:
    You probably do not want to do any significant image processing on the ARM 

    The A8 has Neon inside which is a powerful DSP (not as powerful as C674) and i do want to use it.

    Lee Holeva said:
    There are many things that you can do to optimize DSP code.

    I am using all the DSP tricks, writing with intrinsic functions, slice processing and using EDMA to copy double input and output buffers to DSP L2 cache and indeed performances has improved significantly but i need more, and according to benchmark i have seen on the web the Neon is equivalent to 50-70% of TI C64+ DSP and it would be a waist not to use it.

  • Gabi Gvili said:

    The A8 has Neon inside which is a powerful DSP (not as powerful as C674) and i do want to use it.

    It is my understanding that neon is simply a co-processor, it doesn't have anything like the DSP's pipeline.  I use -O3 and -mfpu=neon when I build my code, so I assume that floating point operations get mapped to neon.  Also, there are a bunch of neon intrinsics:

    http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

    I haven't tried to use these, but I imagine that if you rewrite your code to take advantage of them you should see some performance improvement.  I wouldn't put much faith in published benchmarks.

    Lee

     

  • Anyway, the performances of the Neon on the A8 is not the issue here, i believe that bad access to shared memory region 2 is the issue here, the HD video frames are allocated by OMX components in shared region 2, for my A8 performances test i am doing a simple average of 2 422 HD video frames (Image1>>1 + image2>>1), roughly total of 12Milion operation per frame (without memory accesses), should not be a big deal for a 1GHz processor. The problem is that the processing rate deteriorates to 1 frame per sec (instead of the 60 fps before that average), that is the issue here.
     

    Gabi

  • Hello,

    Chris.

    I am working EZSDK5.05 on dm8168evm.

    I have modified saLoopBack and video copy examples to do video capture and codecs.

    In order to do video processing, I have added several steps below to accomplish it in the codec.

    1. allocate new memory using "malloc" to store captured video frames;

    2. transfer captured video(in the cmem buffer) to malloced buffer;

    3. process;

    4. transfer the malloced buffer back to cmem buffers.

    But when I run the application, there is nothing display on the lcd.

    So where does the problem lay?

    Could you give me any instruction?

    Best Regards,

    Yang.

  • Hi,

    Chris.

    More information list,

    tmpBuf =(XDAS_Int8 *)malloc(720*1280*2);

    After I add the code below,

    if(tmpBuf==NULL) return (-1);

    I could not run the application correctly.

    Here is what it returns:

    [t=0x001cee64] [tid=0x400ae000] xdc.runtime.Main: [+1] App-> Processing frame 0.
    ..                                                                              
    [t=0x001cf4e4] [tid=0x400ae000] xdc.runtime.Main: [+2] App-> Encoder frame 0 pro
    cess returned - 0xffffffff                                                      
    [t=0x001cf51d] [tid=0x400ae000] xdc.runtime.Main: [+7] App-> Encoder frame 0 pro
    cessing FAILED, status = 0xffffffff, extendedError = 0x2d51d08a                 
    VPSS_DCTRL: failed to disable the venc.                                         
    VPSS_DCTRL: failed to disable hdmi venc                                         
    VPSS_DCTRL: stop venc before changing mode

    Hope for your reply.

    Regards,

    Yang

  • Hi Ralph,
    I am working with ezsdk_5.05 for DM8168 processor. Presently I want to add video frame capture-display with the codec engine framework. I have seen that you were also working on the same thing . It would be very helpful if you give some suggestion/example code regarding this matter.

    Thanks & Regards
    Srikanta