OpenMax (OMX) output buffers

Joel Keller

Hi,

We are developing an application which decodes high resolution video (>1080p) and performs video analytics on the decoded video frames. It appears that using (reading, copying, whatever) the output buffer passed to us from the OMX decoder component is very slow. We believe that this is because the buffer is located in memory that has been mapped as non-cached.

1) Is it true that the OMX decoder output buffer is non-cached?

2) What is the best way to speed this up? We need a significant increase in speed.

Thanks,

Joel

over 13 years ago

0 Joel Keller over 13 years ago

Expert 1305 points

Any ideas out there? Surely someone else has run in to this issue...

0 Jon S. over 13 years ago in reply to Joel Keller

Expert 1240 points

Hi Joel,

This is something I'd also like to hear an answer on, as I too am trying to understand a bit more about what's going under the hood in the OpenMax implementation for this device.

While I don't know the answer, perhaps starting some discussion here and keeping the thread alive will catch someone's attention. (Feel free to tell me to just go away if you think this doesn't help ;) )

I've been under the assumption that the OpenMax buffers reside in a mmap'd (and uncached?) contiguous memory region, IPC_SR_FRAME_BUFFERS (http://processors.wiki.ti.com/index.php/EZSDK_Memory_Map). If I understand things correctly, these buffers are in continuous memory regions because the M3's don't have MMUs.

One thing I toyed with was using CMEM to allocate my own contiguous memory regions and using those when I need to copy and work with frames. I was hoping that circumventing virtual memory might gain me a bit of speedup (although I suppose this might be a dirty hack). I also haven't been working with >1080p frames, so I'm not sure what kind of performance you're seeing.

Your thoughts?

0 Joel Keller over 13 years ago in reply to Jon S.

Expert 1305 points

Hi Jon,

Thanks for the response. You mention:

"One thing I toyed with was using CMEM to allocate my own contiguous memory regions and using those when I need to copy and work with frames. I was hoping that circumventing virtual memory might gain me a bit of speedup (although I suppose this might be a dirty hack)."

Can you elaborate on what you had in mind here? How exactly were you planning on getting the data out of the IPC_SR_FRAME_BUFFERS area and in to your other area? DMA? If so, how will you know where within the IPC_SR_FRAME_BUFFERS area the buffer is?

I have also thought about this path, but it is a huge hack. I'm concerned that it might not work at all, or if it did that it would be very 'fragile'. I'm really hoping there is a nice way to do this. Ideally, there would be a way to have OMX mmap the area as cacheable and manage the cache-coherency aspects, or have OMX internally DMA-copy to a cachable region.

Can TI please comment on their recommend way of achieving quick access to OMX component buffers?

0 Jon S. over 13 years ago in reply to Joel Keller

Expert 1240 points

Hi Joel,

I'll try to elaborate quickly here before I head home...let me know if you still want more details. Unfortunately, I don't think I'll be of much help, as I basically have the same question.

If I remember correctly, one of the earlier EZSDK OMX demos was using contiguous memory reserved for /dev/fbX. I think it was making OMX_UseBuffer() calls, passing offsets into this region (which it obtained from one of the standard framebuffer ioctl()'s). This wouldn't work for me, as I'm using all the frame buffer device nodes. I had dug into CMEM just enough to figure out how to use it to allocate a sufficiently large contiguous memory region, and pass offsets into this region to OMX_UseBuffer() calls.

I suppose that perhaps you could get away with just a memcpy(cmem_buf[i], omx_buf->pBuffer) (pBuffer being the data in IPC_SR_FRAME_BUFFERS region)...but I'm guessing the speedup you'd get there will be minuscule, considering that you're dealing with >1080p frames.

I had started trying to figure out if I could in fact implement a little driver that could resolve the mapped userspace pointers to their logical addresses and effectively perform a memcpy-like operation, leveraging the existing EDMA driver (which I still have to even look at). Unfortunately, I never ended up pursuing this. This felt like I'd be hacking a nasty solution -- what ideas did you have in mind?

Just because this is still the main question, requoting:

Joel Keller said:

Can TI please comment on their recommend way of achieving quick access to OMX component buffers?

0 Joel Keller over 13 years ago in reply to Jon S.

Expert 1305 points

Hi Jon,

Sorry for the late reply, I was out for a long weekend (Canada Day). It sounds like we're basically in the same place. I am going to be actively working on this part of our problem over the next two weeks, so if I discover anything interesting I'll post it here. Hopefully someone from TI will comment as well.

-Joel

0 Joel Keller over 13 years ago in reply to Joel Keller

Expert 1305 points

Can anyone from TI shed some light here?

0 Pavel Botev over 13 years ago in reply to Joel Keller

TI__Guru**** 170625 points

Hi Joel,

I am not an OMX expert, but this thread looks similar : http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/t/121558.aspx

Does it help?

Pavel

0 Pavel Botev over 13 years ago in reply to Pavel Botev

TI__Guru**** 170625 points

I found one more thread which seems related to yours:

http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/p/158219/578047.aspx

Pavel

0 Joel Keller over 13 years ago in reply to Pavel Botev

Expert 1305 points

Hi Pavel,

Thanks for your posts. Those two threads are indeed related, and they seem to confirm that the OMX buffers are memory mapped as non-cached, however they do not offer a solution. It is mentioned that if these buffers are used from another processor (ie the DSP), then this lack of caching on the A8 is not a problem. However, I do want to use these buffers from the A8. I think the correct solution would be for the OMX framework to perform the DMA necessary to get me the OMX component's output into a buffer which is cached (and therefore useable - performance wise).

Thanks,

Joel

0 Pavel Botev over 13 years ago in reply to Joel Keller

TI__Guru**** 170625 points

hi Joel,

I made some consultations within TI.

We need to know what "significant increase in speed" is required?

Currently we can support 60 frames per second for a complete video processing cycle (from capture->scaling->encode->decode->sc-> display). Do you require more than 60 fps?

Regards,

Pavel

0 Joel Keller over 13 years ago in reply to Pavel Botev

Expert 1305 points

Hi Pavel,

30 fps would be great for us. The difference here is that I am interested in a different "video processing cycle". Instead of doing (capture->scaling->encode->decode->sc-> display) I am doing (stream->decode->myCode). I need to access the raw decoded frames, not send them to the OMX display component. This is why the non-cached mapping of the OMX buffers creates such a problem. One thing you could do to observe this issue is to verify what FPS you are able to achieve with the following OMX chain, at 1080p (or above - our resolution is 1536x1536) (decode-><file write>). I believe if you try this you will run in to the same issue.

Thanks,

Joel

0 Pavel Botev over 13 years ago in reply to Joel Keller

TI__Guru**** 170625 points

hi Joel,

Could you please provide more details about your method to observe/measure the FPS with this OMX chain? I need to stay as close as possible to your working environment.

Pavel

0 Gabi Gvili over 13 years ago in reply to Pavel Botev

Genius 4120 points

Hi Pavel,

I am reading this thread and i think that i am facing a similar problem, i need to transfer 1080p frames from the OMX components to another buffers located on the A8 in order to perform some processing on these buffers, in order that these transfers would be fast enough EDMA must be used for these kind of transfers. See the following posts:
http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/164186.aspx

http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/p/155685/564349.aspx#564349

However when i try to follow the way suggested in the threads above i have a problem which i have posted here:

http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/t/203642.aspx

May i cease the opportunity that you are reading this thread and willing to help and ask for help as well?

Thanks,
Gabi

0 Joel Keller over 13 years ago in reply to Pavel Botev

Expert 1305 points

Hi Pavel,

I am out of the office for the next two weeks. I will respond when I am back. thanks

0 Vimal Jain over 13 years ago in reply to Joel Keller

TI__Expert 6535 points

HI,

If cache enabling of Shared region for buffers is required, properties for caching is set in firmware-loader. please see board-support/media-controller-utils_2_05_00_17/src/firmware_loader/memsegdef_default.c

for enabling cache A8 bit would need to be set-up. Also please note, user needs to take care of cache-invaidation.

(1 << LDR_CORE_ID_VM3) | (1 << LDR_CORE_ID_DM3) | (1 << LDR_CORE_ID_A8),

Regards

Vimal

0 Gabi Gvili over 13 years ago in reply to Vimal Jain

Genius 4120 points

Hi Joel/Vimal,

1. The memory area which OMX buffers are allocated is called shared region 2 and it is not cache-able by default. See Archith John Bency's answer here:

http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/t/155547.aspx

"One issue I see is that with the current EZSDK is that the Shared Region #2 from where data buffers are allocated is not cached on A8. So if you do buffer manipulations on A8, performance will not be good. The next EZSDK release should have this exposed as a configuration. Also, you would have to do Cache invalidation operations on A8 in your application. once Cache gets enabled."

2. The best way to speed up data transfers in A8 is not making this area cache-able but using the EDMA3 which performed wonderfully for me in the DSP, unfortunately i didn't succeed in using EDMA3 for data transfers on A8 because the OMX code runs on Linux and uses virtual addresses and the EDMA3 needs physical addresses. And i didn't succeed to find a way how to translate virtual addresses of OMX buffers into physical addresses.
Maybe someone from TI (Vimal?) will please answer me here or in my thread here:

http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/p/202453/723882.aspx#723882

Thanks,
Gabi

0 Joel Keller over 12 years ago in reply to Vimal Jain

Expert 1305 points

Hi Vimal,

Thanks for your reply. I apologize for not getting back to this issue sooner. I had to move on to other parts of the project. Currently I have a work around where I use the EDMA API in a kernel driver to DMA the OMX buffer to another region of memory which I can later access with caching enabled. It is not an elegant solution and has security implications which thankfully don't matter for our project.

If I get the time I would like to switch to the solution you suggest, and if I do so I will report back here.

One question I had: when you say 'user needs to take care of cache-invalidation' - is this don't by OMX if the Shared region is cached on the A8? The OMX IL client code that I am writing is, of course, just user-mode code and cannot perform cache-invalidation.

Thanks,

Joel

0 Vimal Jain over 12 years ago in reply to Joel Keller

TI__Expert 6535 points

Hi,

It is possible to use cache APIs in user mode. It should be used in IL clients, if user is setting the cache.

Regards

Vimal

Processors

Processors forum

OpenMax (OMX) output buffers