This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

iUniversal copy example is slow

Guru 10685 points

I'm working with the simple iUniversal copy codec in XDAIS 7.23.00.06 with Codec Engine 3.23.00.07 (latest one) and am finding it is slow at processing data. I am using the DSP core of the DM8168 chip.

I've modified the example so that the input buffer is totally ignored, and all that is done is the output buffer is allocated from shared memory by the application and passed to the codec where it is looped over and each byte is individually set to 1.

I'm trying to process 1080p video frames, so I'm effectively looping over 6,220,800 bytes and I'm finding that I'm only achieving about 16 frames per second. That's about 10,000,000 bytes of the input buffers being set to 1 per second. Why is the processing so slow? Please could someone give me a hint how I can optimise my loop?

By the way, I don't think "intrinsics" will help me because eventually I want to actually do some proper processing on the video pixels and not just set them to 1. The way I see it, I would like to get my basic loop running at at least 60fps and then work on an actual algorithm.

Thanks,
Ralph

  • I got it to 30fps by using the _amem4 intrinsic to write 4 bytes to memory at a go, and 43fps using _amemd8. It strikes me that the cache might be operating in "write through" mode given the speed increases I'm seeing by doing 4 bytes at a time.

    Anyone know how I can speed it up further than writing 8 bytes at a time?

    Thanks,

    Ralph

  • Ralph,

    I'm hoping codec experts will chime in to point you to some optimization techniques that your algorithm could benefit from. If you are specifically concerned about Codec Engine overheads, there's a wiki article that you might find useful :-

    http://processors.wiki.ti.com/index.php/Codec_Engine_Overhead

    Gunjan.

  • Ralph,

    If you are doing huge data tranfer like frame copy, it is better to use EDMA instead of mem_copy. EDMA operation can be parallelized to compute operation and will hide the data transfer operation or cache penalty.

    regards

    Yashwant

  • Thanks for your prompt response, sorry but I've been caught up with something else at work and will get back to this thread soon.

    Ralph

  • Hi, I'm back on this now. Sorry for disappearing.

    Gunjan, thanks for the link. I've already had a read of that and it is useful but it doesn't point me to any further speed ups than I already have.

    Yashwant, I am not using a memcpy at present so how do you suggest I apply EDMA to my current setup? I don't think I quite understand what you mean. Do you mean use EDMA to transfer data from main memory into cache? Surely it already does this? Or do you mean setup some of my L2 cache as fast scratch memory?

    Thanks,
    Ralph

  • Anyone have any pointers?

  •  

    You might be able to improve performance by using DMA to copy data sitting in slow (DDR) memory into a fast scratch memory (L1 or L2) where you perform CPU intensive operations and read/writes and then subsequently  using DMA to copy processed output data.

    By using DMA you might be able to achieve (1) optimal (at least better) DDR throughput due to DMA hardware taking care of bursts (2) more deterministic operation since you wouldn't rely on caching behavior or CPU DDR access patterns. But perhaps more importantly: you can free up the CPU from doing synchronous I/O by "scheduling" chunck data transfers via DMA to take place in the background and use this freed up time to perform CPU intensive processing in parallel on the previously fetched data sitting in fast scratch cache. Using a ping-pong buffering mechanism, as soon as there is input data for processing, you can initiate a background DMA transfer to bring in next input chunk and proceed in parallel with processing the current chunk. When finished processing, you can immediately start a DMA transfer to copy out the results and only wait for the finish of the next chunk which we had scheduled earlier as a DMA transfer (if it is not already finished). And repeat this cycle until all input is processed. By carefully parallelizing CPU processing with background memory transfers you can achieve overall better throughput.

    Of course, doing DMA introduces more complexity to the algorithm with respect to resource management (acquiring internal scratch memory, DMA resources), interfacing with the DMA hardware, and parallelizing the algorithm to overlap CPU processing with background DMA transfers. TI's Framework Components supplies EDMA resource management as well as ECPY (functional EDMA programming APIs). I am not sure how deep you need to go, but there would be some learning curve.

    Best regards,

    Murat

     

  • Hi Murat,

    Thanks for your reply. What you describe is exactly what I'm trying to do. Are there any documents that I can use to help me achieve this? I have no idea where to find the relevant information that tells me how to set up cache as internal SRAM or indeed how to do the EDMA from the DSP side.

    Ralph