memset eats up processor time

rect

HI,

I am working with DM637. I have an algorithm working on every incoming frame.Algorithm has many functions and all functions require somw arrays. So instead of passing pointer to array in each fucntion i decided to kepp everything as global. Due to this all the array needs to be set to ZERO before processing of each frame or else algorithm takes wrong values of past frames and hungs up. For setting arrays to zero i use memset() function. Problem is memset function eats up lot of processor.

For now it takes around 30 ms just to set all values to zero.Details of arrays and their sizes are

1-> 2 char arrays of size 1280x800

2->5 interger arrays of size 640x480

Kindly help me on how to minimise. Is there any other ways to do this?

over 12 years ago

0 Shankari G over 12 years ago

TI__Mastermind 43955 points

Hi rect,

Moving your post to "DM64x DaVinci Video Processor Forum"

Regards,

Shankari.

0 Victor Kazmirenko over 12 years ago

Guru 13202 points

Hello,

First of all, I would suggest to review your algorithm. Setting all buffer to zero and then set some useful values over zeroes means double work. Perhaps you may avoid that. Next point is that memset() might be inefficient. One may want to see assembly it produces to make sure. If it really performs writes byte by byte, then you may improve too. If I were you I would do 2 things: 1) ensure array alignment, 2) use intrinsics. I am not familiar with C64+ core, my experience is mostly with C64. On latter we have _amem8(), which allows me to set 8 bytes at a time. And finally, you don't need CPU's MIPS just to zero the buffer. If you really need to write zeros, consider using EDMA. Instruct DMA controller "write this much zeros from here" and let hardware do the job, then CPU would be free to do useful job.

Update:

I've just noticed, how large buffers you use. They definitely do not fit into DSP's memory, so they reside in external (read: slo-o-o-w) memory. There are some ways to deal with that, but I don't feel I can comment more on that without further details about your application.

0 rect over 12 years ago in reply to Victor Kazmirenko

Prodigy 170 points

Hi rrlagic,

Thaks for your suggestion. Ya you are correct. The whole algorithm runs from DDR2. Since there are no enough space for processing from internal memory , I have made everything run from DDR2. I had planned for moving some part of data onto L2RAM from DDR2 to process and using ping pong buffering in the later stages, if processing from DDR2 cannot be optimised to required timing requirements ,since it requires the whole algorithm to be modified as now algorithm has lots of functions and each function works on whole of the image. Kinldy suggest me if there is any other ways possible.

As for memset, I will try your suggestion of alligning memory and EDMA .

Also the purpose of using memset is that i have to clear all these arrays or else during the processing of next frame , previous frame values remain for positions in array which are not touched in current frame, causing system to hang. This is sole reason i am forced to clear all arrays before processing of each frame. Kindly through some light if there is any alternative for this.

Thanks,

0 Victor Kazmirenko over 12 years ago in reply to rect

Guru 13202 points

Hello,

Idea of the cache is grounded on assumption, that one need adjacent data often, but don't need distant data point. If you can modify your algorithm to process portion of the image - say subframe, line or column, then you may have ping-pong buffers in your fast L2 memory and use (E)DMA to load/unload those buffers. However, if your algorithm may use coordinates (0;0) and (1000;1000) simultaneously, then I have no good news. Loading subrframe of the image is right job for DMA controller too. There was an example in user guide for EDMA of C64x. I am not strong in image processing, but I suspect subframe extraction example is given for EDMA use for good reason.

As to your last question, I suspect there might be some trouble in processing algorithm. Let me explain it with simple example.

int a = 1, b = 2, c = 5;

c = a + b;

Regardless of c's initial value, this algorithm finally produces c equal 3, although c might have other value during algorithm work. Consider another one:

int a = 1, b = 2, c;

c = a + b + c;

Result of latter algorithm depends on c's intermediate value. If you make sure c = 0 before execution, looks fine, but if c holds some intermediate value, we're in troubles. So I deduct, that your algorithm may be looking in some places, where it should not. When your set all buffers zero, then access to wrong place makes no difference. So I'm afraid, that zeroing the buffer not only not curing the problem, but also makes it harder to find.

0 rect over 12 years ago in reply to Victor Kazmirenko

Prodigy 170 points

Hi ,

Coming to the last question first, I found the place where the algorithm hangs. It is a place where it shouldn't be looking into as you suspected. Now that part of the problem is solved there by eliminating need for memset to all the arrays. Ya but still some arrays have to reset for the algorithm to work. But this doesn't take much time as before and looking into memory alignment and intrinsics to reduce it more.

Coming to the EDMA part. I dont take (0,0) and (1000,1000) simultaneously. Algorithm goes through the pixels sequentially. Bu the challenge here is algorithm doesn't work on rectangular sub frame. We have a region of interest within which we do processing. Though there is no break in the rows,say the algorithm starts from row 220 and ends in row 600 and there is no row left in between, Column vales for each row are different. For example for row 220 we start processing from say 250 col and end at 800 col, row 221 we start at col 358 and end at 900 and so on. The example in EDMA is a rectangular subframe extraction.

Now since i cannot place whole of my image data to be processed in internal i need to use some kind of ping pong buffering to transfer data from DDR2 to internal memory for processing. But then

1) How can I set up EDMA transfer where number of data transferred is different for each time.

2) Algorithm can be split into 2 parts. First part doesn't depend on orientation of data , it just acts on data. So if i could find a way to transfer data to internal memory then this part of algorithm just goes through the data and processes it.

3) But second part of algorithm , though it works on region of interest pixels, It works as if the data is rectangular subframe. This part of code is which accessed data from invalid locations.Now for this part of algorithm i have no way but to keep data as in a frame i,e rectangle.How can this be achieved through EDMA??

Thnaks,

0 Victor Kazmirenko over 12 years ago in reply to rect

Guru 13202 points

Hello,

Rectangular subframe extraction is just an impressive example of shooting bunch of goals with one bullet. Your case might be different, and you might not achieve that level of benefits. Still you may use ping-pong buffering. While DSP is processing ping, DMA may load pong and vice versa. Advantage of this approach is that data get loaded by hardware controller while DSP's MIPS are used for processing. Variable start position, variable data count mean that you have to program your DMA controller each time. Its not that big deal if you prepare DMA channel configurations in advance and update fields like source, destination and count. Perhaps your processor has QDMA, that might submit transfers even faster. Anyway, you'll have to submit transfer requests one by one for every subrow.

What you have to think on is planning your application schedule. There could be situations, when DSP finishes processing of ping buffer before pong buffer is loaded by DMA. Then DSP has to wait for DMA completion. You may setup interrupt triggered by DMA controller to synchronize DSP. DSP then setups loading of ping buffer and starts processing of pong buffer.Again it finishes before data get loaded and waits for transfer end interrupt.

However, it may happen, that loading completes before DSP has finished processing. Then you have to synchronize other way. I'd like to warn you consider both probabilities, not just rely on some timing measurements.

As to the last question, it depends on how large subframe you need to process. If that whole subframe fits into L2 memory, then you may fill it row by row with separate transfers.

Please do not take my suggestion as direct recipes, but rather as directions for further thinking.

Processors

Processors forum

memset eats up processor time