Need a fully pipelined implementation of findmax function which finds maximum val, row index and col index of a two dimensional array

adnan khan14207

Prodigy 245 points

hello folks

findMax( float *data, int row, int col, int *rmax, int* cmax, float *val)

{

float max = -9999.0;

for (int i=0;i<rows; i++)

for (int j=0;j<cols; j++)

{

if (data[i*cols + j] > max)

{ max = data[i*cols +j;

rmax = cols;

cmax = rows;

}

plz some body give me an efficient assembly implementation for C6713 dsp. I have imp. it but it takes lot of time. If some one unroll this loop and schedule instructions properly so that all the registers are being exploited and calculations are done in minimum time.

its v urgent , so plz every do reply

thanx.

over 16 years ago

0 C. Hayes over 16 years ago

Expert 1710 points

Why don't you look through the TI provided libraries for a function that efficiently searches through a 1D or 2D vector for a maximum value or similar.

Usually compiler tool sets implement intrinsic high performance functions like memcmp, memcpy, strncmp or similar in such a way that efficient possibly inline assembler language implementations are used. If you do not find a maximum value locating routine, look at the source code of how such similar strncmp or memcmp or such functions are implemented in assembler and modify slightly that assembly code to perform your function.

In a 1D data array which you can index with column * ColumnVectorSize + row the data must all be contiguous in memory so there is probably a simple single instruction or few instruction sequence that can iterate through a contiguous memory vector finding a maximum or comparing against a previously found maximum value. Read the instruction set reference and find the efficient ways to iterate through memory and perform an operation like a compare against the retrieved values from memory.

For a large array it is often true in DSP/CPUs that the read access bandwidth for large consecutive fetches from the memory will be slower than the processor's internal instruction execution cycles of a tight loop / repetitive search instruction , so you may not need totally optimized code to achieve peak performance -- you'll never get the algorithm to work faster than the processor can consecutively read every data element from memory.

0 C. Hayes over 16 years ago in reply to C. Hayes

Expert 1710 points

You could probably start with this function and modify it for your needs:

http://read.pudn.com/downloads35/sourcecode/embed/110214/c67xfiles/vecmaxf.asm__.htm

If I was doing it, I'd take the code you already have in C, optimize it with a few manual unrolls et. al. and then compile it to assembler output and inspect what areas of the generated assembly are being implemented efficiently and try to improve the less efficient areas by changing the C code.

When you have the C code generating fairly efficient assembly output you can always manually modify the assembler code's inner loop to make it more efficient if there are substantial gains to be had.

0 adnan khan14207 over 16 years ago in reply to C. Hayes

Prodigy 245 points

Thanx Mr. C. Hayes

Tthere is another poblem..!!

Both of my loops execute odd no. of times and most of the times rows = cols i.e. we have a square array.

Can you suggest some method with which i can optimize my function ?

many regards

0 C. Hayes over 16 years ago in reply to adnan khan14207

Expert 1710 points

The opportunities for optimizing this sort of thing depend on the actual size ranges of the data vector, the consecutive read speed of the memory from which the data is being fetched, and the DSP's tight loop execution speed from L1/L2 cache.

If you were reading the data from memory like on chip SRAM that can be read, for totally hypothetical instance, in a single cycle per access repetitively, then the data read speed might be high enough that the efficiency of the instruction loop could be a bottleneck for performance. One would typically assume that the tight loop of the DSP instruction code will be small enough that it will be executing from L1/L2 cache and can run sustained at the maximum floating point instruction rate of the DSP chip. On the other hand if the data vector is stored in off chip SDRAM or something like that that involves some wait states and delays due to memory cycle timings which makes the data read throughput much slower than the instruction execution throughput, then even a very non-optimized code loop might still run as fast as possible due to the relatively large delays necessary just to read the data from start to finish.

I would suggest setting up a test where you call your maximum value locating function 10,000 times in a loop over the same data vector, and precisely time the execution speed of those 10,000 executions. Then you can divide by 10,000 * the number of elements in the data vector and get an execution speed number per element.

Then repeat the same 10,000 execution loop test with calling the existing fvecmax function (as above) on the same exact data array, and see how long that takes to execute per element. If your existing C code is not significantly slower than the execution of calling a function like fvecmax which you believe is better optimized then you're probably already running as fast as your memory system and DSP will allow and further code optimizations would not help.

If you know the size of your data array at compile time, you could write a function that specifically deals with its odd / square size in a hard coded manner. If you have to code it so that the data vector is treated as if it is of variable / arbitrary length then you can include generic code to deal with odd sizes or lack of a good size divisor factor something like this:

unsigned FindIndexOfMaximumValue( register const float *pdata, const unsigned column_size, const unsigned row_size ) {

unsigned const QUANTUM_SIZE = 16;

const unsigned remainder = data_size % QUANTUM_SIZE;

while( remainder-- ) {

if( datum > max_value) { max_value = datum; max_location = data_address; }

++data_address;

}

// Now we know that there are a multiple of QUANTUM_SIZE elements remaining to be processed

// We can reliably unroll the following loop QUANTUM_SIZE times or otherwise optimize it for

// dealing with a multiple of that number of elements.

while( data_address < data_size ) {

if( datum > max_value) { max_value = datum; max_location = data_address; }

++data_address;

}

// Column(0..N) = data_address / column_size

// Row(0..N) = data_address % column_size

return( data_address );

}

Processors

Processors forum

Need a fully pipelined implementation of findmax function which finds maximum val, row index and col index of a two dimensional array