Hi,
I have many similar operations in an algorithm. It processes blocks of data repeatively. Because there is some for loop inside (it cannot unloop because it processes 64 short, unloop will result the code size too large I feel), the whole block loop does not fit for software pipelining. There is similar operations on the data structurely. I program these similar operations in C MACRO. In order to reduce data memory size, one byte stores a state (It is like a small structure). I would read 2 words each time to save external memory access time. I have read "TMS320C6000 Programmer’s Guide", spru198k.pdf. There are many examples on LDW, LDDW, _amem4, _amem8 etc, but I feel it is difficult at how to use these technique in my algorithm.
I give a demonstration algorithm structure below.
pseudo code:
#define Example(aa, bb, cc) cc={some complex comparison and manipulation etc. (aa, bb)}
..............
main_func:
Example(aa0, bb0, cc0)
Example(aa1, bb1, cc1)
Example(aa2, bb2, cc2)
Example(aa3, bb3, cc3)
Example(aa4, bb4, cc4)
Example(aa5, bb5, cc5)
Example(aa6, bb6, cc6)
Example(aa7, bb7, cc7)
..
cc0, cc1, cc2, cc3, cc4, cc5, cc6, cc7 are defined as integer in order to get high efficiency. Each of cc0, cc1, cc2, cc3, cc4, cc5, cc6, cc7 is within a byte range. I would store two int's for these data in one time memory access. My question is how to implement this in C. The examples in 198k all are add, multiply. These have directly DSP instruction supported. If I define int array, these will result a new memory allocation. The intrinsic can operate on short, not byte. Simply put, I want to eliminate external memory access if it is possible to call 8 MACROs and write the result out in one time.
Regards,