Hi,
I've an application where I need to do 4 convolutions at the same time, but they are correlated: I've two inputs (I1,I2) and two sets of coefficients (H1,H2). I also have two outputs (O1=I1*H1+I2*H1,O2=I1*H2+I2*H2).
I've an C6748LCDK to get started with this application. Although it supports floating point I think integer math would be a better option as I've 16 bit inputs and 16 bit outputs.
The DSPLIB for the C6748 only includes floating point functions and the DSPLIB with integer functions seems to be not optimized for this processor. So it seems I've to write my own convolution function. I want to use a delay line and calculate an output sample for each input sample. Given that the number of coefficients can be more than 40,000, it needs to be quite optimized.
I've been studying some documentation and I've some questions:
The floating point functions in DSPLIB seem to try to do 4 multiplies each cycle, but the core can only do 2. Why is the loop then unrolled for 4 multiplies?
The documentation mentions that normally the core can do 4 16bit multiplies each cycle (2 units x2 16bit multiplies) but it seems that the DDOTP4 instruction matches perfectly with the task at hand, if I order my data correctly. That would allow me to do 8 multiply/add instructions each cycle. Am I correct?
Is it possible to write such functions in C or would I need assembly code to get it really optimized? Would it be possible to replace all FP data types in the DSPLIB SP function with shorts? Using the existing optimizations there?
Thanks in advance.
Kind regards,
Remco Poelstra