Hi all,
I'm working with the C5515 ezdsp kit and I run quite "heavy" algorithms with quite a lot of input data. After turning on -O3 and optimize for speed = 5 I still don't meet my target requirements. I think I might need your advice w.r.t. to optimization for data acquisition, transmission and algorithm.
First of all, here's what I do (roughly):
- Input data arrives in 8-bit parallel fashion on GPIO with about 3 MBytes/s and currently GPIO receive function is executed at a period of T_gpio = 10 ms
- Data is transmit to host by USB where the host requests approx 10 kbytes at an interval of 40 ms
- The data to process is of video type (64x64x16 bit matrices). My algorithm takes approximately T_algo = 8 ms currently.
My target requirements are:
- read 64x64x16 bit at an interval of approx T_gpio = 1 ms (very limit is T_gpio = 3 ms)
- Target-Host transfer is okay already
- Algorithm should run at approximately T_algo < 1 ms (very limit is T_algo = 4 ms)
If I don't manage to have T_algo < 1 ms (which is very probable) I still should read GPIO at T_gpio = 1 ms (or 2 or 3 ms) and then run a filter on the read input.
My questions:
- GPIO is all managed by CPU. Is there a better way to handle such kind of data acquisition? I saw that SPI is handled by CPU as well... If not, what's the best way to do data acquisition and input filter algorithm independently from the rest of the algorithm? I'm currently using Timer HWIs to execute GPIO receive code. It's okay. But when I put the filter algorithm inside the the HWI, USB requests are blocked which slows down my Target-Host transfer. How would you "organize" those HWIs?
- The algorithm is entirely coded in C. I'm not using DSPLIB. I might use DSPLIB, but if I can manage to optimize my code by hand, I'd rather leave it out. So here are some very specific optimization questions:
- if I do #pragma MUST_ITERATE .. (how) can I be sure that the code is pipelined or executed in parallel
- I have several "element-wise" matrix operations like subtract matrix A from B etc. What is better (see the following exemplary pseudo-code)?
for (i...N)
C[i] = A[i] - B[i]
end
for (i...N)
D[i] = k*C[i]
end
... and many other operations to follow...
OR
for (i...N)
C[i] = A[i] - B[i]
D[i] = k*C[i]
... and many other operations to follow...
end
- w.r.t. to memory operations. I'm currently using memcpy for copy and memset for setting to zero. I imagine that I could get much better performance if I'd use DMA for memory copy. What about setting to zero? What's the best performing method? Doing it manually in a loop?
- and then there's the #pragma DATA_ALIGN ... what impact does it have on the C5515? First, I've seen that for example on the OMAP-L137 DATA_ALIGN (x,128) is preferable for caching. But why is it DATA_ALIGN (x, 8) on the C5515?
I'd very appreciate your advice and answers to (some of) my questions. Sorry for the huge list but I'm quite new in DSP coding. And even if I've looked at the C55x compiler user's guide and assembly tool's guide there are many points which are not very clear w.r.t. to implementation. One thing is for sure: I think I have to go down to assembly for large parts of my code.
Thank you very much.
Best regards,
Andreas