Part Number: TDA4AH-Q1
Other Parts Discussed in Thread: MATHLIB
Tool/software:
- Do you have benchmarks for SE? Ideally during a non sequential use case? My use case is icon0=64 and dim1=60. Read 64bytes then increment addr by 60 bytes. I am seeing it takes 3x longer than sequential from l2 memory. From l2 I can get ~2 64byte reads per clk using both streaming engines. It was ~2-3 clks per read when non sequential like addr+60 vs 64. If addr change was smaller like less than +-10 it would still be 1clk per read which is good. I also noticed SE in various transpose modes was reading from l2 in ~2-3clks per read.
- Because of SE non sequential performance I found I can get ~2x better fir filter performance by keeping se sequential. What is the best way to merge 2 vectors? Right now I am using VSEL and VPERM. __shift_right_merge can only do limited shifts. I needed 4,8,12,16...56,60 byte shifts(i*4). Right now my ii=19 but if there was 16 predicate registers it would probably be the theoretical best case of ii=16 since my loop is unrolled 16. If SE was faster this loop would be simpler, have a ii=1 and not need to be unrolled so I am hoping I setup SE wrong.
My use case is convolution with nxn kernels or nx1 fir filters processing 10s of gigapixels/second. For the example above it was a 5x5.