TDA4AH-Q1: C7X streaming engine non sequential performance

Ryan Stark

Part Number: TDA4AH-Q1
Other Parts Discussed in Thread: MATHLIB

Tool/software:

Do you have benchmarks for SE? Ideally during a non sequential use case? My use case is icon0=64 and dim1=60. Read 64bytes then increment addr by 60 bytes. I am seeing it takes 3x longer than sequential from l2 memory. From l2 I can get ~2 64byte reads per clk using both streaming engines. It was ~2-3 clks per read when non sequential like addr+60 vs 64. If addr change was smaller like less than +-10 it would still be 1clk per read which is good. I also noticed SE in various transpose modes was reading from l2 in ~2-3clks per read.
Because of SE non sequential performance I found I can get ~2x better fir filter performance by keeping se sequential. What is the best way to merge 2 vectors? Right now I am using VSEL and VPERM. __shift_right_merge can only do limited shifts. I needed 4,8,12,16...56,60 byte shifts(i*4). Right now my ii=19 but if there was 16 predicate registers it would probably be the theoretical best case of ii=16 since my loop is unrolled 16. If SE was faster this loop would be simpler, have a ii=1 and not need to be unrolled so I am hoping I setup SE wrong.

My use case is convolution with nxn kernels or nx1 fir filters processing 10s of gigapixels/second. For the example above it was a 5x5.

over 1 year ago

0 Asha Bhandarkar over 1 year ago

TI__Genius 10170 points

Hi Ryan,

Focusing on your first point first, as that seems to be the root of the issue -

Ryan Stark said:
Do you have benchmarks for SE? Ideally during a non sequential use case? My use case is icon0=64 and dim1=60. Read 64bytes then increment addr by 60 bytes. I am seeing it takes 3x longer than sequential from l2 memory. From l2 I can get ~2 64byte reads per clk using both streaming engines. It was ~2-3 clks per read when non sequential like addr+60 vs 64. If addr change was smaller like less than +-10 it would still be 1clk per read which is good.

In terms of benchmarks for SE, we don't have explicit numbers that profile the cycles from streaming engine. We do have C7x libraries in the SDK since 9.2 (MATHLIB, DSPLIB, VXLIB) that use streaming engine in various modes and with different ICNTs and DIMs - in particular DSPLIB and VXLIB will have good examples of this.

I would not expect this behavior you are describing - would you mind sharing the full SE parameters that you are using in this case so we can investigate further?

Ryan Stark said:
also noticed SE in various transpose modes was reading from l2 in ~2-3clks per read.

If SE is configured in transpose mode, there is a level of cycle overhead that occurs compared to a linear read mode, so the behavior here is what I would expect.

Best,

Asha

0 Ryan Stark over 1 year ago in reply to Asha Bhandarkar

Expert 2930 points

Here is one option for settings.

s_se_params = __gen_SE_TEMPLATE_v1();
s_se_params.DIMFMT =__SE_DIMFMT_2D;
s_se_params.ELETYPE = __SE_ELETYPE_8BIT;
s_se_params.VECLEN = __SE_VECLEN_64ELEMS;
s_se_params.ELEDUP = __SE_ELEDUP_OFF;
s_se_params.ICNT0 = 64; //perf really drops off when this is 64 from most memories(ddr,l2 cached, msmc 0x70.., msmc 0x680..). only seems ok when from l2 sram 0x64800000.
s_se_params.ICNT1 = 4096;
s_se_params.DIM1 = 64;
s_se_params.LEZR_CNT = 0;

Looks like main issue is just icnt0 needs to be >=512 or 1024 for most memories. It does seem ok when reading from l2 sram 0x6480_0000 but not ok when that memory is cached in l2(like reading 32k from ddr).

When reading from msmc sram 0x68000000 which is uncached I get 1.69clks per SE read at icnt0=128 but 10x slower at 19.5clks per read when icnt0=64. DIM1==icnt0 for these tests. dim1 not being equal to icnt0( like 60 and 64) also has a perf penalty. If dim1 is smaller like <10 then no perf penalty and get expect 1.3clks per read.

For ddr addr that is l2 cached I see 1.3clks for icnt0=128 and 3.0 clks for icnt0=64.Same test using l2 sram at 0x64800000 it is always expected 1.3.

0 Asha Bhandarkar over 1 year ago in reply to Ryan Stark

TI__Genius 10170 points

Hi Ryan,

Streaming engine is designed to optimally read from L2 memory (as you have noted). Streaming engine reads from MSMC/L3 and DDR and memories will start introducing various memory overheads. For most optimal performance, it's recommended to use map streaming engine reads to C7x L2SRAM.

Best,

Asha

Processors

Processors forum

TDA4AH-Q1: C7X streaming engine non sequential performance