PROCESSOR-SDK-J784S4: Streaming-in scattered vector

Thierry BERNIER

Tool/software:

Hi TI,

I programmed the SE to read scattered cfloat data. I mean that each get_adv() goes "far" in memory in the inner loop of that SE (I know it's probably bad regarding cache)

I am reading this way 8 times to build a cfloat8. So it's something like :

cfloat d0 = ..get_adv();
...
cfloat d7 = ..get_adv();
cfloat8 v = cfloat8(d0,...,d7);

Is there an intrinsics to do this in one instruction, something like :

cfloat8 v = ...get_adv(); // get'ing & adv'ing 8 times

All d0/7/v are local variables of my kernel loop, and I see in assembly they are all in registers so there's no memory write to the stack there (good).

[ I am new to all of these ]

Thanks.

over 1 year ago

0 Asha Bhandarkar over 1 year ago

TI__Genius 10170 points

Hi Thierry,

Based on what you've posted it's my understanding that you are trying to read data that isn't contiguous in memory with the streaming engine (hence the scattered). Looking at the code snippet, it seems that your current method is to read each cfloat pair individually, and then concatenate them?

Your idea of doing this in one get_adv() is correct - this is much more efficient in terms of cycle counts. You would need to adjust your streaming engine parameters to achieve this however

Can you provide more information about how your data is stored in memory? For example, are all of these cfloat pairs scattered with a fixed offset between? A graphic here could be useful. Then we can determine what optimization you can do. If the data is scattered randomly, then it will be difficult to achieve better performance.

Could you also give what your current streaming engine parameters are?

Best,

Asha

0 Thierry BERNIER over 1 year ago in reply to Asha Bhandarkar

Prodigy 80 points

Thanks Asha,

The cfloats, as vector elements, are scattered like this

cfloat memory[e*N+k];

e=[0,7] is the inner-loop of vector elements, ans k=[0..P[ the outter (P < N) of vectors list.

I want to load these elements in a loop over k on a per-vector basis :

for k = 0:P
    cfloat8 v = cfloat8(memory[0*N+k], memory[1*N+k], ..., memory[7*N+k]);
    ... (use v),
end loop

Hope this snippet is enough with no drawing.

FYI, experimenting with the "v.s[e]" API to set the vector v is not faster (a bit slower) than the cast constructor "v = vectortype(e0,e1,...)", on J784S4.

Unrolling the k-loop by hand does not show significant gain.

0 Asha Bhandarkar over 1 year ago in reply to Thierry BERNIER

TI__Genius 10170 points

Hi Thierry,

Yes, this more information helps.

Thierry BERNIER said:
ans k=[0..P[ the outter (P < N) of vectors list.

Would you mind explaining this part more so I can confirm my understanding?

Best,

Asha

0 Thierry BERNIER over 1 year ago in reply to Asha Bhandarkar

Prodigy 80 points

Sorry, it's a typo, read it : "and k=[0..P[ (..)"

k is the vector number, from k=0 to k=P-1. Each vector is made of 8 scalar elements numbered from e=0 to e=7.

So I have a loop over k from k=0 to k=P-1, inside which I want to feed that k'ith vector of 8 scalars numbered by e. The scalars are are in memory[e*N+k].

N and P are constant values, and P is less than N

0 Asha Bhandarkar over 1 year ago in reply to Thierry BERNIER

TI__Genius 10170 points

Hi Thierry,

Thanks for the extra information. Let me take this and better understand your data access pattern and look into SE parameters in this case. I will provide an update on this by 7/31.

Best,

Asha

0 Asha Bhandarkar over 1 year ago in reply to Asha Bhandarkar

TI__Genius 10170 points

Hi Thierry,

Sorry for the delay in response. At least from my understanding, this is quick idea of what you are trying to achieve in terms of a data access pattern, correct? I've just picked values for N and P to demonstrate.

Note: I'm using single digits in the 8-byte blocks as a representation for the 2, 4 byte pair that occurs with the cfloat8 type.

In this case, blue is when k=0 (first vector), green is when k=1 (second vector), and orange is when k=2 (third vector)

For this, since you are doing a "fixed" (rather than random) scatter, I believe you can accomplish this with SE. Assuming that the chart above is correct, I am still working through the best way that you could set the streaming engine parameters to do so.

I'll provide an update on 8/5 regarding this progress.

Best,

Asha

0 Asha Bhandarkar over 1 year ago in reply to Asha Bhandarkar

TI__Genius 10170 points

Hi Thierry,

Can you look at following code snippet for something you could implement to achieve the similar data access pattern I diagrammed above?

   // Array of floats for example data
   float cA[64] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
                  17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
                  33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
                  49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64};

   __SE_ELETYPE SE_ELETYPE;
   __SE_VECLEN SE_VECLEN;
   SE_VECLEN = c7x::se_veclen<c7x::cfloat_vec>::value;
   SE_ELETYPE = c7x::se_eletype<c7x::cfloat_vec>::value;

   uint32_t N = 4; // User-set,
   uint32_t P = 3; // User-set
   uint32_t elemCount = (c7x::element_count_of<c7x::cfloat_vec>::value); // number of elements in a vector
   printf("Element Count: %d\n", elemCount);

   // Configure SE
   __SE_TEMPLATE_v1 se0_param = __gen_SE_TEMPLATE_v1 ();
   se0_param.TRANSPOSE = __SE_TRANSPOSE_64BIT; 
   se0_param.ELETYPE = SE_ELETYPE;
   se0_param.VECLEN  = SE_VECLEN;
   se0_param.DIMFMT  = __SE_DIMFMT_2D;
   // With 2D Transpose mode, your ICNT0 = columns where ICNT1 = rows 
   se0_param.ICNT0 = P; 
   se0_param.ICNT1 = elemCount; 
   se0_param.DIM1 = N; // "skip" factor

   __SE0_OPEN (&cA, se0_param);

   for(int i = 0; i < P; i++) {
      c7x::cfloat_vec cB = c7x::strm_eng<0, c7x::cfloat_vec>::get_adv ();
      cB.print();
   }

You should be able to use Streaming Engine's transpose to achieve the "scatter" pattern you are looking for. In this case, you would be setting your data up as 2-dimensional, so you need to set the ICNT0, ICNT1, and DIM1 parameters.

With linear mode, you can consider your ICNT0 as your width of your 2D data, ICNT1 as your height. With tranpose mode this is flipped.

DIM1 = N (with N being your fixed scatter offset)

ICNT1 = element count, 8 in the case of cfloat8 (with this being the number of elements you will be able to access with each adv())

ICNT0 = P, your looping factor

Note that in this case, since SE Transpose does not support the DECDIM feature on DIM1, you will need to account for padding with your data. Meaning, that for you last vector k, if your memory array only fills 7 elements for example, the last element will not automatically be padded with a 0 by the streaming engine.

With the code snippet, you should be able to run this with host emulation and see how the data is accessed when each vector is printed to the terminal. Please let me know if this is what you are trying to achieve.

Best,

Asha

Processors

Processors forum

PROCESSOR-SDK-J784S4: Streaming-in scattered vector