TDA4VM: C7x training

Clément Fau

Part Number: TDA4VM
Other Parts Discussed in Thread: PROCESSOR-SDK-J721E

Hello,

I downloaded a C7x training at version 0.5 which I got from the PROCESSOR-SDK-J721E v6.1 page : PROCESSOR-SDK-RTOS-J721E_06.01.01.12 | TI.com

Is a new version of this training available or planned? If planned, can you give an approximate date of availability?

In this new version, will the advanced examples contain their documentation?

Is it possible to get the documentation of these examples now, even if it is not finished yet?

An example on FFT (c7x_fft1d_16bit) interests me a lot but there is no documentation or even comments.

Thanks a lot,
Clément

over 4 years ago

0 Clément Fau over 4 years ago

Prodigy 135 points

On the same subject, and I take the opportunity to up this post at the same time : I have a 16 data linearly arranged and I would like to create 4 vectors with this shape :

I have the feeling that the streaming engine does not allow group duplication other than filling the vector with ONE group, while here I have 2 groups for one vector (0 1 and 4 5 for vector 1).

The duplication of an element does not work either (I do not want 0 0 1 1 but 0 1 0 1).

So I tried to do this by hand using several dimensions:

params0.TRANSPOSE = __SE_TRANSPOSE_128BIT;

params0.ICNT0 = 2;
params0.ICNT1 = 2; params0.DIM1 = 0;
params0.ICNT2 = 2; params0.DIM2 = 4;
params0.ICNT3 = 2; params0.DIM3 = 1;
params0.ICNT4 = 2; params0.DIM4 = 8;

Here is the transposed way, but i tried in normal way before, but the problem is the same : I get empty half vectors because of the boundary awareness and ICNT1 = 2

So my question: Is is possible to build those vectors with the SE ? If not, what is the best way to build them ?

Thank you very much for your time,

Clément FAUCONNET

0 Burton Copeland over 4 years ago in reply to Clément Fau

TI__Prodigy 100 points

Hi Clément,

I'm not connected to documentation/training at all so I cannot comment on your questions there, but I can provide info about SE/C7x programming and how to get the pattern you are looking for.

You're on the right track with transposed mode! Unfortunately there is not a mechanism to create the pattern you're after in linear mode so utilizing transposed mode is a requirement if this is going to be completely contained within SE.

To create this pattern what you need to do is:

set transposed granularity (TRANSPOSE) to 2x your element size. It looks like you are doing that with 128-bit granularity on what's probably 64-bit elements.
set ELTYPE to be complex, so element 0' would have sub-elements (0,1), 1' would have (2,3), 2' would have (4,5), etc.
set ELDUP (element duplication) to 2x

If you run this with ICNT1,2,3,4,5 = 1, then you would see a vector containing just 0101. To get 01014545 you set ICNT1=2 and DIM1=4 so that the second "row" starts at element 4.

More parameters:

ICNT0 = 2 (The efficiency of transposed mode is tied to the ratio of granules within ICNT0 compared to the number of rows. I moved this here from your ICNT3)
ICNT1 = 2; DIM1 = 4
ICNT2 = 2; DIM2 = 8
ICNT3,4,5 = 1

There are limitations with this approach, though. SE hardware restrictions exist such that 16-bit transposed granularity requires 2x element type promotion so 1B elements will not work without post-processing. Also, since this utilizes complex element types it only works for patterns that repeat consecutive pairs of elements. If you wanted 0123_0123_89AB_89AB that wouldn't be possible, for example.

If you find yourself needing to create more generalized patterns then you can also look into C7x's VPERM instruction. Here's a brief overview of how you would use that for this scenario:

Read elements 0-7 (assuming 64-bit elements) as one vector from SE.

Pass that as the second input into VPERM (you can use the __permute(uchar64,uchar64) intrinsic). The first input to VPERM is an array where each byte indicates where the corresponding output byte comes from in the input vector and some formatting information. In this case your array would be:

Then just loop that. Setting SE's addressing parameters to be the following and you can reuse the same permutation map for each vector:

6 <= ICNT0 <= VECLEN (You need at least 6 elements to get 0,1,4, and 5. I include VECLEN as the upper bound, because you want ICNT0 to be completely contained within one vector)
ICNT1 = 2; DIM1 = 2
ICNT2 = 2; DIM2 = 8
ICNT3,4,5 = 1

This has some advantages over using SE's transposed mode:

Can be used to create more complex patterns and avoids SE's hardware restrictions regarding small granules
Depending on your final code and the scheduling that the compiler is able to achieve, this may run faster, because SE is more efficient and has less overhead when operating in linear mode than it is in transposed mode.

Best regards,

Burton

0 Clément Fau over 4 years ago in reply to Burton Copeland

Prodigy 135 points

Burton,

First, thank you very much for your time, your answer helps me a lot!

1) Your solution was smart but I didn't specify: each element of my input vector is a 64b complex. Sorry for my lack of precision but I didn't think this information was important for you and wanted to make the question simpler.

On the other hand, I didn't know VPERM very well and I do think that it can be a solution, I thought I would 100% lose performance by using it, thank you.

2) You talk about the problems of the SE in transpose mode, and indeed I encountered a problem recently that I had to work around which I wanted to tell you about because I saw nothing about it in the documentation:

Here a SE :

__SE_TEMPLATE_v1 params1 = __gen_SE_TEMPLATE_v1();
params1.ELETYPE = __SE_ELETYPE_32BIT_CMPLX_SWAP;
params1.DIMFMT = __SE_DIMFMT_3D;
params1.TRANSPOSE = __SE_TRANSPOSE_64BIT;
params1.VECLEN = __SE_VECLEN_8ELEMS;

params1.ICNT0 = 1;
params1.ICNT1 = ptByButterfly; params1.DIM1 = butterfly;
params1.ICNT2 = butterfly; params1.DIM2 = 0;

This streaming engine crashes if ptByButterfly is strictly greater than 16.

So I had to write it like this:

params1.ICNT0 = 1;
params1.ICNT1 = 16; params1.DIM1 = butterfly;
params1.ICNT2 = ptByButterfly/16; params1.DIM2 = 16*butterfly;
params1.ICNT3 = butterfly; params1.DIM3 = 0;

I found this strange but maybe I missed something in the documentation.

3) Finally, I have one last question:

Being currently on a classical FFT code, in my algorithm I start with a bit-reversed address step as the VSTBITRDW instruction does. The instruction uses necessarily a limited size (the size of the vector) and it won't work if your input is greater than the vector since it needs all elements to re arrange them. Would you have a way to do this in an optimized way for any size ?

Something like that : but for a lenght greater than 16, like 512 for example. Still working with __SE_ELETYPE_32BIT_CMPLX

Thank you very much for your help, it really helps me to push the performance of the C71x core to the maximum.

Clément

+1 Burton Copeland over 4 years ago in reply to Clément Fau

TI__Prodigy 100 points

Clément,

1) Yeah, that gets you into the case of wanting 0123012389AB89AB and SE doesn't have support for generating that pattern in a single vector, unfortunately.

VPERM is a .C unit exclusive instruction so it can negatively impact your performance if you have multiple operations that need to execute on the .C unit in your loop. Here are a few more alternatives:

You can use both SEs (if you aren't using both already...) to generate parts of your vector via group duplication on one of them (01010000 and 45454545), use an AND mask on the second one (01010000 and 00004545) and finally OR them together to get your desired vector. This is more instructions, but is less limited in what units it can be scheduled on and the complier should be able to schedule it into a single cycle loop by putting both operations in parallel. So candidate assembly to put the result in VB0 could be:
- VANDW SE0++, VB1, VB2 <-- VB1 is the AND mask
- VORW SE1++, VB2, VB0
You can try to do the same with one SE, sort of a manual software loop unroll (VB1 and VB2 are masks, destination is still VB0).
- SE0 read ordering:
  1. 01012323
  2. 45456767
- Pseudocode:
  - first_vec = SE0++;
  - sec_vec = SE0++;
  - first_lower = (first_vec & lower_mask);
  - first_upper = (first_vec & upper_mask) >> [32-bytes];
  - sec_lower = (sec_vec & lower_mask) << [32-bytes];
  - sec_upper = (sec_vec & upper_mask);
  - first_output = (first_lower | sec_lower);
  - sec_output = (first_upper | sec_upper);
Program the MMA to do it if you aren't using it currently. That's definitely a more involved task, though.

2) To transpose an array one has to read in all of the rows and then write out all of the columns. To do that all rows have to be received from memory so you have to buffer the ones that have been received, meaning that the maximum number of rows that can be operated on is limited by the amount of storage in the IP doing the transposition. Unfortunately SE only has internal storage available to perform 16 row transposition. The reason for that is that transposed mode is optimized for working on large arrays where you can break the array into smaller "tiles" sort of like what you've done. To get optimal performance, though, you should be making your ICNT0 as large as possible. Here's an example 24x24 array where the columns are each 64-bits wide:

Since the columns are 64-bits wide you can only fit 8 in a single vector, so lets say that you want to work with 8 rows. When SE produces the first column (8 rows) it will fetch the light green region into its local storage which takes at least 8 cycles. If you work your way all the way down that first column SE will have to discard the light green region to make room for the next tile that you are requesting data from. When you finally get to the second column at the top, SE will have to re-fetch the light green region again. The result is that you'll end up with SE performing 8 fetches for every single vector you read from it, meaning >700% overhead. Performance will be terrible. With linear mode you can basically tell SE to do whatever pattern you want without a lot of performance consideration, but you need to be very careful when working with transposed mode

To optimize for performance it's recommended to consume the full tile before moving to another one so that SE doesn't have to re-fetch it. Unfortunately that does mean that you have to rework your loops and algorithm to deal with different access pattern. Alternatively you could try moving the operation of transposition into the MMA and use a transposition matrix to get the job done. The width of each tile is the vector size (64B). Also, if your data is non-aligned then I highly suggest limiting ICNT1 to no more than 8 so that SE can double buffer. If you are aligned then performance should be comparable between ICNT1 = 8 and 16.

3) Here's some brainstorming for your reversed address step problem assuming that you are going to do the algorithm and immediately store the result:

final operation is a VSTBITRDW.

Length = 8: need one vector: 01234567

Length = 16: goal is to create two vectors:

02468ACE
13579BDF

Length = 32: goal is to create four vectors:

0 4 8 C 10 14 18 1C
1 5 9 D 11 15 19 1D
2 6 A E 12 16 1A 1E
3 7 B F 13 17 1B 1F

etc...

To create those N vectors you can read in N vectors from SE and then use permute/swap operations or AND/OR masking to create the final vectors before calling VSTBITRDW to write the results back into memory. I don't know if that is optimal, but it's the idea that came to me.

Regards,

Burton

0 Clément Fau over 4 years ago in reply to Burton Copeland

Prodigy 135 points

Burton,

Your answers are very clear

I will potentially come back to you if I have any questions or remarks on the subject.

Thank you very much for your support, it helps us a lot in our work.

Clément

Processors

Processors forum

TDA4VM: C7x training