DSPF_sp_fftSPxSP usage example and data packing

Victor Kazmirenko

Hello!

I need to implement complex floating point (single precision) FFT on C6670. I am considering DSPLIB v.3.1. for my case. It seems there is DSPF_sp_fftSPxSP() function. From documentation which comes with DSPLIB its hard for me to get usage example. Would appreciate some reference.

Another point is complex data packing. In other thread on e2e (http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/273221/957418.aspx#957418) I saw clear statement, when it comes to intrinsics for complex data manipulation, real part goes to odd half, imaginary - to even half of the register or register pair. I have to use some more complex data manipulations, so I use __float2_t container. This container places imaginary part in even address, real - in odd address.

I had experience with integer FFT of DSPLIB on C64. That used other packing principle, when real part was in even address, imaginary - in odd.

So my question is what is right order of input data to DSPF_sp_fftSPxSP()? Is it consistent with intrinsics input?

Thanks in advance.

over 11 years ago

0 Victor Kazmirenko over 11 years ago

Guru 13202 points

Well, very similar DSPF_sp_fftSPxSP() function is described in SPRU657C, SPRA947A for C67x. The former clearly states:

Real values are stored in even word positions and imaginary values in odd positions.

So, DSPLIB input data packing is inconsistent with intrinsics like _complex_mpysp(). It looks that if I use intrinsics myself I'd better keep {Im,Re}, but when it comes to DSPLIB, one have to make {Re, Im}. Is that correct? Is there any plan to provide DSPLIB consistent with intrinsics data packing?

0 Asheesh Bhardwaj over 11 years ago in reply to Victor Kazmirenko

TI__Expert 4680 points

The DSPF_sp_fftSPxSP function is the complex single precision floating point implementation available in DSP lib. This uses the complex_mpysp instrinsics. The assembly implementation of this function is available which using the cmpysp and daddsp instructions. The twiddle factors and the complex input are given in the function such that the real and imaginary results are generated appropriately. The complex_mpysp intrinsics uses the cmpysp followed by daddsp instructions which are described in the user guide. Please refer the description of these instructions.

http://www.ti.com/lit/ug/sprugh7/sprugh7.pdf

CMPYSP instruction performs a complex multiply of two Single Precision Floating-Point numbers in a register pair giving a 128-bit output. The details from the document is below.

The product of the lower word of src1 and the upper word of src2 to is placed into dst_0.

The product of the lower word of src1 and the lower word of src2 is negated and placed into dst_1.

The product of the upper word of src1 and the lower word of src2 to is placed into dst_2.

The product of the upper word of src1 and the upper word of src2 to is placed into dst_3.

The test bench of the SP complex floating point FFT function is provided with the DSPlib function. The output of the C, optimized C and assembly implementation matches in the test bench.

Regards

Asheesh

0 Victor Kazmirenko over 11 years ago in reply to Asheesh Bhardwaj

Guru 13202 points

Hello Asheesh,

Thank you for feedback. However, my question was about different matter.

Testbench and experiment prove that DSPLIB implementation expects complex data to come in {real:even, imaginary:odd} order. In the same time, complex intrinsics expect their input in {im,re} order. So I don't claim DSPLIB works wrong - it works OK. What I am telling is that if I have to use FFT and other complex computation, I have to reorder re/im packing between them.

That is the point of inconvenience. Perhaps I missing something in complex data design, please advise.

I've noticed, there is integer version of DSP_fft16x16_imre just for such case. May I suggest to provide similar routine for floating data as well?

0 Asheesh Bhardwaj over 11 years ago in reply to Victor Kazmirenko

TI__Expert 4680 points

The C6x complex intrinsics take the Real and Imaginary data with Real value in high 32 bit in the 64 bit register and Imaginary in the low 32 bit in the 64 bit register. You can check this by implementing the _ftof2 and _lof2, _hif2 instrinsics.

The document mentioned in thread above also describes how the data is arranged in the 128 bit quad register for cmpysp instruction. Please read the instructions in the document. All the instructions are consistent which will allow you to write all the code in the same data format. If you follow them in your code you will not see any difference in data between different DSP code.

Here is the snapshot from the document describing what the dst_0, dst_1, dst_2 and dst_3 means in my previous post.

dst_0 or src_0 32-bit value in the least-significant position in 128-bit quad register
dst_1 or src_1 32-bit value in the next to least-significant 32-bit word position in 128-bit quad register
dst_2 or src_2 32-bit value in the next to most-significant 32-bit word position in 128-bit quad register
dst_3 or src_3 32-bit value in the least-significant position in 128-bit quad register

There are different forum post which you want to refer here in order to get more reference for complex multiply operations.

http://e2e.ti.com/support/development_tools/compiler/f/343/p/210329/744038.aspx#744038

http://e2e.ti.com/support/development_tools/compiler/f/343/t/192618.aspx

Regards,

Asheesh

0 Victor Kazmirenko over 11 years ago in reply to Asheesh Bhardwaj

Guru 13202 points

Hello Asheesh,

Please forgive me keeping this discussion as I get no clear vision yet.

What you mentioned about im/re ordering for intrinsics operation is out of question. let me say short:

1) complex intrinsics use im/re packing

2) DSPLIB uses re/im packing

If I have im/re packed vector, how do I pass it to DSPLIB? Please see following skeleton of code:

__float2_t x[N], c[N], y[N];


for (i=0, i<N; i++)
{
    x[i] = _fdmv_f2( x_re[i], x_im[i]);
    c[i] = _fdmv_f2( c_re[i], c_im[i]);
}

for (i=0, i<N; i++)
{
    y[i] = _complex_conjugate_mpysp( c[i], x[i] );
    // Ok here, intrinsic consumes im/re packed dword.
}

DSPF_sp_fftSPxSP( N, y, // problem here, DSPLIB uses re/im packing

Complex multiplication in the loop is just fine, because x, c, y use imre packing, but FFT - not. FFT expects re/im packed data. So if I want run FFT over y, I have to swap its real and imaginary parts.

If I am wrong, please show, how to fix above snippet.

Thanks in advance.

0 Asheesh Bhardwaj over 11 years ago in reply to Victor Kazmirenko

TI__Expert 4680 points

There is no swap of data happens between real and imaginary inside the dsp library function otherwise the complex_cmpysp intrinsics result will be wrong. The complex_cmpysp intrinsics is used inside the FFT functions without changing any packing format to the intrinsics.

Use the intrinsic the way it is used in the dsplib.

Read the values in register from the memory

x_01 = _amem8_f2(&x[0]);

then pass the register to the complex_cmpysp instrinsics

xl2_2o_xl2_3o = _complex_mpysp(co31_si31, xt2_1_yt2_1);

store the result in the memory.

_amem8_f2(&x2[l2+2]) = xl2_2o_xl2_3o;

Use the CCS simulator to view the results in registers and memory.

Regards

Asheesh

0 Victor Kazmirenko over 11 years ago in reply to Asheesh Bhardwaj

Guru 13202 points

I am terribly sorry to bother this topic again, but proposed answer is not an answer at all. I have a feeling that my question was not understood.

Please answer clearly two questions:

What is complex SP floating memory packing for DSPLIB?
What is complex SP floating memory packing for intrinsics?

I have my own answers, but they could be wrong, so please correct.

For DSPLIB, there is no C66x specific user manual. Help provided within CCS is far from adequate. The best I found was SPRU657CTMS320C67x DSP Library Programmer’s Reference Guide. At page 4-24 it clearly states:

Real values are stored in even word positions and imaginaryvalues in odd positions.

So, for DSPLIB the memory packing is re0-im0-re1-im1...

Now consider second question. Another forum thread http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/273221/957418.aspx#957418 tells that

Internally the device's complex instructions allways have the real part in the upper half and the imaginary part in the lower half, of a double size unit, be it fixed or single precision floating point.

So if I make a __float2_t* pointer to sequence of floats, so memory packing should be im0-re0-im1-re1.

Please tell where I am wrong.

Thanks in advance.

0 Asheesh Bhardwaj over 11 years ago in reply to Victor Kazmirenko

TI__Expert 4680 points

It is Real and Imaginary format consistently and not Imaginary and Real format. If you see the registers where the values are stored it will be more clear. The values goes into the register pair like A3 (upper half) - Real A4(lower half)- Imaginary.

The link you have referred also says real in upper half and imaginary in lower half which is same as the description on the document referred for the cmpysp instruction.

If FFT implementation is complex for you then look at the DSPF_SP_FIR_cplx implementation. That also uses the complex_mpysp intrinsics.

Regards,

Asheesh

0 Victor Kazmirenko over 11 years ago in reply to Asheesh Bhardwaj

Guru 13202 points

Hello Asheesh,

Thank you for your patience. I feel lack of knowledge and documentation is not sufficient to resolve my doubt.

As you suggested, I have created simple test code for simulator. I found the document SPRABG7 Optimizing Loops on the C66x DSP Application Report (November 2010), where on page 14 there is usage example to make register pair: _fod(real, imag). However, _fod() intrinsic is not found by my compiler (7.4.6). So I hope _fdmv_f2() is right intrinsic to use. Please clarify on _fod() vs _fdmv_f2().

I have created some float arrays to use them as storage for __float2_t. See the following snippet:

    float ma[2];
   __float2_t  *pa;

   pa = (__float2_t *) ma;
   *pa = _fdmv_f2( 2, 3 );

I have checked before, that floating 2.0 is represented in 32-bit register as 0x4000_0000, 3.0 is represented as 0x4040_0000, 4.0 is 0x4080_0000, 5.0 is 0x40a0_0000 respectively. When I execute *pa = _fdmv_f2( 2, 3 ); I expect 2 is real, 3 is imaginary. After statement execution I see memory location changes to 40400000 40000000. So as you see 3.0 went to lower address and 2.0 went to higher address. In other words, imaginary part (3.0) is stored first, then there is real (2.0). So ordering is ImRe.

I see only one chance, that _fdmv_f2() expects imag as the first argument and real as the second argument, i.e. _fdmv_f2(im, re). Could you please clarify _fdmv_f2() usage.

Thanks.

0 Victor Kazmirenko over 11 years ago in reply to Victor Kazmirenko

Guru 13202 points

Hello again!

I have improved my test and now attach the project and screenshot. I was running the following snippet:

    __float2_t  a, b, y;
    float ar, ai;

    a = _fdmv_f2( 2, 3 );
    b = _fdmv_f2( 4, 5 );
    y = _complex_mpysp( a, b );

So a=2+3i, b=4+5i, and their product y=a*b=-7+22i. As one can see on attached screenshot, 22 went to lower address then -7. So this experiment proved what I claimed when started the thread: complex intrinsics place data in IrRe order.

And the initial question remain: intrinsics use ImRe ordering, FFT in DSPLIB is written for ReIm ordering. How do I interface them correctly?

5734.hello_GenericC66xxDevice.zip

0 Asheesh Bhardwaj over 11 years ago in reply to Victor Kazmirenko

TI__Expert 4680 points

Basically, the ordering is correct for data in Real and Imaginary format. In this order, complex multiply can be used directly with big-endian mode. You are using the Little-endian mode and it can still use complex multiply, just need to negate imaginary part of input or output with swap. You can refer to intrinsic implementation of fft in DSPLIB, which has both little and big endian code. Even all the libraries using the complex arithmetic in little endian mode are tuned such that the data ordering is correct. Just look at the complex FIR example which arranges the data after negating the imaginary part so that the intrinsic can be used.

Little Endian

_amem8_f2(&r[i]) = _ftof2(_lof2(sum1), -_hif2(sum1));

Big Endian

_amem8_f2(&r[i]) = sum1;

Regards,

Asheesh

0 Victor Kazmirenko over 11 years ago in reply to Asheesh Bhardwaj

Guru 13202 points

Hello Asheesh,

Finally we confirmed what was stated at the very beginning of the thread: complex data packing might be good for either DSPLIB or intrinsics, but not for both, if DSP is running in LE mode. I know I could fix that with extra instructions, but why would then I use intrinsics? Just for information, here is assembly of swap with negation:

24            y = _complex_mpysp( a, b );
00808508:   033C43E6            LDDW.D2T2     *+B15[2],B7:B6
0080850c:   023C63E6            LDDW.D2T2     *+B15[3],B5:B4
00808510:   00006000            NOP           4
00808514:   1210CF02            CMPYSP.M2       B7:B6,B5:B4,B7:B6:B5:B4
00808518:   00004000            NOP           3
0080851c:   1210C79A            DADDSP.L2       B7:B6,B5:B4,B5:B4
00808520:   00002000            NOP           2
00808524:   023C83C6            STDW.D2T2     B5:B4,*+B15[4]
25            p = _ftof2(_lof2(y), -_hif2(y));
00808528:   033C83E6            LDDW.D2T2     *+B15[4],B7:B6
0080852c:   05A6                MVK.L1        0,A3
0080852e:   F9A2                SET.S1        A3,31,31,A3
00808530:   2C6E                NOP           2
00808532:   E347                MV.L2         B6,B7
00808534:   030CB2E2 ||         XOR.S2X       B5,A3,B6
00808538:   033CA3C6            STDW.D2T2     B7:B6,*+B15[5]
0080853c:   E3000200            .fphead       n, l, W, BU, nobr, nosat, 0011000b

I see 14 cycles in complex multiply itself and at least 8 in swap with negation. Don't you feel that's too expensive?

What is more important to me, is there a plan to implement ImRe version of DSPLIB, or, at least, SP FFT, similarly to DSP_fft16x16_imre?

Thanks.

0 Asheesh Bhardwaj over 11 years ago in reply to Victor Kazmirenko

TI__Expert 4680 points

Please refer the pipe lining and optimization guide for DSP. The operation will always in a loop and the cycles are hidden in the pipeline. The negation and swap need not to be done every time. Also, you can always club the operation with other operations. Refer the FFT and FIR function in DSPlib for different optimization techniques.

Regards,

Asheesh

0 Victor Kazmirenko over 11 years ago in reply to Asheesh Bhardwaj

Guru 13202 points

Hello Asheesh,

I agree that pipelining may reduce the penalty of the extra instructions. On the other hand, nothing comes for free. Larger loop would require more registers and so on. I would not argue on that any more.

I have a proposal to close this thread.

Could you please file enhancement request for DSPLIB to support both ReIm and ImRe ordering?

Thanks.

BTW, I've just discovered, that FFTC uses ImRe ordering.

0 Asheesh Bhardwaj over 11 years ago in reply to Victor Kazmirenko

TI__Expert 4680 points

You can continue to unroll the loop to reach the optimization limit depending on the loop requirements and usage of the functional units needed for the loop. Once you limit on the register pressure then you can break the loop. Again refer the optimization guide.

The Real Imaginary is the natural order followed for all the DSP libraries and widely used.

Regards,

Asheesh

Processors

Processors forum

DSPF_sp_fftSPxSP usage example and data packing