C66 version of DSP_ifft16x16_imre known issues?

Christopher Peters

Other Parts Discussed in Thread: AM5728

I am wondering if there are any known issues with this function in the DSPLIB V340 library.

I am upgrading from a 6482 based system running DSP/BIOS to a AM5728 DSP Core based system runnign SYS/BIOS.

I ahve existing unit test functions that all pass on the 6482 but have come across a test failing on the AM5728 I can't quite figure out.

The 6482 is using DSPLIB V100. Yeah, pretty old, I know.

I have narrowed it down to the function DSP_ifft16x16_imre from DSPLIB V340 C66 version.

I know that the twiddle factors have changed over the various versions so I have been tracking that also.

I have dumped the inputs and outputs and I see that my twiddle factors are doing what is expected and my input vector is constant, but my output changes for the C66 V340 version.

Here is a detailed trace of what I have tested.

C6482 running DSPLIB V100 is my baseline. I took the source code from V100 and compiled it on the AM5728 DSP and it worked.

On the AM5728 I then did this progression using the files from the DSPLIB V340 source code.

ifft     twiddle  output match? Unit Test Pass?
C64p/cn  C64p     Yes           Yes
C64p/i   C64p     Yes           Yes
C66/cn   C66      Yes           Yes
C66/i    C66      No            No

cn is the c natural version, i is the processor intrinsic version.

For each test case above I copied the source code to my project and rebuilt the code, changing the function names so I was 100% sure
I was using the correct code and not some library code by default.

As you can see, the only difference that causes it to fail is the version I want to use, C66 Intrinsic.

More details on my config:

Compiler v8.1.0
DSPLIB C66x 3.4.0.0
XDCtools 3.32.0.06_core
am57xx PDK 1.0.2

over 7 years ago

0 Christopher Peters over 7 years ago

Genius 3370 points

I should have mentioned I am doing a 2048 point IFFT. I have checked that all my vectors are on Int64 boundaries. Comparing the C66 version of DSP_ifft16x16_imre.c with DSP_ifft16x16_imre_cn.c shows quite a bit of difference in the overall approach to the FFT, mainly the radix 8 usage in the intrinsic version.

I've now noticed that in the "working" versions, C64p _cn and _i and C66 _cn version, the radix is 2 or 4. Determined like this:
radix = _norm(npoints) & 1 ? 2 : 4;

However, in the non-working C66 intrinsic version it is this
radix = _norm(npoints) & 1 ? 8 : 4;

I have to wonder if that change and the 2048 point IFFT combine to be a problem.

0 ToddMullanix over 7 years ago in reply to Christopher Peters

TI__Guru* 96960 points

I've moved your thread to the device forum.

Todd

0 ran35366 over 7 years ago in reply to ToddMullanix

TI__Genius 12805 points

Christopher, this is very interesting question

First I compared the ifft function in release dsplib_c66X_3_4_0_0 for AM572X to the same function in the same release for C667X and they are identical. So if you have an issue, we have it across platform. Thus I do not think (I may be wrong though) that we have an issue in the optimized code itself, but again, I may be wrong.

I need your help, first look at the output and examine if you see pattern in the error, that is, when you compare the results of the ifft to the expected results, do you see a pattern of error? for example the imaginary part and the real part are switched?

Next try to call FFT and ifft in cascade, that is, the ifft processes the fft results. Do it for the natural code and the optimized code, and report the results.

Then generate a single sine wave, do FFT and then do ifft (again, in natuiral C and in the optimized code) and report the results

Waiting to hear back from you

Regards

Ran

0 Christopher Peters over 7 years ago in reply to ran35366

Genius 3370 points

The result of a different issue, https://e2e.ti.com/support/development_tools/compiler/f/343/t/527232, lead me down a different path with this issue.

Looking more deeply at the test failure, we were finding the peak of a correlation at a slightly different spot. Plotting the correlation and comparing the C64p and C66 implementations showed only very small differences in the magnitude but enough at the peak to skew the result a few bins. So, I dug even deeper.

One thing that is causing a difference is the calculation of the twiddle factors seems to change from DSP V100 to C64p to C66. Most notably the addition of the line

d = floor(0.5 + d); // Explicit rounding to integer //

to the d2s() function in the gen_twiddle functions for ifft and fft. I removed that line and re-ran my tests. My test passed, but I could see the magnitude response was not an exact match.

Digging deeper into the values of the twiddle factors, I expected to see maybe the same values but in a different order. Well, the DSP V100 twiddle factors seem to exactly match (order and value) with the C64p twiddle factors. However, the C66 twiddle factors were further optimized and only 1/3 of the table is generated, so comparing the absolute values is very difficult. I did confirm that every value that is in the C66 table has an exact match in the C64p table. The C66 table has no negative values, so I am assuming that is handled in the ifft code somehow.

So, while I can modify the C66 code and get my test to pass, that is certainly not desired. I feel like the new C66 iFFT function it no bit exact match to the C64p or DSPLIB 100 versions. The difference is very slight though. Here is an example correlation showing the working (C64p) result in Green and the working but not quite right (C66) result in Red.

At this point, I am not sure what more I can do. It just seems like the C66 implementation is not an exact bit match to the C64p implementation.

0 Christopher Peters over 7 years ago in reply to Christopher Peters

Genius 3370 points

Here is another plot. It shows in green the "working" C64p DSP_ifft16x16_imre implementation. In blue, which is almost exactly covered by green is the C66 C Natural result and in red is C66 Intrinsic result. I think this shows that this is not a result of the twiddle factors since they are the same for the red and blue results. It does show the C66 Intrinsic and CN versions do not produce the same results.

0 Christopher Peters over 7 years ago

Genius 3370 points

0 Christopher Peters over 7 years ago in reply to Christopher Peters

Genius 3370 points

0 ran35366 over 7 years ago in reply to Christopher Peters

TI__Genius 12805 points

Thank you very much. I think that we understand what you see and we can close the thread

Ran

0 Christopher Peters over 7 years ago in reply to ran35366

Genius 3370 points

I am confused by this reply. Are you agreeing with me that there does seem to be a problem with this function? Do you have a work around that would allow me to use the C66 library? Will a bug report be filed and this fixed in a later release?

0 ran35366 over 7 years ago in reply to Christopher Peters

TI__Genius 12805 points

What I try to say is that indeed the C66 may not be the bit exact results with C64p algorithm. The functional units of the C66 are different and there is no 40-bit accumulator. To get more than 32-bit one has to use pair of registers.

I wonder which one is more accurate? If you take your input, convert it to double precision floating point and run the IFFT (you can use MatLab or Visual C or any other double precision support compiler, or even the C66) and compare the results with the results of the C64P and C66. Calculate the maximum error and some norm (like mean square) of the error in both cases (assume that the double precision is the golden standard)

As you know fixed point calculations of large FFT suffers from truncation errors. You can try to use floating point (single precision) to compare to the golden standard. Convert you 16-bit input to floating point, calculate IFFT and compare the results and the performances

Please report back

Best Regards

Ran

0 Christopher Peters over 7 years ago in reply to ran35366

Genius 3370 points

Thank you for the detailed explanation. I understand your point. It is interesting that that C64p intrinsic implementation works just fine on the C66, but that might be at the expense of speed. A degradation in quality for an increase in speed might be a good trade. For now, I am going to use the C64p implementation until we get things up and running, then try the C66 to see if our overall performance remains the same.

I did compare the ifft result using Matlab as a baseline. As expected the Matlab result is smoother than the C64p result which is smoother than the C66 result. No surprises there.

Processors

Processors forum

C66 version of DSP_ifft16x16_imre known issues?