FFT to AM3359?

Bogi Magnussen

Other Parts Discussed in Thread: AM3359, AM3358

Do any of the TI dsp/fft libraries work for AM3359?

We are testing using the beaglebone and trying to compile the FFTW source code into a library using Code Composer Studio, but it is not optimal.

Thanks

over 12 years ago

0 Joe G. over 12 years ago

TI__Mastermind 25886 points

We "borrowed" the C674x DSP-based DSPF_sp_biquad (Biquad Filter) from : http://processors.wiki.ti.com/index.php/C674x_DSPLIB and have buuilt/run on AM335x in both Starterware and Linux.

Figured "why not" since they have test vectors provided and could compare with DSP benchmarks.

We even optimized for cycles somewhat in floating point on Linux using the Neon processing unit:

http://processors.wiki.ti.com/index.php/Cortex-A8_Architecture#Cortex-A8_Pipeline_Diagram

http://processors.wiki.ti.com/index.php/Cortex-A8#Neon_and_VFP_both_support_floating_point.2C_which_should_I_use.3F

0 Bogi Magnussen over 12 years ago in reply to Joe G.

Prodigy 250 points

Hi Joe

Thanks for your reply, got it working pretty quickly.

However, only with DSPF_sp_fftSPxSP_cn it seems that I cant use the DSPF_sp_fftSPxSP from the library (maybe I am not linking correctly, or maybe I can't use the library file??)

A 256 point complex FFT with DSPF_sp_fftSPxSP_cn takes about 27 ms on the beaglebone, pretty slow I think.

0 Joe G. over 12 years ago in reply to Bogi Magnussen

TI__Mastermind 25886 points

DSPF_sp_fftSPxSP.asm is C674x optimized assembly code that will not run on Beaglebone.

But did you try compiling with the Neon flags in the Wiki?

On the biquad we started with those initially to see much improvemnt. And got pretty close to the theortical best case by using Neon intrinsics.

Also are you running in CCSv5 or Linux?

We found the Linux numbers were much better "out of the box" since the O/S optimized memory/cache settings.

Can do as a CCSv5 project from scratch but takes some work (and we did not try. Was easier to debug/optimize the Linux from CCSv5).

0 Bogi Magnussen over 12 years ago in reply to Joe G.

Prodigy 250 points

Thanks for the hint. I did set the neon flag in the compiler, but that didn't help. This is because the Neon module is disabled on reset, so I just need to figure out how to enable it and then I will report my findings.

0 Joe G. over 12 years ago in reply to Bogi Magnussen

TI__Mastermind 25886 points

You can enable Neon on CCSv5 in an "Advanced ARM options" or something like that dialog box.

Easiest way to get to it is to find the "GEL Files View" and click though the list (I usually right click on the Cortex-A8 in the Debug Window once you are connected).

But your beaglebone benchmark above sounds really slow. My guess is that the ARM caches are not enabled sufficiently.

Can you build/run this in Linux as a quick test?

Or I do see a CacheEnable() function in Starterware. Have you tried it?

Just don't have a comparison to tell you how well it works.

0 Jeff L over 12 years ago in reply to Joe G.

TI__Expert 5960 points

Hi Bogi,

Some time ago, I was using Starterware. In order to get good performance you have to enable the MMU. L2 cache will not work if the MMU is not enabled. There were some Starerware projects which enabled MMU and L2 cache, I don't recall which ones. I just picked one that enabled flat memory mapping for the MMU. Finally I wrote a small function to enable NEON. At the time, I could not find anyway except for CCSv5 which does have an method to enable NEON as mentioned by Joe G.

Here is the NEON function I used:

enable_neon:
MRC p15, #0, r1, c1, c0, #2 ; r1 = Access Control Register
ORR r1, r1, #(0xf << 20) ; enable full access for p10,11
MCR p15, #0, r1, c1, c0, #2 ; Access Control Register = r1
MOV r1, #0
MCR p15, #0, r1, c7, c5, #4 ; flush prefetch buffer because of FMXR below
; and CP 10 & 11 were only just enabled
; Enable VFP itself
MOV r0,#0x40000000
FMXR FPEXC, r0 ; FPEXC = r0
bx lr

Please post anything you find with FFTW, we are also interested in how well it performs.

0 Mark Gregory over 9 years ago in reply to Jeff L

Prodigy 60 points

Hi Jeff,

I have been running FFTW on the beaglebone green running Linux with neon enabled. Currently cross compiling in Visual Studio using VisualGDB plugin. Here are some of the results I have acquired for various FFT lengths:

/////////////////////////////////////////////////////////////////////////////////
//The following FFT lengths have exhaustive generated wisdom for AM3358 Cortex A8 (beaglebone green)
//Average times based on 5000 consecutive FFTs, rounded up to a hundredth of a ms.
// 20000 ~ 1.14ms Average
// 32768 ~ 1.97ms Average
// 100000 ~ 10.24ms Average
// 131072 ~ 12.43ms Average
//1000000 ~128.67ms Average
//1048576 ~127.65ms Average
/////////////////////////////////////////////////////////////////////////////////

BTW these FFTs are performed real-2-complex format. Complex-2-complex takes a bit longer but not too much. Also, FFTW was compiled for float because neon on AM3358 cannot support double.

For comparison, FFTW on my surface pro 3 took ~1ms for a 100000pnt FFT so the beaglebone was about 10x slower.

Regards,

Mark

Processors

Processors forum

FFT to AM3359?