TMS320C6748: options for sampling at 20 Mhz

Robert Wolfe

Part Number: TMS320C6748
Other Parts Discussed in Thread: OMAP-L138,

Hello,

I need to sample an RF signal at 20 Mhz, and am wondering if a TMS320C6748 (via OMAP-L138) is going to adequate by itself, or whether external logic is required via FPGA, etc. What would be the options for sampling at this speed ... could the McASP be connected to an ADC, to get 16 bit resolution at 20 Mhz? (from what I have seen so far, maybe not). If not that, then is the uPP the only other option? And there, I haven't see an option for 16 bit at that speed (lower 12, maybe 14 bit resolution).

Also, could the samples be streamed to the DDR2 memory in real-time, for processing?

Please advise,

Robert

over 6 years ago

0 Yordan Kovachev over 6 years ago

TI__Guru**** 161600 points

Hi Robert,

The uPP interface is an option to directly connect the ADC to the OMAP-L138/TMS320C6748 devices. The downside is that TI does not provide a driver for this interface. You should do the software yourself using the TRM & Datasheet.

Yes, it should be possible to place the samples to DDR directly.

Best Regards,
Yordan

0 Brad Griffis over 6 years ago

TI__Guru*** 125430 points

Robert_Wolfe said:
I need to sample an RF signal at 20 Mhz,

Is that a continuously running 20 MHz, or for example does it turn on for 1 second to save a bunch of data to memory, and then get processed? If that's an always-running 20 MHz signal, then I think you'll need an FPGA to at least do some kind of initial processing. Otherwise that only gives you about 22 CPU cycles per sample to do your processing (e.g. at the max speed of 456 MHz), and that doesn't even account for task switching overhead and other tasks that need to run in the system. On the other hand, if you're just bursting a bunch of data into DDR and then post-processing it (e.g. more of a soft real time) then I think c6748 would be a suitable option.

0 Robert Wolfe over 6 years ago in reply to Brad Griffis

Mastermind 7860 points

Thanks Brad. It is continuously running at 20 Mhz. Would this equation change, if the part was a C66 instead?, i.e. handle the 20 Mhz continuous stream to DDR continuous, plus post-processing (FFT, etc).

Robert

0 Robert Wolfe over 6 years ago in reply to Yordan Kovachev

Mastermind 7860 points

Thanks Yordan. Correct me if wrong please, but with the uPP, you're limited to 16 bits, right? Also, even if so ... 16 bits ... it cannot handle 20 Mhz at that resolution?

Robert

0 Brad Griffis over 6 years ago in reply to Robert Wolfe

TI__Guru*** 125430 points

Robert,

I'm not sure what you mean about being "limited to 16 bits". You indicated previously that you were sampling 12-bit or 14-bit data, so I don't see any limitation there. The uPP can operate up to 75 MHz.

What processing needs to be done? Even if you were using a 1.2 GHz c66 device, that still only gives you 60 CPU cycles/sample to deal with the data. You could potentially look at a multicore c66 device to give you a performance multiplier. That's assuming you could "pipeline" your data, e.g. core 0 does some processing and then hands off to core 1 which does the next phase, etc.

Brad

0 Robert Wolfe over 6 years ago in reply to Brad Griffis

Mastermind 7860 points

Brad Griffis said:
Robert,

I'm not sure what you mean about being "limited to 16 bits". You indicated previously that you were sampling 12-bit or 14-bit data, so I don't see any limitation there. The uPP can operate up to 75 MHz.

I had seen some literature that suggested the resolution is inversely proportional to the speed, i.e. higher speeds via the uPP require lower resolution. But are you saying that the uPP can do 75 Mhz, regardless whether 14 or 16 bit?

Brad Griffis said:

What processing needs to be done?

Essentially, stream 20 Mhz continuously to DDR, FFT operations, followed by a bit of post-processing on that data.

Brad Griffis said:

Even if you were using a 1.2 GHz c66 device, that still only gives you 60 CPU cycles/sample to deal with the data. You could potentially look at a multicore c66 device to give you a performance multiplier. That's assuming you could "pipeline" your data, e.g. core 0 does some processing and then hands off to core 1 which does the next phase, etc.

Ok, will look at multicore too.

A bit different question - is there a reason why we might want to handle this as DSP + FPGA, instead of just FPGA only? What is the advantage of having a DSP in the equation?

Thanks,

Robert

0 Brad Griffis over 6 years ago in reply to Robert Wolfe

TI__Guru*** 125430 points

Robert_Wolfe said:
I had seen some literature that suggested the resolution is inversely proportional to the speed, i.e. higher speeds via the uPP require lower resolution. But are you saying that the uPP can do 75 Mhz, regardless whether 14 or 16 bit?

The uPP speed is constrained only by SDR vs DDR mode. The 8- or 16-bit configuration option is independent.

Robert_Wolfe said:
A bit different question - is there a reason why we might want to handle this as DSP + FPGA, instead of just FPGA only? What is the advantage of having a DSP in the equation?

For many applications a DSP is generally more power efficient than FPGAs. Many people prefer C code to VHDL for programming. You also have flexibility for things like running a networking stack.

0 Robert Wolfe over 6 years ago in reply to Brad Griffis

Mastermind 7860 points

Thanks for the assist. Will close this one for now.

Robert

0 Robert Wolfe over 6 years ago in reply to Brad Griffis

Mastermind 7860 points

Hello,

As quick follow-up, the hard specs are: 20 mHz in 16 bit samples, DMA to ping-pong buffers, do two sequential 1K FFTs. So it is only the stream to buffer happening at 20 Mhz, via DMA ... hopefully that'd be minimal overhead. The FFTs would be done at 24 kHz, allowing some background time/processing to handle.

Sort of over-riding question is whether it's felt the omap-L138 could handle this, without hiccup. I guess that would be the C6748, since I don't think the ARM portion would be up for FFT-type operations in time.

The real test would be to code it up as prototype, but any intuition or guess whether this is feasible would be appreciated.

Best,
Robert

0 Brad Griffis over 6 years ago in reply to Robert Wolfe

TI__Guru*** 125430 points

Robert_Wolfe said:
As quick follow-up, the hard specs are: 20 mHz in 16 bit samples, DMA to ping-pong buffers, do two sequential 1K FFTs. So it is only the stream to buffer happening at 20 Mhz, via DMA ... hopefully that'd be minimal overhead. The FFTs would be done at 24 kHz, allowing some background time/processing to handle.

This might be feasible. Here are some quick calculations...

So if you have a 24 kHz main processing loop running on a 456 MHz DSP, that gives you 19,000 CPU cycles to do the processing. That 19,000 cycles the total number of cycles between running the loop, so it would actually be a little less once you account for other stuff like context switching, interprocessor communication, etc.

Is the 1k FFT the only thing you need to do? Would it be a fixed point FFT directly on the data, or do you need to convert to floating point first? For some fixed point FFT benchmarks I'm looking at the DSPLIB test report:

http://software-dl.ti.com/sdoemb/sdoemb_public_sw/dsplib/latest/exports/DSPLIB_C64Px_TestReport.html

DSP_fft16x16 Passed 527 (N=128) 1009 (N=256)

So this is saying that a 128 point FFT took 527 cycles while a 256 point FFT took 1009 cycles. That's run on a simulator where everything is running with single cycle access (e.g. what you would expect to observe if you put all your code in L1P and all your data in L1D). So there are two things we would need to estimate:

What would the corresponding number be for 1024 point FFT?
What would the performance be if the data isn't kept in L1D?

In general FFT performance is proportional to N*log2(N), and also any function is going to have some fixed overhead to enter/exit the function. So I expect the performance of this FFT algorithm can be approximated as being A*N*log2(N) + B. So given the data points above, I would venture to guess A=1/2 and B=79. I used the first data point to determine those numbers. It doesn't perfectly fit the second data point, but it's within 100 cycles... So applying that to a 1024 point FFT, I would estimate you're in the neighborhood of 5200 cycles.

If this is the only thing running, you could conceivably locate this data directly in L1D to achieve this performance. Otherwise, I recommend putting it in L2 SRAM. It would see only very small degradation given that your data buffers can fit entirely in the L1D cache (i.e. for a 1k FFT you won't be thrashing the cache).

So if you were to go down this route, I would envision:

UPP writes data directly to DDR.
EDMA transfers a block of data to DSP L2 SRAM for processing. Better yet, if you can tolerate a little extra latency in the calculations you can have ping-pong buffers in L2 SRAM such that the EDMA is transferring one buffer while the DSP simultaneously processes the other buffer.

So it seems feasible as long as there's not a whole lot more processing in addition to this FFT. You would need to do some more in-depth benchmarking to really know for sure. The goal of this paper exercise was to see if it even looked possible, e.g. if my quick calculations had estimated that it would take 20,000 to do the FFT then I would have said to not bother with any further benchmarking. Given my estimates though, it could merit further investigation if the OMAP-L138 is a good option for your system.

0 Robert Wolfe over 6 years ago in reply to Brad Griffis

Mastermind 7860 points

Brad Griffis said:

Is the 1k FFT the only thing you need to do? Would it be a fixed point FFT directly on the data, or do you need to convert to floating point first?

It's actually two 1k FFTs, sequentially ... one on the result of the other.

Preference would be floating point, since all setup for that, but whatever it takes!

Brad Griffis said:

So it seems feasible as long as there's not a whole lot more processing in addition to this FFT. You would need to do some more in-depth benchmarking to really know for sure. The goal of this paper exercise was to see if it even looked possible, e.g. if my quick calculations had estimated that it would take 20,000 to do the FFT then I would have said to not bother with any further benchmarking. Given my estimates though, it could merit further investigation if the OMAP-L138 is a good option for your system.

Awesome, exactly the kind of up-front analysis I was looking for/initial idea. I agree the real test would be a bench-marking. But, would my note about the two sequential FFTs being required change your view of things?

Thanks,

Robert

0 Brad Griffis over 6 years ago in reply to Robert Wolfe

TI__Guru*** 125430 points

The floating point FFT's are approximately twice as many cycles. Two fixed point 1k FFT's still seems to be within the realm of possiblity, but two floating point 1k FFT's is out of the question.

0 Robert Wolfe over 6 years ago in reply to Brad Griffis

Mastermind 7860 points

Ok, noted, thanks!

Processors

Processors forum

TMS320C6748: options for sampling at 20 Mhz