This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Speeding up EMIF reads

Other Parts Discussed in Thread: OMAP-L138, THS10082

I have a new project that uses the C6455 to do some signal processing on a incoming signal. The signal is being sampled at 8 MHz and I have run a small DFT-like algorithm on the data stream. The algorithm only involves a dozen or so multiplies so a 1.2 GHz core should easily keep up, right?

The prototype HW is a 6455DSK board from Spectrum Digital. We plan to design and build a custom daughter card which will contain analog conditioning, a high speed ADC and some glue logic so the ADC appears on the processors external bus. We're planning on using CE5 (address 0xD0000000).The daughter card isn't ready so I've done a first pass at coding the algorithm working from internal array. By toggling a GPIO line I can time my loop. It's running every 30ns - great. I have lots of head room.

Now the problem: I eventually need to read the ADC once each time through the loop. I put a dummy read of address 0xD0000000 into my code and the loop time jumped to 130 ns!! I expected a penalty for touching external memory but 100 ns seems high and it's a killer in this project. The bizarre thing is that I put CE5 on a scope and it's active for only 20 ns. Why the other 80? I moved my GPIO writes around a bit and it appears that the core stalls 40 ns before CE5 goes active and stays stalled for 40 ns after CE5 becomes inactive. What's up?

I've setup the EMIF so that CE5 has no wait states. I have ECLKOUT at 150 MHz using SYSCLK4. Any idea what I might be doing wrong? I have to get around the extra 80 ns stall.

The EMIF registers are shown below. While I'm asking, the EMIF manual only describes a few registers that in the memory region of 0x70000000 to 0x700000D0. What does the data in the other address mean? Is one of them my problem?

0x70000000    0x00320311    0x40000000
0x70000008    0x00010620    0x00000753
0x70000010    0xFFFFFFFB    0x01FFFFFF
0x70000018    0x00000000    0x00000000
0x70000020    0x000000FE    0x00000000
0x70000028    0x40070B07    0x00040F14
0x70000030    0x00000000    0x00000000
0x70000038    0x00000000    0x00000000
0x70000040    0x00000000    0x00000000
0x70000048    0x00010000    0x00000000
0x70000050    0x00000000    0x00000000
0x70000058    0x00000000    0x00000000
0x70000060    0x00000091    0x00000000
0x70000068    0x00000000    0x00000000
0x70000070    0x00000000    0x00000000
0x70000078    0x00000000    0x00000000
0x70000080    0x00240120    0x00240120
0x70000088    0x00000002    0x00000002
0x70000090    0x00000000    0x00000000
0x70000098    0x00000000    0x00000000
0x700000A0    0x40000080    0x00000000
0x700000A8    0x00000000    0x00000000
0x700000B0    0x00000000    0x00000000
0x700000B8    0x00000000    0x00000000
0x700000C0    0x00000001    0x00000000
0x700000C8    0x00000000    0x00000000
0x700000D0    0x00000000    0x00000000

Thank you for any help,

Fred Hansen

  • Fred,

    The interconnect (switched central resource / SCR) is "tuned" for throughput rather than latency in order to allow huge amounts of data from SRIO, etc. to efficiently be piped through.  The downside of this tuning is that it is heavily pipelined and hence there is some significant latency through the interconnect.  For bursts of data (e.g. cache lines, etc.) this delay is not noticeable or appreciable, particularly on writes where the CPU can "fire and forget", i.e. just stick the data into a write buffer and let it drain.  The case you mention of reading a single data element from the external memory bus is pretty much the worst case there is, sorry to say.

    I recommend hooking up a FPGA or some kind of FIFO in between the data converter and the EMIF such that you can burst in multiple words of data per read from the EMIF.  That will tremendously improve your throughput and spread that latency across many sample periods.

    One other part you may wish to consider is the OMAP-L138 which features a new peripheral called the Universal Parallel Port (UPP) Interface which is specifically made to interface to a parallel data converter.  It has the FIFO built into the peripheral so that you can gluelessly interface to the codec.

    Brad

  • Brad,

    Thank you for the information. Man this seems like a long latency. I was able to get the dead time down to 90 ns by cutting the turnaround time to 0 in AWSS. I thought about this when I picked the part. I looked at the EMIF a bit. I did not see anything in the specs to suggest a single read would take this long. They really should have a timing layout for a single word read.

    I'm afraid an FPGA in not in the cards for the prototype. This is a short timeline project with a small staff (2 EE's). It started ~3 weeks ago and we're suppose to have something to show in 6 to 8 weeks. The daughter card is already drawn up and about to go to layout. Neither one of us is an FPGA guy anyway.

    I have looked at using the EDMA to allow me to run code "under" the external read. I've gotten that to work for the most part. I'm having a bit of trouble telling when the DMA has finished. There doesn't seem to the a complete bit anywhere. I know I can have the DMA fire off an interrupt. But, with a 120 ns cycle time requirement I wasn't planning on using an interrupt. I'm thinking I'll basicly run my algorithm in a tight while loop. What is the easiest way for the code to tell if the DMA has finished?

    The other option we've talked about is to use a serial ADC. That would be a very clean solution. The buffer in the serial port itself would then serve as my FIFO of sorts. I'm sure the prot will have a transfer done or data ready bit that I can use to tell a new piece of data is available. We need to run at 8 MHz. Does TI have a serial ADC that will sample at say 10 MHz and tie directly into one of the the C6455 serial ports directly? If so I can add it do the schematic along with the parallel ADC I have now without delaying the program much. Then decide which ADC to use after the board has come back from fab.

    Thanks again for the help, Fred

  • Yes, it is a significant latency for just a single read.  There's a fairly long chain from the CPU issuing a read instruction until the data actually gets read.  Within the cpu "megamodule" there is the CPU which issues the read to the L1D cache controller.  It then checks to see if it already has that addressed cached.  Once it realizes that data is a "miss" then it forwards the request to the L2 cache controller which also checks for the data.  It misses in L2 and then the L2 cache controller forwards the request to the external memory controller (EMC) and the boundary of the megamodule.  The EMC is what bolts on to the Switched Central Resource (SCR) I mentioned earlier.  So finally the EMC issues the read request to the SCR which sends the request over to the External Memory Interface (EMIF) which finally does the read out on the bus.  Then the data that is read needs to traverse back through all of those elements for the CPU to get the data.  So going through all of that for just 1 piece of data is a lot of overhead!

    Using the EDMA instead of the CPU should be helpful.  Do you have very rigid latency requirements?  If not, perhaps you could use the EDMA to buffer a block of data and then process it with the CPU.  That would be my recommendation, i.e. buffer a block of data using the EDMA and generate an interrupt once the block has been captured.  You can then ping pong back and forth, i.e. EDMA is filling one buffer while you process another.

    If for some reason you do not want an interrupt then simply do not set the corresponding EDMA.IER bit and no interrupt will be sent to the core.  In that scenario you would poll the EDMA.IPR bit for completion.

    I'm a "digital guy" not an "analog guy" but did a little quick searching for you.  One thing I stumbled across is a device called the THS10082 which is a 10-bit 8MSPS dual A/D.  The thing that is interesting about this device is that it's a parallel A/D with an integrated 16-deep FIFO.  It could be just the thing you need in terms of being able to burst multiple elements and significantly improve your efficiency.  I still recommend doing it with the EDMA though.  I didn't see any 10MSPS converters with a serial interface.

    Brad

     

  • Brad,

    That part is perfect. We just put it on the board as an extra option. We left the original as it was. That's a quicker schematic change and will give us the flexibility to go either way when the time comes. The board goes to layout tomorrow - only one day late.

    I just verified that I can set the EDMA to burst read the FIFO. Now I'm off to get the rest of  my code working so I can be ready when the daughter card comes in.

    Thanks for your help. This was a scedule saver!!

    Fred