This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Question: Migrating from TMS320C6414T to Keystone

Other Parts Discussed in Thread: TMS320C6414T

Hi,

Currently, we plan to migrate from old TMS320C6414T to Keystone DSP. In current design we use following interface:

HPI: 32bit@25MHz

EMIFA: 32bit@100MHz sync mode

MsBsp: all 3 streams run @100MHz for real time data.

the new device we plan to use is 320C665x(eg 320c6657). Question is:

1, for our EMIFA, 32bit@100MHz synchronous. the EMIF of the 320c6657 cannot support our current system, because of slow performance. So the best interface to communicate with our fpga may be PCIe. Does 320C6657, configued at x2 lane, provide 10Gbps speed?

2, for three MsBSP run @100 MHz for real-time data processing. The new 320C6657 only has 2x MsBSPs. It looks like best way is to use rapidIo to replace the MsBSPs. Our realtime data is 16 bit streams. So if we just transfer the 16 bit data at a time via rapidio, the performance will be lower than current design. If we buffer these data in fpga and transfer at same time, it will introduce latency and lower the core performance. So what is best trade off in this case? Is there any info/doc on performance of payload size vs the data rate? I plan to use 2x port for replacement of the MsBSPs; another 1x port for replacement of HPI interface (downloading code and communicate between dsp and main processor)?

3, new dsp may needs bring OS and drivers for support all feature as well as more buffer for dsp operating. Is there any estimate how much memory normally required to store and run the new dsp os & drivers?

Thanks in advance for your help.

shiquan

  • Hello,

    I've been in a similar situation some time ago. May I suggest to look over couple of threads to get familiar with common difficulties:
    https://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/124251,
    e2e.ti.com/.../181052,
    https://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/112/p/262001/917519#917519,

    You did not mention, what is your real time data source/sink. If that is AD/DA, then, I'm afraid, there is no easy way to connect them to DSP over either serial interface. Glue logic and buffering would be required and FPGA is just right for that.

    I'm afraid to judge the numbers, but I'd suggest be very careful with throughput performance. It would depend on multiple factors. In case of PCIe besides lanes count and link speed one should consider transaction layer payload size. Top figures are given for maximum payload size, which is 4KB in PCIe. Guess what? DSP supports ingress data with 256B of payload and egress with just 128B. Next thing to not forget is that making such a bulk transfers would require sort of DMA engine. Regular memory accesses would result in single DWORD transaction. In my setup on x1 gen1 link in this mode we see 40MBps on write, about 2MBps on read. You may use DSP's DMA engine to issue multi-dword writes to FPGA, but FPGA still has to have capability to handle them. In opposite direction, to upload data from FPGA to DSP your DSP's DMA engine is of little help. Instead, FPGA should have kind of bus mastering feature and DMA engine on its own. It tends that transfers of both directions are usually handled by FPGA's DMA engine.
    Although some reference designs exist, its unlikely any would match you needs without adaptation, and that is another headache. Be prepared, that TI's LLD functions for PCIe may not work with your FPGA (https://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/416971, ). When selecting FPGA, consider its capabilities as well. Our Spartan 6 can handle just x1 lane in gen1 mode. Newer FPGAs may have better features.

    Summarizing, I'd like to suggest exclude word "migration" from your use. Instead, that would be painful and difficult development of the new platform.

  • Welcome to the TI E2E forum. I hope you will find many good answers here and in the TI.com documents and in the TI Wiki Pages (for processor issues). Be sure to search those for helpful information and to browse for the questions others may have asked on similar topics (e2e.ti.com). Please read all the links below my signature.

    We will get back to you on the above query shortly. Thank you for your patience.

  • Hi Shiquan,

    Much of the information you need on the throughput of of the various interfaces can be found in the Throughput Performance Guide for C66x KeyStone Devices (SPRABK5A1). The C6657 has the same architecture as the C6678 referred to in that document. The throughput numbers are valid for the C6657 as well.

    Regards,

    Bill

  • Rrlagic,

    Thank you for your reply.

    Is the performance of the 40MB/s for writing and 2MB/s for reading is the max speed @ your current PCIe setting? Currently, our dsp reads/writes from/to Spartan 6 fpga. The performance on our current EMIFA is about 250MB/s. So I plan to set up read(256B)/write(128B) buffers inside dsp internal memory. Thus, I hope, for a payload size of 128B, with the DMA help, the performance of the pcie will be around 65-70% of the data rate for writing and around 50% for reading. As we configure dsp PCIe on x2 Gen 2, it should be enough for our application. Just like you mention, we need to upgrade the fpga to 7-serial to have more than one PCIe block.

    Do you use the external DDR? What size if used? The TMS320C665x internal mem is 1M + 2M L2 cache(c6657). I am not sure if it has enough room to store and run the TI RTOS and driver on this new dsp internal mem. Otherwise, we need to add another ddr3 dedicated to this dsp. We already have ddr3 for main processor. Adding another one will cost our valuable pcb space, unless I eliminate one PCIe bridge and put the function into new fpga.  

    Thanks,

    shiquan

  • Bill,

    thanks, I will take a look it. Right now, the only concern is the RapidIO interface. I don’t have experience on RapidIO. This interface is critical for us. If the RapidIO performance are similar as PCIe, I probably let payload size as small as 16B, in order to reduce the latency for our A/D data.

    shiquan
  • Hi Shiquan,
    Most of our customers use external DDR3 for their applications. The C6657 supports either a x16bit or a x32bit wide DDR3 data bus connection with 8GB of addressable. Designs that need higher throughput to the memory use the x32bit but if you can operate with the x16bit data bus width, a single 4Gb x16 memory device would be 500MB of memory.
    Tables 28-30 in the throughput document will give you an idea of the capabilities of the SRIO interface.
    Regards,
    Bill
  • Hello,

    Let me give you brief idea, how PCIe is different from the parallel bus. When you work with EMIF, it provides address, data signals together with CE, WE, RE strobes. Then, in your FPGA design you just connect those lines to memory. Of course, there could be kind of address decoder, some tristates and so on, but basically it is just connect address bus, data bus and enabling strobes. With that, the unit of exchange depends on data bus width. We used 64-bit bus and I confirm throughput of about 240MBps in either direction.

    With PCIe the situation is different. I am familiar with Xilinx solution, so let me explain using that. The IP core will provide either AXI streaming or transaction interface (TRN). While the former is modern standard, the latter is easier to understand. So going with TRN interface the core will mainly interact with user application of FPGA trough trn_td and trn_rd buses, transmit and receive data respectively. Besides some control signals, that's all. You have to understand, that in PCIe addressing information is signalled inside transaction layer packet (TLP), namely in the TLP header. Header may consist of 3 or 4 DWORDS, followed by payload. What is important, that IP core outputs received packet DWROD by DWORD sequentially. Now it is user application responsibility to capture received fields and properly decode them. Simple user design provided with IP core can handle read and write requests. However, with a limitation. Although maximum payload of PCIe is 4KB, although maximum payload supported by Spartan 6 is 512B, that design can handle just one, yes just one DWORD of payload. So at TRN level there is an overhead 3 times exceeding payload. And we did not consider lower layers of protocol yet.

    Of course you can allocate exchange buffers in DSP memory. And EDMA hardware is capable of sending downstream write requests with multi-DWORD payload. But that does not solve your problem. You have to have some kind of hardware in FPGA, which will handle these multi_DWORD packets. Again, example design provided does not do that.

    With read the situation is even worse. You have to know, that reading is at least two step operation. First, initiator sends read request, and then remote party returns read data in completion packet. So the latency of this operation is increased even more, and that explains low performance on read. Luckily, one read request can request large piece of data. But again, inside your FPGA there should be some hardware, capable of making such a multi-DWORD completions. I mean  even if you use EDMA of DSP, you cannot accomplish reading, there should matching hardware. Without that there is no way to get impressive throughput numbers. Check with your FPGA designer, if he can do busmastering DMA on FPGA side.

    As to read performance, I have roughly explained, why it happens in word-by-word read in PIO design. With DMA engine inside FPGA one can get upstream throughput comparable to downstream.

    And yes, we have external DDR3 in our design. Be sure, having it just as mapped memory is simple, but when it comes to get better performance, there will be caching pain again. I'd like to suggest to have that many DRAM chips, which would allow to use whole EMIF width.

    If you ever decide to go with 7-series FPGA, remember that they require Vivado as recommended and supported design environment. Check with your FPGA engineer, if he aware of that.

    I think, that having x2 lanes and gen2 speed will increase performance about 4 times.

  • Hi again,
    If you ever decide to stick with SRIO, please talk first with your FPGA designer. The price of SRIO IP core has 4 digits, and design deliverables are not that easy to comprehend, comparing to PCIe.
  • Hi rrlagic,

    Thanks for your help. I really appreciate it. As you suggest, I start to do more research on fpga.

    Best regards,

    shiquan