This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM64x PCI bus master throughput

Other Parts Discussed in Thread: XIO2000A

Hi

 

Does anyone have PCI bus master throughput numbers for the DM64x DSPs?.  There is section in the errata for these devices that suggests that the sslave througput is higher than the bus master throughput, but no numbers are actually given.

 

-steve

  • Unfortunately we do not have true benchmarks for this situation, but if you are to be using the PCI of the DM648 the optimal solution would be to let another device master the PCI bus as the difference will be significant if you are transferring large streams of data like images (or anything longer than 64 bytes). Essentially the errata means that every 64 bytes you have to go through the overhead of restarting the transaction, so how much bandwidth you can have depends on your system conditions. A real ball park figure would be maybe ~30MBps, keeping in mind that your situation could be entirely different, much better or much worse, we do not have a performance benchmark for the PCI bus mastering DM648.

  • Bernie is correct. We don't have good PCI benchmarks for a number of reasons, but on DM648 the PCI Master will likely end up around 1/3 the throughput of the PCI Slave depending on other system traffic. For DM648 specifically we highly suggest using the PCI only as a slave, not as a master.

  • Thanks TI for answering my questions (or at least trying to)

     

    But shame on the DSP team for not specifying during the design phase of this chip what the PCI throughput should be!!.  Whats the point of putting a PCI interface on such a device if it can't fully utilize the bus?

     

    >>For DM648 specifically we highly suggest using the PCI only as a slave, not as a master.

    Can you recommend any recent DSPs that have a better (and tested) throughput on the PCI bus?.  I assume the DM647 is the same, how about the DM643x family?

     

    thanks

    steve turner

    AudioScience

  • Unfortunately all the latest Davinci processors which have PCI have this same errata, the DM643x and DM646x series both have the same limitation as the DM647/8 series. You would have to go back as far as a DM642 to get a PCI peripheral that did not have this limitation (it is capible of up to 64K bytes in a master burst).

  • Hi,

    I'd like to open this thread again... Some days back I was trying to setup a system with an EVM648 and a XIO2000A PCI-PCIe bridge and I also measured the PCI throughput. With the XIO I can use the EVM648 PCI only with 66MHz, without only with 33MHz

    The rate is horrible low. I get about 20MB/s with 33MHz and about 40MB/s with 66MHz.

    Well - actually our whole project is built on the assumption that we could transfer about what was possible on our DM642 prototype. Going back to the DM642 is not an option because we need 6 Videoports, DDR Ram etc.

    Another thing is that the DSP grabs data from this Videoports and 'forwards it' through PCI as the data arrives. So there is not one big block to transfer but data (the DSP knows about) that is growing. Its hard for me to imagine how I should read this from the PC side. This would require another channel of communication so that the PC thread would know how far the data has already been grabbed. Its a time critical application and we need to utilize the time during which the data arrives.

    I'd like to discuss the options that are there for me now:

    Will there be a fix for that in the future? Or is this something I cannot even dream about? I think that this 'feature' of the DM648 is quite severe. Dont other customers complain about that limitation? Would it be that hard to fix it (well if its a silicon issue it might be expensive...).

    Is there another way to start a PCI transfer 'under DSP control' that avoids this low bandwidth?

    How can I utilize the full bandwidth? How do I have to start the PCI transfer on the PC side? When I access the DSP registers to setup the transfer wont it be the same that the DSP starts the actual PCI transfer? How is a 'slave' transfer started?

    best regards,

    Thomas

  • Hi Steve,

    How did you solve this problem?

    bye,

    Thomas

  • Hi again...

    For copying data over PCI and measuring the throughput from the DM648 I used the 'writeData' example in psp_pci_bios_sample_main.c. Maybe this is the cause for the low throughput and there is a more intelligent way.

    The core of the writeData() function basically loops over all 32bit-values and copies them to the mapped PCI window:

    for(i = 0; i< writeCount; i+=4)
    {
        *((Uint32 *)tmpDstAddr) = *((Uint32 *)srcAddr);
        tmpDstAddr+=4;
        srcAddr+=4;
    }

    Wouldnt it be faster to move the complete block via DMA to the mapped window?

    Apart from that... The sample is incomplete. On the first overview I asked myself what will happen if the size is not %4? It will most likely overwrite the following 1-3 bytes in memory that is not allocated. What will happen if the source block overlaps a 8MB boundary? A remap of the pci window would be required in that case from my point of view.

    Why is it so complicated to do things on the DM648 compared to the DM642? One basically has to rewrite the CSL functionality one needs. On DM642 you had
    PCI_xfrConfigArgs(); PCI_xfrStart(); and PCI_xfrTest(); and you were done.

    On DM648 the (incomplete) sample has about 500 lines of code. The samplecode is nice but basically you have to rewrite it thinking about every little problem that could happen. Is moving data between a DSP and a PC such an incommon task that you let every customer reinvent the wheel?

    Bernie, you posted that the PCI Interface bug is in all the newer DaVicis. Is the PCI interface so rarely used on the DaVincis in other apps that you can keep such a bug?

    best regards,

    Thomas

  • As a feedback - some news on the DM648 PCI throughput issue...

    I was right with the assumption that the writeData sample in psp_pci_bios_sample_main.c is far too slow, incomplete and even buggy. The sample doesnt use bursts at all and copies 4 Byte by 4 Byte. As posted before this results in transfer stop after each 4 Byte... Theoretically (8 clocks overhead per burst) this results in 14MB/s for 33MHz and 28MB/s for 66MHz - I have seen about this rate.

    When the data is copied to the PCI adress window using an EDMA transfer, the rate increases. Due to the 64Byte/Burst bug of the DM648, theoretically 88MB/s are possible at 33MHz and 176 at 66MHz. If the EVM648 is connected to a 1xPCIe port with a XIO2000A-bridge EVM board, the transfer rate I have measured is about 150-160MB/s at 66MHz.

    The transfer rate is below of what we expected, but not that much as I thought in the first moment. A DM642 reaches more than 100MB/s at 33MHz, so the DM648 slows us down about 25%.

    I really dont understand why such figures are not there neither in the datasheet nor in the errata. And its also quite a surprise that you dont even get such figures from TI support. No one even questioned my observation of 20MB/s....

    When I look at the DM648 datasheet, I expect a full PCI interface and we based the calculations of our project on that. Its a lot of work to verify the values because of all the workarounds that are required because the actual hardware is not yet available... And you have to care about a lot of little things that can go wrong (Cache-handling etc. etc.). Normally you trust the datasheet when you decide on what processor you use - isnt it?

    What else do I have to question now? The Videoport bandwidth? Or the memory bandwidth? I hope that there will not be more such surprises like that in the future...

    bye,

    Thomas

  • Thomas

    Thanks for publishing your findings - your results are interesting and its good to know that you can achieve higher throughput than TI had quoted. 

     

    Do you have numbers for a PCI burst read using EDMA as well?.

     

    NOTE TO TI - someone needs to write an app note on PCI throughput for all DSPs with a PCI interface.

    -Steve Turner

  • Hi Steve,

    I have not yet tested this since there will only be writes in my application. But I'll try. I think that there are 12 pci cycles latency at reads instead of 8 at writes. So theoretically for 64 Byte bursts, you'll need 24 cycles for writes and 28 for reads resulting in a theoretical maximum of 88/176 MB/s at 33/66MHz for writes and 75/150 MB/s for reads - when I understood it correctly

    bye,

    Thomas

  • Hi Steve,

    I just did a quick test...: replacing the copy loop of the readData sample:

        for(i = 0; i< readCount; i+=4)
        {
            *((Uint32 *)dstAddr) = *((Uint32 *)tmpSrcAddr);
            tmpSrcAddr+=4;
            dstAddr+=4;
        }

    by an EDMA-copy, I see about 25-30MB/s now. I dont know why it is so low in my example. What datarates have you seen?

    bye,

    Thomas

  • Hello:

    I saw this thread some time ago and I'm happy that's interest on it again. It's not strange that you are still concerned about this issue. PCI is a major communication channel which makes DM648 an interesting chip for a lot of applications. Having PCI not working as expected is thus a worrying problem.

    I've checked the TI examples for PCI, and I've found some problems in data communication. When looping several the write/read part of the code we seldom end up in some isolated word is not transferred correctly. We wasn't sure it was a HW problem of our design, PCI limitation or host problem but seeing this thread I'm not that confident the example is correct. Have anyone seen similar issues?

    It's been difficult to perform this testing since quality of the example, specifically the linux driver example is old and buggy. This seems to be partially solved in DVSDK 1.11. But some more documentantion wouldn't hurt.

    We haven't checked the throughput of the PCI since we saw this thread and moved to always slave which is compatible with our design. At the current stage of the project we are not ready as well to conduct throughput tests for the slave case as well.

    Regards,

  • Hi, Thomas

    I read your reply and find your had finished PCI data transfering by EDMA. I met a problem recently, the result of DSP's write EDMA transfer is not correct. So I think I could get some help from you, could you please hang your PCI examples?

    Best regards.

    Cesc