This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6474 - DDR2 write interface performance

Other Parts Discussed in Thread: TMS320C6474

Hi,

In the following file I am exhibiting results and measurements I have carried out on a TMS320C6474

The purpose was to get a real idea of the core 0 to DDR2 write interface performance. The result I got is far from being optimal, and therefore I would appreciate some help from T.I to answer the questions I put into this document. (There are 2 chapters inside  : core access, and DMA access with corresponding results.)

Please also don't hesitate to correct  the mistakes I may have written.

1778.write_datarate_v3.pdf

Thanks for your support,

With best regards,

Bruno(THALES COMPANY)

  • Brune,

    There's an Application Note on the Throughput Performance of the C6472 that covers most of what you've discussed. http://www.ti.com/lit/an/spraay0a/spraay0a.pdf

    I did a cursory look at the document you wrote and I noted some things that are wrong such as your assumption of bus widths when calculating theoretical performance of internal buses are wrong.  These are detailed in the Throughput Performance Ap Note.  Please have a look at them.

    Regarding, the Read-Modify Write with cache turned off routines and comments regarding poor DDR performance.  I suspect you'll see this on any device using DDR2 interface that isn't caching the data.  The performance you get from DDR2 which you show later w/ cache on and larger transfer sizes is with burst data.  If you read - modify - write w/ cache off, you have to do the small access to DDR, get that data in, modify it, write it all the way back out before you can read the next piece of data.  With cache turned on, it will burst in a cache line size with of data, the modifications are done and written out to the local cache and resides there until evicted or write-back is performed.  Much more efficient use and is the reason we use large cache line sizes.

    Best Regards,

    Chad

  • Hi Chad,

    Thank you for looking at my document.

    1- Although this report is very interesting, I don't understand why you are refering to the "spray0.pdf" application report since it is relative to the C6472 running at 500MHz while I am speaking of a C6474 running at 1GHZ. The CPU frequency is quite different, and also the SCR A/B architecture (bus width and organisation)

    I did read the "spraaw5" which is dedicated to C6474. One can read some EDMA measured throughput in §4.2., specially for TC0 realizing transfer from L2 to 32bit-DDR2and reaching 2444.64MB/s. It is also written that (according to the SCR internal path and organisation) the maximum theoretical throughput for such transfers from GEM0_L2 to 32bit-DDR2 is 2667MB/s (whatever the TC used). All this seems to confirm my assertion ( or conversely).

    This is why I am expecting to measure up to 2,4 GB/s instead of 2 to 2,13 GB/s

    Please tell me if I am still wrong.


    2 - In the "spraay0" there are some words on  "DMA transfer overhead" measurements. Do  you know if the results applying for the C6474 are published somewhere ?


    3 -  I understand the benefit of large cache line size, and also the differences between activating cache or not. But my document precisely shows that in my particular case where I want to run a long 512MB 'only write' to DDR2, activating the cache or not makes almost no difference on the duration. Cache does not help and this is because the write (remember that I only want to execute the loop: *pt_ddr2++=0;) is actually converted into a read-modify-write by the cache hardware.

    If there is some trick to overcome this , let me know.


    With best regards,

    Bruno
     

  • Bruno,

    Yes, you're right, I grabbed the wrong one.  Must have been thinking about 6472 that I had just been posting about and was using the wrong numbers. 

    1.) Back to the 'why I am expecting to measure up to 2.4GB/s instead of 2 to 2.13GB/s', you're using TC0, which you're at the edge of theoretical performance limits of both DDR and EDMA.  Which is 2.66GBs.   There's always some amount of overhead and it's difficult to ever reach the true theoretical limits.  You'll see this in any system.  That said, you should be using TC3-5, so at least you're only hitting limitations on the DDR side of things and getting closer to the theoretical limits.  There's differences shown in the throughput application note.

    2.) I don't see this DMA transfer overhead either. 

    3.) CPU writes to DDR space that isn't resident in cache is going to take longer time.  If the data was resident in cache then it would have been written to the cache space instead of all the way out to DDR.  It did look like you had this with the data cached and it was doing write-backs, but it wasn't certain if you used a cache-writeback command or if you're effectively manually evicting these.  I'd assume it was a cache-write back command.

    In cases, where you're processing and modifying data on larger memory block that will need to be retained in external memory, it's recommended to use EDMA to transfer the data.

    One other thing to note is this is a 7yr old architecture.  You may want to take a look at our Keystone Architecture devices (C66x Devices.)

    Best Regards,
    Chad