In the following file I am exhibiting results and measurements I have carried out on a TMS320C6474
The purpose was to get a real idea of the core 0 to DDR2 write interface performance. The result I got is far from being optimal, and therefore I would appreciate some help from T.I to answer the questions I put into this document. (There are 2 chapters inside : core access, and DMA access with corresponding results.)
Please also don't hesitate to correct the mistakes I may have written.
Thanks for your support,
With best regards,
There's an Application Note on the Throughput Performance of the C6472 that covers most of what you've discussed. http://www.ti.com/lit/an/spraay0a/spraay0a.pdf
I did a cursory look at the document you wrote and I noted some things that are wrong such as your assumption of bus widths when calculating theoretical performance of internal buses are wrong. These are detailed in the Throughput Performance Ap Note. Please have a look at them.
Regarding, the Read-Modify Write with cache turned off routines and comments regarding poor DDR performance. I suspect you'll see this on any device using DDR2 interface that isn't caching the data. The performance you get from DDR2 which you show later w/ cache on and larger transfer sizes is with burst data. If you read - modify - write w/ cache off, you have to do the small access to DDR, get that data in, modify it, write it all the way back out before you can read the next piece of data. With cache turned on, it will burst in a cache line size with of data, the modifications are done and written out to the local cache and resides there until evicted or write-back is performed. Much more efficient use and is the reason we use large cache line sizes.
Please click the Verify Answer button on this post if it answers your question.
Hi Chad,Thank you for looking at my document.1- Although this report is very interesting, I don't understand why you are refering to the "spray0.pdf" application report since it is relative to the C6472 running at 500MHz while I am speaking of a C6474 running at 1GHZ. The CPU frequency is quite different, and also the SCR A/B architecture (bus width and organisation)I did read the "spraaw5" which is dedicated to C6474. One can read some EDMA measured throughput in §4.2., specially for TC0 realizing transfer from L2 to 32bit-DDR2and reaching 2444.64MB/s. It is also written that (according to the SCR internal path and organisation) the maximum theoretical throughput for such transfers from GEM0_L2 to 32bit-DDR2 is 2667MB/s (whatever the TC used). All this seems to confirm my assertion ( or conversely).This is why I am expecting to measure up to 2,4 GB/s instead of 2 to 2,13 GB/sPlease tell me if I am still wrong.2 - In the "spraay0" there are some words on "DMA transfer overhead" measurements. Do you know if the results applying for the C6474 are published somewhere ?3 - I understand the benefit of large cache line size, and also the differences between activating cache or not. But my document precisely shows that in my particular case where I want to run a long 512MB 'only write' to DDR2, activating the cache or not makes almost no difference on the duration. Cache does not help and this is because the write (remember that I only want to execute the loop: *pt_ddr2++=0;) is actually converted into a read-modify-write by the cache hardware.If there is some trick to overcome this , let me know.With best regards,Bruno
Yes, you're right, I grabbed the wrong one. Must have been thinking about 6472 that I had just been posting about and was using the wrong numbers.
1.) Back to the 'why I am expecting to measure up to 2.4GB/s instead of 2 to 2.13GB/s', you're using TC0, which you're at the edge of theoretical performance limits of both DDR and EDMA. Which is 2.66GBs. There's always some amount of overhead and it's difficult to ever reach the true theoretical limits. You'll see this in any system. That said, you should be using TC3-5, so at least you're only hitting limitations on the DDR side of things and getting closer to the theoretical limits. There's differences shown in the throughput application note.
2.) I don't see this DMA transfer overhead either.
3.) CPU writes to DDR space that isn't resident in cache is going to take longer time. If the data was resident in cache then it would have been written to the cache space instead of all the way out to DDR. It did look like you had this with the data cached and it was doing write-backs, but it wasn't certain if you used a cache-writeback command or if you're effectively manually evicting these. I'd assume it was a cache-write back command.
In cases, where you're processing and modifying data on larger memory block that will need to be retained in external memory, it's recommended to use EDMA to transfer the data.
One other thing to note is this is a 7yr old architecture. You may want to take a look at our Keystone Architecture devices (C66x Devices.)
All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.
TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs andembedded processors, along with software, tools and the industry’s largest sales/support staff.