Could you please(TI engineers or anyone who has the similar project) provide 'the EDMA Throughput Between Different CorePac L2s to DDR3' project?
Cause I wanna get the performance data runned by myself, instead of the data on the SPRABK5 report. Thanks a lot.
the data shows like this , do you have similar demo?(this pic is from the Throughput Performance Guide for C66x KeyStone Devices)
Nicole He,
The tests used for these calculations were not written for use outside of the TI device verification department. I am not in that group, but have worked with them and I can tell you that because of the way these tests are written and connected to automated testing, it would be time consuming to pull those exact tests into a usable project for you to run.
However, I have been working on a project that allows me to verify the frequency that my DDR is running, and this project may be a good starting point for you or it may be good enough to let you generate similar numbers. This early version of the DSPMemSpeed.zip project is attached. You can Import it directly into CCSv5 from the zip file without un-zipping it first.
This project does require that you know how fast your DSP is running. Without test equipment at hand, I have used the project on the TI Wiki Pages at What is my DSP clock speed . With a clock that can show 10 seconds duration, you can verify the DSP clock speed with that project.
I would like to make a similar Wiki page for "What is my DDR clock speed" and that is why I wrote this EDMA3-based project. I have used it on four different devices so far, the DM8168, DM8148, C6657, and C6678. I have had trouble with the C6657 version, I have not validated it recently on the DM8168 and DM8148. But I modified it specifically for the C6678 for both you and me to use, so please let me know any comments you have that I can use to improve this fairly simple test.
The main.c has a set of #defines to decide the device that is being tested. This may not be the cleanest way to do this, but I have based this on register-level CSL that might not be available for every device. You can choose from the list to define which device you are using:
#define SELECT_DM8168 1 #define SELECT_DM8148 2 #define SELECT_C6657 3 #define SELECT_C6678 4 #define SELECT_DEVICE SELECT_C6678 // SELECT_DM8148 SELECT_DM8168
For the C6678, since there are three Channel Controllers, this test requires you to choose one of them. I think I left it using CC1. This is defined in soc.h. You can search through there and find the #if/elif where some of the EDMA parameters are configured. The soc.h file is where you would make updates to add another device to use with the test.
For the C6678, since the project attached here is configured for it, you can load it and run it as it is. There are settings in main.c if you are not running at 1.0 GHz, for example. The test needs the right number for that.
The test configures several DMA channels to copy data from a srcBuff to a dstBuff. You will get different results depending on the size of the test and the location of the srcBuff. You can select src_buffer1 or src_buffer2 to choose between CorePac0's L2 or the MSMC RAM, respectively. The intention is that the size of the test transfers will not exceed the size of the available DDR3 banks. This way there will be no page-thrashing overhead but only pure data writes to the DDR3.
My first test was using CorePac0's L2 which had to go through the CPU/3 TeraNet3_A and my apparent memory speed was only 666 MT/s even though it should have been 1333 MT/s. Then I changed the srcBuff to src_buffer2 to use the MSMC RAM to copy to DDR3, and the apparent memory speed went up to 1311 MT/s which is an impressive 98% utilization.
Please let me know if you have any problems or questions with this project. And please let me know about any comments that should be added to the files to make them easier to use. If you would like a clearer description of how the test works, please ask. Or if you have a suggestion for how to describe it, please post that, too.
Regards,RandyP
Search for answers, Ask a question, click Verify when complete, Help others, Learn more.
Hi,Randy.
Thanks so much for the detailed reply and the related project. It really helps me a lot.
And here are some questions when I used the project.
1. Because I seldom use the #elif thing, and I can understand that this project can be applied on different kinds of boards. But I still don't get where you set the configure to make the project to choose the C6678 as the target board (beacause when I open the CCS and I saw that the C6678 description parts are in grey color, but until now I still don't get how does this happen )
2.I loaded it in the seperate core and runned seperately and get the following result ,(get the result one by one )
it looks like each core performs well equally.
but when I loaded the project in 8 cores and let them run together. And get the result as following
it seems that the some cores didn't perform normally.
I assumed that the project is designed only for test one core at a time.
but how did the above table get the result (it shows that when 4 cores are moving data together the performance is affected because of the collison on the bus).
So I just wanna ask how to test the collison on the bus, do you have related test demo. Like I heard, when the collison occured on the bus the performance is affected and sometimes the data is transferred in error.
Looking forward to more discussion.
Regards,
NicoleH
NicoleH,
[Your large font is much easier to read. I am lazy and stay with the default. What setting do you use?]
1. I use #if 0 a lot to temporarily remove code from a test without commenting every line, or #if Select == N / #elif Select == N for multiple uses. I formalize it with labels when offering it to others, such as you, so it is easier to follow. The lines of code shown in my previous post (highlighted with the green bar) show the #define constants used to make the selection for the C6678; if you change the line 6 of that excerpt to #define SELECT_DEVICE SELECT_DM8148, then the test will be compiled for the DM8148, for example. Try changing it that way and see how the highlighting changes. The CCSv5 editor and IDE is so smart that it immediately knows that you have changed that setting in main.c even before compiling; it does not know that setting has changed in soc.h until you re-compile, then the soc.h grey highlights will change.
I put the #define SELECT_DEVICE in main.c prior to the #include "soc.h" so that you do not have to edit anything in soc.h, only in main.c, for the purpose of configuring the test or changing the device. In soc.h there are some #if 0 / #elif 1 for selecting the EDMA3 instance for the tests. Later, I will change this to use a #define symbol to make it cleaner.
2. I am very glad that this worked for you. It is interesting in a positive way that you get the exact same results each time on each core. On another device I have been testing, I get different TransferCnt's each time and slightly different EMIF clock reports (yours on the C6678 only varied by 1 out of 10,000,000!). In a later version of this test, I will change the "EMIF Clock NNN MHz" to "DDR Transfer rate NNN MT/s" to be more clear on what the number means.
This test used the EDMA3 CC1 to run the tests. The DSP core triggers the transfers to occur, but has no direct involvement with the data movement. The test is done this way because the EDMA3 is very efficient at doing bursting transfers on the DDR3, and that is the way to get the best efficiency from the DDR3 bus / device.
The EDMA3 CC1 is a single resource that has 64 channels. Those channels can be partitioned among the 8 cores, but any one channel can only be used by one core at a time. When you try to run the test from any core one-at-a-time, it will operate the same, as you have shown. But when you try to run the same code on more than one core at the same time, the same channels are being triggered by more than one core and this will lead to errors in the Channel Controller.
As you determined, this test was written to be a single-core test. It was written only to test the memory performance of the device. The test would have to be re-written to use the EDMA3 in a different way if it needs to run on multiple cores. But this test is not an example of a practical use of the EDMA3, just an artificial memory speed test.
It amazes me that the memory performance is this high. 1311 / 1333 = 98% utilization. This test only does writes to the DDR3, so it is not practical from that point-of-view, either. But it meets my intent of measuring the DDR clock speed.
Nicole He but how did the above table get the result (it shows that when 4 cores are moving data together the performance is affected because of the collison on the bus).
I do not agree with the numbers in this table, but I also do not have the exact tests that were run so I cannot explain what might be wrong with the numbers. But the simple test that you have run tells us both that the numbers in the table are pretty close to correct. It is not a big difference (in my opinion) that the 2nd, 3rd, and 4th columns vary slightly from each other and also vary slightly from the results of the DSPMemSpeed test. DSPMemSpeed reports over 98% utilization, and this table reports over 99% for columns 2, 3, and 4.
What is your reason for assuming the "performance is affected because of the collision on the bus"? The text in the Application Note says there is "no contention between TCs" in this scenario. They are stating that the only contention is with the several TCs using the same DDR3 EMIF bus and devices.
Nicole He So I just wanna ask how to test the collison on the bus, do you have related test demo.
So I just wanna ask how to test the collison on the bus, do you have related test demo.
The KeyStone architecture is designed to maximize the total throughput of the device. It uses multiple internal buses and the many TeraNet switches that make up the total TeraNet traffic control design. This robust and high-speed design reduces collisions by getting more data pushed through from source to destination and by doing that more quickly than previous architectures. Anytime two or more bus masters try to access the same endpoint, whether it is a C66x CorePac or an EDMA3 Transfer Controller or a Multi-core Navigator transfer or an SRIO DirectIO operation, there will be a collision. But the bus and switch architectures handle this collision by forwarding the memory read and write commands in an orderly process so every requested transfer will take place as quickly as possible and always accurately.
The DSPMemSpeed test deals with collision on the DDR3 interface, since all TCs are writing to the same DDR3 interface. It also deals with collision on the MSMCRAM interface because all TCs are reading from the same MSMCRAM module. You can change the location of srcBuff by changing the DATA_SECTION macro to src_section1 (in a later version I will make these names more clear, like src_sectionL2) and see that the performance drops significantly. This is because of the collision of four TCs at a single CorePac's EMC port, and the speed of reading from a single CorePac's L2 through a slower speed TeraNet switch.
Nicole He Like I heard, when the collison occured on the bus the performance is affected and sometimes the data is transferred in error.
Like I heard, when the collison occured on the bus the performance is affected and sometimes the data is transferred in error.
Please explain this very serious statement. There should never be any data transferred in error. By definition, collisions affect performance, but the KeyStone architecture minimizes that effect.
I cannot clarify or explain something you heard without knowing what you heard. But this is a serious statement that we need to follow-through with you to avoid any problems that may exist.
hi, i ran the DSPMemSpeed.zip. it worked. the result was 1301Mhz when data was transfered between DDR and MSMC. but the result was only 666Mhz when data was transfered between L2 and DDR or MSMC, while it should be around 1300Mhz and 2000Mhz.
can you help me out?
Coco,
Which processor are you using? Since you are jumping onto an existing thread, I do not want to make assumptions, but it would be C6678; we have other multicore DSPs with different characteristics.
What you are measuring is the correct result for what you are running. If you are getting the full speed entitlement that you expect using one method, then you have proven out the measurement methodology. The later results may not be what you expected, but they must be correct for what you have done.
DSPMemSpeed was intended to exercise the memory as much as possible to determine the speed at which it is running. Like you, I have used it for benchmarking, When you get different-than-expected results from an experiment, the normal scientific method would be to examine the methods, the code, the logical paths inside the device to help understand why the results were different.
When you look at the logical paths inside the device, what do you find different between the two different tests? What can you do differently to find other ways to exercise the DDR more?