This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Looking for ways to increase simulation speed for TMS320F2812

Other Parts Discussed in Thread: TMS320F2812

Hi,

We are developing small piece of software that will run as part of a larger program on a TMS320F2812 to implement a specific function, and we will never ourselves need to install this software on physical hardware. While we have a licensed version of Code Composer 5.5, we chose to use the size-limited version of CCS 3.3 (http://e2e.ti.com/support/development_tools/code_composer_studio/f/81/t/41417.aspx) instead because it has the cycle-accurate simulator available and that simplifies our development, which is purely software for this project. Our concerns are only the correctness of the software and meeting timing requirements.

In deployment, data will be obtained through the normal acquisition channels such as the A/D, and access to this data access has been abstracted to a few function calls. For testing purposes, we have been given a few hundred MB of data in several CSV files. Alternate data acquisition functions were written in a separate library so that during development and testing this library can be linked against to read from the files and for deployment this library can be omitted to access the "real" functions.

Everything is working well, but reading data from files is extremely slow. Probably over 99% of the time is being devoted to I/O. A Windows equivalent of this software literally runs in about 5 seconds to complete something that requires over a day running the nearly identical Code Composer version. It would seem that something is artificially slowing the program running in Code Composer. I am guessing that the program is being slowed to match the 150 MHz clock speed of the 2812.

Since we are only interested in the results that are produced and the cycle counts data collected by the profiler and not real-time accuracy, is there a way to speed up the execution?

Thanks for any information or suggestions!


Jim

  • Jim Monte said:
    Everything is working well, but reading data from files is extremely slow. Probably over 99% of the time is being devoted to I/O. A Windows equivalent of this software literally runs in about 5 seconds to complete something that requires over a day running the nearly identical Code Composer version

    Unfortunately, C I/O routines are inherently slow. They are all funneled through very low-level functions which eventually reaches breakpoints at which point the debugger performs the I/O request. Here is a wiki page that gives some background on how C I/O operations work:  http://processors.wiki.ti.com/index.php/Tips_for_using_printf

    There may be some things you could try to speed up the process. One option may be to use Memory Load/Save feature to load/save the contents of your data file rather then using C I/O routines like fopen/fread/fwrite. You could also try customizing it further using a GEL function (if you are familiar with GEL) as described in this forum thread.

  • Thanks for the suggestions. I was hoping to minimize changes between platforms if possible, but may end up having to consider those options. Besides inefficient implementations of the C I/O routines, it seems that the simulator is being slowed to work at the rate of the real chip. Is there a way to bypass this restriction? The only items of interest are verifying correctness and the results from the profiler.

    Jim

  • Hi,

    I haven't seen a way to speed up simulators, as you've noticed it takes much more time to simulate a program than to run it in the device.

    One possibility is to create a file with the values you want to use as a C array and use it in your project, this way you wouldn't have to read read file during runtime because the array would already be part of the executable, but I think this is not a real option for you, probably you will run out of memory, right?

    J

  • Each simulation requires 10s to 100s of MB of data, so building it into the executable is not possible. We have never used a real 2812 chip to compare with the simulator. I had assumed that the simulator software was intentionally slowing itself to match the speed of the real device since it is about 17 000X slower than the same program in Windows. Is the simulator a lot slower than a real device? How much of an improvement in speed would we see if the simulator is replaced by one of the starter kits that has real hardware?

    Jim

  • Yes, simulation tends to be slower than real hardware - especially device cycle accurate simulators. Functional and CPU simulators tend to be faster. At one point we looked into compiled simulation, but only for a select few simulators. In any case, jumping to real HW will make a substantial difference.

  • Thanks for all of the information. If I understand correctly, there are two independent issues reducing the speed of the simulations:

    • Performing  I/O using the C stdlib implementation has a lot of overhead
    • The cycle-accurate simulators require large amounts of CPU processing making them slower than using a real device, even with a modern PC

    Some alternatives were mentioned for inputting data. The option of compiling the data into the program would not work due to the amount being greater than the total memory of the chip. The other option was

    "... to use Memory Load/Save feature to load/save the contents of your data file rather then using C I/O routines like fopen/fread/fwrite. You could also try customizing it further using a GEL function (if you are familiar with GEL) as described in this forum thread (http://e2e.ti.com/support/development_tools/code_composer_studio/f/81/t/132199.aspx)."

    The link mentions using GEL_MemoryLoad() at a breakpoint. I have read through the documentation for that function and have also seen a tutorial where data is read from a file when a breakpoint is reached. There was also a function  GEL_AddInputFile() that looks like it could be used to supply data. Which of these would be the best to use from the standpoint of performance, and how much would the improvement be compared to repeatedly calling fscanf()?

    The use of real hardware would require a purchase that I would have to justify. How much of an improvement is likely compared to using the simulator when running on a fairly new computer? Specifically the processor is an AMD FX-6100 with a clock speed of 3.3 GHz. There are six cores, although that probably does not matter since I only see one being used heavily during the simulation.

    Any comments or other ideas would be appreciated.

    Jim

  • Hi,

    The time it takes to transfer the data from PC to hardware will depend on the connection that you use, if you a get a XDS100 Emulator like the one used in this TI kit http://www.ti.com/lit/ml/sprufr5f/sprufr5f.pdf  it will take a long time (many minutes) to transfer that amount of data you mentioned. How long does it take to perform your simulation?

    J

  • Jim Monte said:
    The link mentions using GEL_MemoryLoad() at a breakpoint. I have read through the documentation for that function and have also seen a tutorial where data is read from a file when a breakpoint is reached. There was also a function  GEL_AddInputFile() that looks like it could be used to supply data. Which of these would be the best to use from the standpoint of performance, and how much would the improvement be compared to repeatedly calling fscanf()?

    For pure data transfer performance, GEL_MemoryLoad() is the best. GEL_AddInputData() is used with probe points, which is used to periodically inject data into a memory location on the target by reading a *.dat file on the host. As for how much faster than fscanf(), I can't say... I am not aware of any benchmark numbers comparing the two. But it is likely several magnitude of order faster

    Jim Monte said:
    The use of real hardware would require a purchase that I would have to justify. How much of an improvement is likely compared to using the simulator when running on a fairly new computer? Specifically the processor is an AMD FX-6100 with a clock speed of 3.3 GHz. There are six cores, although that probably does not matter since I only see one being used heavily during the simulation.

    Right, old CCS did not take advantage of multiple cores so having a multi-core PC doesn't really help much. There was also some limit with the simulator in regards to memory used so having a lot of RAM after a certain point did not help either.

    Performance aside, I would strongly recommend moving to real hardware simply because simulators are no longer supported. The latest version of v6 does not come with any simulators at all.

  • Johannes said:

    How long does it take to perform your simulation?


    The bare minimum for a reasonable check is 12 hrs. About 10X that amount of data would be desirable.

    Jim

  • Ki-Soo Lee said:

    Yes, simulation tends to be slower than real hardware - especially device cycle accurate simulators. Functional and CPU simulators tend to be faster. At one point we looked into compiled simulation, but only for a select few simulators. In any case, jumping to real HW will make a substantial difference.

    I found a data sheet for the TMS320F28x simulator at

    http://www.ti.com/lit/ml/sprs466/sprs466.pdf

    It is described as product preview and dated March 2008. The speed of the simulator for the TMS320F2812 is given as 24 kCPS when running on a 3.2 GHz P IV. I would expect the speed on my computer would be similar, so it would appear that the simulator is slower than real hardware by a factor of 150M/24k = 6250. I did not see a data sheet on the actual product (not a preview). Does anyone have a link to such information? Assuming the performance did not change dramatically during development, my simulations should run about 6000X faster without even addressing the I/O issues. It seems almost too good to be true. Is there something wrong with my reasoning??

    Jim

  • Ki-Soo Lee said:

    Performance aside, I would strongly recommend moving to real hardware simply because simulators are no longer supported. The latest version of v6 does not come with any simulators at all.

    It is a bit off the topic of simulation speed, but yes, I had seen the posts regarding the end of support for simulators. I would expect that they would be more difficult to build as chips become more complex, but it is still an unfortunate decision in my opinion. All else being equal, it would certainly make me less likely to choose TI over other alternatives. For this particular project CCS 3.3 offers another significant advantage of being the version used for the rest of the development (the majority), and using the same version will avoid any unforeseen compatibility issues.


    Jim

  • Ki-Soo Lee said:

    For pure data transfer performance, GEL_MemoryLoad() is the best. GEL_AddInputData() is used with probe points, which is used to periodically inject data into a memory location on the target by reading a *.dat file on the host. As for how much faster than fscanf(), I can't say... I am not aware of any benchmark numbers comparing the two. But it is likely several magnitude of order faster

    I briefly tried GEL_MemoryLoad(), but when I tested its operation, it kept reading the beginning of the file rather than subsequent parts each time it is called. So to use this approach, I would have to literally make hundreds of thousands of data files and write a GEL function to select the appropriate sequence of files.

    I did not quite get GEL_AddInputData() to work. It kept giving a message box "Not all symbols resolve to a valid location" when I put it at a breakpoint early in the program that had an action of executing a GEL command that was the function GEL_AddInputData(). I still am not sure why this method did not work since everything in the breakpoint that it created looked OK. While it did not quite work, I could see that it was making a breakpoint just like one that could be made manually, including the same progress bar GUI.

    The final approach, manually adding a breakpoint with an action of reading from a file, worked well. I used the standard sinewave tutorial to compare this method with equivalent fscanf() input by modifying the data input function (dataIO()) to return a value on error or EOF and to either read from the breakpoint or use fscanf() depending on how the program was compiled. I could not find a way for the breakpoint data input to return an equivalent of EOF to the program, so I added a sentinel value of 0xFFFF to the data for the sine wave. Using fscanf(), I was able to read 40 blocks of 100 data points in 181 seconds using the time() function for timing. The equivalent program using the breakpoint for input could read 440 blocks in 10 seconds. So for this test, the breakpoint was 198 times faster, a significant improvement indeed!

    Jim

  • Jim Monte said:

    The final approach, manually adding a breakpoint with an action of reading from a file, worked well. I used the standard sinewave tutorial to compare this method with equivalent fscanf() input by modifying the data input function (dataIO()) to return a value on error or EOF and to either read from the breakpoint or use fscanf() depending on how the program was compiled. I could not find a way for the breakpoint data input to return an equivalent of EOF to the program, so I added a sentinel value of 0xFFFF to the data for the sine wave. Using fscanf(), I was able to read 40 blocks of 100 data points in 181 seconds using the time() function for timing. The equivalent program using the breakpoint for input could read 440 blocks in 10 seconds. So for this test, the breakpoint was 198 times faster, a significant improvement indeed!

    I forgot to mention, but I also removed the processing function from the program in the tutorial, so these times are entirely for data input


    Jim

  • Jim Monte said:

    The final approach, manually adding a breakpoint with an action of reading from a file, worked well. I used the standard sinewave tutorial to compare this method with equivalent fscanf() input by modifying the data input function (dataIO()) to return a value on error or EOF and to either read from the breakpoint or use fscanf() depending on how the program was compiled. I could not find a way for the breakpoint data input to return an equivalent of EOF to the program, so I added a sentinel value of 0xFFFF to the data for the sine wave. Using fscanf(), I was able to read 40 blocks of 100 data points in 181 seconds using the time() function for timing. The equivalent program using the breakpoint for input could read 440 blocks in 10 seconds. So for this test, the breakpoint was 198 times faster, a significant improvement indeed!

    I forgot to mention, but I also removed the processing function from the program in the tutorial, so these times are entirely for data input


    Jim

    [/quote]

    I noticed today that watch data is evaluated at the breakpoint even when the action is reading from a file. I only had about three scalar items in the watch window, but when I deleted them and closed the window (as a precaution) the 440 blocks were read in only 5 seconds. Now the input of data through a breakpoint is about 400 times faster.

    Jim