I want to benchmark the F28069 Piccolo with this simple floating-point [5x5] matrix multiplication. This code is copied from the TI benchmark sample and modified for [5x5] matrix size.
The code apparently runs for 100 iterations, but freezes at 1000 iterations and over. I would like to perform 100,000 loops.
#include <stdio.h>
#include <math.h>
void main(void) {
int j, m, n, p;
float m3[5][5] = { {0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0} };
const float m1[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };
const float m2[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };
printf("Benchmark Program \n");
printf("Starting \n");
for(j = 0; j < 100000; j++) {
for(m = 0; m < 5; m++) {
for(p = 0; p < 5; p++) {
m3[m][p] = 0;
for(n = 0; n < 5; n++) {
m3[m][p] += m1[m][n] * m2[n][p];
}
printf("Ending \n");
Screenshot attached. Any comments?
Thanks again. What about this "make" problem? I get the compiler error in some circumstances. In other circumstances, I'll get this Make error:
After two weeks of this, I'm either going to find some support or abandon the effort. My other candidates (STM32F4/Keil and LCP78xx/CodeRed) have been running real-time state-space system simulations for a week. I can't even get the Piccolo to blink an LED now.
Once you setup the project to work with the compiler version you have, you will not get the managed make error. Also, it looks like you didn't put this example in the right place which will prevent it from finding all the support files it needs (thats why all the files have "!" icons). This project needs to reside in c:\ti\controlsuite\device_support\f2806x\version\f2806x_examples\.
I understand your frustration. I've been working with CCS for years now, so its all second nature to me. The reason CCS is so complex is because it has to support so many different architectures. Everything from the smallest MSP430 to the biggest C6000 mulitcore DSP is programmed through CCS, so everything has to be very extensible which adds a lot of complexity.
Trey
Trey German
C2000 Applications
OK, I made a totally fresh installation of ControlSUITE and CCS. I went through the registry and hard drive and cleaned out all of the TI garbage from previous installations and rebooted a couple times. I installed ControlSuite and CCS, opened CCS and followed instructions to the letter. See screenshots of green "checkmarks" in the TI Resource Explorer. It imports the example project, builds the project, and sets up the debugger (Steps 1-3). I left the properties dialog on the screen, set to the correct hardware and USB emulator, and also the compiler version. It throws 27 errors and does not produce an output file to proceed to Step 4. Notice that I have not performed any independent actions here, I'm allowing the wizard to guide the process of loading a demonstration example.
Stephen,
Sorry to hear you're still having trouble. The resource explorer project import is design to make things very easy for users, but I guess this isn't very true for you. Would you mind posting the errors? I suspect an issue with linked in resources in the project.
The way our controlSUITE software is architected right now, if the right versions of everything aren't installed the projects break. We are fully aware of this issue and the frustration it causes new users and we are developing a new controlSUITE structure which will fix many of these problems. Later this year we will be releasing this new controlSUITE architecture and problems like this won't happen any more.
Yahoo. With a little (more) perseverance, I got the program running again. Compiler was adjusted to v6.0.2. Interestingly, it won't compile with FLASH, but will run from RAM. The run took 47 seconds (same as before). Maybe the FPU not being used, so I looked at the Runtime support library under Properties was set to rts2800_ml.lib. I changed this to rts2800_fpu32.lib, and it won't compile anymore. Screenshots and errorlog attached when using rts2800_fpu32.lib.
[code]<Linking>"../F2806x_RAM_BlinkingLED.CMD", line 49: error: BEGIN memory range has already been specified"../F2806x_RAM_BlinkingLED.CMD", line 49: error: BEGIN memory range overlaps existing memory range BEGIN"../F2806x_RAM_BlinkingLED.CMD", line 51: error: RAMM0 memory range has already been specified"../F2806x_RAM_BlinkingLED.CMD", line 51: error: RAMM0 memory range overlaps existing memory range RAMM0"../F2806x_RAM_BlinkingLED.CMD", line 52: error: progRAM memory range overlaps>> Compilation failure existing memory range RAML0_L3"../F2806x_RAM_BlinkingLED.CMD", line 54: error: FPUTABLES memory range has already been specified"../F2806x_RAM_BlinkingLED.CMD", line 54: error: FPUTABLES memory range overlaps existing memory range FPUTABLES"../F2806x_RAM_BlinkingLED.CMD", line 55: error: IQTABLES memory range has already been specified"../F2806x_RAM_BlinkingLED.CMD", line 55: error: IQTABLES memory range overlaps existing memory range IQTABLES"../F2806x_RAM_BlinkingLED.CMD", line 56: error: IQTABLES2 memory range has already been specified"../F2806x_RAM_BlinkingLED.CMD", line 56: error: IQTABLES2 memory range overlaps existing memory range IQTABLES2"../F2806x_RAM_BlinkingLED.CMD", line 57: error: IQTABLES3 memory range has already been specified"../F2806x_RAM_BlinkingLED.CMD", line 57: error: IQTABLES3 memory range overlaps existing memory range IQTABLES3"../F2806x_RAM_BlinkingLED.CMD", line 59: error: BOOTROM memory range has already been specified"../F2806x_RAM_BlinkingLED.CMD", line 59: error: BOOTROM memory range overlaps existing memory range BOOTROM"../F2806x_RAM_BlinkingLED.CMD", line 61: error: RESET memory range has already been specified"../F2806x_RAM_BlinkingLED.CMD", line 61: error: RESET memory range overlaps existing memory range RESET"../F2806x_RAM_BlinkingLED.CMD", line 66: error: RAMM1 memory range has already been specified"../F2806x_RAM_BlinkingLED.CMD", line 66: error: RAMM1 memory range overlaps existing memory range RAMM1error #10010: errors encountered during linking; "BlinkingLED.out" not builtgmake: *** [BlinkingLED.out] Error 1gmake: Target `all' not remade because of errors.**** Build Finished ****[/code]
CCSv5 does some wierd thing that should make running projects easier, but it some cases it breaks things. What happened here is when you imported the CCSv4 project into CCSv5 it automatically added a linker command file for the 06x device you are using, but your project already had a linker command file. The two files defined the same ranges in memory which is why it is complaining about overlap. To fix this you can either remove the F2806x_RAM_BlinkingLED.cmd file or in the build properties remove the F2806x_ram_lnk.cmd file. The fact that you changed to use the FPU run time support library didn't have any thing to do with the above errors.
Also, I believe switching to the FPU run time support library ought to solve the speed issue.
Regards,
That was the final bit. Thanks for your continued responsiveness.
2.103 seconds for 5.94 MFLOPS. Does that sound consistent with design capability?
I'm a little nervous about the TI CCS, but at least we now understand the hardware capabilities.
The core is capable of much more than 5.94 MFLOPS. If you hand coded assembly you could actually theoretically get up to 160 MFLOPs as we have a parallel add multiply instruction that is single cycle. That being said MFLOPs is more of a marketing number because it really depends on how the code is written: assembly, c, loops unrolled, optimizations, etc. Your question has spurred some internal discussion between the floating point experts and I expect they will reply to this post soon.
Regards,Trey
I suspect the compiler is not doing as well as it could. Here are a few things to try out.
Understand that as the compiler generates more optimal code there is a tradeoff with debug capability. When you start out, you may want the most debug capability available. In this case the compiler options will likely be limited to -g and mt. You would then increase optimization from there.
There are some more details of these tips on this wiki page:
http://processors.wiki.ti.com/index.php/C28x_Code_Generation_Tips_and_Tricks#Optimization
Regards
Lori
Stephen Moore STM32F4 (168MHz and FPU) 8.6 secondsNXP mbed LPC1768 (96MHz) 16.2 secondsLCPXpresso LPC1769 (120MHz) 19.4 secondsPiccolo F28069 (80MHz and FPU) 43.7 seconds
STM32F4 (168MHz and FPU) 8.6 secondsNXP mbed LPC1768 (96MHz) 16.2 secondsLCPXpresso LPC1769 (120MHz) 19.4 secondsPiccolo F28069 (80MHz and FPU) 43.7 seconds
After unsuccessful attempt to run Coremark benchmark on C2000 (coremark doesn't like lack of 8bit data type on C2000), I tried to run code from the first post. Here are my results:
1. Code in Flash - default waitstates-O0 - 85.149 seconds-O2 - 51.782 seconds-O4 - 51.782 seconds
2. Code in Flash - minimum waitstates-O0 - 11.594 seconds-O2 - 5.603 seconds-O4 - 5.602 seconds
3. Code in SRAM- O0 - 11.241 seconds- O2 - 5.414 seconds- O4 - 5.414 seconds
I don't have STM32F4 to retest the code, but is it possible that piccolo on 80MHz is executing floating point code much faster than 168MHz STM?
It is also interesting how not properly initialized flash gives you very crappy performance 0:-)
Just a reply to agree with you about the flash.
I unwittingly left out the example flash initialisation routines when creating my software and spent a good day scratching my head wondering why it was taking something like 15 clock cycles to do a single assembler instruction.
Once I put the flash wait-state setup code back in, performance was back to 1 instruction per clock cycle and all was great :-)
Almost not worth putting code in SRAM, the flash is so quick when set up properly.
That being said MFLOPs is more of a marketing number because it really depends on how the code is written: assembly, c, loops unrolled, optimizations, etc.
I'm using FLOPS as my benchmark number, based on code that was derived from TI benchmarking application note.
The performance is highly dependent on the compiler settings. The compiler sensitivity is extremely high, yielding greater than an order of magnitude differences in performance.
In general, the F28069 is running FPU faster than the STM32, although the STM could be running slow because of similar optimization issues.
If you hand coded assembly you could actually theoretically get up to 160 MFLOPs as we have a parallel add multiply instruction that is single cycle.
The F28069/CSS system is very sensitive and tricky. I'm concerned what could be profitable software development time will be spent figuring out the sensitivities of the TI system. We could spend forever tweaking settings instead of writing revenue-generating code. As you mention, hand-coding the most math-intensive routines (matrix multiplication, dot products, or matrix inversions) may be the best way to go. Hand-coded assembly would basically remove the FPU-heavy routines from being CPU throughput hogs, and alleviate our timing worries.