Help running simple executable on F28069 Piccolo controlSTICK

Stephen Moore

Other Parts Discussed in Thread: CONTROLSUITE

I want to benchmark the F28069 Piccolo with this simple floating-point [5x5] matrix multiplication. This code is copied from the TI benchmark sample and modified for [5x5] matrix size.

The code apparently runs for 100 iterations, but freezes at 1000 iterations and over. I would like to perform 100,000 loops.

#include <stdio.h>

#include <math.h>

void main(void) {

int j, m, n, p;

float m3[5][5] = { {0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0} };

const float m1[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };

const float m2[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };

printf("Benchmark Program \n");

printf("Starting \n");

for(j = 0; j < 100000; j++) {

for(m = 0; m < 5; m++) {

for(p = 0; p < 5; p++) {

m3[m][p] = 0;

for(n = 0; n < 5; n++) {

m3[m][p] += m1[m][n] * m2[n][p];

}

printf("Ending \n");

}

Screenshot attached. Any comments?

over 12 years ago

0 Trey German289 over 12 years ago

TI__Mastermind 22025 points

Hi Stephen,

I'm curious what happens when the code crashes?

I have two guesses:

1) the watchdog is enabled and isn't being kicked which is causing your program to reset after a while

2) you are overflowing the stack

Trying bumping the stack up in size and disabling the watchdog manually in your code and I bet the issue goes away.

Regards,

Trey

0 Stephen Moore over 12 years ago in reply to Trey German289

Prodigy 250 points

The benchmark is running now. On the Piccolo F28069, it runs the 100,000 iterations of the [5x5] matrix multiplication in 57.3 seconds, resulting in 1.75kops (matrix multiplications per second). This seems slow for an MCU with integrated FPU.

Does running the benchmark from Code Composer Studio 5 in debugger slow the code execution?

0 Trey German289 over 12 years ago in reply to Stephen Moore

TI__Mastermind 22025 points

Stephen,

Curious what you did to get it working...

If you are running the exact code you posted, the issue is that you are not setting up the PLL. Try running the InitSysCtl() function which should setup the PLL to run the CPU at 80MHz. Also, if you are running this code out of flash you will want to setup the flash wait states to improve performance. Finally, execution performance can also be increased by running the code from RAM. Take a look in controlSUITE for examples of how to do all of the above.

Regards,

Trey

0 Stephen Moore over 12 years ago in reply to Trey German289

Prodigy 250 points

Trey German said:
Curious what you did to get it working...

I loaded the example program BlinkLED, killed the interrupt that blinked the LED, and put my matrix routine in the code body. I also added "fpu.h" and <math.h> include files.

I combed through the BlinkLED includes, does it already have the InitSysCtl() function call?

0 Trey German289 over 12 years ago in reply to Stephen Moore

TI__Mastermind 22025 points

Stephen,

If you left the call to DeviceInit() in there it is in fact setting up the PLL for 80MHz. The examples in that folder are developed by a different team and have some different naming conventions for the functions. This example also sets up the flash for the correct number of waitstates assuming you have left the call to FlashInit() in.

Trey

0 Stephen Moore over 12 years ago in reply to Trey German289

Prodigy 250 points

I confirmed the MCU is operating at the correct frequency. Moving the code from FLASH to RAM and a few other optimizations resulted in 43.7 seconds, still a bit too slow for my requirements.

Some of the biggest problems were coming from the printf statement. Once those were removed, the compile and run process went much more smoothly. I used the LED to signal when the benchmark was running and resetting.

0 Stephen Moore over 12 years ago in reply to Stephen Moore

Prodigy 250 points

I am concerned the configuration was incorrect for the benchmark. Some comparisons running the identical 100,000 iterations of [5x5] single-precision floating-point matrix multiplication routine:

STM32F4 (168MHz and FPU)         8.6 seconds
NXP mbed LPC1768 (96MHz)        16.2 seconds
LCPXpresso LPC1769 (120MHz)     19.4 seconds
Piccolo F28069 (80MHz and FPU) 43.7 seconds

Any comments?

0 Trey German289 over 12 years ago in reply to Stephen Moore

TI__Mastermind 22025 points

Stephen,

The build configuration you are using appears to be correct, but I overlooked one small issue with your benchmark code. On our architecture an int is 16 bits. Since this is a signed int the problem is even worse...you can only count up to 32k or so, so your outer most for loop should never complete. I'm suprised that you code ever did complete because technically it should have sat in that for loop forever. After changing the count variables to longs, I was able to achieve a time of around 14 seconds on the matrix multiply.

Regards,

Trey

0 Stephen Moore over 12 years ago in reply to Trey German289

Prodigy 250 points

Thank you for the reply.

Indeed, when I made the other changes, I also spotted the int and changed to long. As my program stands, it runs at about 40 seconds. I suspect the difference is between running debug vs. release code. Are there settings in CCS5 that can specify which build? You obviously have an architectural difference that is resulting in 14 seconds.

Once you run your program, will it execute when disconnected from the computer and powered with an external power supply?

0 Trey German289 over 12 years ago in reply to Stephen Moore

TI__Mastermind 22025 points

There really aren't any differences between debug and release code. Those build configurations are created by default when a project is created and they have the same build properties initially. I don't think optimizations are the cause cause I was able to hit 14 seconds with optimization level 0 (no opts). My advice would be to make sure the FPU is enabled in the build properties and that the PLL is setup as we believe it is. I would also make sure that my code gen tools were up to date. Finally, I linked this program to RAM, so it is executing with 0 waitstates which also improve performance. I can't run this project standalone because it is linked to RAM, BUT you can use a #pragma to define a code section for this benchmark and copy it from flash to RAM at runtime which will allow you to run standalone.

If you're still having trouble I'd be happy to post my project for you.

Trey

0 Stephen Moore over 12 years ago in reply to Trey German289

Prodigy 250 points

43 seconds

#include "PeripheralHeaderIncludes.h"
#include "fpu.h"
                                                                         
void DeviceInit(void);

void main(void) {
    DeviceInit();
    long j;
    int m, n, p;
    float m3[5][5] = { {0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0} };
    const float m1[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };
    const float m2[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };
    for(;;) {
        GpioDataRegs.GPBTOGGLE.bit.GPIO34 = 1;    //Toggle GPIO34 (LD2)
        for(j = 0; j < 100000; j++) {
            for(m = 0; m < 5; m++) {
                for(p = 0; p < 5; p++) {
                    m3[m][p] = 0;
                    for(n = 0; n < 5; n++) {
                        m3[m][p] += m1[m][n] * m2[n][p];
                    }
                }
            }
        }
    }
}

0 Trey German289 over 12 years ago in reply to Stephen Moore

TI__Mastermind 22025 points

I don't buy it :-P

Ok, I've attached the project I'm using. Its either the PLL or something in the build configuration, but I'm starting to suspect the PLL. 43 seconds is wayyyyy too slow. Give this project a shot. If you copy the timed_led_blink folder into your f2806x_examples directory the project should import and build without issue.

5001.timed_led_blink.zip

Trey

0 Stephen Moore over 12 years ago in reply to Trey German289

Prodigy 250 points

I thank you for your help, however, the TI CCS software has proven - by far - to be the most difficult IDE of those we have evaluated. We have spent more time on CCS than CodeRed, Freescale, and Keil IDE combined, and have failed to have a single competitive run. I'm past my deadline to select the production MCU, and without any success with the Piccolo, there's no reason to consider it.

The example project you posted - is it using CCS4 or CCS5?

Unfortunately, I can't get any of the example programs in TI ControlSUITE to successfully run using CCS4 because it wants C2000 Compiler v5.2.2. CCS5 gives similar incompatibilities attributed to C2000v5.2.2. It was only by piecing together fragments of files did I finally get the benchmark to run, although not competitively (>40seconds).

I googled for the C2000 compiler v5.2.2 with no results. Either the ControlSUITE is incompatible with CCS, or the CCS installation is missing critical files for success.

As last-ditch measures:

is there a configuration of the F28069 ControlStick for the Keil IDE?
Which version of CCS were you using?
Which version of CCS is compatible with the ControlSUITE examples?
How is the C2000 compiler v5.2.2 installed?

Again, thanks for your efforts.

0 Stephen Moore over 12 years ago in reply to Stephen Moore

Prodigy 250 points

I made a fresh TI setup_CCS_4.2.4.00033 installation and imported the example project directory from C:/ti/controlSUITE/development kits/F28069/

Please see attached screenshot of fresh installation and fresh import. None of the example projects loaded properly.

0 Trey German289 over 12 years ago in reply to Stephen Moore

TI__Mastermind 22025 points

When you import projects and don't have the correct compiler installed for that example you will get errors like in the above post. Its annoying and CCS shouldn't generate errors just because you have a different version of the compiler, but I'm afraid thats how it works for now.

You can however change the compiler version a project is using. After you import and get the error, the project should still appear in your project navigator. Right click the project and go to Source, Upgrade Code-Generation Tools Version. Change the version in the dialog box to one of the other version listed (in your case there should only be one other choice...the compiler that you actually have installed). After clicking ok, you ought to be able to compile and run the project.

Sorry you've had so much trouble with CCS.

Trey

0 Stephen Moore over 12 years ago in reply to Trey German289

Prodigy 250 points

Thanks again. What about this "make" problem? I get the compiler error in some circumstances. In other circumstances, I'll get this Make error:

After two weeks of this, I'm either going to find some support or abandon the effort. My other candidates (STM32F4/Keil and LCP78xx/CodeRed) have been running real-time state-space system simulations for a week. I can't even get the Piccolo to blink an LED now.

0 Trey German289 over 12 years ago in reply to Stephen Moore

TI__Mastermind 22025 points

Once you setup the project to work with the compiler version you have, you will not get the managed make error. Also, it looks like you didn't put this example in the right place which will prevent it from finding all the support files it needs (thats why all the files have "!" icons). This project needs to reside in c:\ti\controlsuite\device_support\f2806x\version\f2806x_examples\.

I understand your frustration. I've been working with CCS for years now, so its all second nature to me. The reason CCS is so complex is because it has to support so many different architectures. Everything from the smallest MSP430 to the biggest C6000 mulitcore DSP is programmed through CCS, so everything has to be very extensible which adds a lot of complexity.

Trey

0 Stephen Moore over 12 years ago in reply to Trey German289

Prodigy 250 points

OK, I made a totally fresh installation of ControlSUITE and CCS. I went through the registry and hard drive and cleaned out all of the TI garbage from previous installations and rebooted a couple times. I installed ControlSuite and CCS, opened CCS and followed instructions to the letter. See screenshots of green "checkmarks" in the TI Resource Explorer. It imports the example project, builds the project, and sets up the debugger (Steps 1-3). I left the properties dialog on the screen, set to the correct hardware and USB emulator, and also the compiler version. It throws 27 errors and does not produce an output file to proceed to Step 4. Notice that I have not performed any independent actions here, I'm allowing the wizard to guide the process of loading a demonstration example.

0 Trey German289 over 12 years ago in reply to Stephen Moore

TI__Mastermind 22025 points

Stephen,

Sorry to hear you're still having trouble. The resource explorer project import is design to make things very easy for users, but I guess this isn't very true for you. Would you mind posting the errors? I suspect an issue with linked in resources in the project.

The way our controlSUITE software is architected right now, if the right versions of everything aren't installed the projects break. We are fully aware of this issue and the frustration it causes new users and we are developing a new controlSUITE structure which will fix many of these problems. Later this year we will be releasing this new controlSUITE architecture and problems like this won't happen any more.

Trey

0 Stephen Moore over 12 years ago in reply to Trey German289

Prodigy 250 points

Yahoo. With a little (more) perseverance, I got the program running again. Compiler was adjusted to v6.0.2. Interestingly, it won't compile with FLASH, but will run from RAM. The run took 47 seconds (same as before). Maybe the FPU not being used, so I looked at the Runtime support library under Properties was set to rts2800_ml.lib. I changed this to rts2800_fpu32.lib, and it won't compile anymore. Screenshots and errorlog attached when using rts2800_fpu32.lib.

<Linking>
"../F2806x_RAM_BlinkingLED.CMD", line 49: error: BEGIN memory range has already
   been specified
"../F2806x_RAM_BlinkingLED.CMD", line 49: error: BEGIN memory range overlaps
   existing memory range BEGIN
"../F2806x_RAM_BlinkingLED.CMD", line 51: error: RAMM0 memory range has already
   been specified

"../F2806x_RAM_BlinkingLED.CMD", line 51: error: RAMM0 memory range overlaps
   existing memory range RAMM0
"../F2806x_RAM_BlinkingLED.CMD", line 52: error: progRAM memory range overlaps
>> Compilation failure
   existing memory range RAML0_L3
"../F2806x_RAM_BlinkingLED.CMD", line 54: error: FPUTABLES memory range has
   already been specified
"../F2806x_RAM_BlinkingLED.CMD", line 54: error: FPUTABLES memory range
   overlaps existing memory range FPUTABLES
"../F2806x_RAM_BlinkingLED.CMD", line 55: error: IQTABLES memory range has
   already been specified
"../F2806x_RAM_BlinkingLED.CMD", line 55: error: IQTABLES memory range overlaps
   existing memory range IQTABLES
"../F2806x_RAM_BlinkingLED.CMD", line 56: error: IQTABLES2 memory range has
   already been specified
"../F2806x_RAM_BlinkingLED.CMD", line 56: error: IQTABLES2 memory range
   overlaps existing memory range IQTABLES2
"../F2806x_RAM_BlinkingLED.CMD", line 57: error: IQTABLES3 memory range has
   already been specified
"../F2806x_RAM_BlinkingLED.CMD", line 57: error: IQTABLES3 memory range
   overlaps existing memory range IQTABLES3
"../F2806x_RAM_BlinkingLED.CMD", line 59: error: BOOTROM memory range has
   already been specified
"../F2806x_RAM_BlinkingLED.CMD", line 59: error: BOOTROM memory range overlaps
   existing memory range BOOTROM
"../F2806x_RAM_BlinkingLED.CMD", line 61: error: RESET memory range has already
   been specified
"../F2806x_RAM_BlinkingLED.CMD", line 61: error: RESET memory range overlaps
   existing memory range RESET
"../F2806x_RAM_BlinkingLED.CMD", line 66: error: RAMM1 memory range has already
   been specified
"../F2806x_RAM_BlinkingLED.CMD", line 66: error: RAMM1 memory range overlaps
   existing memory range RAMM1
error #10010: errors encountered during linking; "BlinkingLED.out" not built
gmake: *** [BlinkingLED.out] Error 1
gmake: Target `all' not remade because of errors.

**** Build Finished ****

0 Trey German289 over 12 years ago in reply to Stephen Moore

TI__Mastermind 22025 points

CCSv5 does some wierd thing that should make running projects easier, but it some cases it breaks things. What happened here is when you imported the CCSv4 project into CCSv5 it automatically added a linker command file for the 06x device you are using, but your project already had a linker command file. The two files defined the same ranges in memory which is why it is complaining about overlap. To fix this you can either remove the F2806x_RAM_BlinkingLED.cmd file or in the build properties remove the F2806x_ram_lnk.cmd file. The fact that you changed to use the FPU run time support library didn't have any thing to do with the above errors.

Also, I believe switching to the FPU run time support library ought to solve the speed issue.

Regards,

Trey

0 Stephen Moore over 12 years ago in reply to Trey German289

Prodigy 250 points

That was the final bit. Thanks for your continued responsiveness.

2.103 seconds for 5.94 MFLOPS. Does that sound consistent with design capability?

I'm a little nervous about the TI CCS, but at least we now understand the hardware capabilities.

0 Trey German289 over 12 years ago in reply to Stephen Moore

TI__Mastermind 22025 points

Stephen,

The core is capable of much more than 5.94 MFLOPS. If you hand coded assembly you could actually theoretically get up to 160 MFLOPs as we have a parallel add multiply instruction that is single cycle. That being said MFLOPs is more of a marketing number because it really depends on how the code is written: assembly, c, loops unrolled, optimizations, etc. Your question has spurred some internal discussion between the floating point experts and I expect they will reply to this post soon.

Regards,
Trey

0 Lori Heustess over 12 years ago in reply to Trey German289

TI__Guru* 89465 points

Stephen,

I suspect the compiler is not doing as well as it could. Here are a few things to try out.

Understand that as the compiler generates more optimal code there is a tradeoff with debug capability. When you start out, you may want the most debug capability available. In this case the compiler options will likely be limited to -g and mt. You would then increase optimization from there.

Start with -g -mt (symbolic debug + unified memory) these can both be found on the basic options tab of the project options (in CCS 5).
Next you can add -mn (optimize with debug) This is on the runtime model tab. This will re-enable some optimizations that -g disabled but still allow you to debug fairly well.
The next step would be to turn on some optimization level. -o2 is often a good balance. This can be found on the basic options tab.
Next you would try perhaps -o3 or -o4 optimization. These may nor may not help improve the benchmark.
Finally you can remove -g - this can severely limit debug capability so it is often done only on a particular file with code you need highly optimized.

There are some more details of these tips on this wiki page:

http://processors.wiki.ti.com/index.php/C28x_Code_Generation_Tips_and_Tricks#Optimization

Regards

Lori

0 John Connor over 12 years ago in reply to Stephen Moore

Prodigy 165 points

Stephen Moore said:

STM32F4 (168MHz and FPU)         8.6 seconds
NXP mbed LPC1768 (96MHz)        16.2 seconds
LCPXpresso LPC1769 (120MHz)     19.4 seconds
Piccolo F28069 (80MHz and FPU) 43.7 seconds

After unsuccessful attempt to run Coremark benchmark on C2000 (coremark doesn't like lack of 8bit data type on C2000), I tried to run code from the first post. Here are my results:

1. Code in Flash - default waitstates
-O0 - 85.149 seconds
-O2 - 51.782 seconds
-O4 - 51.782 seconds

2. Code in Flash - minimum waitstates
-O0 - 11.594 seconds
-O2 - 5.603 seconds
-O4 - 5.602 seconds

3. Code in SRAM
- O0 - 11.241 seconds
- O2 - 5.414 seconds
- O4 - 5.414 seconds

I don't have STM32F4 to retest the code, but is it possible that piccolo on 80MHz is executing floating point code much faster than 168MHz STM?

It is also interesting how not properly initialized flash gives you very crappy performance 0:-)

0 John Bennett over 12 years ago in reply to John Connor

Prodigy 110 points

Just a reply to agree with you about the flash.

I unwittingly left out the example flash initialisation routines when creating my software and spent a good day scratching my head wondering why it was taking something like 15 clock cycles to do a single assembler instruction.

Once I put the flash wait-state setup code back in, performance was back to 1 instruction per clock cycle and all was great :-)

Almost not worth putting code in SRAM, the flash is so quick when set up properly.

0 Stephen Moore over 12 years ago in reply to John Bennett

Prodigy 250 points

That being said MFLOPs is more of a marketing number because it really depends on how the code is written: assembly, c, loops unrolled, optimizations, etc.

I'm using FLOPS as my benchmark number, based on code that was derived from TI benchmarking application note.

The performance is highly dependent on the compiler settings. The compiler sensitivity is extremely high, yielding greater than an order of magnitude differences in performance.

I don't have STM32F4 to retest the code, but is it possible that piccolo on 80MHz is executing floating point code much faster than 168MHz STM?

In general, the F28069 is running FPU faster than the STM32, although the STM could be running slow because of similar optimization issues.

If you hand coded assembly you could actually theoretically get up to 160 MFLOPs as we have a parallel add multiply instruction that is single cycle.

The F28069/CSS system is very sensitive and tricky. I'm concerned what could be profitable software development time will be spent figuring out the sensitivities of the TI system. We could spend forever tweaking settings instead of writing revenue-generating code. As you mention, hand-coding the most math-intensive routines (matrix multiplication, dot products, or matrix inversions) may be the best way to go. Hand-coded assembly would basically remove the FPU-heavy routines from being CPU throughput hogs, and alleviate our timing worries.

C2000™︎ microcontrollers

C2000 microcontrollers forum

Help running simple executable on F28069 Piccolo controlSTICK