Help running simple executable on F28069 Piccolo controlSTICK

I want to benchmark the F28069 Piccolo with this simple floating-point [5x5] matrix multiplication. This code is copied from the TI benchmark sample and modified for [5x5] matrix size.

The code apparently runs for 100 iterations, but freezes at 1000 iterations and over. I would like to perform 100,000 loops.

#include <stdio.h>

#include <math.h>

 

void main(void) {

 

       int j, m, n, p;

       float m3[5][5] = { {0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0} };

       const float m1[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };

       const float m2[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };

 

       printf("Benchmark Program \n");

 

       printf("Starting \n");

 

       for(j = 0; j < 100000; j++) {

              for(m = 0; m < 5; m++) {

              for(p = 0; p < 5; p++) {

                   m3[m][p] = 0;

                   for(n = 0; n < 5; n++) {

                       m3[m][p] += m1[m][n] * m2[n][p];

                   }

               }

           }

       }

 

       printf("Ending \n");

}

Screenshot attached. Any comments?

  • Hi Stephen,

    I'm curious what happens when the code crashes?

    I have two guesses:

    1) the watchdog is enabled and isn't being kicked which is causing your program to reset after a while

    2) you are overflowing the stack

    Trying bumping the stack up in size and disabling the watchdog manually in your code and I bet the issue goes away.

    Regards,

    Trey

  • In reply to Trey German:

    The benchmark is running now. On the Piccolo F28069, it runs the 100,000 iterations of the [5x5] matrix multiplication in 57.3 seconds, resulting in 1.75kops (matrix multiplications per second). This seems slow for an MCU with integrated FPU.

    Does running the benchmark from Code Composer Studio 5 in debugger slow the code execution?

  • In reply to Stephen Moore:

    Stephen,

    Curious what you did to get it working...

    If you are running the exact code you posted, the issue is that you are not setting up the PLL.  Try running the InitSysCtl() function which should setup the PLL to run the CPU at 80MHz.  Also, if you are running this code out of flash you will want to setup the flash wait states to improve performance.  Finally, execution performance can also be increased by running the code from RAM.  Take a look in controlSUITE for examples of how to do all of the above.

    Regards,

    Trey

  • In reply to Trey German:

    Trey German
    Curious what you did to get it working...


    I loaded the example program BlinkLED, killed the interrupt that blinked the LED, and put my matrix routine in the code body. I also added "fpu.h" and <math.h> include files.

    I combed through the BlinkLED includes, does it already have the InitSysCtl() function call?

  • In reply to Stephen Moore:

    Stephen,

    If you left the call to DeviceInit() in there it is in fact setting up the PLL for 80MHz.  The examples in that folder are developed by a different team and have some different naming conventions for the functions.  This example also sets up the flash for the correct number of waitstates assuming you have left the call to FlashInit() in. 

    Trey

  • In reply to Trey German:

    I confirmed the MCU is operating at the correct frequency. Moving the code from FLASH to RAM and a few other optimizations resulted in 43.7 seconds, still a bit too slow for my requirements.

    Some of the biggest problems were coming from the printf statement. Once those were removed, the compile and run process went much more smoothly. I used the LED to signal when the benchmark was running and resetting.

  • In reply to Stephen Moore:

    I am concerned the configuration was incorrect for the benchmark. Some comparisons running the identical 100,000 iterations of [5x5] single-precision floating-point matrix multiplication routine:

    STM32F4 (168MHz and FPU)         8.6 seconds
    NXP mbed LPC1768 (96MHz)        16.2 seconds
    LCPXpresso LPC1769 (120MHz)     19.4 seconds
    Piccolo F28069 (80MHz and FPU)  43.7 seconds

    Any comments?

  • In reply to Stephen Moore:

    Stephen,

    The build configuration you are using appears to be correct, but I overlooked one small issue with your benchmark code.  On our architecture an int is 16 bits.  Since this is a signed int the problem is even worse...you can only count up to 32k or so, so your outer most for loop should never complete.  I'm suprised that you code ever did complete because technically it should have sat in that for loop forever.  After changing the count variables to longs, I was able to achieve a time of around 14 seconds on the matrix multiply.

    Regards,

    Trey

  • In reply to Trey German:

    Thank you for the reply.

    Indeed, when I made the other changes, I also spotted the int and changed to long. As my program stands, it runs at about 40 seconds. I suspect the difference is between running debug vs. release code. Are there settings in CCS5 that can specify which build? You obviously have an architectural difference that is resulting in 14 seconds.

    Once you run your program, will it execute when disconnected from the computer and powered with an external power supply?

  • In reply to Stephen Moore:

    There really aren't any differences between debug and release code.  Those build configurations are created by default when a project is created and they have the same build properties initially.  I don't think optimizations are the cause cause I was able to hit 14 seconds with optimization level 0 (no opts).  My advice would be to make sure the FPU is enabled in the build properties and that the PLL is setup as we believe it is.  I would also make sure that my code gen tools were up to date.  Finally, I linked this program to RAM, so it is executing with 0 waitstates which also improve performance.  I can't run this project standalone because it is linked to RAM, BUT you can use a #pragma to define a code section for this benchmark and copy it from flash to RAM at runtime which will allow you to run standalone.

    If you're still having trouble I'd be happy to post my project for you.

    Trey