Help running simple executable on F28069 Piccolo controlSTICK

I want to benchmark the F28069 Piccolo with this simple floating-point [5x5] matrix multiplication. This code is copied from the TI benchmark sample and modified for [5x5] matrix size.

The code apparently runs for 100 iterations, but freezes at 1000 iterations and over. I would like to perform 100,000 loops.

#include <stdio.h>

#include <math.h>

 

void main(void) {

 

       int j, m, n, p;

       float m3[5][5] = { {0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0},{0.0 , 0.0 , 0.0 , 0.0 , 0.0} };

       const float m1[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };

       const float m2[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };

 

       printf("Benchmark Program \n");

 

       printf("Starting \n");

 

       for(j = 0; j < 100000; j++) {

              for(m = 0; m < 5; m++) {

              for(p = 0; p < 5; p++) {

                   m3[m][p] = 0;

                   for(n = 0; n < 5; n++) {

                       m3[m][p] += m1[m][n] * m2[n][p];

                   }

               }

           }

       }

 

       printf("Ending \n");

}

Screenshot attached. Any comments?

  • In reply to Stephen Moore:

    CCSv5 does some wierd thing that should make running projects easier, but it some cases it breaks things.  What happened here is when you imported the CCSv4 project into CCSv5 it automatically added a linker command file for the 06x device you are using, but your project already had a linker command file.  The two files defined the same ranges in memory which is why it is complaining about overlap.  To fix this you can either remove the F2806x_RAM_BlinkingLED.cmd file or in the build properties remove the F2806x_ram_lnk.cmd file.  The fact that you changed to use the FPU run time support library didn't have any thing to do with the above errors.

    Also, I believe switching to the FPU run time support library ought to solve the speed issue.

    Regards,

    Trey

  • In reply to Trey German:

    That was the final bit. Thanks for your continued responsiveness.

    2.103 seconds for 5.94 MFLOPS. Does that sound consistent with design capability?

    I'm a little nervous about the TI CCS, but at least we now understand the hardware capabilities.

  • In reply to Stephen Moore:

    Stephen,

    The core is capable of much more than 5.94 MFLOPS.  If you hand coded assembly you could actually theoretically get up to 160 MFLOPs as we have a parallel add multiply instruction that is single cycle.  That being said MFLOPs is more of a marketing number because it really depends on how the code is written: assembly, c, loops unrolled, optimizations, etc.  Your question has spurred some internal discussion between the floating point experts and I expect they will reply to this post soon. 

    Regards,
    Trey

  • In reply to Trey German:

    Stephen,

    I suspect the compiler is not doing as well as it could.  Here are a few things to try out.

    Understand that as the compiler generates more optimal code there is a tradeoff with debug capability.  When you start out, you may want the most debug capability available.  In this case the compiler options will likely be limited to -g and mt.  You would then increase optimization from there.

    1. Start with -g -mt  (symbolic debug + unified memory)   these can both be found on the basic options tab of the project options (in CCS 5).
    2. Next you can add -mn (optimize with debug)  This is on the runtime model tab.  This will re-enable some optimizations that -g disabled but still allow you to debug fairly well.
    3. The next step would be to turn on some optimization level.  -o2 is often a good balance.  This can be found on the basic options tab.
    4. Next you would try perhaps -o3 or -o4 optimization.  These may nor may not help improve the benchmark.
    5. Finally you can remove -g - this can severely limit debug capability so it is often done only on a particular file with code you need highly optimized.

    There are some more details of these tips on this wiki page:

    http://processors.wiki.ti.com/index.php/C28x_Code_Generation_Tips_and_Tricks#Optimization

    Regards

    Lori

  • In reply to Stephen Moore:

    Stephen Moore

    STM32F4 (168MHz and FPU)         8.6 seconds
    NXP mbed LPC1768 (96MHz)        16.2 seconds
    LCPXpresso LPC1769 (120MHz)     19.4 seconds
    Piccolo F28069 (80MHz and FPU)  43.7 seconds

    After unsuccessful attempt to run Coremark benchmark on C2000 (coremark doesn't like lack of 8bit data type on C2000), I tried to run code from the first post. Here are my results:

    1. Code in Flash - default waitstates
    -O0 - 85.149 seconds
    -O2 - 51.782 seconds
    -O4 - 51.782 seconds

    2. Code in Flash - minimum waitstates
    -O0 - 11.594 seconds
    -O2 - 5.603 seconds
    -O4 - 5.602 seconds

    3. Code in SRAM
    - O0 - 11.241 seconds
    - O2 - 5.414 seconds
    - O4 - 5.414 seconds

    I don't have STM32F4 to retest the code, but is it possible that piccolo on 80MHz is executing floating point code much faster than 168MHz STM?

    It is also interesting how not properly initialized flash gives you very crappy performance 0:-)

  • In reply to John Connor:

    Just a reply to agree with you about the flash.

    I unwittingly left out the example flash initialisation routines when creating my software and spent a good day scratching my head wondering why it was taking something like 15 clock cycles to do a single assembler instruction.

    Once I put the flash wait-state setup code back in, performance was back to 1 instruction per clock cycle and all was great :-)

    Almost not worth putting code in SRAM, the flash is so quick when set up properly.

  • In reply to John Bennett:

    That being said MFLOPs is more of a marketing number because it really depends on how the code is written: assembly, c, loops unrolled, optimizations, etc.

    I'm using FLOPS as my benchmark number, based on code that was derived from TI benchmarking application note.

    The performance is highly dependent on the compiler settings. The compiler sensitivity is extremely high, yielding greater than an order of magnitude differences in performance.

    I don't have STM32F4 to retest the code, but is it possible that piccolo on 80MHz is executing floating point code much faster than 168MHz STM?

    In general, the F28069 is running FPU faster than the STM32, although the STM could be running slow because of similar optimization issues.

    If you hand coded assembly you could actually theoretically get up to 160 MFLOPs as we have a parallel add multiply instruction that is single cycle.

    The F28069/CSS system is very sensitive and tricky. I'm concerned what could be profitable software development time will be spent figuring out the sensitivities of the TI system. We could spend forever tweaking settings instead of writing revenue-generating code. As you mention, hand-coding the most math-intensive routines (matrix multiplication, dot products, or matrix inversions) may be the best way to go. Hand-coded assembly would basically remove the FPU-heavy routines from being CPU throughput hogs, and alleviate our timing worries.