This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler/TMS320F280049: Slower code than 2808 compiler

Part Number: TMS320F280049

Tool/software: TI C/C++ Compiler

Hi C2000 Team,

My customer is looking into using the F280049 and has notice this compiler discrepancy. They are using the F2808 in their previous design.

I’ve been doing some testing with the F2808 and the F280049 in an attempt to quantify the performance increase we could expect from a migration from the ’08 to the ‘0049.  Before I can get into the “meat” of these tests, though, I need to do some baseline testing to make sure everything is “set” correctly.  So I created some simple test code to execute on both, and I get good results.  Here’s my test code (matrix multiplication):

 

for( i = 0; i < N; i += 1 )

   {

   for( j = 0; j < N; j += 1 )

   {

      for( k = 0; k < N; k += 1 )

      {

         C[i][j] += A[i][k] * B[k][j];

      }

   }

}

 

My Environment:

CCS v 6.2.0.00050

Compiler version TI v18.1.1.LTS

Test code executes from M0 ram

Core = 100MHz

Matrices A, B, and C are all 64x64

 

This snippet of code was purposely written “textbook” style, without any regards to optimization (this is part of the test).  Furthermore, the algorithm itself it is a good representation of how we currently use the F2808, so it should give me a good idea of how the F280049 would perform for the same application.  I measure the execution speed by toggling a GPIO pin after each pass and watching it on a scope (the old fashioned way).

 

I’m glad to say that the assembly code generated by the compiler for each processor is identical, as expected (thankfully), and both execute in exactly the same time when both are running their cores at 100MHz (82.1ms execution time per pass).  This is a good baseline for my performance comparison investigation, and I am happy to see this.

 

However, when I set the compiler optimization level to “2” (speed = 2 = default), the F2808 gains an advantage in terms of execution speed.  Upon further investigation, I found that this advantage is a result of a difference in the compiler optimizations, which is puzzling.  The F280049 fully supports the entire F2808 instruction set, so I would expect (at the very least) identical assembly code generated – but that is not the case.  The F2808 optimizer “wins” by creating a tighter “inner loop” for this test code snippet (see assembly listings below).  This results in the F280049 taking 11% longer to perform the calculation.  Should I expect my code to run 11% slower on the F280049 if I decide to migrate?  Please tell me “no” and that a new compiler revision is in the works.

 

Below is a side-by-side assembly listing (generated by the compiler) for the F280049 and F2808 – I’ve highlighted the differences that result in a better optimization for the F2808 (not sure why there is ANY difference).

 

I literally copied the F2808 asm code and place it in the F280049 “main()” procedure and it worked fine.  Execution time is now identical between the two processors.  Still not sure why the F280049 compiler chose a sub-par solution.

 

Please let me know if you have any question, I appreciate your feedback.

Regards,

~John

  • John,

    The compiler is the same for both devices. It doesn't know F280049 from F2808. It only knows the build options that are passed to it. Are the test cases two different projects (I suspect so). There must be some subtle difference in the build options. Hmmm, try disabling the FPU in the build options for the F280049 and see if that fixes the problem. Even though the FPU has nothing to do with this code, maybe it is affecting the optimization.

    Regards,
    David
  • Hi David,

    Thank you for the quick response and feedback. The customer has let me know that they will test this. However, they wanted to re-iterate that this doesn't solve the issue, just answers the why. From the customer:

    That’s a good point – I suspected that the compiler was considering using the RPTB instruction (new to the 280049), which is only an option when FPU is enabled (from what I”ve been told).  But it eventually decided not to.  The result of this extra decision branching resulted in a slightly different compilation (still not sure why – in a perfect universe, the end-result would still be the same).

    I STILL must stress that the resulting compilation was not optimized!  The 2808 compilation ran faster on the 280049.  From what I can surmise from TI's response (and my test results) is that I can expect identical code to run slower on the 280049 when I have the FPU enabled.  And why would I use the 280049 if not for the FPU?

    Regards,

    ~John

  • Hi John,

    If the FPU being enabled in the compiler options is causing sub-optimal fixed-pt. code to be generated, then that is a compiler problem and something the compiler team would need to look at. I don't know if that is the case here of course. It is just speculation.

    - David
  • I would like to focus on why the F280049 build gets an extra MOV instruction in the innermost loop.  Please submit a test case as described in the article How to Submit a Compiler Test Case.  I expect I will be able to use that to produce the same assembly code as you show for the F280049 build.  Then I will submit a performance bug against the compiler.  That will result in either the problem being fixed, or an explanation given as to why the MOV cannot be avoided.

    Thanks and regards,

    -George

  • I'd appreciate if we could get the requested test case.

    Thanks and regards,

    -George

  • If the customer uses the "UNROLL" #pragma, they don't have to modify the code:

     

       for( i = 0; i < N; i += 1 )

            {

            for( j = 0; j < N; j += 1 )

                {

                #pragma UNROLL(2);

                for( k = 0; k < N; k += 1 )

                    {

                    C[i][j] += A[i][k] * B[k][j];

                    }

                }

            }

     

     

    The code is a matrix multiply.  As written (typical generic code), the compiler performs a memory access on every operation (read and write). This takes more cycles but also consumes more power.

    Such code it is more efficient to write as follows (a DSP engineers view):

     

        for( i = 0; i < N; i += 1 )

           {

            for( j = 0; j < N; j += 1 )

                {

                sum = 0;

                #pragma UNROLL(4);

                for( k = 0; k < N; k += 1 )

                    {

                    sum += (long) A[i][k] * (long) B[k][j];

                    }

                C[i][j] = (int) sum;

                }

            }

     

    Notes: loop unrolling of 4 gives best result.

     

    The above code allows the compiler to use the MAC operation. But you need to make sure that the A & B arrays  are in separate RAM blocks using DATA_SECTION #pragma:

     

    #pragma DATA_SECTION(A,"section name in linker");

    #pragma DATA_SECTION(B,"section name in linker");

     

    int A[N][N];

    int B[N][N];

    int C[N][N];

     

    The above modified code will run faster then the original.

     

    Further, if the code is written as above, the compiler could be optimized this to a RPT single operation and then the cycle times will get MUCH better:

     

    Compiler basically generates something like this (MAC operation takes 2 cycles if no memory conflicts):

    RPTB     .....

    ...

    MAC      P,*XAR0++,*XAR7

    ADDB    XAR7,#64

    MAC      P,*XAR0++,*XAR7

    ADDB    XAR7,#64

    ....

     

    But could generate this as RPT single:

     

    MOVB       XAR0,#64

    MVB        ACC,#0

    MOV        T,*XAR7++

    MPY         P,*XAR6                             ; ARP pointer points to XAR6

    ADDB       XAR6,#64

    RPT           #62

    ||MAC      P,*0++,*XAR7++

    ADD          ACC,@P

     

    This could be compiler optimization that recognizes a sum of product.

     

    Cheers,

    Alex T.