This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Multi for loop performance

Other Parts Discussed in Thread: CCSTUDIO

Hi.

I used EVM C6747 with CCS v3.3. I tried the following code and it takes 2050ms to complete. I have already tune the build option for optimization level 3 (build details as below) . Any suggestion or idea to improve the performance significantly? i wish to obtain within tenth mili-seconds as this algorithm will affect my whole system. Can you guys benchmark with your systems and compare with mine?

TQ.

/*********************CODE START************************/

unsigned short SenMap [32][32][128][128];
float WBM [128][128];

for (tx=0;tx<32;tx++)                            
    {    for (rx=0;rx<32;rx++)
        {    
            for ( x=0;x<128;x++)
            {    for ( y=0;y<128;y++)                
                {                                
                    WBM[x][y] = WBM[x][y] + (SenMap[tx][rx][x][y] * 255);
                   }            
            }
                            
        }    
    }

/*********************CODE END***************************/

-------------------------  Uart_example.pjt - Release  -------------------------
"C:\CCStudio_v3.3\C6000\cgtools\bin\cl6x" -k -pm -op2 -o3 -fr"../obj/release" -i"../../../../../../../" -mo -ml3 -mf5 -mv6700 -mv6740 -@"../build/Release.lkf"
[Pllc_example.c]
[uart_example.c]
"C:\Program Files\Texas Instruments\pspdrivers_01_20_00\packages\ti\pspiom\cslr\evmOMAPL137\examples\uart\src\uart_example.c", line 52: warning: variable "large" was declared but never referenced
<Optimizing>
<Generating>
<Assembling>

[Linking...] "C:\CCStudio_v3.3\C6000\cgtools\bin\cl6x" -@"Release.lkf"
<Linking>

Build Complete,
  0 Errors, 1 Warnings, 0 Remarks.


  • The precise way in which the WBM and SenMap are defined makes a big difference.  Are they local, arguments passed in, or what?  With that in mind, please submit a source file that can be built.  Attach it to your next post.  From that, we can likely offer some specific advice.

    In the meantime, please try the advice in this Wiki article.  Also consider memory effects.  Is it possible you are losing many cycles just to reading and writing memory?

    Thanks and regards,

    -George 

  • Hi again,

    tq for suggeston. i attached the whole project file.0066.uart.zip

    Additional note: i  used 32 x PLL. The given project file should includes all pll settings. What do you think? Can you get a faster loops?

  • Thanks for the test case.  The usual advice in cases like this (use the restrict keyword, MUST_ITERATE pragma, etc.) does not apply here.  I have nothing like that to offer.

    I do observe that, if the type of SenMap is changed from unsigned short to float, it runs about 2X faster.  I don't know if that is a change you can make.

    While I pointed out memory latency as a potential issue, I don't know much about how to improve it.  In particular, I don't know anything about PLL settings.  I suggest you turn to the C67x Forum for that.

    Thanks and regards,

    -George

  • I don't have time to try this, but perhaps some use of the restrict keyword would help.

    Another thought is that this loop is going to be very limited by data load and store. Is there any mathematical or logical pattern in the SenMap array that can be used instead of using it as a look-up-table? I mean, if any of the values can be computed in the loop instead of being loaded from the SenMap array, the processor could use less D unit instructions and more L, S, and M unit instructions and that could be optimized better. It may seem counter-intuitive that computing some values would be faster. We were all taught that Look-up-tables are very fast at the cost of wasting memory, but that's not always the case with this processor using software pipelined loops.