Loop Carried Dependency Bound question

Jonathan Wieman

Hello,

I am starting to optimize some C source code for speed gains and have been attempting to understand the information provided in the optimization feedback in the assembly code.

I am getting a loop carried dependency bound value much higher then I expected. The pointer parameter is a restricted, however my loop carried dependency bound value is "4" while I would expect a zero value here.

The function in question takes a pointer to floating point array, and a size. The floating point sum of the input floating point array is returned.

float array_sum ( const float * restrict In   ,     /* Input array */
                                          unsigned          Size ) /* size            */
{
    int   i = 0;                              /* Loop Counter */
    float x = 0.0;                                                        /* Running sum of array */

    _nassert( (int) In % 8 == 0);              /* Input pointer is 64-bit aligned */
    #pragma MUST_ITERATE (2, )                      /* Loop must execute at least twice */
    for ( i = 0; i < Size; i++ )
    {
        x += In[i];
    }
    return x;
}

*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : ../foo.c
;*      Loop source line                 : 300
;*      Loop opening brace source line   : 301
;*      Loop closing brace source line   : 303
;*      Known Minimum Trip Count         : 2
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 4
;*      Unpartitioned Resource Bound     : 1
;*      Partitioned Resource Bound(*)    : 1
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0
;*      .S units                     0        0
;*      .D units                     1*       0
;*      .M units                     0        0
;*      .X cross paths               0        0
;*      .T address paths             1*       0
;*      Long read paths              0        0
;*      Long write paths             0        0
;*      Logical ops (.LS)           1        0     (.L or .S unit)
;*      Addition ops (.LSD)          0        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             1*       0
;*      Bound(.L .S .D .LS .LSD)     1*       0
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 4 Schedule found with 2 iterations in parallel
;*
;*      Register Usage Table:
;*          +-----------------------------------------------------------------+
;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*          |00000000001111111111222222222233|00000000001111111111222222222233|
;*          |01234567890123456789012345678901|01234567890123456789012345678901|
;*          |--------------------------------+--------------------------------|
;*       0: |    **                          |                                |
;*       1: |   ***                          |                                |
;*       2: |    **                          |                                |
;*       3: |    **                          |                                |
;*          +-----------------------------------------------------------------+
;*
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages       : 0
;*      Collapsed prolog stages       : 0
;*      Minimum required memory pad   : 0 bytes
;*
;*      Minimum safe trip count       : 1
;*      Min. prof. trip count (est.) : 3
;*
;*      Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
;*      Mem bank perf. penalty (est.) : 0.0%
;*
;*
;*      Total cycles (est.)         : 4 + trip_cnt * 4
;*----------------------------------------------------------------------------*

Any suggestions?

over 13 years ago

0 RandyP over 13 years ago

TI__Guru* 84110 points

Jonathan,

Welcome to the TI E2E forum. I hope you will find many good answers here and in the TI.com documents and in the TI Wiki Pages. Be sure to search those for helpful information and to browse for the questions others may have asked on similar topics.

There are a lot of forums within E2E. You posted this question to the C67x Single Core DSP forum, which is for device questions. This belongs in the TI C/C++ Compiler forum where the compiler experts live. Since you are new, we will have someone move this there for you.

They will want to know what your compiler version number is, what your compiler optimization settings are, and I would think it would be helpful to show the assembly that was generated for this short loop. Also, the specific device you are using plus the compiler -mv command will be needed.

The Compiler User's Guide is the place to learn more about the different optimization terms and how optimization work.

Regards,
RandyP

0 Archaeologist over 13 years ago

TI__Guru* 84285 points

The quick answer is that you've got a recurrence on the ADDSP instruction, which consumes its own output (thus the recurrence), and takes 4 cycles (thus the recurrence is 4 cycles).

The longer answer is that the compiler can't do much about it, because it is not entitled to reorder floating-point instructions, and it doesn't know much about exactly how many times the loop would execute. You might be able to get a better schedule with --fp_reassoc=on and some loop unrolling.

[Edit: output, not input --Archaeologist]

Code Composer Studio™︎

Code Composer Studio forum

Loop Carried Dependency Bound question