This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

High Loop Carried Dependency Bound - why?



Hi,

 I have some code doing basically mergesort on two input arrays (actually its raw data not represented by a C data structure, thats why I rely on pointers):


int x;
for(x=0; x < 4; x++) {
   short* restrict leftMin = &min0_0[x];
   short* restrict leftMinIdx = &minIdx0_0[x];
   short* restrict rightMin = &min0_1[x]; 
   short* restrict rightMinIdx = &minIdx0_1[x];
   short* restrict validMin = &validMin0[x]; 
   short* restrict validMinIdx = &validMin0Idx[x];

   int i; 
   #pragma UNROLL(4)
   for (i = 0; i < 4; i++) {
      if (*leftMin < *rightMin) {
           *validMin = *leftMin;
           *validMinIdx = *leftMinIdx;
            leftMin += yStep;
            leftMinIdx += yStep;
       } else {
           *validMin = *rightMin;
           *validMinIdx = *rightMinIdx;
            rightMin += yStep;
            rightMinIdx += yStep;
      }

      validMin += valStep;
      validMinIdx += valStep;
   }
 }



Compiling this code generates a loop (outer one) which takes 25 cycles, instead of the expected <20, the culprit seems to be the high Loop Carried Dependency Bound:

;* Loop found in file : ../main.c
;* Loop source line : 49
;* Loop opening brace source line : 49
;* Loop closing brace source line : 78
;* Known Minimum Trip Count : 4
;* Known Maximum Trip Count : 4
;* Known Max Trip Count Factor : 4
;* Loop Carried Dependency Bound(^) : 23
;* Unpartitioned Resource Bound : 16
;* Partitioned Resource Bound(*) : 16
;* Resource Partition:
;* A-side B-side
;* .L units 2 2
;* .S units 1 0
;* .D units 16* 16*
;* .M units 0 0
;* .X cross paths 10 4
;* .T address paths 16* 16*
;* Long read paths 0 0
;* Long write paths 0 0
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 22 16 (.L or .S or .D unit)
;* Bound(.L .S .LS) 2 1
;* Bound(.L .S .D .LS .LSD) 14 12
;*
;* Searching for software pipeline schedule at ...
;* ii = 23 Did not find schedule
;* ii = 24 Did not find schedule
;* ii = 25 Schedule found with 2 iterations in parallel

 

 

However, what I don't understand is - how the loop can be dependency bound? It does not pass any data to the next iteration actually, and pointers are all annotated with the restrict keyword.

I would be grateful for a pointer whats going wrong here.

Thx

 

 

 

 

 

 

 

  • First, there is a recurrence on several arrays.  Consider validMin.  On every iteration of the outer loop, you reset validMin to &validMin0[x], but also increment validMin in the inner loop.  The pattern of accesses for validMin0 is 0 1 2 3, 1 2 3 4, 2 3 4 5, etc.  Perhaps you meant to initialize validMin to &validMin0[x*4] in the outer loop?

    Secondly, restrict only takes effect for the lifetime of the variable.  As you have it, it says nothing about the relationship between these variables and the same variables on the next iteration of the outer loop.  Consider moving the declaration of the variables outside the outer for loop as follows:

    short* restrict leftMin;
    short* restrict leftMinIdx;
    short* restrict rightMin; 
    short* restrict rightMinIdx;
    short* restrict validMin; 
    short* restrict validMinIdx;
    
    int x;
    for(x=0; x < 4; x++) {
        leftMin = &min0_0[x];
        leftMinIdx = &minIdx0_0[x];
        rightMin = &min0_1[x]; 
        rightMinIdx = &minIdx0_1[x];
        validMin = &validMin0[x]; 
        validMinIdx = &validMin0Idx[x];
    
  • Hi Archaelogist,

    Actually the code is iterating in two dimensions - one dimension in the outer loop ("x") and another one in the inner loop ("y" only 0-3).

    Thanks to your suggestions of moving the definition of the pointers out of the loop, the "loop carried dependency bound" went down to 20 - and the compiler was able to find a schedule within 21 cycles. ts a descent improvement of 20%, thanks a lot.
    I decided against writing this function in linear assembly as its memory bound anyway.

    What I don't understand is, why is there any  "loop carried dependency bound" at all - the inner loop is unrolled, so according to the output the optimizer is pipelining the outer loop. However, all a loop iteration does, is reading and writing some non-dependant memory locations, there is not a single variable passed to the next iteration, as all pointers are reset at the beginning.
    It would be really great for me to understand the optimizer in this case, so I would able to understand whats going on behind the scenes and write better code next time.

    Thx

     

     

  • The loop carried dependency bound comes from the if statement.  There is a path through the if statement such that the pointer leftMin gets incremented, and another path such that it does not get incremented.  The if control expression reads the value of leftMin.  Thus, the next iteration of the loop can't start until we know whether or not leftMin got incremented in the previous iteration.  The same is true of rightMin.  No amount of unrolling will help.

  • Heinrich Schwarz said:
    Actually the code is iterating in two dimensions - one dimension in the outer loop ("x") and another one in the inner loop ("y" only 0-3).

    Are yStep and valStep guaranteed to be multiples of 4?

  • The loop carried dependency bound comes from the if statement. There is a path through the if statement such that the pointer leftMin gets incremented, and another path such that it does not get incremented. The if control expression reads the value of leftMin. Thus, the next iteration of the loop can't start until we know whether or not leftMin got incremented in the previous iteration. The same is true of rightMin. No amount of unrolling will help.


    Ah of course, the if causes a "dependency chain" which basically spans from the first to the last iteration. Thanks for the clear explanation.


    Are yStep and valStep guaranteed to be multiples of 4?


    Could be arranged...

    Thx