This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

why the value of Loop Carried Dependency Bound can't be optimized?

my bilinear interpolation algorithm was coded with c++, the main code as follow:

//*************************************************

while the compile infomation as follow:

and the assembly part as follow:

note that: the input px and py parameters has been used with restrict keyword, as the line 490 was the main reason, try to add restrict keyword for variable po,p1,p2,p3 is unusefull, the main point is point multiplication in line 490 and 491 can't be paralleled, i have also tried to use _dotpu4 , failed. 

so,  how to reduce the  Loop Carried Dependency Bound base on the assembly profile?

  • It's a bit difficult to see at first, but my guess is that pDstTemp is a pointer to unsigned char? It looks like the compiler cannot rule out that pDstTemp aliases gudL2SrcData_Addr, so it has to load gudL2SrcData_Addr after writing to *pDstTemp. You can either use the restrict keyword on pDstTemp (declare it as "unsigned char * restrict pDstTemp") or use a local variable defined outside the inner loop to store pDstTemp, i.e.

    for(j = 0; j < l32Height; j++)
    {
      unsigned char * l2SrcData_Addr = (unsigned char*)gudL2SrcData_Addr;
      /* code that follows uses l2SrcData_Addr instead of gudL2SrcData_Addr */
      [...]
    

  • i must say that to optimize the line 490, the pointer p0,p1,p2,p3 wasn't used with restrict key word, while i did use it ,the value of Loop Carried Dependency Bound not reduced but increased, meanwhile, i have tried to calculate the variable of one_sub_dx*one_sub_dy in advance, it works, so the main reason is i can't calculate the value of p0,p1,p2,p3 in advance , how can i do to optimize the Loop Carried Dependency Bound?

    anyone can help?

    Best Regards!
    lai yi
  • First thanks for your advice, it works a lot ! and the pDstTemp is a pointer to unsigned char indeed, but i still can't find out why the compiler cannot rule out that pDstTemp aliases gudL2SrcData_Addr, could you explain for me clearly from the .asm assembly page? and i do think the Partitioned Resource Bound also can be optimized?

    Best Regards!
    lai yi
  • Your problem likely has little to do with the multiplications on lines 490 and 491. The problem is the assignment in line 493, which writes to *pDstTemp. The value that is written ultimately depends (through a series of computations) on p0 and thus gudL2SrcData_Addr (line 483). So in iteration i the following steps are performed:
    (i, 1) load gudL2SrcData_Addr
    (i, 2) compute p0, p1, ... (depends on (i, 1))
    (i, 3) write to *pDstTemp (depends on (i, 2))

    The problem is that the compiler has to assume that (i+1, 1) depends on (i, 3), which creates the loop-carried dependency bound you are seeing. You need to tell the compiler that writing to *pDstTemp in (i, 3) does not affect (i+1, 1). You can do that by adding restrict to pDstTemp or by making a local copy of gudL2SrcData_Addr, as I wrote above. I think I'd actually prefer the latter in this case.

    EDIT: our replies were overlapping, so I didn't see your latest reply when I was answering. To see why the compiler has to assume that there is a dependency, think about what happens when you pass "(unsigned char*)&gudL2SrcData_Addr" as pDstTemp.

  • As you wrote above,  by adding restrict to pDstTemp is usefull while the latter is not. in my project . So i really agree with your replies that the compiler must rule out the dependency by restrict keyword . According to this advice , i rewrote the following codes by adding restrict to the input and output variables for a section of circulation, but why it doesn't work at all?

    The  rewrote codes section as follow:

    (i,1) load pu8TmpImg[l32Index2 - 1]

    (i,2)compute l32Sum variable.....(depends on (i,1))

    (i,3)compute l32SquareSum.......(depends on (i,1))

    (i,4)write to pl32TmpIntegImg[l32Index2].....(depends on pl32TmpIntegPrev[l32Index2] and (i,2)),      

           while the int32_t * restrict pl32TmpIntegPrev=(int32_t *)(pl32TmpIntegImg - l32ImgWidth),  means pl32TmpIntegPrev[l32Index2] value depends on pl32TmpIntegImg, so this can't be ruled out there is a dependency even a restrict  was added?

    the compiler gives the result  line 366 has Loop Carried Dependency Bound , as follow:

    As your advice, i must give the compiler the information that  writing to  pl32TmpIntegImg[l32Index2] in (i,4)does not affect (i+1,1),  so  pl32TmpIntegImg has been defined as follow:

    int32_t * restrict pl32TmpIntegImg = pl32IntegImg; while pl32IntegImg was the input variable added by int32_t * restrict pl32IntegImg

    But why the  Loop Carried Dependency Bound is still so high? What 's the problem of my thinking according to your advice?

    Note: (i,1) load pu8TmpImg[l32Index2 - 1], pu8TmpImg was one of the input variables added by  uint8_t * restrict pu8TmpImg = pu8ImgData;

  • Can you show us the whole function? It's hard to tell without seeing the variable declarations. The following code

    #include <stdint.h>
    
    void function(uint8_t* restrict pu8TmpImg, int32_t * restrict pl32TmpIntegImg, int32_t* pl32TmpIntegPrev, double * restrict pd64TmpSquarImg, double *pd64TmpSquarPrev, int32_t roi_width)
    {
    	int32_t l32Sum = 0;
    	int32_t l32SquareSum = 0;
    	int32_t l32Index2;
    	uint8_t gray_value;
    
    
    	for(l32Index2=0; l32Index2<roi_width; l32Index2++)
    	{
    		gray_value = pu8TmpImg[l32Index2-1];
    		l32Sum += gray_value;
    		l32SquareSum += (gray_value * gray_value);
    
    		pl32TmpIntegImg[l32Index2] = pl32TmpIntegPrev[l32Index2] + l32Sum;
    		pd64TmpSquarImg[l32Index2] = pd64TmpSquarPrev[l32Index2] + l32SquareSum;
    	}
    }
    

    resulted in

    ;*----------------------------------------------------------------------------*
    ;*   SOFTWARE PIPELINE INFORMATION
    ;*
    ;*      Loop found in file               : loop.c
    ;*      Loop source line                 : 11
    ;*      Loop opening brace source line   : 12
    ;*      Loop closing brace source line   : 19
    ;*      Known Minimum Trip Count         : 1                    
    ;*      Known Max Trip Count Factor      : 1
    ;*      Loop Carried Dependency Bound(^) : 0
    ;*      Unpartitioned Resource Bound     : 3
    ;*      Partitioned Resource Bound(*)    : 3
    ;*      Resource Partition:
    ;*                                A-side   B-side
    ;*      .L units                     2        0     
    ;*      .S units                     0        0     
    ;*      .D units                     3*       2     
    ;*      .M units                     1        0     
    ;*      .X cross paths               0        1     
    ;*      .T address paths             3*       2     
    ;*      Long read paths              0        0     
    ;*      Long write paths             0        0     
    ;*      Logical  ops (.LS)           0        1     (.L or .S unit)
    ;*      Addition ops (.LSD)          3        0     (.L or .S or .D unit)
    ;*      Bound(.L .S .LS)             1        1     
    ;*      Bound(.L .S .D .LS .LSD)     3*       1     
    ;*
    ;*      Searching for software pipeline schedule at ...
    ;*         ii = 3  Schedule found with 6 iterations in parallel
    ;*      Done
    ;*
    ;*      Loop will be splooped
    ;*      Collapsed epilog stages       : 0
    ;*      Collapsed prolog stages       : 0
    ;*      Minimum required memory pad   : 0 bytes
    ;*
    ;*      Minimum safe trip count       : 1
    ;*----------------------------------------------------------------------------*

  • You will have to be careful though, if pl32TmpIntegPrev is actually based on pl32TmpIntegImg (because it is something like pl32TmpIntegImg - 1). In that case, the restrict qualifiers will be wrong.
  • As you might guess, pl32TmpIntegPrev is actually based on pl32TmpIntegImg ,because pl32TmpIntegPrev was defined as follow:
    pl32TmpIntegPrev = pl32TmpIntegImg - l32ImgWidth;

    The whole function was following:

    whether i add the restrict or not , it desn't work.

    So, how to process?

    Thanks for your assistance!

  • Since you show screen shots, and not the text of the source code, I cannot copy-n-paste into a source file and then build.  Please preprocess the file which contains the function CalcIntegraImg, and attach that to your next post.  Also indicate what version of the compiler you use, and show all the build options exactly as the compiler sees them.  That will allow me to build this example, and then explain what changes will avoid the loop carried dependency.

    In the meantime, please see this wiki article, especially the section on eliminating loop carried dependencies.

    Thanks and regards,

    -George

  • Hi,George.

    I will post the source code with the insert code tool , the compiler version is V7.4.2 while the target processor version is 6740, the output format of this project was set as ELF, the bulid option i have used for this source code file are only -o3,-mw,-mt and -k.

    void AdaDection::CalcIntegralImg(int32_t * restrict pl32IntegImg, int64_t * restrict pd64SquareImg, uint8_t * restrict pu8ImgData, int32_t l32ImgWidth, int32_t l32ImgHeight, CvRect tRect)
    {
    	int32_t l32Index1, l32Index2;
    
    	uint8_t * restrict pu8TmpImg;
    	int32_t * restrict pl32TmpIntegImg;
    	int64_t * restrict pd64TmpSquarImg;
    
    	//calculate integral for each pyramid layer
    	int32_t l32left = MAX(tRect.x, 0);
    	int32_t l32top = MAX(tRect.y, 0);
    	int32_t l32right = MIN(tRect.x + tRect.width, l32ImgWidth);
    	int32_t l32bottom = MIN(tRect.y + tRect.height, l32ImgHeight);
    
    	//the first line of ROI
    	pl32TmpIntegImg = pl32IntegImg + (l32top - 1) * l32ImgWidth;
    	pd64TmpSquarImg = pd64SquareImg + (l32top - 1) * l32ImgWidth;
    
    	//the first column of ROI
    	pl32TmpIntegImg = pl32IntegImg + l32left;
    	pd64TmpSquarImg = pd64SquareImg + l32left;
    	#pragma MUST_ITERATE(1,,)
    	for (l32Index1 = 0; l32Index1 < l32ImgHeight; l32Index1++)
    	{
    		pl32TmpIntegImg[0] = 0;
    		pd64TmpSquarImg[0] = 0;
    
    		pl32TmpIntegImg += l32ImgWidth;
    		pd64TmpSquarImg += l32ImgWidth;
    	}
    
    	int32_t roi_height = tRect.height;
    	int32_t roi_width = tRect.width;
    	int32_t * restrict pl32TmpIntegPrev;
    	int64_t * restrict pd64TmpSquarPrev;
    	int32_t l32Sum, l32SquareSum;
    	uint8_t gray_value;
    	uint32_t nIntegPrev;
    	uint64_t nSquarePrev;
    
    	pu8TmpImg = pu8ImgData + (l32top - 1) * l32ImgWidth + (l32left - 1);
    	pl32TmpIntegImg = pl32IntegImg + l32top * l32ImgWidth + l32left;
    	pl32TmpIntegPrev = pl32TmpIntegImg - l32ImgWidth;
    	pd64TmpSquarImg = pd64SquareImg + l32top * l32ImgWidth + l32left;
    	pd64TmpSquarPrev = pd64TmpSquarImg - l32ImgWidth;
    
    #pragma MUST_ITERATE(1,,)
    	for (l32Index1 = 0; l32Index1 < roi_height; l32Index1++)
    	{
    		l32Sum = 0;
    		l32SquareSum = 0;
    #pragma MUST_ITERATE(1,,)
    		for (l32Index2 = 0; l32Index2 < roi_width; l32Index2++)  //line 300
    		{
    			gray_value = pu8TmpImg[l32Index2];
    			nIntegPrev = pl32TmpIntegPrev[l32Index2];
    			nSquarePrev = pd64TmpSquarPrev[l32Index2];
    			
    			l32Sum += gray_value;
    			l32SquareSum += (gray_value * gray_value);
    
    			pl32TmpIntegImg[l32Index2] = nIntegPrev + l32Sum;
    			pd64TmpSquarImg[l32Index2] = nSquarePrev + l32SquareSum;
    		}
    
    		pu8TmpImg += l32ImgWidth;
    		pl32TmpIntegImg += l32ImgWidth;
    		pl32TmpIntegPrev += l32ImgWidth;
    		pd64TmpSquarImg += l32ImgWidth;
    		pd64TmpSquarPrev += l32ImgWidth;
    	}
    
    }

    From the .asm file,you will see that

    .compiler_opts --abi=eabi --c64p_l1d_workaround=off --endian=little --hll_source=on --long_precision_bits=32 --mem_model:code=near --mem_model:const=data --mem_model:data=far_aggregates --object_format=elf --silicon_version=6740 --symdebug:dwarf --symdebug:dwarf_version=3 
    
    ;******************************************************************************
    ;* GLOBAL FILE PARAMETERS                                                     *
    ;*                                                                            *
    ;*   Architecture      : TMS320C674x                                          *
    ;*   Optimization      : Enabled at level 3                                   *
    ;*   Optimizing for    : Speed                                                *
    ;*                       Based on options: -o3, no -ms                        *
    ;*   Endian            : Little                                               *
    ;*   Interrupt Thrshld : Disabled                                             *
    ;*   Data Access Model : Far Aggregate Data                                   *
    ;*   Pipelining        : Enabled                                              *
    ;*   Speculate Loads   : Enabled with threshold = 0                           *
    ;*   Memory Aliases    : Presume not aliases (optimistic)                     *
    ;*   Debug Info        : DWARF Debug                                          *
    ;*                                                                            *
    ;******************************************************************************
    
    ;*      Loop source line                 : 300
    ;*      Loop opening brace source line   : 301
    ;*      Loop closing brace source line   : 311
    ;*      Known Minimum Trip Count         : 1                    
    ;*      Known Max Trip Count Factor      : 1
    ;*      Loop Carried Dependency Bound(^) : 9
    ;*      Unpartitioned Resource Bound     : 3
    ;*      Partitioned Resource Bound(*)    : 3
    ;*      Resource Partition:
    ;*                                A-side   B-side
    ;*      .L units                     1        0     
    ;*      .S units                     1        0     
    ;*      .D units                     3*       2     
    ;*      .M units                     1        0     
    ;*      .X cross paths               0        1     
    ;*      .T address paths             3*       2     
    ;*      Long read paths              0        0     
    ;*      Long write paths             0        0     
    ;*      Logical  ops (.LS)           0        0     (.L or .S unit)
    ;*      Addition ops (.LSD)          4        2     (.L or .S or .D unit)
    ;*      Bound(.L .S .LS)             1        0     
    ;*      Bound(.L .S .D .LS .LSD)     3*       2     
    ;*
    ;*      Searching for software pipeline schedule at ...
    ;*         ii = 9  Schedule found with 2 iterations in parallel

  • Your source code does not build as is.  For instance, I don't have the definition of the class AdaDection, so I commented that out.  I made a few other similar guesses.  And these guesses somehow cause the problem to go away.  Please send the preprocessed file, and show all the build options exactly as the compiler sees them.  That removes all the guesswork.

    lai yi said:
    the compiler version is V7.4.2

    This is quite old, from 2012.  I recommend you update to the latest 7.4.x version of the compiler, which is presently 7.4.19.  Functionally, it is exactly the same.  But there are four years worth of bug fixes.  One of those fixes might solve your problem.  Please see this wiki article for details on how to upgrade the compiler.

    Thanks and regards,

    -George

  • Hi,George

     Sorry for my source code can't be build, actually you don't need the definition of the class Adadection, i have repreprocessed it as follow, and the the build options i have show it before, -mv6740, --abi=eabi, -O3, -g ,-mt,-k, i have choose for this source code. So is there anything of the build option  i haven't shown for you?

    void CalcIntegralImg(int32_t * restrict pl32IntegImg, int64_t * restrict pd64SquareImg, uint8_t * restrict pu8ImgData, int32_t l32ImgWidth, int32_t l32ImgHeight, CvRect tRect)
    {
    	int32_t l32Index1, l32Index2;
    
    	uint8_t * restrict pu8TmpImg;
    	int32_t * restrict pl32TmpIntegImg;
    	int64_t * restrict pd64TmpSquarImg;
    
    	//calculate integral for each pyramid layer
    	int32_t l32left = MAX(tRect.x, 0);
    	int32_t l32top = MAX(tRect.y, 0);
    
    	//the first column of ROI
    	pl32TmpIntegImg = pl32IntegImg + l32left;
    	pd64TmpSquarImg = pd64SquareImg + l32left;
    	#pragma MUST_ITERATE(1,,)
    	for (l32Index1 = 0; l32Index1 < l32ImgHeight; l32Index1++)
    	{
    		pl32TmpIntegImg[0] = 0;
    		pd64TmpSquarImg[0] = 0;
    
    		pl32TmpIntegImg += l32ImgWidth;
    		pd64TmpSquarImg += l32ImgWidth;
    	}
    
    	int32_t roi_height = tRect.height;
    	int32_t roi_width = tRect.width;
    	int32_t * restrict pl32TmpIntegPrev;
    	int64_t * restrict pd64TmpSquarPrev;
    	int32_t l32Sum, l32SquareSum;
    	uint8_t gray_value;
    	uint32_t nIntegPrev;
    	uint64_t nSquarePrev;
    
    	pu8TmpImg = pu8ImgData + (l32top - 1) * l32ImgWidth + (l32left - 1);
    	pl32TmpIntegImg = pl32IntegImg + l32top * l32ImgWidth + l32left;
    	pl32TmpIntegPrev = pl32TmpIntegImg - l32ImgWidth;
    	pd64TmpSquarImg = pd64SquareImg + l32top * l32ImgWidth + l32left;
    	pd64TmpSquarPrev = pd64TmpSquarImg - l32ImgWidth;
    
    #pragma MUST_ITERATE(1,,)
    	for (l32Index1 = 0; l32Index1 < roi_height; l32Index1++)
    	{
    		l32Sum = 0;
    		l32SquareSum = 0;
    #pragma MUST_ITERATE(1,,)
    		for (l32Index2 = 0; l32Index2 < roi_width; l32Index2++)//line 372
    		{
    			gray_value = pu8TmpImg[l32Index2];
    			nIntegPrev = pl32TmpIntegPrev[l32Index2];
    			nSquarePrev = pd64TmpSquarPrev[l32Index2];
    
    			l32Sum += gray_value;
    			l32SquareSum += (gray_value * gray_value);
    
    			pl32TmpIntegImg[l32Index2] = nIntegPrev + l32Sum;
    			pd64TmpSquarImg[l32Index2] = nSquarePrev + l32SquareSum;
    		}
    
    		pu8TmpImg += l32ImgWidth;
    		pl32TmpIntegImg += l32ImgWidth;
    		pl32TmpIntegPrev += l32ImgWidth;
    		pd64TmpSquarImg += l32ImgWidth;
    		pd64TmpSquarPrev += l32ImgWidth;
    	}
    
    }
    

    From the .asm information, you will see that;

    ;*      Loop source line                 : 372
    ;*      Loop opening brace source line   : 373
    ;*      Loop closing brace source line   : 383
    ;*      Known Minimum Trip Count         : 1                    
    ;*      Known Max Trip Count Factor      : 1
    ;*      Loop Carried Dependency Bound(^) : 9
    ;*      Unpartitioned Resource Bound     : 3
    ;*      Partitioned Resource Bound(*)    : 3
    ;*      Resource Partition:
    ;*                                A-side   B-side
    ;*      .L units                     1        0     
    ;*      .S units                     1        0     
    ;*      .D units                     3*       2     
    ;*      .M units                     1        0     
    ;*      .X cross paths               0        1     
    ;*      .T address paths             3*       2     
    ;*      Long read paths              0        0     
    ;*      Long write paths             0        0     
    ;*      Logical  ops (.LS)           0        0     (.L or .S unit)
    ;*      Addition ops (.LSD)          4        2     (.L or .S or .D unit)
    ;*      Bound(.L .S .LS)             1        0     
    ;*      Bound(.L .S .D .LS .LSD)     3*       2     
    ;*
    ;*      Searching for software pipeline schedule at ...
    ;*         ii = 9  Schedule found with 2 iterations in parallel
    ;*      Done

    The adveice you recommend for me to upgrade the compiler wasn't be tried yet, since the link you shown can't be opend in my page, and i only want to optimize the loop from the recent compiler version, but i would like to upgrade it after i have optimized this function and to compare the result between V7.4.2 and the presently version.

    Then how to reduce the value of  Loop Carried Dependency Bound as shown above?

    Thanks and Regards.

    lai yi

  • Your code still won't compile, #include <stdint.h> is missing as well as the definition of MAX and the definition of CvRect. The following file

    #include <stdint.h>
    
    #define MAX(x,y) ((x) < (y) ? (y) : (x))
    
    typedef struct CvRect_t
    {
       int32_t x;
       int32_t y;
       int32_t width;
       int32_t height;
    } CvRect;
    
    void CalcIntegralImg(int32_t * restrict pl32IntegImg, int64_t * restrict pd64SquareImg, uint8_t * restrict pu8ImgData, int32_t l32ImgWidth, int32_t l32ImgHeight, CvRect tRect)
    {
    	int32_t l32Index1, l32Index2;
    
    	uint8_t * restrict pu8TmpImg;
    	int32_t * restrict pl32TmpIntegImg;
    	int64_t * restrict pd64TmpSquarImg;
    
    	//calculate integral for each pyramid layer
    	int32_t l32left = MAX(tRect.x, 0);
    	int32_t l32top = MAX(tRect.y, 0);
    
    	//the first column of ROI
    	pl32TmpIntegImg = pl32IntegImg + l32left;
    	pd64TmpSquarImg = pd64SquareImg + l32left;
    	#pragma MUST_ITERATE(1,,)
    	for (l32Index1 = 0; l32Index1 < l32ImgHeight; l32Index1++)
    	{
    		pl32TmpIntegImg[0] = 0;
    		pd64TmpSquarImg[0] = 0;
    
    		pl32TmpIntegImg += l32ImgWidth;
    		pd64TmpSquarImg += l32ImgWidth;
    	}
    
    	int32_t roi_height = tRect.height;
    	int32_t roi_width = tRect.width;
    	int32_t * restrict pl32TmpIntegPrev;
    	int64_t * restrict pd64TmpSquarPrev;
    	int32_t l32Sum, l32SquareSum;
    	uint8_t gray_value;
    	uint32_t nIntegPrev;
    	uint64_t nSquarePrev;
    
    	pu8TmpImg = pu8ImgData + (l32top - 1) * l32ImgWidth + (l32left - 1);
    	pl32TmpIntegImg = pl32IntegImg + l32top * l32ImgWidth + l32left;
    	pl32TmpIntegPrev = pl32TmpIntegImg - l32ImgWidth;
    	pd64TmpSquarImg = pd64SquareImg + l32top * l32ImgWidth + l32left;
    	pd64TmpSquarPrev = pd64TmpSquarImg - l32ImgWidth;
    
    #pragma MUST_ITERATE(1,,)
    	for (l32Index1 = 0; l32Index1 < roi_height; l32Index1++)
    	{
    		l32Sum = 0;
    		l32SquareSum = 0;
    #pragma MUST_ITERATE(1,,)
    		for (l32Index2 = 0; l32Index2 < roi_width; l32Index2++)//line 372
    		{
    			gray_value = pu8TmpImg[l32Index2];
    			nIntegPrev = pl32TmpIntegPrev[l32Index2];
    			nSquarePrev = pd64TmpSquarPrev[l32Index2];
    
    			l32Sum += gray_value;
    			l32SquareSum += (gray_value * gray_value);
    
    			pl32TmpIntegImg[l32Index2] = nIntegPrev + l32Sum;
    			pd64TmpSquarImg[l32Index2] = nSquarePrev + l32SquareSum;
    		}
    
    		pu8TmpImg += l32ImgWidth;
    		pl32TmpIntegImg += l32ImgWidth;
    		pl32TmpIntegPrev += l32ImgWidth;
    		pd64TmpSquarImg += l32ImgWidth;
    		pd64TmpSquarPrev += l32ImgWidth;
    	}
    
    }

    is probably the closest to what you posted. However, CGT 7.4.2 compiles this just fine. There are two pipelined loops in my output, the first has a dependency bound of 1, the second has a dependency bound of 0. There must be something you do differently, but without knowing what that is we will not be able to help you, I'm afraid.


    Another word of warning: Do not blindly add restrict to your declarations. The rules when and where restrict may be applied (safely) are a bit tricky, and it's easy to get it wrong.

  • You don't know how inconceivable it is, i copy your posted code into my page , and this page really compile the dependency bound of 9, while i copy into another .cpp file, the dependency bound was compiled as 0 like you, after i check out these two cpp files i only found the difference was that the dependency bound of 9 cpp file was set text file encoding as UTF-8 while the other was set as default GBK, but after i change the UTF-8 into GBK, the source code still compiled the dependency bound 9, then, i think the text file encoding wasn't the trouble factor , and i will insist on finding what 's the difference.
    However, it is appreciative for me from your warning.

    Thanks and Regards.
    laiyi
  • Hi,Markus ,

    For your warnning of restrict using int the code declarations, i have another confused problem about this issue,
    the previous optimizing problem about loop carried dependency bound was occured almost in the sistution that two different pointer variable to compute ,
    for example as above code,

    int32_t * restrict pl32IntegImg,
    uint8_t * restrict pu8ImgData

    while pl32IntegImg need computed from pu8ImgData.

    But if the loop carried dependency bound bottleneck was from the variable itself ,
    for example, (*pL2AllIntMap)[(65)*65)] was defined as a Array pointer,
    while compile the following code:pL2AllIntMap[0][target] = pL2AllIntMap[0][left] + pL2AllIntMap[0][top] - pL2AllIntMap[0][left_top];

    how could you do to optimize the loop carried dependency bound?
    Thanks and Regards.
    laiyi
  • For this case ...

    lai yi said:
    pL2AllIntMap[0][target] = pL2AllIntMap[0][left] + pL2AllIntMap[0][top] - pL2AllIntMap[0][left_top];

    The compiler can see that the addresses used are all based on pL2AllIntMap, and therefore it could be the case that the same memory location is written in loop iteration N and read in loop iteration N+1.  The safe thing to do is just live with this loop carried dependence.

    If, however, it is vitally important this loop run faster, you could consider creating your own pointer variable(s), and after carefully verifying it is safe, marking it restrict.  Just for the sake of a simple and very unlikely example ... Pretend target is always odd and left, top, and left_top are always even.  Then you could do something like ...

    int * restrict ptr_target = &pL2AllIntMap[0][target];
    for ( /* loop details here */ )
    {
        // use *ptr_target instead of pL2AllIntMap[0][target]
        ptr_target += 2;
    }

    In practice though, it is unlikely the behavior of your code is so simple that such analysis is practical.  In that case, you have to live with the loop carried dependence.  Sometimes there is nothing you can do about it.

    Thanks and regards,

    -George

  • Thanks a lot at first, George.

    In my code,target index can'be always odd and left was defined as left=target-1, so left, top, and left_top can't be always even i think.

    Even though i can change the target to be odd, the left,top and left_top even, i still can't understand the ptr_target+2 behavior, in my thinking habits, this ptr_target+2 just like to improve loop unrolling.

    Dose loop unrolling operation live with Partitioned Resource Bound not loop carried dependence bound?

    As speaking of Partitioned Resource Bound,  a difficultly understooding problem occured in my code:

    uint16_t *offsetAdd=arrIndixJu; //while arrIndixJu is a constant one-dimensional array, i define offsetAdd as a pointer and give the first address of arrIndixJu for it

    i use the offsetAdd pointer to be a incremental index to load the value of the array
    pl32IntegImg like this:

    int32_t al32Tmp[3];
    al32Tmp[0] = pl32IntegImg[offset +(*offsetAdd++)];
    al32Tmp[0]+= pl32IntegImg[offset +(*offsetAdd++)];
    al32Tmp[0]-= pl32IntegImg[offset +(*offsetAdd++)];
    al32Tmp[0]-= pl32IntegImg[offset +(*offsetAdd++)];

    the.asm file show the Partitioned Resource Bound was 23,  .D units were the neckbottle.

    so i change the code like this:
    al32Tmp[0] = pl32IntegImg[offset + (*offsetAdd++)] +
    	 pl32IntegImg[offset + (*offsetAdd++)] -
    	pl32IntegImg[offset + (*offsetAdd++)] -
    	pl32IntegImg[offset +(*offsetAdd++)];

    the .asm file show the Partitioned Resource Bound was 4, this seems l have optimized successfully, in fact the latter code i have changed was wrong, the *offsetAdd pointer seems to be optimized addledly, and i haven't got the right value from the latter changed code.

    Why the compiler optimized the offsetAdd pointer to be a wrong address(i guess) and what's the mechanism of  Partitioned Resource Bound difference between this two only writeen code?

  • lai yi said:
    Dose loop unrolling operation live with Partitioned Resource Bound not loop carried dependence bound?

    Unrolling does not change the loop-carried dependences in the loop.  Usually, just using unrolling only helps to hide the latencies by making the loop body longer.

    Your latest example shows arrays indexed by a value you are loading from another array, which is usually a big problem for a loop because it usually creates long loop-carried dependences.  However, I can't figure out how the latest code maps to the original example in this thread, so I wonder if this is from a different example?

    It is illegal to have an auto-increment of a variable multiple times between sequence points, as you have in the example above.  Your code increments offsetAdd four separate times before the semicolon.  You should instead write this as:

    al32Tmp[0] = pl32IntegImg[offset + offsetAdd[0]] +
    	 pl32IntegImg[offset + offsetAdd[1]] -
    	pl32IntegImg[offset + offsetAdd[2]] -
    	pl32IntegImg[offset + offsetAdd[3]];
    offsetAdd += 4;
    
  • lai yi said:
    In my code,target index can'be always odd and left was defined as left=target-1, so left, top, and left_top can't be always even i think.

    It is very unlikely to be practical for you to change your code such that target is odd, and the other array index variables are even.  I give that example as a way to illustrate what unusual methods you have to use to break a loop carried dependence of that kind.  These methods are so unusual, it is nearly certain that you cannot use them.  That being the case, you have to live with the loop carried dependence.

    lai yi said:
    Does loop unrolling operation live with Partitioned Resource Bound not loop carried dependence bound?

    The partitioned resource bound and the loop carried dependence are independent concepts.  That are related only in that they are both characteristics of a software pipelined loop.  Otherwise, they are quite different.  A description of both these terms is given in the section titled Understanding Feedback of the C6000 Programmer's Guide.  Please note this manual is quite old (1998) and out-of-date.  Much of what it contains is no longer accurate.  But the Understanding Feedback section is still accurate enough to be useful.

    Thanks and regards,

    -George

  • First sorry to submit the wrong button for your email post.

    Archaeologist said:

    However, I can't figure out how the latest code maps to the original example in this thread, so I wonder if this is from a different example?

    it is really from a different example.

    Archaeologist said:
    It is illegal to have an auto-increment of a variable multiple times between sequence points

    to have an auto-increment of the offsetAdd pointer is acoording to the synax of C, why it is illegal?

    Archaeologist said:

    You should instead write this as:

    al32Tmp[0] = pl32IntegImg[offset + offsetAdd[0]] +
    	 pl32IntegImg[offset + offsetAdd[1]] -
    	pl32IntegImg[offset + offsetAdd[2]] -
    	pl32IntegImg[offset + offsetAdd[3]];
    offsetAdd += 4;

    as your advice, the .asm file shown the Partitioned Resource Bound(*) is stil 23, that is a large bottleneck for my code optimizing, do you have any better opinion?

    Best Regards!

    laiyi

  • lai yi said:
    Archaeologist
    It is illegal to have an auto-increment of a variable multiple times between sequence points

    to have an auto-increment of the offsetAdd pointer is acoording to the synax of C, why it is illegal?


    It is syntactically correct, yet its behavior is undefined. See e.g. .

  • lai yi said:
    as your advice, the .asm file shown the Partitioned Resource Bound(*) is stil 23, that is a large bottleneck for my code optimizing, do you have any better opinion?

    I'd need to see a complete, compilable loop that demonstrates the problem to offer any advice.

    However, it may very well be possible that there is no way to improve the bound. As I noted, if your loop has an array index computed as a load from from another array, usually you get long loop-carried dependences.