my bilinear interpolation algorithm was coded with c++, the main code as follow:
//*************************************************
while the compile infomation as follow:
and the assembly part as follow:
note that: the input px and py parameters has been used with restrict keyword, as the line 490 was the main reason, try to add restrict keyword for variable po,p1,p2,p3 is unusefull, the main point is point multiplication in line 490 and 491 can't be paralleled, i have also tried to use _dotpu4 , failed.
so, how to reduce the Loop Carried Dependency Bound base on the assembly profile?