Hi All,
I would like to ask some questions on compling code for VLIW devices.
Is it guaranteed that intended logical/execution order of the programmer in C language will still be preserved in
- the binary instruction code
- in real execution in CPU
This is essentially a question on how interdependency is preserved. Consider an extremely simple example: I sincerely would like to hear any comments on these questions.
a=a+1
b=a
The intended order is that variable a should increment first and then its value will be assigned to variable b. However, in many DSPs we have more than one ALU. As a concrete example, in DM6437 chip we have two groups of ALUs:
L1, S1, M1, D1
L2, S2, M2, D2
Put it together we have eight ALUs, and how can we avoid the wrong execution order of either:
1. b=a & a=a+1 at the same time
2. first b=a, then a=a+1
A straightforward yet inefficient approach is to put many "NOP"s after (or even before) a=a+1, which is to ensure that when a=a+1 is fetched it will be fetched along with NOPs to fill in the vacancies inside the instruction cache, so that only one of the eight ALUs will be picking and executing the a=a+1 instruction and all remaining seven ALUs do nothing due due to "NOP".
Apparently, this extremely inefficient safety measure reduced a VLWI CPU to essentially a single-instruction-per-cycle type, at least for the this exemplary case.
On the other extremely, many of DSP algorithms, very similar to GPU, are inherently parallel. For computations like:
y=c1*x1+c2*x2+ ... +cn*xn
One might use carefully written assembly code to assign the individual multiplication to different ALUs and doing the addition afterwards. In fact, it is also clear that the addition itself, which in its logical and visual structure is a tree, can also be done in parallel in the same manner as merge-sort.
There is then the natural, unavoidable question: who will be responsible for this delicate work?
We have the compiler which is after all a human construction and implements algorithmic logic to do some types of optimization. This logic is fixed for any particular compiler version. If the compiler encounters an unexperienced DSP software designer who knows little about the hardware architecture and he just writes as he did on Windows platform without too much thinking:
define i..
for (i=1;i<n;i++)
y+=c[i]*x[i];
Which will then be the course code provided to the compiler. Will the compiler be "smart" enough to discover the inherent parallelism within this for loop and generate the optimized VLIW instruction accordingly to make full use of the multiplicity of ALUs?
Obviously, a very high requirment will then be put on the construction of the compiler. But even if the compiler is built with the state-of-art optimization technique, what happens if the parallelism which in the previous example is relatively apparent be obscured by some additional "inserted" code? Like this:
define i..
define b..
for (i=1;i<n;i++)
{
if (i==1)
b=b+1;
else if (i==2)
b=b+2;
else if (i==3)
b=b+3;
y+=c[i]*x[i];
}
Will the conditional judgment lines make it more difficult for the compiler to "see" the parallelism of multiplication/addition in the later?
And will the compiler be really smart enough to detect and find out the indepency of the conditional judgment lines and multiplication/addition, and do in its intermediate steps (within compiling process) to split the above lines to two blocks:
for (i=1;i<n;i++)
{
if (i==1)
b=b+1;
else if (i==2)
b=b+2;
else if (i==3)
b=b+3;
}
for (i=1;i<n;i++)
{
y+=c[i]*x[i];
}
After doing this, it then analyze the second block again and be able to generate parallelized code for that?
This is merely some wishful thinking. In reality, how much can a compiler, especially TI's own, can achieve for VLIW chip targets?
On the other hand is not to seek the optimal performance from the carefully optimized library code. These library functions should have certain assumptions on the input, for example, the length of vectors be the of the order of 2, and of course, there will be nothing like the additional judgment logic within the forgoing for loop. In this case, TI engineers could write highly-optimized library routines that ensure both safety (preserving the logical order) and maximum efficiency.
I haven't used any TI library like IQMath yet. But is this the purpose for their provision?
Please allow me to summarize the key points of the questions:
Regards,
Zheng