How does compiler work for VLIW chips?

Anonymous

Hi All,

I would like to ask some questions on compling code for VLIW devices.

Is it guaranteed that intended logical/execution order of the programmer in C language will still be preserved in

the binary instruction code
in real execution in CPU

This is essentially a question on how interdependency is preserved. Consider an extremely simple example:

dependency order said:

a=a+1
b=a

The intended order is that variable a should increment first and then its value will be assigned to variable b. However, in many DSPs we have more than one ALU. As a concrete example, in DM6437 chip we have two groups of ALUs:
L1, S1, M1, D1
L2, S2, M2, D2

Put it together we have eight ALUs, and how can we avoid the wrong execution order of either:
1. b=a & a=a+1 at the same time
2. first b=a, then a=a+1

A straightforward yet inefficient approach is to put many "NOP"s after (or even before) a=a+1, which is to ensure that when a=a+1 is fetched it will be fetched along with NOPs to fill in the vacancies inside the instruction cache, so that only one of the eight ALUs will be picking and executing the a=a+1 instruction and all remaining seven ALUs do nothing due due to "NOP".

Apparently, this extremely inefficient safety measure reduced a VLWI CPU to essentially a single-instruction-per-cycle type, at least for the this exemplary case.

On the other extremely, many of DSP algorithms, very similar to GPU, are inherently parallel. For computations like:

inner product said:

y=c1*x1+c2*x2+ ... +cn*xn

One might use carefully written assembly code to assign the individual multiplication to different ALUs and doing the addition afterwards. In fact, it is also clear that the addition itself, which in its logical and visual structure is a tree, can also be done in parallel in the same manner as merge-sort.

There is then the natural, unavoidable question: who will be responsible for this delicate work?

We have the compiler which is after all a human construction and implements algorithmic logic to do some types of optimization. This logic is fixed for any particular compiler version. If the compiler encounters an unexperienced DSP software designer who knows little about the hardware architecture and he just writes as he did on Windows platform without too much thinking:

inner product by

define i..

for (i=1;i<n;i++)
y+=c[i]*x[i];

Which will then be the course code provided to the compiler. Will the compiler be "smart" enough to discover the inherent parallelism within this for loop and generate the optimized VLIW instruction accordingly to make full use of the multiplicity of ALUs?

Obviously, a very high requirment will then be put on the construction of the compiler. But even if the compiler is built with the state-of-art optimization technique, what happens if the parallelism which in the previous example is relatively apparent be obscured by some additional "inserted" code? Like this:

additional code in the loop said:

define i..
define b..

for (i=1;i<n;i++)
{
if (i==1)
b=b+1;
else if (i==2)
b=b+2;
else if (i==3)
b=b+3;
y+=c[i]*x[i];
}

Will the conditional judgment lines make it more difficult for the compiler to "see" the parallelism of multiplication/addition in the later?

And will the compiler be really smart enough to detect and find out the indepency of the conditional judgment lines and multiplication/addition, and do in its intermediate steps (within compiling process) to split the above lines to two blocks:

wishful

for (i=1;i<n;i++)
{
if (i==1)
b=b+1;
else if (i==2)
b=b+2;
else if (i==3)
b=b+3;
}

for (i=1;i<n;i++)
{
y+=c[i]*x[i];
}

After doing this, it then analyze the second block again and be able to generate parallelized code for that?

This is merely some wishful thinking. In reality, how much can a compiler, especially TI's own, can achieve for VLIW chip targets?

On the other hand is not to seek the optimal performance from the carefully optimized library code. These library functions should have certain assumptions on the input, for example, the length of vectors be the of the order of 2, and of course, there will be nothing like the additional judgment logic within the forgoing for loop. In this case, TI engineers could write highly-optimized library routines that ensure both safety (preserving the logical order) and maximum efficiency.

I haven't used any TI library like IQMath yet. But is this the purpose for their provision?

Please allow me to summarize the key points of the questions:

Safety. Do we need additional "NOP"s to ensure that dependency order be preserved? Are TI's compiler "safety guaranteed" for VLIW architecture?
Smartness. To how large an extent can the compiler discover the parallelism in user's code and optimize accordingly?

I sincerely would like to hear any comments on these questions.

Regards,
Zheng

over 14 years ago

BrandonAzbell over 14 years ago

TI__Guru 64440 points

Zheng,

These are great questions and I'm glad you are thinking through them with such detail.

I would suggest that you download and read the TMS320C6000 Optimizing Compiler v 7.0 User's Guide (Rev. Q) as I believe it will answer many of the questions you have above.

TI has spent an incredible investment in the C6000 (VLIW) compiler over the years to address these concerns and continues to do so. In fact, TI recently announced an advancement in the C6000 architecture that brings even more parallelism to the mix and continues to invest in compilers to maximize the most out of it.

You are correct, that even with those investments, an experienced, motivated and knowledgable individual of the architecture can likely implement a very tight and massively parallel execution kernel by hand which could be better in performance than what the compiler can do. The question is, how much time do you have to spend on that particular task. It has been done.

Anonymous over 14 years ago in reply to BrandonAzbell

Dear Brandon,

BrandonAzbell said:

TI has spent an incredible investment in the C6000 (VLIW) compiler over the years to address these concerns and continues to do so.

Sometimes even the smallest computational error could result in a disaster, for example, integer overflow in the Ariane 5 flight 501. Even for non-(military/life-critical) applications, unpredictable execution order differing from what was intended is by itself enough to nullify the entire program logic. I believe in all situations correctness is much more important than performance, so I need to ask on some details:

1. From the very beginning(maybe decades ago), TI's VLIW chips has always been "safety guaranteed", i.e., the correctness of order is strictly observed and parallel execution would only be done (by compiler's decision) when this correctness prerequisite is satisfied. However, performance can be compromised due to this correctness insistence, and TI's line of effort over the years have all been focused on improving the performance under the absolute correctness constraint.

Put is simply: Improve performance (by compiler) within the constraint of absolute correctness.

2. Correctness is not guaranteed by the compiler. This is due to

2.1. The VLIW architecture, though computationally and economicaly attractive, has theoretically already been proved to be subject to execution errors, i.e., theoretically there is no way to construct a high-level programming language compiler to completely avoid this issue. In order to guarantee absolute correctness, the user must manually examine and modify the assembly code.

2.2. Theoretically there do exist ways of guaranteeing absolute correctness (actually there is, as the aforementioned redundant "NOP"s solution in the last post. This in fact invalidate 2.1.) but adhering to this seriously impairs performance. Therefore, TI didn't implement this absolutely strict logic in the compiler but leaves that task to the developer.

But none of 2's points are plausible. 2.1 has already been invalidated and the fact is that there do exist ways for ensuring absolute correctness. However, if 2.2 is correct and TI expects the developer to check the assembly code themselves, is that possible for the majority of programmers? This should be very difficult even for an experienced embedded system developer, requiring knowledge on hardware, software and instruction set of each particular device.

From a legal perspective, I also tend to believe 1 is the case, i.e., correctness is always guaranteed. If not, how should TI write in its contract with customers? Will TI then make itself liable for responsiblity caused by any faulty VLIW execution order? Or does TI needs to specify on all document that "compilers are not safety-guaranteed"? Because in United States there are law regulation such that there are established cases that both Intel and Nvidia has recalled their sold chips or made settlements due to errors in the product.

It would take quite a long time for me to fully under the compiler user guide. At this time point, could you verify/ascertain the above points for me?

Regards,

Zheng

Anonymous over 14 years ago in reply to BrandonAzbell

Dear Brandon,

BrandonAzbell said:

brings even more parallelism to the mix

What does "mix" mean here?

BrandonAzbell said:

The question is, how much time do you have to spend on that particular task. It has been done.

What does "it" refer to?

Regards,

Zheng

Brad Griffis over 14 years ago in reply to Anonymous

TI__Guru*** 125430 points

Zheng,

See license.txt file bundled with the code generation tools for TI's legal stance.

Brad

Code Composer Studio™︎

Code Composer Studio forum

How does compiler work for VLIW chips?