I've encountered what appears to be a scheduling problem leading to missed optimization. According to the C674x CPU reference guide (sprufe8b) section 2.4, "two units on a side may read the same cross path source simultaneously" . Section 3.8.3 again confirms this: "Up to two units (.S, .L, .D, or .M unit) per data path, per execute packet, can read a source operand from its opposite register file via the cross paths (1X and 2X) provided that each unit is reading the same operand."
However when presented with clear opportunity to do so, neither compiler nor linear assembly optimizer seems inclined to do so (tested v7.4.8 and v8.0.0b4). For example, taking just the first two instructions of the example given at the top of page 79 of the reference guide:
ADD A0, B1, A1 SUB A2, B1, A2
The assembly optimizer schedules them sequentially on L1X rather than in parallel. Same holds for other combinations I've tried with a common cross-path source register. An example in plain C:
extern void foo( int a4, int b4, int a6, int b6, int a8 ); void bar( int a4, int b4, int a6, int b6, int a8, int *b8 ) { int x = *b8; foo( a4 + x, b4 + x, a6 + x, b6 + x, a8 + x ); }
Here I'm using the memory load to force x to be available relatively late and ensure the compiler has good reason to parallelize the adds (rather than having plenty of time during the delay slots of the call). The result for both compilers is:
LDW .D2T2 *B8,B5 NOP 1 CALLRET .S1 foo NOP 2 ADD .L1X A8,B5,A8 ADD .L1X A4,B5,A4 ADD .L2 B4,B5,B4 || ADD .S2 B6,B5,B6 || ADD .L1X A6,B5,A6
The problem is again evident. Also, given that the variable x is more heavily needed on the A-side rather than the B-side, allocating it in B5 was perhaps not the best decision; but of course while that's maybe easy to see in this particular case, I can imagine that register allocation algorithms for the c6x architecture are even tricker than they already are for "normal" CPUs.
A question related to this: when two units read the same cross-path, does this count as one or two reads against the source operand on the other side, with regard to the "max 4 reads per register" restriction? Intuitively I'd expect it to count as one, especially since the existence of the "cross path stall" suggests the cross path is registered, but the documentation doesn't seem to make any clear statement on this point (and an attempt to infer it from the compiler's behaviour led to the discovery of the problem stated above).