Failure to schedule cross-path instructions in parallel

Matthijs van Duin

I've encountered what appears to be a scheduling problem leading to missed optimization. According to the C674x CPU reference guide (sprufe8b) section 2.4, "two units on a side may read the same cross path source simultaneously" . Section 3.8.3 again confirms this: "Up to two units (.S, .L, .D, or .M unit) per data path, per execute packet, can read a source operand from its opposite register file via the cross paths (1X and 2X) provided that each unit is reading the same operand."

However when presented with clear opportunity to do so, neither compiler nor linear assembly optimizer seems inclined to do so (tested v7.4.8 and v8.0.0b4). For example, taking just the first two instructions of the example given at the top of page 79 of the reference guide:

        ADD      A0, B1,  A1
        SUB      A2, B1,  A2

The assembly optimizer schedules them sequentially on L1X rather than in parallel. Same holds for other combinations I've tried with a common cross-path source register. An example in plain C:

extern void foo( int a4, int b4, int a6, int b6, int a8 );

void bar( int a4, int b4, int a6, int b6, int a8, int *b8 ) {
	int x = *b8;
	foo( a4 + x, b4 + x, a6 + x, b6 + x, a8 + x );
}

Here I'm using the memory load to force x to be available relatively late and ensure the compiler has good reason to parallelize the adds (rather than having plenty of time during the delay slots of the call). The result for both compilers is:

           LDW     .D2T2   *B8,B5
           NOP             1
           CALLRET .S1     foo
           NOP             2
           ADD     .L1X    A8,B5,A8
           ADD     .L1X    A4,B5,A4

           ADD     .L2     B4,B5,B4
||         ADD     .S2     B6,B5,B6
||         ADD     .L1X    A6,B5,A6

The problem is again evident. Also, given that the variable x is more heavily needed on the A-side rather than the B-side, allocating it in B5 was perhaps not the best decision; but of course while that's maybe easy to see in this particular case, I can imagine that register allocation algorithms for the c6x architecture are even tricker than they already are for "normal" CPUs.

A question related to this: when two units read the same cross-path, does this count as one or two reads against the source operand on the other side, with regard to the "max 4 reads per register" restriction? Intuitively I'd expect it to count as one, especially since the existence of the "cross path stall" suggests the cross path is registered, but the documentation doesn't seem to make any clear statement on this point (and an attempt to infer it from the compiler's behaviour led to the discovery of the problem stated above).

over 9 years ago

0 George Mock over 9 years ago

TI__Guru**** 232790 points

Matthijs van Duin said:
I've encountered what appears to be a scheduling problem leading to missed optimization.

Thank you for letting us know about this performance issue. I filed SDSCM00051071 in the SDOWP system to have this investigated. Feel free to follow it with the SDOWP link below in my signature.

Matthijs van Duin said:
A question related to this: when two units read the same cross-path, does this count as one or two reads against the source operand on the other side, with regard to the "max 4 reads per register" restriction?

I'll make sure some relevant experts see this question. However, you might get a faster response if you start a new thread in the C6600 multicore forum.

Thanks and regards,

-George

0 Todd Hahn over 9 years ago

TI__Expert 3455 points

Matthijs,

I can confirm that the compiler does not take advantage of the device's capability for more than one instruction on the same side to read the same cross path in the same execute packet.

Regards,

-Todd

0 Matthijs van Duin over 9 years ago in reply to George Mock

Mastermind 8020 points

George Mock said:
A question related to this: when two units read the same cross-path, does this count as one or two reads against the source operand on the other side, with regard to the "max 4 reads per register" restriction?

I'll make sure some relevant experts see this question.[/quote]

Appreciated!

George Mock said:
However, you might get a faster response if you start a new thread in the C6600 multicore forum.

While I can imagine that's the place where DSP-core experts are most likely to hang out, it doesn't seem appropriate to me to ask a question about the C64x+/C674x core¹ there? Then again, the problem is there doesn't seem to be a genuinely appropriate forum for questions like these: the C674x forum is about the device (i.e. the omap-L13x) and not the DSP core. I wouldn't have much hope of getting a useful answer about something as obscure as DSP data path resource constraints there (let alone in the forum of the device I'm actually using). It's mainly compiler developers who have to deal with them, and programmers with tendencies towards obsessive optimization. (I definitely fall in the latter category, but I'm also occasionally exploring the option of doing dynamic code generation, which is how I stumble over questions like these.)

¹I think I can safely lump those together architectually.

0 George Mock over 9 years ago

TI__Guru**** 232790 points

Matthijs van Duin said:
when two units read the same cross-path, does this count as one or two reads against the source operand on the other side, with regard to the "max 4 reads per register" restriction?

I consulted with the relevant experts. The answer is no.

Thanks and regards,

-George

0 Matthijs van Duin over 9 years ago in reply to George Mock

Mastermind 8020 points

George Mock said:

when two units read the same cross-path, does this count as one or two reads against the source operand on the other side, with regard to the "max 4 reads per register" restriction?

I consulted with the relevant experts. The answer is no.

[/quote]

"No"? As in, the cross-path reads do not count toward the "max 4 reads per register" restriction at all, only same-side reads do?

In that case I have a optimizer regression to report between 7.4.8 and 8.0.0b4. Where the former puts
        ADD     B4, A8,  B4
        ADD     A4, A8,  A4
        ADD     A6, A8,  A6
        ADD     A8, A8,  A8
inside a single execute packet, the v8 compiler splits it into two. I assumed that the v7 behaviour was a bug (five reads of A8 in a single cycle, one of which via the cross-path) but that it had already been fixed in v8 hence no need to report it. If however my interpretation of the expert's answer is correct and the actual limitation on number of reads of one register is "four from the same side + two from the opposite side" then the v7 compiler was justified in scheduling all four instructions in parallel and v8 splits it unnecessarily.

Addendum: if one of the source operands of the last instruction is changed (e.g. to A7) then v8 does schedule all four instructions in parallel just like the v7 compiler, which shows the reason for the split is indeed due to the number of reads of A8.

0 George Mock over 9 years ago in reply to Matthijs van Duin

TI__Guru**** 232790 points

Matthijs van Duin said:
the cross-path reads do not count toward the "max 4 reads per register" restriction at all

That is correct.

Matthijs van Duin said:
In that case I have a optimizer regression to report between 7.4.8 and 8.0.0b4

I am unable to get 7.4.8 to put all 4 of those instructions in parallel. Please send me a test case. Be sure to include the exact build options you use.

Thanks and regards,

-George

0 Matthijs van Duin over 9 years ago in reply to George Mock

Mastermind 8020 points

George Mock said:
I am unable to get 7.4.8 to put all 4 of those instructions in parallel. Please send me a test case. Be sure to include the exact build options you use.

I used:

cl6x -mv6740 --abi=eabi -O3 --symdebug:none -n 1881.test.sa

and got 0312.test.asm as output

0 Matthijs van Duin over 9 years ago in reply to Matthijs van Duin

Mastermind 8020 points

Note that while I select --abi=eabi out of habit and add --symdebug:none when generating assembly listings to avoid cluttering the output, neither is of importance here. Using -mv6740 with -o1 or higher is sufficient. Oddly, if I use -mv64p instead it does split into two packets. I had the impression that the c64x+ and c674x were identical when ignoring floating-point functionality, am I wrong?

0 George Mock over 9 years ago in reply to Matthijs van Duin

TI__Guru**** 232790 points

Thank you for submitting the test case. I can reproduce your results. All four ADD instructions are in parallel with 7.4.8, but not with 8.0.0b4. I filed SDSCM00051092 in the SDOWP system to have this investigated. Feel free to follow it with the SDOWP link below in my signature.

Matthijs van Duin said:
I had the impression that the c64x+ and c674x were identical when ignoring floating-point functionality

That's reasonable. I don't know why -mv6740 makes a difference in this case.

Thanks and regards,

-George

0 Todd Snider over 9 years ago in reply to George Mock

TI__Intellectual 2175 points

Just an informational update about compiling linear assembly source with the v8.0 compiler ...

Note that the v7.4.8 compiler does parallelize all four ADDs in the .sa file (provided in SDSCM00051092) if --symdebug:none is selected, but the v8.0 compiler does not. This is because there has been a slight shift in functionality in the v8.0 compiler. If the optimization level is -o1 or higher, then the compiler will attempt to parallelize straight-line linear assembly source (independent of whether --symdebug:none is selected or not).

Conceptually, whether or not the compiler attempts to parallelize linear assembly source code is now controlled by the level of optimization (not by turning off debug). By default, the v8.0 compiler will assume an optimization level of '-o0' (no optimization). To get the compiler to attempt parallelization of linear assembly source, you must specify an optimization level of -o1 or higher. Note that debug information will be included in the output .asm file and will not interfere with the parallelization.

One additional note ... for the test case provided in SDSCM00051092, even with -o1 turned on, the v8.0 compiler does not parallelize quite as efficiently as the v7.4.8 compiler. The v8.0 compiler puts 3 of the 4 ADDs in parallel (whereas v7.4.8 was able to put all 4 ADDs in parallel). I am therefore leaving SDSCM00051092 open as a performance issue for the v8.0 compiler vs. the v7.4.8 compiler.

Todd Snider

C6000 Code Generation Tools

Texas Instruments Incorporated

0 Matthijs van Duin over 9 years ago in reply to Todd Snider

Mastermind 8020 points

Todd,

I'm not sure what you're saying here. The behaviour I've observed is that the 7.4.8 compiler schedules all four ADDs in parallel when using -mv6740 and optimization enabled:

           ADD     .L1     A8,A8,A8          ; |5|
||         ADD     .L2X    B4,A8,B4          ; |2|
||         ADD     .S1     A4,A8,A4          ; |3|
||         ADD     .D1     A6,A8,A6          ; |4|

It does so regardless of optimization level (-o0 suffices) and regardless of whether I use --symdebug:none or not. (It also continues to do so with --symdebug:dwarf --optimize_with_debug.) As I mentioned, I just added --symdebug:none to reduce clutter when reading assembly output.

The 7.4.8 compiler with -mv6400+ produces two execute packets:

           ADD     .L1     A6,A8,A6          ; |4|
           ADD     .L1     A8,A8,A8          ; |5|
||         ADD     .L2X    B4,A8,B4          ; |2|
||         ADD     .S1     A4,A8,A4          ; |3|

The 8.0.0 compiler produces exactly this same sequence, but for both -mv6400+ and -mv6740. In none of the cases does the optimization level appear to have any influence (provided it isn't off), nor the symdebug level (provided --optimize_with_debug is included if full symdebug is enabled).

Edit: to clarify, since I'm targeting a c674x I initially noticed this as a compiler regression. Only later did I notice that the 7.4.8 produces the same behaviour when compiling for c64x+ (to my surprise, since I did not expect those targets to be treated differently for integer code), showing the issue is in fact older but has now also spread to -mv6740.

0 Todd Snider over 9 years ago in reply to Matthijs van Duin

TI__Intellectual 2175 points

Hi Matthijs,

In the SDSCM00051092 report, the .sa file provided is compiled without optimization and with --symdebug:none and generates different results on v8.0 vs. v7.4.8. My previous note on this thread was in reference to that particular situation noting that if optimization is turned on, the v8.0 compiler will attempt to parallelize the 4 ADDs.

I did verify that there is a difference in compiler behavior between v7.4.8 and v8.0.0 when the .sa file is compiled with -mv6740 on both compilers (v7.4.8 successfully parallelizes all 4 ADDs whereas v8.0 only gets 3 out of 4). This is a remaining performance issue (degradation when moving from v7.4.8 to v8.0) that is still to be addressed in v8.0.x.

To this we can add your observation that even v7.4.8 doesn't get the 4 ADDs in parallel when -mv64+ is used even though it should. I will update SDSCM00051092 accordingly.

Thanks and Regards,

Todd

0 Matthijs van Duin over 9 years ago in reply to Todd Snider

Mastermind 8020 points

Ah, I see now that SDSCM00051092 has a link to this thread in general rather than more specificially to the post that led to its creation (which did include build options with optimization enabled). Given that this thread led to two tickets being created, perhaps it should be a bit more specific. It is possible the two issues are related, though they are not obviously so (the former concerning unwillingness of the compiler to parallelize 2 reads from the cross-path, the latter of its unwillingness to parallelize a cross-path read with four same-side reads).

In any case, I see now what you mean: 7.4.8 even produces optimized output with -ooff while 8.0.0 doesn't. It had never even occurred to me to even try without optimization: it would seem a bit strange to me to report a failure to optimize properly when the optimizer is off ;-) This change is definitely not what I was referring to in my regression report; I'm actually surprised by 7.4.8's behaviour here. But some of my earlier posts didn't include explicit build options, so apologies if that caused any confusion.

Code Composer Studio™︎

Code Composer Studio forum

Failure to schedule cross-path instructions in parallel