why is the C-compiler version 4.3.1 producing such bad code?

Patrick Allison

OK, I'm really hoping someone can see why the TI MSP430 C compiler is producing such terrible code, even with full optimizations and speed tradeoffs turned all the way up.

The function is this:

inline void msp430_softuart_tx(uint8_t txc) {
softuart_tx_process_buffer[softuart_tx_process_buffer_write_ptr++] = txc;
	softuart_tx_process_buffer_write_ptr = softuart_tx_process_buffer_write_ptr % MSP430_SOFTUART_TX_PROCESS_BUFFER_SIZE;
}

and MSP430_SOFTUART_TX_PROCESS_BUFFER_SIZE is 128. When this is called with msp430_softuart_tx(0x48), I get the bizarre:

MOV.B   &softuart_tx_process_buffer_write_ptr,R15
MOV.B   #0x0048,0x0200(R15)
MOV.W   #1,R15
ADD.B   &softuart_tx_process_buffer_write_ptr,R15
AND.B   #0x007f,R15
MOV.B   R15,&softuart_tx_process_buffer_write_ptr

???? As far as I can see MOV.W #1, R15 and the ADD.B lines are entirely pointless. It could've just done

MOV.B   &softuart_tx_process_buffer_write_ptr,R15
MOV.B   #0x0048,0x0200(R15)
INC.B   R15
AND.B   #0x007F, R15
MOV.B   R15,&softuart_tx_process_buffer_write_ptr

which is both 2 words shorter and 3 cycles faster.
Is there something I'm missing here? I would love to just write the code in assembly straight off, but I've never been able to find a way to make the TI C-compiler inline an assembly function (removing the call/ret and optimizing the parameter load).

The code above just seems so obviously bad that I'm hoping I'm missing something...

over 11 years ago

Chester Gillon over 11 years ago

Guru 92251 points

Is softuart_tx_process_buffer_write_ptr declared as volatile?

Patrick Allison over 11 years ago in reply to Chester Gillon

Prodigy 250 points

Nope. They're just

uint8_t softuart_tx_process_buffer[MSP430_SOFTUART_TX_PROCESS_BUFFER_SIZE];
uint8_t softuart_tx_process_buffer_write_ptr;
uint8_t softuart_tx_process_buffer_read_ptr;

Chester Gillon over 11 years ago

Guru 92251 points

Patrick Allison1 said:
and MSP430_SOFTUART_TX_PROCESS_BUFFER_SIZE is 128. When this is called with msp430_softuart_tx(0x48), I get the bizarre:

Using the TI MSP430 compiler v4.3.5 with Optimization Level "4 (Whole Program Optimizations)" and Speed .vs. size trade-offs "5 speed" I can repeat the problem in that the following in-lined assembler is generated:

	.dwpsn	file "../main.c",line 10,column 1,is_stmt,isa 0
        MOV.B     &softuart_tx_process_buffer_write_ptr+0,r15 ; [] |10| 
        MOV.B     #72,softuart_tx_process_buffer+0(r15) ; [] |10| 
        MOV.W     #1,r15                ; [] |10| 
        ADD.B     &softuart_tx_process_buffer_write_ptr+0,r15 ; [] |10| 
        AND.B     #127,r15              ; [] |10| 
        MOV.B     r15,&softuart_tx_process_buffer_write_ptr+0 ; [] |10|

Whereas if use the MSP430 GNU v4.9.1 compiler the following shorter in-lined assembler is generated:

	.loc 1 10 0
	MOV.B	&softuart_tx_process_buffer_write_ptr, R12
	MOV.B	#72, softuart_tx_process_buffer(R12)
	ADD.B	#1, R12
	.loc 1 11 0
	AND.B	#127, R12
	MOV.B	R12, &softuart_tx_process_buffer_write_ptr

I don't know the internals of the TI MSP430 compiler to understand why it missed the optimization, but you will probably get a faster response by posting on the TI C/C++ Compiler forum.

Patrick Allison over 11 years ago in reply to Chester Gillon

Prodigy 250 points

Yeah, I'm in the process of trying out GCC, but sadly a good portion of the project is in assembly and the GCC assembly syntax isn't identical, so it's not trivial.

I asked here first to make sure I wasn't missing something stupid with the MSP430, but given that gcc produces the same code I thought it should, I'm guessing the answer is 'no'.

I'll ask over there as well.

Jens-Michael Gross over 11 years ago in reply to Patrick Allison

Guru 227245 points

You should note that the MOV #1, R15 or ADD.B #1, R15 is a single-word instruction, using the constant generator. So it takes only one word and one clock cycle.
Actually INC x is simulated by ADD #1, x and it really does ADD (R3), x; using the constant generator.
Apparently, the compiler has optimized the post-increment and the clipping, eliminating one write, but didn’t detect that the read is also redundant.
Probably, the post-increment, which normally would be a simple ADD #1, &softuart_tx_process_buffer_write_ptr, has been optimized to fuse with the following clipping, but then, the optimizer didn’t look backwards to the previous loading of the pointer, as the previous analysis has revealed that the load and the (original) post increment couldn’t be optimized for smaller or faster code.
Probably, the compiler does not recurse through optimization, in case that a later optimization allows further optimization of code that couldn’t be optimized on first run.

It’s too easy to forget how difficult it is to put complex analysis into linear compiler code. That’s a field where the human brain still excels. And the reason why good compilers still are so expensive. It’s amazing anyway how good today’s compilers already optimize.

Patrick Allison over 11 years ago in reply to Jens-Michael Gross

Prodigy 250 points

The second load is the costly mistake - this is an inlined function, used in loops, so 3 extra cycles each loop costs a lot.

I'd be happy to accept the compiler's limitations here and just do it myself in assembly, but I can't, because it's an inline function, and there's no way to inline assembly that interacts with C variables with the C compiler, and as far as I can tell there's no way to inline a pure assembly function.

That's a big, big limitation of the TI C compiler that the GCC compiler doesn't share (since its asm() inline is much more powerful). Sadly the GCC compiler isn't nearly as good at optimizing, so there's really just no good solution.

Patrick Allison over 11 years ago in reply to Jens-Michael Gross

Prodigy 250 points

Jens-Michael Gross said:

Probably, the post-increment, which normally would be a simple ADD #1, &softuart_tx_process_buffer_write_ptr, has been optimized to fuse with the following clipping, but then, the optimizer didn’t look backwards to the previous loading of the pointer, as the previous analysis has revealed that the load and the (original) post increment couldn’t be optimized for smaller or faster code.

Wait, actually, is that right? I don't think it can be this. This code:

ADD #1, &softuart_tx_process_buffer_write_ptr
AND #0x007F, &softuart_tx_process_buffer_write_ptr

does not optimize to

MOV #1, R15
ADD &softuart_tx_process_buffer_write_ptr, R15
AND #0x007F, R15
MOV R15, &softuart_tx_process_buffer_write_ptr

does it? The first code is 5 words long, and 9 cycles to execute (4 cycles, 2 words for the first, 5 cycles, 3 words for the second). The second code is 7 words long, and 10 cycles to execute. I think my math is right there.

If so, that's not an optimization. (And I might actually just do inline assembly to force the first code anyway, it's not optimal, but it is at least just as small).

The only way it should've even bothered to optimize the original add + mask was if it knew that softuart_tx_process_buffer_write_ptr was in a register already, but then why bother loading it?

I think it's more along the line that somehow the compiler magically thinks that softuart_tx_process_buffer_write_ptr is volatile, because *then* that operation makes sense, since it can change between the assignment and the increment (and the increment can't be done in-place in memory, since it's volatile). But it's definitely not volatile.

It's too bad that optimization ignores the 'register' keyword, otherwise I could try using a register temporary explicitly in the code and see if that does it. But sadly, optimization ignores the register keyword, and *not* optimizing disables inlining. Catch-22!

Jens-Michael Gross over 11 years ago in reply to Patrick Allison

Guru 227245 points

Agreed, the second is not an optimization of the first, but it’s not sure that the first code is ever generated that way. The post increment is part of an assignment that requires the unincremented index.
Honestly, I don’t know how the compiler parses the code, translates it to assembly and then optimizes it. It would surely be interesting to see the individual steps analyzed.
You could look at the generated code that is created with no optimization.

After all, I’ve always found room for improvement on all assembly code I analyzed. However, in most cases, the gain is not worth the effort of hand-optimizing assembly code.

The compiler surely doesn’t assume the variable being volatile. Else the number of read and write instructions would match the source code. Which it doesn’t.

I agree that some (many) things are still less than optimal. However, compilers have greatly improved during the last years. And there must be some reason left for releasing (and selling) new versions :)

Regardign inlining, I don’t understand why not optimizing should disable inlining. It should of course automatic inlining, but if a function is explicitly marked for inlining, why should the compiler ignore this? Unfortunately, it does.
GCC has an __alyways_inline__ attribute that forces inlining (if possible at all) even with disabled optimization.
Also, it shouldn’t be a problem to obey the register keyword even if optimization is enabled. Well, only the compiler coder knows...

Patrick Allison over 11 years ago in reply to Jens-Michael Gross

Prodigy 250 points

God, I wish the effort wasn't worth it. I've almost always had to hand-optimize interrupt service routines, for instance, and the gain is usually massive - like, half of the execution time. That doesn't bother me that much, though, because many of the tricks are stuff that doesn't map well into the C programming model, like reusing the carry bit.

Regarding inlining and register keywords, I agree, but that's what it does. The register keyword is really frustrating, because with simple firmware you *almost always* end up with a huge amount of totally unused registers because of the push/pop cost.

The last firmware I was just testing had something like r4-r9 totally unused, if memory serves. You can grab r4/r5 using the global register flag... but you can't actually *use* r4/r5 inside C code without using deprecated intrinsics, since again, there's no way to get assembly to interact with C variables (and explicit reg vars aren't supported).

It's just really frustrating. GCC for MSP430 produces larger and slower code, on average, than the TI C compiler, but the TI C compiler doesn't support enough extensions to allow you to fully optimize the C code that it generates.

Jens-Michael Gross over 11 years ago in reply to Patrick Allison

Guru 227245 points

I agree about the poor interaction of C and assembly.
GCC offers great support for inline assembly. Depending on project, this may make up for the increased C code size.

Code Composer Studio™︎

Code Composer Studio forum

why is the C-compiler version 4.3.1 producing such bad code?