My Linear Assembly routine uses most of the registers -- leaving four registers unused. My routine does not need to preserve registers from any call, and has no need whatever to push/pop things on-and-off the stack -- the stack is completely unused in this routine. However, the optimizer is repeatedly thrashing things on-and-off the stack, and doesn't remotely approach the optimal code.
How can I tell the optimizer to stop thrashing the stack? There are plenty of registers for this routine.
[Added Note: The .map listing shows each of my variable names being mapped to many registers -- moved from one register to another by being pushed on and off the stack -- which is all unnecessary.]
There is not enough information here to know why the compiler is using the stack. I can think of three scenarios.
One, you are using the .cproc directive and the compiler is using some registers which the register convention requires be preserved. These registers are described as "preserved by child" in the Register Conventions section of the compiler manual http://www.ti.com/lit/pdf/spru187 . When these registers are used, the compiler preserves them by saving them to the stack at the beginning of the function, and restoring them from the stack at the end of the function. If your routine is called from C, then it must adhere to these conventions. If it is not called from C, then use .proc instead of .cproc.
Two, generating correct code for this procedure requires more machine registers than are available. In such a case the compiler uses locations on the stack as another place to keep the results of intermediate computations.
Three, generating correct code for this procedure does not require more machine register than are available, but due to some bug, the compiler is using locations on the stack anyway.
You contend that the third scenario is occurring. It's possible. But first two scenarios are much more likely. Please investigate to determine whether the first two scenarios explain all the stack usage. If not, then please submit a test case here for further evaluation.
Thanks and regards,
TI C/C++ Compiler Forum ModeratorPlease click Verify Answer on the best reply to your question.The Compiler Wiki answers most common questions.Track an issue with SDOWP. Enter your bug id in the "Find Record ID" box.
1) I'm already using .proc (instead of .cproc), so that's not the problem (or the solution).
2) I don't see how my Linear Assembly code could possibly require intermediate computations -- since it is already in assembly code, where there exist no intermediate computations. The assembler should merely optimize my instructions. So, I don't see this as the source of the problem.
3) I'm not suggesting the TI optimizing compiler/assembler is goofing up. (I'm too new to suggest such a thing.) Rather, I was hoping there exists an assembler directive (or something) that will give the assembler a hint. I was hoping you-all would slap me around for not seeing the fine print in footnote xii, of section IX, of the User's Guide to Footsie -- and then I would be suitably embarassed. ... ;-)
I will take your suggestion and try to gen-up a simple test case. However, when I try to simplify the code (to help identify the causes of the problem), then the optimizing assembler starts eliminating whole sections of my code, apparently under the notion that my code isn't doing anything. It literally optimizes it down to nothing. I end up with three or four lines of code that do not represent my original. So, there is some mystery there for me. I want to shout at the assembler: "Stop using the stack! No, don't eliminate my code! My code is correct, just optimize it!" Etcetera. ... But I don't know how to speak assembler to a computer. At this rate, generating a suitably simple test case could be a project. ....
UPDATE: I tried it again just now. That is, I simplified by eliminating the extraneous code and including only the loop counter and the key LOOP of computation -- the same loop the optimizer previously said is a "Disqualified loop: Did not find schedule". ... This time the optimizer eliminated my many lines of computation, and produced a few lines of do-nothing it feels is sufficient. I somehow have to tell it, "No, don't eliminate these computations! Optimize them - Don't eliminate them."
I tried optimizing a another (different) piece of Linear Assembly code. This time it was optimized down to nothing -- absolutely nothing -- no CPU instructions at all.
This particular test routine began with the data already within the registers, then finds the maximum and places it in register A4. But that somehow gets "optimized" down to nothing. Apparently, the optimizer is making some assumptions I don't know about. What are those assumptions?
So I'm still stuck between two extremes: In some cases the optimizer thrashes the stack in an orgy of non-optimality, while in other cases it optimizes the code down to nothing in a triviality of non-productivity. What am I doing wrong?
Do you have all the "live-out" registers (presumably just A4) listed as operands to the ".endproc" directive?
Thanks for the correction -- it helped!
After correcting the ".endproc" directive, the optimizer puts out code now. This time, instead of thrashing the stack, it thrashes the registers. That is, it temporarily moves the data from register to register (like musical chairs), when there is no reason for it. It's not very optimal. My original code didn't do that. Indeed, the possibilities for parallelism are obvious and easy to do by hand.
I could attach this test example, if you like, but it would take us away from the more serious problem at hand. That is, in my project code (not the test example discussed in this post) the optimizer thrashing things needlessly on and off the stack. I'm trying to stop the optimizer from doing that.
I want to address this
Walter Snafu2) I don't see how my Linear Assembly code could possibly require intermediate computations -- since it is already in assembly code, where there exist no intermediate computations. The assembler should merely optimize my instructions. So, I don't see this as the source of the problem.
I think you wrote that because misunderstand what the linear assembler does.
From the compiler book http://www.ti.com/lit/pdf/spru187 , in the section titled "About the Assembly Optimizer" it gives this description:
The assembly optimizer performs several tasks including the following:
• Optionally, partitions instructions and/or registers
• Schedules instructions to maximize performance using the instruction-level parallelism of the C6000
• Ensures that the instructions conform to the C6000 latency requirements
• Optionally, allocates registers for your source code
I presume your input code does not have the instructions partitioned, and uses symbolic registers instead of actual machine registers. Thus, in your case, the linear assembler is performing both of those optional steps.
When the linear assembler maps the symbolic registers on to actual machine registers it can introduce moves between registers and moves to/from the stack. This is normal and expected. Very often these moves are in parallel with other instructions. This means they are free in terms of cycles, but not code size. Usually, such moves are necessary for correct operation. To know whether this is the case for your code requires that we have an example we can build and examine its output.
Your misunderstanding is partly my fault. I introduced the term "intermediate computations". By that I meant the moves described in the previous paragraph. By trying to be brief, I confused you. Sorry about that.
I developed an example I can send. How do I attach it to a post? (The website posting system rejected my "Insert media" command.)
My example here has no loops or branches, and is highly symmetrical -- easily partitioning into parallelizeable A-side and B-side instructions, with few cross-paths.
I can easily (by hand) optimize the code to 34 execution packets (34 instruction cycles).
The TI optimizer is thrashing the registers to-and-fro, and producing a result with 53 instruction cycles -- or 56% slower.
I'm new to this TI optimizer, so probably I'm doing something wrong (like not giving enough clues to the optimizer?). Why is the optimizer missing such an obvious improvement?
Walter SnafuI developed an example I can send. How do I attach it to a post?
In the window that comes up when you create a forum post, look up and to the left. Not all the way up. You'll see
Compose | Options | Preview
Click on Options. Then under "File Attachment" click on "Add/Update". It is pretty straightforward from there.
I attached two files. One is the .sa file, with my extended comments in it -- this file is for the Linear Assembly Optimizer. The other file is .asm, which contains my hand-optimized version of the same code, for comparison with the TI Optimizer.
Hopefully, I can attach these two files.
[UPDATE: The system seems to be rejecting my attachment files. After I submit the files, the system seems to be claiming they have the wrong filename extensions -- when my files have the extensions: .sa and .asm. What's wrong?]
Walter Snafu[UPDATE: The system seems to be rejecting my attachment files. After I submit the files, the system seems to be claiming they have the wrong filename extensions -- when my files have the extensions: .sa and .asm. What's wrong?]
I don't know. Try putting them in a .zip and attaching that. I'm pretty sure that will work. Sorry for the trouble!
Hopefully my attempt to attach the files will work this time. Try the zip file approach now.
What are your command-line options, and what version of the compiler are you using?
Command line options, I believe are:
-mv6748 --symdebug:none -O3 --include_path="C:/Program Files/Texas Instruments/ccsv4/tools/compiler/c6000/include" --diag_warning=225 --issue_remarks --consultant --debug_software_pipeline --speculate_loads=1024 --optimizer_interlist --single_inline --remove_hooks_when_inlining --opt_for_speed=3 --gen_opt_info=2 -k --asm_listing
I'm using CCSv4.1.3.00038. I presume that tells what compiler version I'm using. I believe I'm using code generation tools 6.1.12.
Okay, there are couple of issues here. I don't have a complete answer for you, this is just a preliminary analysis. Anything deeper would have to be done by the support team.
If there are any register values from the surrounding code, you're supposed to list them as arguments to the ".proc" directive to tell the compiler that it needs to care about them. Furthermore, there is no indication in the linear assembly file which specific registers are the 32 you care about as inputs, so the compiler has no idea how to initialize the virutal registers sA00, etc., or which machine register to map them to.
Unfortunately, the straightforward approach of just globally replacing (with .asg) each of the virtual registers sA00, etc., with the machine register it needs to be trips on a problem with the instruction scheduler that is trying to prevent a register allocation bug. That's a complicated issue, so I'll just mention it exists and say you should list all 32 machine registers as arguments to the ".proc" directive and then immediately issue 32 moves into the virtual registers sA00, etc. The compiler will use the right machine registers for these 32 virtual registers and eliminate the leading MVs, so they won't add to the length of the schedule.
Two of the four ZERO instructions need to precede the first two CMPGT in both the linear assembly and the sample hand-coded assembly file; in the hand-coded assembly file, the ZERO instructions are shown in parallel with these first two CMPGT instructions, but the CMPGT read two of those registers (tA0 and tB0), which are not initialized at that point.
Using 6.1.12, I get 54 execute packets, but using 7.0.1 I get 36 for the serial assembly, so apparently the optimization improved between 6.1.x and 7.0.x. It's unclear to me exactly which optimization improved. Note that I get 35 execute packets for the sample hand-coded assembly file.
All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.
TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs andembedded processors, along with software, tools and the industry’s largest sales/support staff.