Dear TI Employee,
The situation is that the processing time doesn't meet my requirement.
In the document SPRS691C page 18.
It tells us " a very high level of parallelism that can be exploited by DSP programmers through the use of TI's optimized C/C++ compiler "
So, my optimization level is set to o2
But is there anything that can help me make my project run faster ?
Or is there anything I missed ?
Does the .out file control the eight functional units or I should control them ?
By the way, I am writing C code.
In general, about me the compiler does a great work in optimizing the code, but seldom it needs some helps, and you have to give it some hints.
You can set the optimization level to 3 (Basic Options) and Optimize for speed to 5 (in Optimization options). Also, if you are compiling in debug, the set "optimize fully in presence of debug directive" (in Runtime mode options).
There are a lot of other option You can try: read:
I think there are 2 ways to consider:1. Actually, in many situation, the optimization of compiler is not enough. You should write assembly code for highly-density-calculation part of your code. Or take advantage of DMA to speedup the transfer of data if there exist plenty of data access requirement. There is Time Stamp counter module inside the corePac, you can utilize them to measure the cycle consumption of each module of your application and locate the most cycle consumers. Then optimize these modules according to their attribute(Computationally intensive,Data access intensive or combination of the two).2. You know there are 8 C66x cores in C6678, so we can simply suppose the performance could be speeded up by 8 times than one core if they are do the same thing. Now you can have a look at your project and think about does it suitable or convenient to deploy onto multi-core? There are different methods to simply achieve that, one is for deploying the same program for every core,but maybe with different input data. And the other is divide the whole work into several steps like a pipeline flow, such as A->B->C->D..., so A can put into Core0, B->Core1 and so on. TI has provided a MAD utility to realize the deployment of multicore application very effectively, you can learn it at http://processors.wiki.ti.com/index.php/MAD_Utils_User_Guide.
Just for reference.
Please press the "Verify Answer" button if you think the post is helpful to your question.Thanks.
Thanks Albert and Allen, good posts.
I'd like to say that, you should consider the techniques in the C optimization guides, 'unrolling loops', using pragma's, etc. to help free up the compiler to do it's job optimally before writing hand assembly, and then I'd suggest using linear assembly, first.
The compiler can give you very optimal results if you free it up to do so as described in the guides and this can take much less time than hand optimizing routines. After you have a system up and running, then you can profile and go back and optimize routines that are consuming lots of cycles or are called extremely frequently.
Please click the Verify Answer button on this post if it answers your question.
Thanks Albert , Allen and Chad.
I will study the document
I have a question about optimization level o3.
If I set o3, the project fails while it works fine at level o2 .
Is there any reason ?
It shouldn't fail at any optimization level. Can you clarify how it's failing? That would probably be more of discussion for the C/C++ Compiler forum section though. If you provide details on how it's failing there, they may be able to better assist.
Matt,I am not sure what do you mean by saying it fails at Optimization level 3. Sometimes its Eclipse fault as well, it crashes and it happens with me many times as well. can you give any particular error message you are getting or anything special you have done so it is failing.
One more thing,Alberto and Allen, can you suggest something if you have used linear assembly to access cores and all . Any Example projects or something like that. I am also working on that.
I have one more suggestion. You may find recommendation to read produced assembly in compiler guide. Don't neglect, sometimes it really helps. Many times reading assembly output I saw strange instructions. Then I realized, that, for example, mixture of signed and unsigned types required extra instruction, or wrong types were taken to perform multiplies. Sometimes I saw that loop was overloaded with shift operations, so I could use expand intrinsic to rebalance some load to multiplier unit. Also read compiler advice files. That what compiler can tell you.
Another side, which I believe must be the first, I proper data placement, which would allow SIMD instructions utilization. Not to forget marking with const qualifier on pointers to input data, which actually don't change in your loop. Finally, as was suggested above, you may think about multiple cores.
What do you mean that use linear assembly to access cores and all?
Hello Allen,as per your post, I was thinking Might be you have accessed registers of each cores something like that by the linear assembly. Actually I am trying to do that so I can put my data in it. So I was thinking if you can help me out with it or if you can suggest some examples for those linear assemblies if you have accessed registers or memory sections separately.
Actually, linear assembly is a good way to improve the performance of computation intensive code while the writing difficulty is lower than the direct assembly code. I will suggest you to go through the chapter 4.3 of C6000 optimizing compiler UG. And there is also a complete example of linear assembly code in that section.
In linear assembly, you don't need to assign the register explicitly and consider which register is available when you want to occupy some of them to do calculation. Therefore, you can't use linear assembly to access the specific register because the assignment is handled by the assembler. So if you want to access a dedicated register such as A6 or B8 and so on, you need write the normal assembly code(with a file extension .asm rather than .sa of linear assembly).
If I set o3, the project fails while it works fine at level o2 .
Is there any reason ?
It is hard to give hints about this without more info of what do you mean with "fails". I suggeste You to begin by isolating the performance critic code, optimize only this and look at what append.
In my experience, one common problem with optimization is the coherence of the variable shared between multiple thread of executaion (OS theread and also interrupts handler). Shared control variable should always be declared "volatile".
Hello Allen,Thanks for reply . Yes, I have read section 4.3. Well, If I would say How would I know that only data's I have is in accessing through registers. I mean We can see every details about it by linker.cmd file that where our every part is. And, you are saying Normal Assembly, As far as I know It is very difficult to write assembly for registers. Share some experience or examples if you have in which we can accessing assembly with our c code or suggest me something else then if you have some other idea. For you knowledge, I am working on performance and trying to see the difference in every part of memory.
Thanks and Regards,Arun
All datas can be accessed by DSP through registers, so before to do some computation on memory's data, you must load them into registers.
Could you explain explicitly what are you working on? Do you just want to test the performance difference when allocating your data section to different memory segment such as DDR3 or MSMCRAM?
Hello Allen,On Simple scale yes, I want to see difference after allocating to different memory sections from register to L1 then L2 then MSMC and DDR3. Simply for matrix to MAtrix multiplication.
On large scale I am actually working on DSP architecture for my research under Energy Efficient computing. can't discuss more details on it as It will be out of scope or subject here. As you said allocation to registers, then do you mean individual allocation to each registers or anything else?Or, Declare it as registers and It will automatically go into it?
Thanks and regards,Arun
Before going the Linear Assembly road consider using :
a) Intrinsics (see the compiler guide for how to use them)
b) #pragma MUST_ITERATE
c) Manual unrolling
d) keyword restrict
and look at the .asm file produced (you have to specify in the compiler options that you need it) to see how the compiler optimize your code and mostly, where it fails to unroll your loops for example.
use a macro to get the number of cycles
(we used this :
#define Start_profile() TSCL=0; \ t_start=TSCL
#define Stop_profile() t_stop = TSCL; \ t_overhead=t_stop-t_start; \ printf("cycles = %d\n",t_overhead)
All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.
TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs andembedded processors, along with software, tools and the industry’s largest sales/support staff.