• Join
  • Sign In with my.TI Login
Texas Instruments
  • Products
  • Applications
  • Tools & Software
  • Support & Community
  • Sample & Buy
  • About TI
Sample & Purchase Cart Sample & Purchase Cart
  • Search
  • Advanced
TI E2E™ Community
  • Support Forums
  • Blogs
  • Groups
  • Videos
  • 简体中文
  • More ...
TI Home » TI E2E Community » Support Forums » Digital Signal Processors (DSP) » C6000 Multicore DSP » Keystone Multicore Forum (C66, 66A, AM5) » How to speedup my project running on C6678
Share
C6000 Multicore DSP
  • Forums
  • Announcements
Options
  • Subscribe via RSS
Training Available
TI provides self-paced online training that introduces the primary components of the KeyStone II family of SoC devices.

  • KeyStone II SoC Overview >
  • KeyStone II Software Overview >
  • KeyStone II ARM Cortex-A15 Corepac Overview >
  • More Information >
  • Check out
    Multicore Mix blog
    • $core_v2_blog.Current.Name

      Geeks UNITE for Geek Pride Day

      Posted 2 days ago
      by Lauren Reed1
      Happy Geek Pride Day from the Processors team! We wanted to celebrate...
    • $core_v2_blog.Current.Name

      OpenMP - All aboard!

      Posted 4 days ago
      by Debbie Greenstreet
      With so many end products today relying on multicore DSPs for...
    • $core_v2_blog.Current.Name

      A look back: Two years of Multicore Mix

      Posted 5 days ago
      by Lauren Reed1
      A big thank you to everyone who participated in our contest last...

    Forums

    How to speedup my project running on C6678

    This question is not answered
    Matt Wu
    Posted by Matt Wu
    on May 11 2012 01:02 AM
    Prodigy100 points

    Dear TI Employee,


      The situation is that the processing time doesn't meet my requirement.

      In the document SPRS691C page 18. 

      It tells us " a very high level of parallelism that can be exploited by DSP programmers through the use of TI's optimized C/C++ compiler "

      So, my optimization level  is set to o2

      But is there anything that can help me make my project run faster ?

      Or is there anything  I missed ?

      Does the .out file control the eight functional units or I should control them ?

      By the way, I am writing C code.

    Regards,

    Matt

       

    6678
    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    All Replies
    • Alberto Chessa
      Posted by Alberto Chessa
      on May 11 2012 03:56 AM
      Genius3740 points

      Hi,

      In general, about me the compiler does a great work in optimizing the code, but seldom it needs some helps, and  you have to give it some hints.

      You can set the optimization level to 3 (Basic Options) and Optimize for speed to 5 (in Optimization options). Also, if you are compiling in debug, the set "optimize fully in presence of debug directive" (in Runtime mode options).

      There are a lot of other option You can try: read:

      • SPRABF2 Introduction to TMS320C6000 DSP Optimization
      • SPRA666 Hand-Tuning Loops and Control Code on the TMS320C6000
      • SPRU198K TMS320C6000 Programmer’s Guide

       

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Allen Lee
      Posted by Allen Lee
      on May 11 2012 04:37 AM
      Genius3500 points

      Hi Matt,

      I think there are 2 ways to consider:
      1. Actually, in many situation, the optimization of compiler is not enough. You should write assembly code for highly-density-calculation part of your code. Or take advantage of DMA to speedup the transfer of data if there exist plenty of data access requirement. There is Time Stamp counter module inside the corePac, you can utilize them to measure the cycle consumption of each module of your application and locate the most cycle consumers. Then optimize these modules according to their attribute(Computationally intensive,Data access intensive or combination of the two).
      2. You know there are 8 C66x cores in C6678, so we can simply suppose the performance could be speeded up by 8 times than one core if they are do the same thing. Now you can have a look at your project and think about does it suitable or convenient to deploy onto multi-core? There are different methods to simply achieve that, one is for deploying the same program for every core,but maybe with different input data. And the other is divide the whole work into several steps like a pipeline flow, such as A->B->C->D..., so A can put into Core0, B->Core1 and so on. TI has provided a MAD utility to realize the deployment of multicore application very effectively, you can learn it at http://processors.wiki.ti.com/index.php/MAD_Utils_User_Guide.

      Just for reference.

      Allen

      Please press the "Verify Answer" button if you think the post is helpful to your question.Thanks.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Chad Courtney
      Posted by Chad Courtney
      on May 11 2012 09:09 AM
      Mastermind22595 points

      Thanks Albert and Allen,  good posts.

      I'd like to say that, you should consider the techniques in the C optimization guides, 'unrolling loops', using pragma's, etc. to help free up the compiler to do it's job optimally before writing hand assembly, and then I'd suggest using linear assembly, first.

      The compiler can give you very optimal results if you free it up to do so as described in the guides and this can take much less time than hand optimizing routines.  After you have a system up and running, then you can profile and go back and optimize routines that are consuming lots of cycles or are called extremely frequently.

      Best Regards,

      Chad

      ------------------------------------------------------------------------------------------------------------

      Please click the Verify Answer button on this post if it answers your question.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Matt Wu
      Posted by Matt Wu
      on May 11 2012 13:05 PM
      Prodigy100 points

        Thanks Albert , Allen and Chad.

        I will study the document

        I have a question about optimization level o3.

        If I set o3, the project fails while it works fine at level o2 .

        Is there any reason ?

        Thanks!

       

       

        

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Chad Courtney
      Posted by Chad Courtney
      on May 11 2012 14:22 PM
      Mastermind22595 points

      Matt,

      It shouldn't fail at any optimization level.  Can you clarify how it's failing?  That would probably be more of discussion for the C/C++ Compiler forum section though.  If you provide details on how it's failing there, they may be able to better assist.

      Best Regards,
      Chad

      ------------------------------------------------------------------------------------------------------------

      Please click the Verify Answer button on this post if it answers your question.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on May 12 2012 02:34 AM
      Intellectual840 points

      Matt,

      I am not sure what do you mean by saying it fails at Optimization level 3. Sometimes its Eclipse fault as well, it crashes and it happens with me many times as well. can you give any particular error message you are getting or anything special you have done so it is failing.

      One more thing,Alberto and Allen, can you suggest something if you have used linear assembly to access cores and all . Any Example projects or something like that. I  am also working on that.

      Thanks.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • rrlagic
      Posted by rrlagic
      on May 13 2012 21:42 PM
      Expert1440 points

      I have one more suggestion. You may find recommendation to read produced assembly in compiler guide. Don't neglect, sometimes it really helps. Many times reading assembly output I saw strange instructions. Then I realized, that, for example, mixture of signed and unsigned types required extra instruction, or wrong types were taken to perform multiplies. Sometimes I saw that loop was overloaded with shift operations, so I could use expand intrinsic to rebalance some load to multiplier unit. Also read compiler advice files. That what compiler can tell you.

      Another side, which I believe must be the first, I proper data placement, which would allow SIMD instructions utilization. Not to forget marking with const qualifier on pointers to input data, which actually don't change in your loop. Finally, as was suggested above, you may think about multiple cores.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Allen Lee
      Posted by Allen Lee
      on May 14 2012 00:48 AM
      Genius3500 points

      Hi Arun,

      What do you mean that use linear assembly to access cores and all?

      Allen

      Please press the "Verify Answer" button if you think the post is helpful to your question.Thanks.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on May 14 2012 00:52 AM
      Intellectual840 points

      Hello Allen,

      as per your post, I was thinking Might be you have accessed registers of each cores something like that by the linear assembly. Actually I am trying to do that so I can put my data in it. So I was thinking if you can help me out with it or if you can suggest some examples for those linear assemblies if you have accessed registers or memory sections separately.

      Thanks. 

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Allen Lee
      Posted by Allen Lee
      on May 14 2012 01:23 AM
      Genius3500 points

      Hi Arun,

      Actually, linear assembly is a good way to improve the performance of computation intensive code while the writing difficulty is lower than the direct assembly code. I will suggest you to go through the chapter 4.3 of C6000 optimizing compiler UG. And there is also a complete example of linear assembly code in that section.

      In linear assembly, you don't need to assign the register explicitly and consider which register is available when you want to occupy some of them to do calculation. Therefore, you can't use linear assembly to access the specific register because the assignment is handled by the assembler. So if you want to access a dedicated register such as A6 or B8 and so on, you need write the normal assembly code(with a file extension .asm rather than .sa of linear assembly).

      Allen

      Please press the "Verify Answer" button if you think the post is helpful to your question.Thanks.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Alberto Chessa
      Posted by Alberto Chessa
      on May 14 2012 01:28 AM
      Genius3740 points

      Matt Wu

        If I set o3, the project fails while it works fine at level o2 .

        Is there any reason ?

      It is hard to give hints about this without more info of what do you mean with "fails". I suggeste You to begin by isolating the performance critic code, optimize only this and look at what append.

      In my experience, one common problem with optimization is the coherence of the variable shared between multiple thread of executaion (OS theread and also interrupts handler). Shared control variable should always be declared "volatile".

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on May 14 2012 01:31 AM
      Intellectual840 points

      Hello Allen,

      Thanks for reply . Yes, I have read section 4.3. Well, If I would say How would I know that only data's I have is in accessing through registers. I mean We can see every details about it by linker.cmd file that where our every part is. And, you are saying Normal Assembly, As far as I know It is very difficult to write assembly for registers. Share some experience or examples if you have in which we can accessing assembly with our c code or suggest me something else then if you have some other idea. For you knowledge, I am working on performance and trying to see the difference in every part of memory.

      Thanks and Regards,
      Arun 

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Allen Lee
      Posted by Allen Lee
      on May 14 2012 02:20 AM
      Genius3500 points

      Hi Arun,

      All datas can be accessed by DSP through registers, so before to do some computation on memory's data, you must load them into registers.

      Could you explain explicitly what are you working on? Do you just want to test the performance difference when allocating your data section to different memory  segment such as DDR3 or MSMCRAM?

      Allen

       

      Please press the "Verify Answer" button if you think the post is helpful to your question.Thanks.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on May 14 2012 02:25 AM
      Intellectual840 points

      Hello Allen,

      On Simple scale yes, I want to see difference after allocating to different memory sections from register to L1 then L2 then MSMC and DDR3. Simply for matrix to MAtrix multiplication.

      On large scale I am actually working on DSP architecture for my research under Energy Efficient computing. can't discuss more details on it as It will be out of scope or subject here.
      As you said allocation to registers, then do you mean individual allocation to each registers or anything else?Or, Declare it as registers and It will automatically go into it?

      Thanks and regards,
      Arun 

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Clement Mesnier
      Posted by Clement Mesnier
      on May 14 2012 03:59 AM
      Intellectual855 points

      Before going the Linear Assembly road consider using :

      a) Intrinsics (see the compiler guide for how to use them)

      b) #pragma MUST_ITERATE

      c) Manual unrolling

      d) keyword restrict

      and look at the .asm file produced (you have to specify in the compiler options that you need it) to see how the compiler optimize your code and mostly, where it fails to unroll your loops for example.

      use a macro to get the number of cycles

      (we used this : 

      #define Start_profile() TSCL=0; \
      t_start=TSCL

      #define Stop_profile() t_stop = TSCL; \
      t_overhead=t_stop-t_start; \
      printf("cycles = %d\n",t_overhead)

      )

      CM

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    12
    TI E2E™ Community
    • Support Forums
    • Blogs
    • Videos
    • Groups
    • Site Support & Feedback
    • Settings
    TI E2E™ Community Groups
    • TI University Program
    • Make the Switch
    • Microcontroller Projects
    • Motor Drive & Control
    Other Communities
    • Deyisupport
    • Designsomething.org
    • beagleboard.org
    • TI on Element 14
    • TI on TechXchangeSM
    Other Technical & Support Resources
    • WEBENCH® Design Center
    • Product Information Centers
    • Technical Documents
    • TI Design Network
    • TI Technical Articles
    • TI Training

    All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.

    Content on this site may contain or be subject to specific guidelines or limitations on use. All postings and use of the content on this site are subject to the Terms of Use of the site; third parties using this content agree to abide by any limitations or guidelines and to comply with the Terms of Use of this site. TI, its suppliers and providers of content reserve the right to make corrections, deletions, modifications, enhancements, improvements and other changes to the content and materials, its products, programs and services at any time or to move or discontinue any content, products, programs, or services without notice.

    Follow Us Texas Instruments on Facebook Texas Instruments on Twitter Texas Instruments on LinkedIn Texas Instruments on Google+
    TI Worldwide | Contact Us | my.TI Login | Site Map | Corporate Citizenship | mobile m.ti.com (Mobile Version)

    TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs and
    embedded processors, along with software, tools and the industry’s largest sales/support staff.

    © Copyright 1995-2013 Texas Instruments Incorporated. All rights reserved.
    Trademarks | Privacy Policy | Terms of Use