• Join
  • Sign In with my.TI Login
Texas Instruments
  • Products
  • Applications
  • Tools & Software
  • Support & Community
  • Sample & Buy
  • About TI
Sample & Purchase Cart Sample & Purchase Cart
  • Search
  • Advanced
TI E2E™ Community
  • Support Forums
  • Blogs
  • Groups
  • Videos
  • 简体中文
  • More ...
TI Home » TI E2E Community » Support Forums » Digital Signal Processors (DSP) » C6000 Multicore DSP » C64x Multicore DSP Forum » IDMA bandwidth
Share
C6000 Multicore DSP
  • Forums
  • Announcements
Options
  • Subscribe via RSS
Training Available
TI provides self-paced online training that introduces the primary components of the KeyStone II family of SoC devices.

  • KeyStone II SoC Overview >
  • KeyStone II Software Overview >
  • KeyStone II ARM Cortex-A15 Corepac Overview >
  • More Information >
  • Check out
    Multicore Mix blog
    • $core_v2_blog.Current.Name

      It’s our second anniversary, but you get the present!

      Posted 4 days ago
      by Lindsey Bare
      It’s hard to believe it’s already been two years...
    • $core_v2_blog.Current.Name

      Limited time offer: Save $100 on Keystone-based EVM!

      Posted 17 days ago
      by tscheck
      Have you been thinking about ordering a TI Keystone-based EVM...
    • $core_v2_blog.Current.Name

      Imagine the impact…TI’s KeyStone SoC + HP Moonshot

      Posted 29 days ago
      by Sanjay35057
      Last week, market leader Hewlett Packard announced a huge change...

    IDMA bandwidth

    IDMA bandwidth

    This question is answered
    Yishay Hayardeni
    Posted by Yishay Hayardeni
    on Aug 04 2011 11:27 AM
    Prodigy210 points

    Hi,

    I'm using a C6678 device. I tested the IDMA transfer rate for L2 to L1D transfers (using IDMA1). From my understanding the internal bus of the L2 and L1D memories is 256bit wide and it works on the EMC clock which is half of the DSP clock. This gives a theoretical bandwidth of 16GB/sec (for a 1GHz device).

    I have done several tests which transfer data from the L2 to L1D using several block sizes (128byte to 2Kbyte), I kept adding transfers to keep a working and pending transfer at all times. I have timed the transfers and measured only 3GB/sec transfer rate. All the transfers where made using the highest configurable priority (priority 0). For my understanding no other memory transactions where made (no active master peripherals, little cache traffic, etc.).

    Does this rate make sense? Do you have other figures?

    Thanks,

    Yishay

    C66x IDMA
    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    All Replies
    • Chad Courtney
      Posted by Chad Courtney
      on Aug 15 2011 09:32 AM
      Mastermind22515 points

      Yishay,

      I'm assuming you've configured L1D as partial RAM and Partial Cache for this (same for L2) and are checking the contents at the end to verify they landed correctly.

      A couple questions.

      1.) How are you capturing the timestamps?

      2.) What timer are you using for timer?

      3.) How are you calculating the throughput?

      4.) Can you provide some raw numbers from what you've observed?

      Best Regards,

      Chad

      ------------------------------------------------------------------------------------------------------------

      Please click the Verify Answer button on this post if it answers your question.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Yishay Hayardeni
      Posted by Yishay Hayardeni
      on Aug 17 2011 03:44 AM
      Prodigy210 points

      1,2. I'm saving the CNTLO register of Timer0. I found out that the system initializes this timer to use a clock 6 times slower than the CPU rate.

      3. I'm dividing the total cycle count by the total byte count.

      4. My sample transfers 9 1K blocks using the following code:

       

      TIMER_TIC;

      for (size = 1024, count = 0; count < BLOCK_COUNT; count++)

      {

      hIdma->IDMA1_SOURCE = (uint32_t) L2Buff + size * count;

      hIdma->IDMA1_DEST = (uint32_t) L1DBuff + size * count;

      hIdma->IDMA1_COUNT = size;


      while (hIdma->IDMA1_STAT & 0x00000002)

      ;

      }

      while (hIdma->IDMA1_STAT)

      ;

      TIMER_TIC;

       

      The TIMER_TIC is defined as:

      #define TIMER_TIC TimerVal[timerValIndex++] = BaseAddress[0].regs->CNTLO; \

      if (timerValIndex == TIME_VAL_LEN) timerValIndex = 0;

      I measured 2946 cycles for 9K bytes which results in 0.319661 cycles/byte

       

       

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Chad Courtney
      Posted by Chad Courtney
      on Aug 17 2011 10:33 AM
      Mastermind22515 points

      What are you compiler options?  Are you using 'debug', if so you want to remove this option as you'll have the least efficient code.

      Where's the TimerVal[] data block?

      What exactly are the cache/SRAM settings

      Where are the L1DBuff and L2Buff located.

      I'd suggest using the TSC (core timer - available in the CSL) for your timestamps, I'm not sure how many cycles are being spent on this with your code.  Also, grab the first timestamp just prior to writing to the IDMA1 Count Register - Do another after exiting the for loop, and then another after the while (hIdma->IDMA1_STAT); loop. 

      Basically we need to figure out what's consuming the time, because the IDMA has been shown to achieve full theoretical performance on internal testing.

      Here's a quick example of the TSC timer usage - it's a direct register read.

              CSL_Uint64        counterVal;
             
              ...
             
              CSL_tscStart();
              counterVal = CSL_tscRead();

      ------------------------------------------------------------------------------------------------------------

      Please click the Verify Answer button on this post if it answers your question.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Yishay Hayardeni
      Posted by Yishay Hayardeni
      on Aug 18 2011 05:18 AM
      Prodigy210 points

      Chad,

      I'm using Release code with -o3 optimization.

      The TimeVal[] data block is located in the L1D.

      My Cache setting are: L2 Cache -128K, L1D Cache - 16K, L1P Cache - 16K

      L1DBuff is in the L1D, L2Buff is in the L2. I also ran the whole routine in the L1P so there will be no cache issues.

      Using the TSC did not change much. I measured using both the TSC and the TIMER0 and got the same results. When using the TSC with the CSL calling CSL_tscRead actually makes things a bit worst, since the code actually branches to the routine instead of inlining the code.

      The timing for the transfer is (using TSC): The whole transfer took 2895, the setup time (writing source and destination registers) took 50 cycles (some may be the TSC function call) the completion (from the end of the loop to the end of the transfer (after issuing the last transfer to pend the IDMA), took 298 cycles.

      Yishay

       

       

       

      C66x IDMA
      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Chad Courtney
      Posted by Chad Courtney
      on Aug 18 2011 09:34 AM
      Mastermind22515 points

      Can you zip up and post your test code so I can take a look at it.

      Best Regards,

      Chad

      ------------------------------------------------------------------------------------------------------------

      Please click the Verify Answer button on this post if it answers your question.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Yishay Hayardeni
      Posted by Yishay Hayardeni
      on Aug 24 2011 17:17 PM
      Prodigy210 points

      Chad,

      Attached is the source code. 

      The memory sections mentioned in the code are configured as follows in the RTSC cfg file:

      Program.sectMap[".L2_test"] = {loadSegment: "L2SRAM", loadAlign:8}; /* L2 Edma test*/
      Program.sectMap[".L1D_test"] = {loadSegment: "L1DSRAM", loadAlign:8}; /* L1D Edma test */
      Program.sectMap[".L1P_test"] = {loadSegment: "L1PSRAM", loadAlign:8}; /* L1P test */

      L2SRAM, L1DSRAM and L1PSRAM are configured automatically when changing the cache sizes in the platform configuration. I used 128k for L2 Cache, and 16k for L1D and L1P caches.
      2627.TestIdma.zip
      Please tell me if you need more information.
      Thanks,
      Yishay
      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Ivan Krechetov
      Posted by Ivan Krechetov
      on Apr 05 2012 14:37 PM
      Prodigy210 points

      Yishay,

      I took same problems with C6678

      ~1230 cycles for 4K data transmitting L2->L1 with IDMA1

      Have you found the solution on the problem?

      Ivan

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Steven Ji
      Posted by Steven Ji
      on Apr 20 2012 17:30 PM
      Verified Answer
      Verified by RandyP
      Expert8595 points

      Yishay and Ivan,

      We identify the issue with the IDMA1 transaction performance on silicon revision 1.0 of the C66x and this issue has been fixed in silicon revision 2.0 already.

      Basically the transaction from L2 to L1D is about 3.41 bytes/cycle in revision 1.0 but it is improved to about 7.67 bytes/cycle in revision 2.0 (which meets the theoretical number 8 bytes/cycle).

      And transaction from L2 to L2 is about 3.75 bytes/cycle in revision 1.0 but it is improved to about 7.76 bytes/cycle in revision 2.0 with theoretical number 8 bytes/cycle.

      One more example is L2 fill, which is about 7.49 bytes/cycle in revision 1.0 and it is improved to about 15.44 bytes/cycle in 2.0 with theoretical number 16 bytes/cycle.

      One Usage Note will be added to the next release of Errata documents for this issue. Please plan to use the revision 2.0 of C66x devices if the IDMA1 transaction is critical in your design. Thanks.

      Sincerely,

      Steven

      Sincerely,

      Steven

      ------------------------------------------------------------------------------------------------------------

      Please click the Verify Answer button on this post if it answers your question.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Ivan Krechetov
      Posted by Ivan Krechetov
      on Apr 21 2012 05:21 AM
      Prodigy210 points

      Steven, thanks!

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Clemens Eisserer
      Posted by Clemens Eisserer
      on Jul 13 2012 12:59 PM
      Expert1350 points

      Is there any way to determine the silicon revision of a C6678 chip soldered on a emv6678 board?

      I've just bought an emv6678 (board isrevision 3.0) development board, which I will to use for my diploma thesis, and if the revision has bandwidth issues I don't intend to use IDMA.

      Thank you in advance, Clemens

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Steven Ji
      Posted by Steven Ji
      on Jul 13 2012 13:16 PM
      Verified Answer
      Verified by RandyP
      Expert8595 points

      Clemens,

      Please take a look at the JTAGID register mentioned in the C6678 data manual and errata documents.

      The VARIANT field (bit 31-28) should show the silicon revision of the device you are using.

      As what mentioned in the errata document, VARIANT=0 means silicon revision 1.0, VARIANT=1 means silicon revision 2.0.

      Hope it helps.

      Sincerely,

      Steven

      ------------------------------------------------------------------------------------------------------------

      Please click the Verify Answer button on this post if it answers your question.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • RandyP
      Posted by RandyP
      on Jul 13 2012 16:34 PM
      Guru59970 points

      Clemens,

      Even if you have silicon 1.0, which is what is on my rev 1.0 EVM, the IDMA performance is very fast. It can be a significant help for some applications by relieving the DSP from having to move data between L1 and L2.

      IDMA is an underutilized feature of the C64x+, C674x, and C66x DSP cores. As far as I know, none of my customers use it, but instead just rely on the DSP's cache and EDMA to do their data movement.

      But there are certain applications that can fit data in multiple buffers in L1D SRAM, and can save larger buffers in L2 SRAM. Those can get the performance lift from running an IDMA1 to copy the old results from L1D to L2 and then the next input from L2 to L1D, all while the DSP is executing on the current input in L1D and writing the current output to L1D. In many cases, the output part can be placed in L2 without any performance loss, but that is very algorithm dependent.

      Basically, I am suggesting that you not give up on IDMA1 just because you "only" get 3GB/s. That is still pretty good.

      Regards,
      RandyP

      Search for answers, Ask a question, click  Verify  when complete, Help others, Learn more.

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Clemens Eisserer
      Posted by Clemens Eisserer
      on Jul 18 2012 01:46 AM
      Expert1350 points

      Hi Randy,

      I verified my emv6678 has v2 silicon - so I'll give IDMA a try, hoping to get 5-10% more throughput for memory bound algorithms due to hidden latencies of L2SRAM.

      Thanks, Clemens

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    TI E2E™ Community
    • Support Forums
    • Blogs
    • Videos
    • Groups
    • Site Support & Feedback
    • Settings
    TI E2E™ Community Groups
    • TI University Program
    • Make the Switch
    • Microcontroller Projects
    • Motor Drive & Control
    Other Communities
    • Deyisupport
    • Designsomething.org
    • beagleboard.org
    • TI on Element 14
    • TI on TechXchangeSM
    Other Technical & Support Resources
    • WEBENCH® Design Center
    • Product Information Centers
    • Technical Documents
    • TI Design Network
    • TI Technical Articles
    • TI Training

    All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.

    Content on this site may contain or be subject to specific guidelines or limitations on use. All postings and use of the content on this site are subject to the Terms of Use of the site; third parties using this content agree to abide by any limitations or guidelines and to comply with the Terms of Use of this site. TI, its suppliers and providers of content reserve the right to make corrections, deletions, modifications, enhancements, improvements and other changes to the content and materials, its products, programs and services at any time or to move or discontinue any content, products, programs, or services without notice.

    Follow Us Texas Instruments on Facebook Texas Instruments on Twitter Texas Instruments on LinkedIn Texas Instruments on Google+
    TI Worldwide | Contact Us | my.TI Login | Site Map | Corporate Citizenship | mobile m.ti.com (Mobile Version)

    TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs and
    embedded processors, along with software, tools and the industry’s largest sales/support staff.

    © Copyright 1995-2013 Texas Instruments Incorporated. All rights reserved.
    Trademarks | Privacy Policy | Terms of Use