• Join
  • Sign In with my.TI Login
Texas Instruments
  • Products
  • Applications
  • Tools & Software
  • Support & Community
  • Sample & Buy
  • About TI
Sample & Purchase Cart Sample & Purchase Cart
  • Search
  • Advanced
TI E2E™ Community
  • Support Forums
  • Blogs
  • Groups
  • Videos
  • 简体中文
  • More ...
TI Home » TI E2E Community » Support Forums » Digital Signal Processors (DSP) » C6000 Multicore DSP » Keystone Multicore Forum (C66, 66A, AM5) » Time calculation
Share
C6000 Multicore DSP
  • Forums
  • Announcements
Options
  • Subscribe via RSS
Training Available
TI provides self-paced online training that introduces the primary components of the KeyStone II family of SoC devices.

  • KeyStone II SoC Overview >
  • KeyStone II Software Overview >
  • KeyStone II ARM Cortex-A15 Corepac Overview >
  • More Information >
  • Check out
    Multicore Mix blog
    • $core_v2_blog.Current.Name

      OpenMP - All aboard!

      Posted 9 hours ago
      by Debbie Greenstreet
      With so many end products today relying on multicore DSPs for...
    • $core_v2_blog.Current.Name

      A look back: Two years of Multicore Mix

      Posted 1 day ago
      by Lauren Reed1
      A big thank you to everyone who participated in our contest last...
    • $core_v2_blog.Current.Name

      It’s our second anniversary, but you get the present!

      Posted 8 days ago
      by Lindsey Bare
      It’s hard to believe it’s already been two years...

    Forums

    Time calculation

    This question is answered
    Arun
    Posted by Arun
    on Apr 23 2012 18:54 PM
    Intellectual840 points

    Hello, 

    I am trying to calculate performance of simple Matrix to Matrix Multiplication code. I am using TSCL and TSCH for calculating my Clock cycles and from there I am calculating How much time it is taking to do that particular nested loop. My code is as follows:

    A = (double*)malloc(dimension*dimension*sizeof(double));
    B = (double*)malloc(dimension*dimension*sizeof(double));
    C = (double*)malloc(dimension*dimension*sizeof(double));

    for(i = 0; i < dimension; i++)
    {
    for(j = 0; j < dimension; j++)
    {
    A[dimension*i+j] = (i+j);
    B[dimension*i+j] = (i-j);
    C[dimension*i+j] = 0.0;

    }
    }

    TSCL = 0;
    TSCH = 0;
    t_start_l = TSCL;
    t_start_h = TSCH;

    for(i = 0; i < dimension; i++)
    {
    for(j = 0; j < dimension; j++)
    {
    tmp = 0.0;
    for(k = 0; k < dimension; k++)
    {
    tmp += A[dimension*i+k] *B[dimension*k+j];
    C[dimension*i+j] = tmp;
    }
    }
    }

    t_stop_l = TSCL;
    t_stop_h = TSCH;
    t_overhead_l = t_stop_l - t_start_l;
    t_overhead_h = t_stop_h - t_start_h;

    Now, Number of clock cycle is Delta= t_overhead_l- t_overhead_h. Below are some values which I am getting with No optimizations and No particular special properties.
    [C66xx_0] Enter the size of dimension : 10
    [C66xx_0] Time Taken during Matrix multiplication is: , t_overhead_h = 0 t_overhead_l=59958
    [C66xx_1] Enter the size of dimension : 100
    [C66xx_1] Time Taken during Matrix multiplication is: , t_overhead_h = 0 t_overhead_l=64627173
    [C66xx_2] Enter the size of dimension : 500
    [C66xx_2] Time Taken during Matrix multiplication is: , t_overhead_h = 2 t_overhead_l=-547958635
    [C66xx_3] Enter the size of dimension : 1000
    [C66xx_3] Time Taken during Matrix multiplication is: , t_overhead_h = 15 t_overhead_l=-124803967
    [C66xx_4] Enter the size of dimension : 1024
    [C66xx_4] Time Taken during Matrix multiplication is: , t_overhead_h = 16 t_overhead_l=320899090
    [C66xx_5] Enter the size of dimension : 1500

    Now, My Questions are as below:
    1. Why values are negative for dimension size 500 and 1000, but why not for 1024.
    2. Why System hangs at dimension size of 1500, It does not give me any error nor any message for half an hour.
    
    
    Apart from this, is there any other way to calculate the time and If I am using  and enabling clock from CCS it is giving me some other values, from the values which I am getting, Does those
    CPU values are for total program if I will selected as  CPU execution cycles.
    
    
    Thanks and Regards,
    Arun 

     

    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    All Replies
    • Arun
      Posted by Arun
      on Apr 23 2012 22:54 PM
      Intellectual840 points

      Hello,

      Sorry to post two consecutive questions again and again. I have tried matrix multiplication code in Blocking mode as well, in which I have some doubts as well. my blocked code is as follows:

       

      void do_mult(int block_i, int block_j, int block_k, double *A, double *B,double *C)
      {
          int i, j, k;
          double tmp;

          for (i=block_i; i < block_i+block; i++)
          {
              for (j=block_j; j < block_j+block; j++)
              {
                  tmp = 0.0;
                  for (k=block_k; k < block_k+block; k++)
                  {
                      tmp += A[dimension*i+k] * B[dimension*k+j];
                      C[dimension*i+j] += tmp;
                  }
              }
          }
      }

      A = (double*)malloc(dimension*dimension*sizeof(double));
      B = (double*)malloc(dimension*dimension*sizeof(double));
      C = (double*)malloc(dimension*dimension*sizeof(double));

      for(i = 0; i < dimension; i++)
      {
          for(j = 0; j < dimension; j++)
          {
             A[dimension*i+j] = (i+j);
             B[dimension*i+j] = (i-j);
             C[dimension*i+j] = 0.0;
          }
      }

      //begin();
      TSCL = 0;
      TSCH = 0;
      t_start_l = TSCL;
      t_start_h = TSCH;

      for(i = 0; i < nr_blocks; i++)
      {
         block_i = i * block;
          for(j = 0; j < nr_blocks; j++)
         {
            block_j = j * block;
            for(k = 0; k < nr_blocks; k++)
            {
               block_k = k * block;
               do_mult(block_i, block_j, block_k, A, B, C);
            }
        }
      }
      //end(&s, &ns);
      t_stop_l = TSCL;
      t_stop_h = TSCH;
      t_overhead_l = t_stop_l - t_start_l;
      t_overhead_h = t_stop_h - t_start_h;

      printf("Number of Clock Cycle Taken during Matrix multiplication is: %d\t\n",t_overhead_l-t_overhead_h);

      Now If I will run this code I am getting Number of Cycle counts same , Whatever be the Dimension size, and same System hangs at more than dimension size 1024.

      As follows:

      [C66xx_0] Enter the number of dimension : 10
      [C66xx_0] Number of Clock Cycle Taken during Matrix multiplication is: 25
      [C66xx_1] Enter the number of dimension : 50
      [C66xx_1] Number of Clock Cycle Taken during Matrix multiplication is: 25
      [C66xx_2] Enter the number of dimension : 100
      [C66xx_2] Number of Clock Cycle Taken during Matrix multiplication is: 25
      [C66xx_3] Enter the number of dimension : 500
      [C66xx_3] Number of Clock Cycle Taken during Matrix multiplication is: 25
      [C66xx_4] Enter the number of dimension : 1000
      [C66xx_4] Number of Clock Cycle Taken during Matrix multiplication is: 25
      [C66xx_5] Enter the number of dimension : 1024
      [C66xx_5] Number of Clock Cycle Taken during Matrix multiplication is: 25
      [C66xx_6] Enter the number of dimension : 1200

      Where I am wrong? 

      Thanks and Regards,
      Arun 

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • one and zero
      Posted by one and zero
      on Apr 24 2012 02:48 AM
      Expert6825 points

      Hi Arun,

      TSCH +TSCL is representing a 64 bit value. Each register is 32 bit.

      So your calculation t_overhead_l-t_overhead_h is wrong.

      Kind regards,

      one and zero

       

      Please click the Verify Answer button on this post if it answers your question.

      You can also follow me on Twitter: http://twitter.com/oneandzeroTI

      Do you want to read interesting multicore articles? Check out our Multicore Mix

       

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • one and zero
      Posted by one and zero
      on Apr 24 2012 04:17 AM
      Verified Answer
      Verified by Arun
      Expert6825 points

      Hi Arun,

      here's an example:

      #include <stdio.h>
      #include <c6x.h>

      void main(void) {
      unsigned int stampl1,stampl2,stamph1,stamph2;;
      long long time1,time2;

          TSCL=0;
          stampl1=TSCL;
          stamph1=TSCH;
              printf("Hi: %d \n",DNUM);
          stampl2=TSCL;
          stamph2=TSCH;

          time1 = ((long long)stamph1 << 32) + (long long)stampl1;
          time2 = ((long long)stamph2 << 32) + (long long)stampl2;

          printf("printf took: %lld cycles \n", time2-time1);
      }

      or alternatively you could use the CSL:

       *   @b Example
       *   @verbatim
              CSL_Uint64        counterVal;
              
              ...
              
              CSL_tscStart();
              
              ...
              
              counterVal = CSL_tscRead();

      Kind regards,

      one and zero

       

      Please click the Verify Answer button on this post if it answers your question.

      You can also follow me on Twitter: http://twitter.com/oneandzeroTI

      Do you want to read interesting multicore articles? Check out our Multicore Mix

       

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Wenzhong Liu
      Posted by Wenzhong Liu
      on Apr 24 2012 09:05 AM
      Intellectual965 points

      Arun,

      What do you mean by "system hang"?  Did you see the error message saying CPU pipeline get stalled, from CCS? Or you simply saw your code run into wild and never complete (in this case, you should still be able to issue an HALT command from CCS, and check if your code is still performing the calculation)?

      If it is the 2nd case, can you attach the link command file (.cmd file) so that I can take a look. It is even better if you can attach the entire project.

       

      Regards!

      Wen

       

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on Apr 24 2012 11:46 AM
      Intellectual840 points

      Hello One and Zero,

      Thanks for your reply.  yeah, I know TSCH +TSCL is representing a 64 bit value. Each register is 32 bit. I was not thinking in that direction. Thanks for Pointing towards it. 
      I have tried the way you told and Now I am getting some reasonable Output and No negative values as well. It makes sense.

      Thanks and Regards,
      Arun 

       

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on Apr 24 2012 12:11 PM
      Intellectual840 points

      Hello Wenzhongliu,

      Thanks for replying. My apologies for not being clear. Actually By the term Hangs I mean, When I run my code on One core it is showing running, But I am not getting any outputs on console,neither I am getting any error message. It just Seems to be Running and running. Below is the Screen shot of how My system looks like:




      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Wenzhong Liu
      Posted by Wenzhong Liu
      on Apr 24 2012 14:06 PM
      Intellectual965 points

      Arun,

      The problem you have seems a SW issue. If you can send me your code (entire project, especially the .cmd file), I'd like to take a look and debug on my EVM board.

      My guess is, your test run out of memory (from heap), and one of your malloc() call might fail due to no enough memory. So, check the size of the heap.

      Another question, during the iterations, do you do mfree() to collect memory?

       

      Regards!

      Wen

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on Apr 24 2012 15:54 PM
      Intellectual840 points

      Hello Wen,

      Please go through the attachment for my Whole project. For your knowledge, I am just checking it as a simple matrix to matrix multiplication. No optimizations at all. And, I am using

      free(A);
      free(B);
      free(C);
      But that is not in iterations, Anyways, Have a look on it. Right now, I am basically stuck in Blocking mode of Same Matrix to Matrix multiplication, It is always giving me same number clock 
      cycle for any number of dimensions. Anyways let me get my hand dirty on it, If I will not get anything then I will trouble you guys.

      3683.MAT_MUL_ARUN.zip


      Thanks and Regards,
      Arun
      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on Apr 24 2012 16:40 PM
      Intellectual840 points

      Hello Wen,

      My Apologies to post two consecutive post back to back. I have tried same thing with blocking mode on my Matrix to Matrix multiplication. I am not able to figure it out that why It is taking same number of clock cycles for any dimensions and also It is behaving same after 1024 dimension size. I thought It can also give you some aspect where I am wrong.8625.Mul_Arun_Blocking.zip

      Thanks and Regards,
      Arun

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on Apr 25 2012 01:06 AM
      Intellectual840 points

      Hello Wen,

      I have changed memory by RTSC platform and I have run the same Simple Matrix to Matrix Multiplication Code and I am able to run after 1024 dimension size, But I am not able to run for 1500. It should run in that dimension size as well when I put everything in MSMCSRAM.

      Thanks and Regards,
      Arun 

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Wenzhong Liu
      Posted by Wenzhong Liu
      on Apr 25 2012 11:30 AM
      Intellectual965 points

      Arun,

      I created a standalone version (not using RTSC) project based on the .c file you sent, without changing anything of your .c file, but added another file - c6678.cmd file with a big heap (-heap 0x4000000). Here is the test results I got:

      [C66xx_0] Enter the size of dimension :  150    // Note: used the Release version

      [C66xx_0] Matrix Multiplication took: 477836025 cycles

      [C66xx_0] Enter the size of dimension :  150  // Note: used the Debug version

      [C66xx_0] Matrix Multiplication took: 1234223435 cycles

      [C66xx_0] Enter the size of dimension :  1024    // Note: used the Release version

      [C66xx_0] Matrix Multiplication took: 202770756487 cycles

      [C66xx_0] Enter the size of dimension :  1500   // Note: used the Debug version

      [C66xx_0] Matrix Multiplication took: 1260036672163 cycles

       

      When I used a smal heap, I also saw my test run into wild with dimension=1500.

       

      Back to your code, I looked at the code you sent, and I see following potential issues:

      1. Your code used RTSC which might use the TSCL/TSCH registers too, and cause confliction when reading TSCL/TSCH.

      2. Since your code is using RTSC, and configuration of memory map is set at default. What I read is, its run-time heap is in L2SRAM space with size 4096. 

       

      By the way, from the testing results, you can see that it take very long time for dimenstion=1500 test to complete. Since (matrix * matrix) is very typical DSP processing algorithm, the DSPLib already covers this with much better performance. In your real application, you should call DSPLib function directly.

      Regards!

      Wen

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on Apr 25 2012 12:11 PM
      Intellectual840 points

      Hello Wen,

      Thanks for your reply. My apologies for my little knowledge. I have some doubts on some points here. 

      1. How did you add new .cmd file , I mean is it not something automatically generated? From Where Did you changed the heap size? And It means, If we can increase the heap size then It doesn't matter where our code and data is, whether it is in L2SRAM or MSMCSRAM or even DDR3. All should run, AM I right?

      2. You have written when you were using small heap then also you were facing issues with matrix size 1500, onwards, But you told me you have got above results on standalone version (NO RTSC PROJECT), Does it mean Is it a issue with RTSC based projects or heap?

      3. Does any difference will occur if I will debug it with Optimization level 3, Fully optimized or If I will disable intrinsic?

      3. What type of Conflict you are trying to refer between RTSC based projects and TSCL/TSCH ? 

      Please Elaborate.

      Thanks and Regards,
      Arun 

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Wenzhong Liu
      Posted by Wenzhong Liu
      on Apr 25 2012 12:48 PM
      Intellectual965 points

      7128.MAT_MUL_WEN.zip

      Arun,

      I am sorry for forgetting attaching the project I created.

      To answer your questions:

      1. See the project I attached. The file c6678.cmd under the project defines the memory map to use, as well as the HEAP size (you can play with it to see how the test works) for building the .out.

      2. See item1. You can try to use a smaller HEAP size by changing the -heap line in the .cmd file.

      3. I haven't tried difference optimization level (I only tried the default setting for Debug and Release build).

      4. The TSCH works this way - whenever the TSCL is read, the current upper 32 bits of the 64 bits counter will be latched in TSCH. So, if both RTSC and your test code are reading TSCL/TSCH in one application. The read of TSCH might be un-reliable, since it could latech the value because another one read the TSCL register.

       

      Regards!

      Wen

       

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on Apr 25 2012 12:58 PM
      Intellectual840 points

      Hello Wen,

      Thanks for replying. Well, as you said about the conflicts between RTSC and TSCL/TSCH then what is another way to calculate about number of clock cycles and Time consumption , even any type of performance related things. Can you suggest something on that?

      And, Also can you little bit more explain to DSP Library things, Do I need to include those in my project, I Haven't done it before so I am not sure how this thing will work.
      By the mean time, I am looking on Project you have attached and will get back to you with my doubts and queries.

      Thanks and regards,
      Arun 

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    • Arun
      Posted by Arun
      on Apr 25 2012 13:16 PM
      Intellectual840 points

      Wen, 

      I have checked in Project file which you have sent and I was not able to find Target Configuration file ".ccxml" Did you created with your project. I have just import your project  to my 6678 board and When I debug it, It is taking hell lot of time in debugging one one core only.

      Thanks! 

      Report Abuse
      • Reply
      You have posted to a forum that requires a moderator to approve posts before they are publicly available.
    12
    TI E2E™ Community
    • Support Forums
    • Blogs
    • Videos
    • Groups
    • Site Support & Feedback
    • Settings
    TI E2E™ Community Groups
    • TI University Program
    • Make the Switch
    • Microcontroller Projects
    • Motor Drive & Control
    Other Communities
    • Deyisupport
    • Designsomething.org
    • beagleboard.org
    • TI on Element 14
    • TI on TechXchangeSM
    Other Technical & Support Resources
    • WEBENCH® Design Center
    • Product Information Centers
    • Technical Documents
    • TI Design Network
    • TI Technical Articles
    • TI Training

    All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.

    Content on this site may contain or be subject to specific guidelines or limitations on use. All postings and use of the content on this site are subject to the Terms of Use of the site; third parties using this content agree to abide by any limitations or guidelines and to comply with the Terms of Use of this site. TI, its suppliers and providers of content reserve the right to make corrections, deletions, modifications, enhancements, improvements and other changes to the content and materials, its products, programs and services at any time or to move or discontinue any content, products, programs, or services without notice.

    Follow Us Texas Instruments on Facebook Texas Instruments on Twitter Texas Instruments on LinkedIn Texas Instruments on Google+
    TI Worldwide | Contact Us | my.TI Login | Site Map | Corporate Citizenship | mobile m.ti.com (Mobile Version)

    TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs and
    embedded processors, along with software, tools and the industry’s largest sales/support staff.

    © Copyright 1995-2013 Texas Instruments Incorporated. All rights reserved.
    Trademarks | Privacy Policy | Terms of Use