This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Time calculation

Hello, 

I am trying to calculate performance of simple Matrix to Matrix Multiplication code. I am using TSCL and TSCH for calculating my Clock cycles and from there I am calculating How much time it is taking to do that particular nested loop. My code is as follows:

A = (double*)malloc(dimension*dimension*sizeof(double));
B = (double*)malloc(dimension*dimension*sizeof(double));
C = (double*)malloc(dimension*dimension*sizeof(double));

for(i = 0; i < dimension; i++)
{
for(j = 0; j < dimension; j++)
{
A[dimension*i+j] = (i+j);
B[dimension*i+j] = (i-j);
C[dimension*i+j] = 0.0;

}
}

TSCL = 0;
TSCH = 0;
t_start_l = TSCL;
t_start_h = TSCH;

for(i = 0; i < dimension; i++)
{
for(j = 0; j < dimension; j++)
{
tmp = 0.0;
for(k = 0; k < dimension; k++)
{
tmp += A[dimension*i+k] *B[dimension*k+j];
C[dimension*i+j] = tmp;
}
}
}

t_stop_l = TSCL;
t_stop_h = TSCH;
t_overhead_l = t_stop_l - t_start_l;
t_overhead_h = t_stop_h - t_start_h;

Now, Number of clock cycle is Delta= t_overhead_l- t_overhead_h. Below are some values which I am getting with No optimizations and No particular special properties.
[C66xx_0] Enter the size of dimension : 10
[C66xx_0] Time Taken during Matrix multiplication is: , t_overhead_h = 0 t_overhead_l=59958
[C66xx_1] Enter the size of dimension : 100
[C66xx_1] Time Taken during Matrix multiplication is: , t_overhead_h = 0 t_overhead_l=64627173
[C66xx_2] Enter the size of dimension : 500
[C66xx_2] Time Taken during Matrix multiplication is: , t_overhead_h = 2 t_overhead_l=-547958635
[C66xx_3] Enter the size of dimension : 1000
[C66xx_3] Time Taken during Matrix multiplication is: , t_overhead_h = 15 t_overhead_l=-124803967
[C66xx_4] Enter the size of dimension : 1024
[C66xx_4] Time Taken during Matrix multiplication is: , t_overhead_h = 16 t_overhead_l=320899090
[C66xx_5] Enter the size of dimension : 1500

Now, My Questions are as below:
1. Why values are negative for dimension size 500 and 1000, but why not for 1024.
2. Why System hangs at dimension size of 1500, It does not give me any error nor any message for half an hour.

Apart from this, is there any other way to calculate the time and If I am using  and enabling clock from CCS it is giving me some other values, from the values which I am getting, Does those
CPU values are for total program if I will selected as  CPU execution cycles.

Thanks and Regards,
Arun 

 

  • Hello,

    Sorry to post two consecutive questions again and again. I have tried matrix multiplication code in Blocking mode as well, in which I have some doubts as well. my blocked code is as follows:

     

    void do_mult(int block_i, int block_j, int block_k, double *A, double *B,double *C)
    {
        int i, j, k;
        double tmp;

        for (i=block_i; i < block_i+block; i++)
        {
            for (j=block_j; j < block_j+block; j++)
            {
                tmp = 0.0;
                for (k=block_k; k < block_k+block; k++)
                {
                    tmp += A[dimension*i+k] * B[dimension*k+j];
                    C[dimension*i+j] += tmp;
                }
            }
        }
    }

    A = (double*)malloc(dimension*dimension*sizeof(double));
    B = (double*)malloc(dimension*dimension*sizeof(double));
    C = (double*)malloc(dimension*dimension*sizeof(double));

    for(i = 0; i < dimension; i++)
    {
        for(j = 0; j < dimension; j++)
        {
           A[dimension*i+j] = (i+j);
           B[dimension*i+j] = (i-j);
           C[dimension*i+j] = 0.0;
        }
    }

    //begin();
    TSCL = 0;
    TSCH = 0;
    t_start_l = TSCL;
    t_start_h = TSCH;

    for(i = 0; i < nr_blocks; i++)
    {
       block_i = i * block;
        for(j = 0; j < nr_blocks; j++)
       {
          block_j = j * block;
          for(k = 0; k < nr_blocks; k++)
          {
             block_k = k * block;
             do_mult(block_i, block_j, block_k, A, B, C);
          }
      }
    }
    //end(&s, &ns);
    t_stop_l = TSCL;
    t_stop_h = TSCH;
    t_overhead_l = t_stop_l - t_start_l;
    t_overhead_h = t_stop_h - t_start_h;

    printf("Number of Clock Cycle Taken during Matrix multiplication is: %d\t\n",t_overhead_l-t_overhead_h);

    Now If I will run this code I am getting Number of Cycle counts same , Whatever be the Dimension size, and same System hangs at more than dimension size 1024.

    As follows:

    [C66xx_0] Enter the number of dimension : 10
    [C66xx_0] Number of Clock Cycle Taken during Matrix multiplication is: 25
    [C66xx_1] Enter the number of dimension : 50
    [C66xx_1] Number of Clock Cycle Taken during Matrix multiplication is: 25
    [C66xx_2] Enter the number of dimension : 100
    [C66xx_2] Number of Clock Cycle Taken during Matrix multiplication is: 25
    [C66xx_3] Enter the number of dimension : 500
    [C66xx_3] Number of Clock Cycle Taken during Matrix multiplication is: 25
    [C66xx_4] Enter the number of dimension : 1000
    [C66xx_4] Number of Clock Cycle Taken during Matrix multiplication is: 25
    [C66xx_5] Enter the number of dimension : 1024
    [C66xx_5] Number of Clock Cycle Taken during Matrix multiplication is: 25
    [C66xx_6] Enter the number of dimension : 1200

    Where I am wrong? 

    Thanks and Regards,
    Arun 

  • Hi Arun,

    TSCH +TSCL is representing a 64 bit value. Each register is 32 bit.

    So your calculation t_overhead_l-t_overhead_h is wrong.

    Kind regards,

    one and zero

  • Hi Arun,

    here's an example:

    #include <stdio.h>
    #include <c6x.h>

    void main(void) {
    unsigned int stampl1,stampl2,stamph1,stamph2;;
    long long time1,time2;

        TSCL=0;
        stampl1=TSCL;
        stamph1=TSCH;
            printf("Hi: %d \n",DNUM);
        stampl2=TSCL;
        stamph2=TSCH;

        time1 = ((long long)stamph1 << 32) + (long long)stampl1;
        time2 = ((long long)stamph2 << 32) + (long long)stampl2;

        printf("printf took: %lld cycles \n", time2-time1);
    }

    or alternatively you could use the CSL:

     *   @b Example
     *   @verbatim
            CSL_Uint64        counterVal;
            
            ...
            
            CSL_tscStart();
            
            ...
            
            counterVal = CSL_tscRead();

    Kind regards,

    one and zero

  • Arun,

    What do you mean by "system hang"?  Did you see the error message saying CPU pipeline get stalled, from CCS? Or you simply saw your code run into wild and never complete (in this case, you should still be able to issue an HALT command from CCS, and check if your code is still performing the calculation)?

    If it is the 2nd case, can you attach the link command file (.cmd file) so that I can take a look. It is even better if you can attach the entire project.

     

    Regards!

    Wen

     

  • Hello One and Zero,

    Thanks for your reply.  yeah, I know TSCH +TSCL is representing a 64 bit value. Each register is 32 bit. I was not thinking in that direction. Thanks for Pointing towards it. 
    I have tried the way you told and Now I am getting some reasonable Output and No negative values as well. It makes sense.

    Thanks and Regards,
    Arun 

     

  • Hello Wenzhongliu,

    Thanks for replying. My apologies for not being clear. Actually By the term Hangs I mean, When I run my code on One core it is showing running, But I am not getting any outputs on console,neither I am getting any error message. It just Seems to be Running and running. Below is the Screen shot of how My system looks like:




  • Arun,

    The problem you have seems a SW issue. If you can send me your code (entire project, especially the .cmd file), I'd like to take a look and debug on my EVM board.

    My guess is, your test run out of memory (from heap), and one of your malloc() call might fail due to no enough memory. So, check the size of the heap.

    Another question, during the iterations, do you do mfree() to collect memory?

     

    Regards!

    Wen

  • Hello Wen,

    Please go through the attachment for my Whole project. For your knowledge, I am just checking it as a simple matrix to matrix multiplication. No optimizations at all. And, I am using

    free(A);
    free(B);
    free(C);
    But that is not in iterations, Anyways, Have a look on it. Right now, I am basically stuck in Blocking mode of Same Matrix to Matrix multiplication, It is always giving me same number clock 
    cycle for any number of dimensions. Anyways let me get my hand dirty on it, If I will not get anything then I will trouble you guys.

    3683.MAT_MUL_ARUN.zip


    Thanks and Regards,
    Arun
  • Hello Wen,

    My Apologies to post two consecutive post back to back. I have tried same thing with blocking mode on my Matrix to Matrix multiplication. I am not able to figure it out that why It is taking same number of clock cycles for any dimensions and also It is behaving same after 1024 dimension size. I thought It can also give you some aspect where I am wrong.8625.Mul_Arun_Blocking.zip

    Thanks and Regards,
    Arun

  • Hello Wen,

    I have changed memory by RTSC platform and I have run the same Simple Matrix to Matrix Multiplication Code and I am able to run after 1024 dimension size, But I am not able to run for 1500. It should run in that dimension size as well when I put everything in MSMCSRAM.

    Thanks and Regards,
    Arun 

  • Arun,

    I created a standalone version (not using RTSC) project based on the .c file you sent, without changing anything of your .c file, but added another file - c6678.cmd file with a big heap (-heap 0x4000000). Here is the test results I got:

    [C66xx_0] Enter the size of dimension :  150    // Note: used the Release version

    [C66xx_0] Matrix Multiplication took: 477836025 cycles

    [C66xx_0] Enter the size of dimension :  150  // Note: used the Debug version

    [C66xx_0] Matrix Multiplication took: 1234223435 cycles

    [C66xx_0] Enter the size of dimension :  1024    // Note: used the Release version

    [C66xx_0] Matrix Multiplication took: 202770756487 cycles

    [C66xx_0] Enter the size of dimension :  1500   // Note: used the Debug version

    [C66xx_0] Matrix Multiplication took: 1260036672163 cycles

     

    When I used a smal heap, I also saw my test run into wild with dimension=1500.

     

    Back to your code, I looked at the code you sent, and I see following potential issues:

    1. Your code used RTSC which might use the TSCL/TSCH registers too, and cause confliction when reading TSCL/TSCH.

    2. Since your code is using RTSC, and configuration of memory map is set at default. What I read is, its run-time heap is in L2SRAM space with size 4096. 

     

    By the way, from the testing results, you can see that it take very long time for dimenstion=1500 test to complete. Since (matrix * matrix) is very typical DSP processing algorithm, the DSPLib already covers this with much better performance. In your real application, you should call DSPLib function directly.

    Regards!

    Wen

  • Hello Wen,

    Thanks for your reply. My apologies for my little knowledge. I have some doubts on some points here. 

    1. How did you add new .cmd file , I mean is it not something automatically generated? From Where Did you changed the heap size? And It means, If we can increase the heap size then It doesn't matter where our code and data is, whether it is in L2SRAM or MSMCSRAM or even DDR3. All should run, AM I right?

    2. You have written when you were using small heap then also you were facing issues with matrix size 1500, onwards, But you told me you have got above results on standalone version (NO RTSC PROJECT), Does it mean Is it a issue with RTSC based projects or heap?

    3. Does any difference will occur if I will debug it with Optimization level 3, Fully optimized or If I will disable intrinsic?

    3. What type of Conflict you are trying to refer between RTSC based projects and TSCL/TSCH ? 

    Please Elaborate.

    Thanks and Regards,
    Arun 

  • 7128.MAT_MUL_WEN.zip

    Arun,

    I am sorry for forgetting attaching the project I created.

    To answer your questions:

    1. See the project I attached. The file c6678.cmd under the project defines the memory map to use, as well as the HEAP size (you can play with it to see how the test works) for building the .out.

    2. See item1. You can try to use a smaller HEAP size by changing the -heap line in the .cmd file.

    3. I haven't tried difference optimization level (I only tried the default setting for Debug and Release build).

    4. The TSCH works this way - whenever the TSCL is read, the current upper 32 bits of the 64 bits counter will be latched in TSCH. So, if both RTSC and your test code are reading TSCL/TSCH in one application. The read of TSCH might be un-reliable, since it could latech the value because another one read the TSCL register.

     

    Regards!

    Wen

     

  • Hello Wen,

    Thanks for replying. Well, as you said about the conflicts between RTSC and TSCL/TSCH then what is another way to calculate about number of clock cycles and Time consumption , even any type of performance related things. Can you suggest something on that?

    And, Also can you little bit more explain to DSP Library things, Do I need to include those in my project, I Haven't done it before so I am not sure how this thing will work.
    By the mean time, I am looking on Project you have attached and will get back to you with my doubts and queries.

    Thanks and regards,
    Arun 

  • Wen, 

    I have checked in Project file which you have sent and I was not able to find Target Configuration file ".ccxml" Did you created with your project. I have just import your project  to my 6678 board and When I debug it, It is taking hell lot of time in debugging one one core only.

    Thanks! 

  • Hello Wen,

    Here I am after looking all details from projects which you have attached. After seeing to that, I have some questions which are as follows:

    1.  First thing about c6678.cmd file, is this file which you have written, because as far as I know we got something like this by name linker.cmd which is automatically generated on its own after successful build of a project. 

    2. What about the Target configuration file, Did you not use any such file with .ccxml extensions, because I was not able to see that in the project.

    3. If there is any conflict between RTSC and TSCL/TSCH which I don't know then It means RTSC will not be a good platform to test such type of extensive programs like matrix to matrix multiplication or is there any other way to solve that and measure all performance related things like clock cycle, Gflops, time consumption and all.

    4. I have also seen in c6678.cmd file that you have put different sections in different memory part like, L2SRAM1, DDR3_mem1, DDR3_mem0, is it not necessary to run the project I mean Do we really need to care about the Heap size? I have tried to put everything in MSMCSRAM, but it was taking too much time for dimension size 1500.

    5. I have just installed DSPLIBRARY  3_1_0, and I am able to see some source files for different versions of matrix to matrix multiplication, Do I need to use those codes for testing performance, Can't I just write my own code and test them? I think that will make me more understand how Architecture behaves in different circumstances.

    6. After just simple importing your project, when I am building and Debugging it, On loading it to one core it takes too much time, Is there anything wrong I have done or is it usual.

    I know some of these might be simple questions for you or other guys but I am learning and trying to understand behavior of 6678  so please bear with me for my little knowledge. 

    Thanks and Regards,
    Arun 

  • Arun,

    Answer your questions one-by-one.

    1. Yes, I created the .cmd file by myself based on the memory layout information of the system (EVM6678L). The name of the file can be any, and you can choose different memory for different code sections as long as your code can fit in.  The project you are using is RTSC/BIOS based, those information are defined somewhare in the .cfg file (?), and the linker.cmd is generated based on those information.

    2. For target configuration file, you should be able to use your own (the C6678l.ccxml with your project). Remember, to run my .out file, you need to load the GEL file (..\ccsv5\ccs_base\emulation\boards\evmc6678l\gel\evm6678l.gel) to the DSP core before you connect CCS to the DSP. This GEL will initialize the system (clock, DDR3, ...) during connection.

    3. Assuming RTSC does use the TSCL/TSCH registers (I am not very sure yet), you can turn off the scheduler by disabling interrupt before your code read TSCL/TSCH, and re-enable interrupt after the read. But, specific to your case, I'd suggest to use the standalone project, since all the factors are under your own control.

    4. Yes, you can play with the memory map by assigning different code sections to different memory blocks. You do need care about the heap size. For example, in your .c file, if you check the status of the each call to malloc(), you will see that some calls will fail due to no enough memory to allocate from heap (1024*1024*8 = 8Mbytes for a matrix).

    5. For DSPLib, I haven't used them before, just know it is existing, and they are optimized and should be much better than most general user can do.

    6. Two possible reasons - you haven't run the GEL file to setup the DDR3 memory yet; or your emulator is running too slow (you are using XDS100?).

    Regards!

    Wen 

     

     

     

  • Hello Wen,

    Thanks for your detailed explanations. Can I use the .cmd file which you ave created with my project? I will make appropriate changes  to it accordingly. I am trying to run one DSP lib source codes and then It will give me some more idea where things were wrong. 
    Yes, I didn't run the Gel files and I am using XDS100v1 USB Emulators.

    I will get back whenever I will have doubts.  For now, I am verifying your above reply as verified answer.

    Thanks and Regards,
    Arun 

  • Hello Wen,

    I have one question for you.  Can you help me out with putting all my data in Registers only. We can select L1, L2, MSMC and DDR3 by CCS . I want to see performance on keeping all of those data at Registers Level. 

    Can you Suggest Something to  me.

    Thanks and regards,

    Arun