This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Matrix Multiplication Performance

Hello,

Can Anyone help me out in putting code and data at register levels and in that way to L1, L2 and MSMC. I mean, I am working on C6678 and I am trying to see the performance difference between the overall calculation of Simple and Complex matrix to matrix Multiplication. I am not very sure how to manually keep my code and data at registers only or At L1or L2 SRAM. I was trrying with RTSC projects and trying to edit it and from there I was trying to change the positions over there, But I am not very sure whether I am doing right or not. And, Do I need to add link.cmd file to my project or the file which will be generated on its own will be good?

may be I have confused what I actually want, In one single line, My Problem is I want to see what is the performance of Matrix to Matrix Multiplication ( Simple and Complex both, single and Double Precision) at different levels when I keep data over there, Starting from Register, then L1, then L2, MSMC and Then DDR3.

Thanks. 

  • Arun,
    take a look at this page that describes memory configuration in RTSC projects. The part that seems most relevant to your question is at the very end, where you can see how to allocate various sections to various memory segments using the CFG script. Once you go through that, let me know if the docs answer your questions or you need some further exaplanations.

  • Hello Sasha,

    I have tried those steps. I don't have clear Idea actually what is going on, But I want to ask one thing like, as per the website We can change L1, L2 and others and We can put Code, data and Stack memory wherever we want. But, What If I only want to put my data in registers. I only want to perform 2X2 matrix multiplication and want to put everything, code, data and stack in registers.

    Thanks. 

  • When you say you want everything in registers, I'm not sure what you mean.  Only data can be in registers.  And even then, only for a short time.  What does it mean for code or stack to be in registers?  Feel free to show a code example of some sort, if it helps you to explain it.

    Thanks and regards,

    -George

  • Hello Georgem,

    Thanks for your reply. My apologies for my un-availability to explain you my problem in efficient way. Forgive me for my little knowledge. Let me try once again to explain you in another way.

    For an example I have written one Simple Matrix-to-Matrix multiplication code and I have created RTSC Project for that. I have set everything and Didn't configure RTSC, I mean I didn't set any particular location for data and code or stacks, Now when I will build my code, There is one Linker.cmd file will be generated automatically in which I can see where my data is and othere fragmenst are. What size of which memory has been used and unused. Am I right?
    Now, think I want to configure that RTSC file and I want to manually put the code and data irrespective of its default values. and I can do it by the screenshot I have attached page. I can select code memory, data memory and stack memory. But here we are getting options for L1, L2 and MSMC and others. We know is Number of registers are small and We can only do limited number of calculation in that, Thats why I want to calculate 2X2 matrix multiplication only and want to put my data in Registers so I can see how it behaves ans what will be the performance and  then will try to move to L1, L2, MSMC and DDR3 and will compare it. Its starting and then I will do it for SGEMM and DGEMM and will compare the Values and Numbers which TI has published than my data which I will get and Will try to improve it by applying Optimizations and Other features on OpenMP like Blocking, nonblocking etc.

    I hope I can give you some insights what I actually want to do. I have written my Simple matrix to matrix multiplication code and also written In various modes on OpenMP. If you want to see those, I can share them as well, but Those are just simple code which you will get anywhere.

    Thanks.

  • From C code, there is no method by which you can directly influence whether a given variable is ever held in a register.  By building with optimization (details on that here), you greatly improve your chances that the function local variables accessed the most are placed in registers.  That's about the best you can do from C.

    Another method to consider ...  Rewrite the function in linear assembly.  Details on that in the chapter titled Using the Assembly Optimizer in the C6000 compiler book.

    Thanks and regards,

    -George

  • Hello George!

    Thanks for your reply! Let me Move forward to my question. Below is the simple Matrix to Matrix multiplication code I have written-

    /*objective:matrix multiplication in dynamic way*/

    #include <stdlib.h>
    #include <stdio.h>
    #include <time.h>
    #include <c6x.h>

    /* change dimension size as needed */
    const int dimension;
    //int begin(void);
    //int end(int *secs, int *nsecs);
    unsigned int t_start_l,t_start_h;
    unsigned int t_stop_l,t_stop_h;
    unsigned int t_overhead_l,t_overhead_h;

    int main(int argc, char *argv[])
    {
    int i, j, k;
    double tmp;
    double *A, *B, *C;
    double time1, timedif1;

    printf("Enter the number of dimension :\t");
    scanf("%d",&dimension);

    A = (double*)malloc(dimension*dimension*sizeof(double));
    B = (double*)malloc(dimension*dimension*sizeof(double));
    C = (double*)malloc(dimension*dimension*sizeof(double));

    for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
    {
    A[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
    B[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
    C[dimension*i+j] = 0.0;
    }

    //begin();

    TSCL = 0;
    TSCH = 0;
    t_start_l = TSCL;
    t_start_h = TSCH;

    time1 = (double) clock(); /* get initial time */
    time1 = time1 / CLOCKS_PER_SEC;

    for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++) {
    tmp = 0.0;
    for(k = 0; k < dimension; k++)
    tmp += A[dimension*i+k] *
    B[dimension*k+j];
    C[dimension*i+j] = tmp;
    }
    //end(&s, &ns);

    t_stop_l = TSCL;
    t_stop_h = TSCH;
    t_overhead_l = t_stop_l - t_start_l;
    t_overhead_h = t_stop_h - t_start_h;

    timedif1 = ( ((double) clock()) / CLOCKS_PER_SEC) - time1;

    printf("\n\n%d %d \n", (double) time1, (double)timedif1 );
    printf("The elapsed time in multiplication of matrix is %f seconds\n", timedif1);

    printf("Time Taken during Matrix multiplication is: , t_overhead_h = %2d\tt_overhead_l=%2d\t\n",t_overhead_h,t_overhead_l);

    /* spot checking -- not very accurate */
    //printf("%f\n", C[17]);

    free(A);
    free(B);
    free(C);

    return 0;
    }
    
    
    Now This is just a Simple Matrix Multiplication code, No optimization, NO special Properties. When I debug this code under default conditions and run it for various 
    dimensions as below snapshot, It is giving me some weird values. For checking I have used two functions to get some idea what actually is going on. with time.h , it 
    always gives me time value 0.00 and with TSCL and TSCH it gives me some random weird values . For an Example for Dimension 500X500 matrix, TSCL is negative, How
    can that be possible?
    At 1200 Matrix Dimension Size, It kind of hangs, Nothing happens for quite long time. 

    George, In this case, can you suggest me something Where I am going wrong? And also How Now I can Change those Memory Configurations , Can I edit Auto generated
    linker file? I want to know For this program How much operations, occurs, How much time has consumed by function and in doing Matrix multiplication? Is there any way 
    we can use PAPI kind of thing on 6678 DSP? I hope you can have some idea what actually I am trying to do and Please tell me if my path is wrong, so I can change to 
    appropriate way.

     

    Thanks and Regards,
    Arun 

  • The function clock() returns a clock_t (unsigned int) value.  I don't understand why you convert it to double.  The clock function only works if you are running CCS.  See this wiki article for more on that.  And dividing by CLOCKS_PER_SEC doesn't look right either, but I'm less sure about that.

    Thanks and regards,

    -George

  • Hello George,

    Thanks for your reply. Anyways, I have removed those double things, But still not getting correct timing values. I have tried Clock thing which we can do it by CCS, I have seen we can get various values from it, I mean, like CPU cycles, L1, L2 Hits, miss and etc..I have a Question here, that do I need to enable it again and again while running on individual core?? and Second thing, is there a way I can see all those different values which I am getting by clock, i.e. CPU clock cycles, L1, L2 hits, misses and others together so we don't have to do it again and again for individual values. On a Short, is there any article which can explain How to use Clock thing on CCS. And If you can also share some views to calculate time consumption during main loop and then from it we can calculate FLOPS and all...

    Thanks.

  • Uhm, slow down a bit.  Your questions keep growing in number and complexity.  And you are going well outside my expertise.  Let's focus on just one thing.

    This lab uses TSCL, TSCH to measure cycle count.  What is wrong with that method?

    Thanks and regards,

    -George

  • Hello George,

    I am very sorry for my number of Questions. Actually, It was Friday and I know next response I will get on Monday That's why I have asked many questions so I can work in these days on it. Anyways, Mostly when any of the replies I get, I try those steps and do some background study and then in that I asked questions about I am not sure. I will take care of this from next time. My apologies for the same again.

    Anyways, Lets come on topic. As far as I know, I can use the Time Stamp Counter to easily measure the execution time of a section of code which is freely running as 64 bit counter on C66x and which basically incremented during each CPU cycle. Now I can access these counters by TSCL and TSCH . TSCL returns the 32 LSBs of the Time Stamp Counter, TSCH returns the 32 MSBs of it. The order is important: to get a consistent 64-bit value, so I have to read TSCL before TSCH.  Now I can do that in C as below:

    #include <stdint.h> // uint64_t #include <c6x.h> // _itoll, TSCH, TSCL uint64_t start_time, end_time; start_time = _itoll(TSCH, TSCL); /* your code section to profile */ end_time = _itoll(TSCH, TSCL); printf("Your code section took: %lld cycles\n", end_time - start_time);

    Once I can get Number of Cycles in that particular section of my code, Mostly starting and ending at matrix - matrix multiplication loops section. I can calculate by this formula:
    Execution time (T) = CPI*Instruction count*clock time = CPI*Instruction count/frequency 
    If I am correct till here, then as my previous replies, I am getting some Negative values from TSCL and TSCH which I don't understand why and Others, After selecting
    dimensions of Matrix more than 1024, It  seems like it is running that, But nothing shown up on console screen and It seems like hangs in between. That thing is also not 
    under my consent because It should give some sort of error message if it is not in particular memory section inspite of hanging. 
    FOR YOUR KNOWLEDGE ALL THIS THING IS WITHOUT ANY KIND OF OPTIMIZATION LEVELS AND ANY OTHER SPECIAL PROPERTIES. Then only I can compare between 
    those particular properties and How much change can happen. I have tried on Blocking and Non blocking modes of my codes as well but the same response.
    
    
    I will ask my other questions Later. And, please accept my apologies for my little knowledge and series of Questions.
    
    
    Thanks and Regards,
    Arun