Hello, I am trying to calculate performance of simple Matrix to Matrix Multiplication code. I am using TSCL and TSCH for calculating my Clock cycles and from there I am calculating How much time it is taking to do that particular nested loop. My code is as follows:
A = (double*)malloc(dimension*dimension*sizeof(double)); B = (double*)malloc(dimension*dimension*sizeof(double)); C = (double*)malloc(dimension*dimension*sizeof(double)); for(i = 0; i < dimension; i++) { for(j = 0; j < dimension; j++) { A[dimension*i+j] = (i+j); B[dimension*i+j] = (i-j); C[dimension*i+j] = 0.0; } } TSCL = 0; TSCH = 0; t_start_l = TSCL; t_start_h = TSCH; for(i = 0; i < dimension; i++) { for(j = 0; j < dimension; j++) { tmp = 0.0; for(k = 0; k < dimension; k++) { tmp += A[dimension*i+k] *B[dimension*k+j]; C[dimension*i+j] = tmp; } } } t_stop_l = TSCL; t_stop_h = TSCH; t_overhead_l = t_stop_l - t_start_l; t_overhead_h = t_stop_h - t_start_h;
Now, Number of clock cycle is Delta= t_overhead_l- t_overhead_h. Below are some values which I am getting with No optimizations and No particular special properties.
[C66xx_0] Enter the size of dimension : 10[C66xx_0] Time Taken during Matrix multiplication is: , t_overhead_h = 0 t_overhead_l=59958[C66xx_1] Enter the size of dimension : 100[C66xx_1] Time Taken during Matrix multiplication is: , t_overhead_h = 0 t_overhead_l=64627173[C66xx_2] Enter the size of dimension : 500[C66xx_2] Time Taken during Matrix multiplication is: , t_overhead_h = 2 t_overhead_l=-547958635[C66xx_3] Enter the size of dimension : 1000[C66xx_3] Time Taken during Matrix multiplication is: , t_overhead_h = 15 t_overhead_l=-124803967[C66xx_4] Enter the size of dimension : 1024[C66xx_4] Time Taken during Matrix multiplication is: , t_overhead_h = 16 t_overhead_l=320899090[C66xx_5] Enter the size of dimension : 1500
Now, My Questions are as below:
1. Why values are negative for dimension size 500 and 1000, but why not for 1024.
2. Why System hangs at dimension size of 1500, It does not give me any error nor any message for half an hour.
Apart from this, is there any other way to calculate the time and If I am using and enabling clock from CCS it is giving me some other values, from the values which I am getting, Does those
CPU values are for total program if I will selected as CPU execution cycles.
Thanks and Regards,Arun
Hello,Sorry to post two consecutive questions again and again. I have tried matrix multiplication code in Blocking mode as well, in which I have some doubts as well. my blocked code is as follows:
void do_mult(int block_i, int block_j, int block_k, double *A, double *B,double *C){ int i, j, k; double tmp;
for (i=block_i; i < block_i+block; i++) { for (j=block_j; j < block_j+block; j++) { tmp = 0.0; for (k=block_k; k < block_k+block; k++) { tmp += A[dimension*i+k] * B[dimension*k+j]; C[dimension*i+j] += tmp; } } }}
A = (double*)malloc(dimension*dimension*sizeof(double)); B = (double*)malloc(dimension*dimension*sizeof(double)); C = (double*)malloc(dimension*dimension*sizeof(double));
for(i = 0; i < dimension; i++) { for(j = 0; j < dimension; j++) { A[dimension*i+j] = (i+j); B[dimension*i+j] = (i-j); C[dimension*i+j] = 0.0; } }
//begin(); TSCL = 0; TSCH = 0; t_start_l = TSCL; t_start_h = TSCH;
for(i = 0; i < nr_blocks; i++) { block_i = i * block; for(j = 0; j < nr_blocks; j++) { block_j = j * block; for(k = 0; k < nr_blocks; k++) { block_k = k * block; do_mult(block_i, block_j, block_k, A, B, C); } } } //end(&s, &ns); t_stop_l = TSCL; t_stop_h = TSCH; t_overhead_l = t_stop_l - t_start_l; t_overhead_h = t_stop_h - t_start_h;
printf("Number of Clock Cycle Taken during Matrix multiplication is: %d\t\n",t_overhead_l-t_overhead_h);
Now If I will run this code I am getting Number of Cycle counts same , Whatever be the Dimension size, and same System hangs at more than dimension size 1024.
As follows:
[C66xx_0] Enter the number of dimension : 10[C66xx_0] Number of Clock Cycle Taken during Matrix multiplication is: 25 [C66xx_1] Enter the number of dimension : 50[C66xx_1] Number of Clock Cycle Taken during Matrix multiplication is: 25 [C66xx_2] Enter the number of dimension : 100[C66xx_2] Number of Clock Cycle Taken during Matrix multiplication is: 25 [C66xx_3] Enter the number of dimension : 500[C66xx_3] Number of Clock Cycle Taken during Matrix multiplication is: 25 [C66xx_4] Enter the number of dimension : 1000[C66xx_4] Number of Clock Cycle Taken during Matrix multiplication is: 25 [C66xx_5] Enter the number of dimension : 1024[C66xx_5] Number of Clock Cycle Taken during Matrix multiplication is: 25 [C66xx_6] Enter the number of dimension : 1200
Where I am wrong?
Hi Arun,
TSCH +TSCL is representing a 64 bit value. Each register is 32 bit.
So your calculation t_overhead_l-t_overhead_h is wrong.
Kind regards,
one and zero
Please click the Verify Answer button on this post if it answers your question.
You can also follow me on Twitter: http://twitter.com/oneandzeroTI
Do you want to read interesting multicore articles? Check out our Multicore Mix
here's an example:
#include <stdio.h>#include <c6x.h>void main(void) {unsigned int stampl1,stampl2,stamph1,stamph2;;long long time1,time2; TSCL=0; stampl1=TSCL; stamph1=TSCH; printf("Hi: %d \n",DNUM); stampl2=TSCL; stamph2=TSCH; time1 = ((long long)stamph1 << 32) + (long long)stampl1; time2 = ((long long)stamph2 << 32) + (long long)stampl2; printf("printf took: %lld cycles \n", time2-time1);}
or alternatively you could use the CSL:
* @b Example * @verbatim CSL_Uint64 counterVal; ... CSL_tscStart(); ... counterVal = CSL_tscRead();
Arun,
What do you mean by "system hang"? Did you see the error message saying CPU pipeline get stalled, from CCS? Or you simply saw your code run into wild and never complete (in this case, you should still be able to issue an HALT command from CCS, and check if your code is still performing the calculation)?
If it is the 2nd case, can you attach the link command file (.cmd file) so that I can take a look. It is even better if you can attach the entire project.
Regards!
Wen
Hello One and Zero,Thanks for your reply. yeah, I know TSCH +TSCL is representing a 64 bit value. Each register is 32 bit. I was not thinking in that direction. Thanks for Pointing towards it. I have tried the way you told and Now I am getting some reasonable Output and No negative values as well. It makes sense.
Hello Wenzhongliu,Thanks for replying. My apologies for not being clear. Actually By the term Hangs I mean, When I run my code on One core it is showing running, But I am not getting any outputs on console,neither I am getting any error message. It just Seems to be Running and running. Below is the Screen shot of how My system looks like:
The problem you have seems a SW issue. If you can send me your code (entire project, especially the .cmd file), I'd like to take a look and debug on my EVM board.
My guess is, your test run out of memory (from heap), and one of your malloc() call might fail due to no enough memory. So, check the size of the heap.
Another question, during the iterations, do you do mfree() to collect memory?
Hello Wen,Please go through the attachment for my Whole project. For your knowledge, I am just checking it as a simple matrix to matrix multiplication. No optimizations at all. And, I am using
free(A);free(B);free(C);
But that is not in iterations, Anyways, Have a look on it. Right now, I am basically stuck in Blocking mode of Same Matrix to Matrix multiplication, It is always giving me same number clock
cycle for any number of dimensions. Anyways let me get my hand dirty on it, If I will not get anything then I will trouble you guys.3683.MAT_MUL_ARUN.zipThanks and Regards,Arun
Hello Wen,My Apologies to post two consecutive post back to back. I have tried same thing with blocking mode on my Matrix to Matrix multiplication. I am not able to figure it out that why It is taking same number of clock cycles for any dimensions and also It is behaving same after 1024 dimension size. I thought It can also give you some aspect where I am wrong.8625.Mul_Arun_Blocking.zip
Hello Wen,I have changed memory by RTSC platform and I have run the same Simple Matrix to Matrix Multiplication Code and I am able to run after 1024 dimension size, But I am not able to run for 1500. It should run in that dimension size as well when I put everything in MSMCSRAM.
I created a standalone version (not using RTSC) project based on the .c file you sent, without changing anything of your .c file, but added another file - c6678.cmd file with a big heap (-heap 0x4000000). Here is the test results I got:
[C66xx_0] Enter the size of dimension : 150 // Note: used the Release version
[C66xx_0] Matrix Multiplication took: 477836025 cycles
[C66xx_0] Enter the size of dimension : 150 // Note: used the Debug version
[C66xx_0] Matrix Multiplication took: 1234223435 cycles
[C66xx_0] Enter the size of dimension : 1024 // Note: used the Release version
[C66xx_0] Matrix Multiplication took: 202770756487 cycles
[C66xx_0] Enter the size of dimension : 1500 // Note: used the Debug version
[C66xx_0] Matrix Multiplication took: 1260036672163 cycles
When I used a smal heap, I also saw my test run into wild with dimension=1500.
Back to your code, I looked at the code you sent, and I see following potential issues:
1. Your code used RTSC which might use the TSCL/TSCH registers too, and cause confliction when reading TSCL/TSCH.
2. Since your code is using RTSC, and configuration of memory map is set at default. What I read is, its run-time heap is in L2SRAM space with size 4096.
By the way, from the testing results, you can see that it take very long time for dimenstion=1500 test to complete. Since (matrix * matrix) is very typical DSP processing algorithm, the DSPLib already covers this with much better performance. In your real application, you should call DSPLib function directly.
Hello Wen,Thanks for your reply. My apologies for my little knowledge. I have some doubts on some points here.
1. How did you add new .cmd file , I mean is it not something automatically generated? From Where Did you changed the heap size? And It means, If we can increase the heap size then It doesn't matter where our code and data is, whether it is in L2SRAM or MSMCSRAM or even DDR3. All should run, AM I right?
2. You have written when you were using small heap then also you were facing issues with matrix size 1500, onwards, But you told me you have got above results on standalone version (NO RTSC PROJECT), Does it mean Is it a issue with RTSC based projects or heap?
3. Does any difference will occur if I will debug it with Optimization level 3, Fully optimized or If I will disable intrinsic?
3. What type of Conflict you are trying to refer between RTSC based projects and TSCL/TSCH ? Please Elaborate.
7128.MAT_MUL_WEN.zip
I am sorry for forgetting attaching the project I created.
To answer your questions:
1. See the project I attached. The file c6678.cmd under the project defines the memory map to use, as well as the HEAP size (you can play with it to see how the test works) for building the .out.
2. See item1. You can try to use a smaller HEAP size by changing the -heap line in the .cmd file.
3. I haven't tried difference optimization level (I only tried the default setting for Debug and Release build).
4. The TSCH works this way - whenever the TSCL is read, the current upper 32 bits of the 64 bits counter will be latched in TSCH. So, if both RTSC and your test code are reading TSCL/TSCH in one application. The read of TSCH might be un-reliable, since it could latech the value because another one read the TSCL register.
Hello Wen,Thanks for replying. Well, as you said about the conflicts between RTSC and TSCL/TSCH then what is another way to calculate about number of clock cycles and Time consumption , even any type of performance related things. Can you suggest something on that?
And, Also can you little bit more explain to DSP Library things, Do I need to include those in my project, I Haven't done it before so I am not sure how this thing will work.By the mean time, I am looking on Project you have attached and will get back to you with my doubts and queries.
Thanks and regards,Arun
Wen, I have checked in Project file which you have sent and I was not able to find Target Configuration file ".ccxml" Did you created with your project. I have just import your project to my 6678 board and When I debug it, It is taking hell lot of time in debugging one one core only.
Thanks!