Hello,I am again asking my Question because I have not get any satisfying solution to that till yet. I am using C6678 and I have tried running Matrix - Matrix Multiplication on it both in Single Precision and Double precision. I have tried to Keep data at L1, L2 ,MSMC and DDR3 by CCS and RTCS. But, I want to see performance and Want to calculate Cycle Counts while putting my data in Nearest to core , I mean REGISTERS. Nothing Else. I am not saying about large size matrix, I am only interested in smallest size, take 2by 2, or even 3by 3. Do I need to write Linear Assembly for that Which I am not sure. I have C code with my self for Matrix matrix Multiplication which I have written. Can Anybody please help me with this. I am trying some solution for this, But I don't know I am not getting any satisfying solutions to it. Provide me some links, (Please Don't provide me RTSC Links), or Any Documents if you don't have any solutions, But I want to Put My Data on registers and Want to calculate matrix multiplication over there.
Hope I will get some solutions this time.
Thanks and Regards,Arun
there's the register keyword in C. The register type modifier tells the compiler to store the variable being declared in a CPU register (if possible), to optimize access. For example:
register int i;
one and zero
Please click the Verify Answer button on this post if it answers your question.
You can also follow me on Twitter: http://twitter.com/oneandzeroTI
Do you want to read interesting multicore articles? Check out our Multicore Mix
Hello One and Zero!
I was thinking Nobody will reply to this question. People told me Its not possible, Or We can only keep data for a short time, but I am thinking why, we can keep data in it as long as we want for. Anyways,
Many Thanks for your reply. Actually, I am not very much sure how to do that. Do you know any examples in which this has happened so I can have a look on it and understand. We have two sides A and B in 6678 DSP and each side we have 32 registers each. Like, As we do by CCS for L1, L2 and Others we can see all those in auto generated linker file that where and which part has loaded where? can I will be see that thing for registers as well.
As you know the c66x includes 64 32bit registers the best way to use them is by using Linear Assembly or Assembly,
the register keyword does not allow you to control which registers should be used nor does it guarantee that the Compiler actually uses a register. It only tells the the Compiler that you think the variable should be kept in a register and the compiler will try to allocate. So it is a recommendation to the compiler to use a register.
The usage is straight forward just put the register keyword in front of your variable declaration. See also:
In case you want to manually control the register allocation you have to go to assembly.
one and zero
Hello One and Zero,Thanks for your reply. Yeah, I was also aware of the fact that registers are 32 bit and 64 in numbers including both sides A and B. And I had doubt on c Code by Register keyword as well. But, I am not interested in controlling and also make sure everything calculation which has happened should be into the Registers only. So, I read some where that I need to go by Linear Assembly .
Anyways, can you provide some example codes to use by Assembly because I have no prior experience with it. I know way but I have not written any Assembly.
Thanks and regards,Arun
I'd recommend to stay with C since our Compiler does an excellent job in optimization. You can also do a lot on the C-level to optimize your code so that it fits the C6000 architecture best. Please have a look at the TMS320C6000 Programmer’s Guide.
In case you want to educate yourself in linear assembly Chapter 5 of the Programmer's Guide will be helpful also showing code examples.
... forgot to mention the very useful application report about Hand-Tuning Loops and Control Code on the TMS320C6000
It is already a bit old and talking about older compiler versions and only up to the 64x+ architecture but the fundamentals and principles still apply today and also for c66x.
Hello One and Zero,Thank you very much for all this links and Knowledge. I also want to stick only with C only. But the problem is I want to access and play with registers and As far as I have understood and read, for Dealing with registers, I need to shift to Linear Assembly. As I have not write any type of code on Linear Assembly ever before so I am kind of hesitation but I am not seeing any other option as well.
Yes, I am trying to work on Optimization and I am also working on new paper which TI has published for SGEMM and DGEMM on C6678 and I am trying to optimize SGEMM kernel over there. Let see how far I can go. I will keep you guys busy with my questions.
Thank You very much for all your support and help! I appreciate it.
I'm sure you're interested in the paper Unleashing DSPs for General-Purpose HPC which describes how to implement GEMM on a C6678 in C + using intrinsics.
I hope that helps and gives you some more ideas ...
Hello One and Zero,Yes, That is the paper which I am talking about. I have already seen and Read that and working on it. Anyways, If you have mentioned then let me ask some couple of questions on it.
1. I have one major doubt on Kernel. Why do we need kernel code, Can we not write any of our own simple matrix to matrix multiplication code and try to optimize, paralleled and then try to change memory locations based on chunks we are creating and sending in a way we want to do multiplication. Because It is already quite hard to understand kernel code.
2. We all know there is a onboard emulator on C6678 and which is very slow. So, I think for achieving the results which are mentioned in this paper we need some external emulators. Because whatever knowledge and understanding I get from this paper I have tried using same kernel code and all and trying to optimize it, unfortunately I have got very very poor results, somewhere about 1%or 2% what they have got. I know i haven't understood it properly but still.
3. Another thing is they have not talked anything about registers in this on which I am quite interested this time. I want to start from very first level then move on to next memory level and see the difference.
1. I'm not quite sure what your question is. Of course you could write your own kernel
2. The benchmarking result is not dependent on the emulator you're using.
3. If you want to look at a real linear (or serial) assembly implementation, you can look into the DSPLIB there's FFT implemented that way (DSPF_sp_fftSPxSP.sa)
Hello One and Zero,Thanks for your reply.
1. I mean by kernel is like, Can't I write my own code in C and trying to paralleled it and then optimize it after that change or configure memory accordingly. Do i really need kernel like thing?
2. I have tried installing DSP lib for Linux and then I go to the folder where I have installed and KI looked into packages then Src and there are some examples for codes.Folder which you have mentioned there is nothing with.sa extension but there are codes But i didn't get any Linear assembly in it. All are c codes. can you attached one folder to me. I will appreciate your help.
I am trying hard to understand Linear Assembly for C6678.
All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.
TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs andembedded processors, along with software, tools and the industry’s largest sales/support staff.