This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Data in register

Hello,

I am again asking my Question because I have not get any satisfying solution to that till yet. I am using C6678 and I have tried running Matrix - Matrix Multiplication on it both in Single Precision and Double precision. I have tried to Keep data at L1, L2 ,MSMC and DDR3 by CCS and RTCS. But, I want to see performance and Want to calculate Cycle Counts while putting my data in Nearest to core , I mean REGISTERS. Nothing Else. I am not saying about large size matrix, I am only interested in smallest size, take 2by 2, or even 3by 3. 

Do I need to write Linear Assembly for that Which I am not sure. I have C code with my self for Matrix matrix Multiplication which I have written. Can Anybody please help me with this. I am trying some solution for this, But I don't know I am not getting any satisfying solutions to it. Provide me some links, (Please Don't provide me RTSC Links), or Any Documents if you don't have any solutions, But I want to Put My Data on registers and Want to calculate matrix multiplication over there.

Hope I will get some solutions this time.

Thanks and Regards,
Arun 

  • Hi Arun,

    there's the register keyword in C. The register type modifier tells the compiler to store the variable being declared in a CPU register (if possible), to optimize access. For example:

    register int i;

    Kind regards,

    one and zero

  • Hello One and Zero!

    I was thinking Nobody will reply to this question. People told me Its not possible, Or We can only keep data for a short time, but I am thinking why, we can keep data in it as long as we want for. Anyways, 

    Many Thanks for your reply. Actually, I am not very much sure how to do that. Do you know any examples in which this has happened so I can have a look on it and understand. We have two sides A and B in 6678 DSP and each side we have 32 registers each. Like, As we do by CCS for L1, L2 and Others we can see all those in auto generated linker file that where and which part has loaded where? can I will be see that thing for registers as well. 

    Thanks and Regards,
    Arun 

  • Hi Arun,

    As you know the c66x includes 64 32bit registers the best way to use them is by using Linear Assembly or Assembly,

    BR,

    HR

  • Hi Arun,

    the register keyword does not allow you to control which registers should be used nor does it guarantee that the Compiler actually uses a register. It only tells the the Compiler that you think the variable should be kept in a register and the compiler will try to allocate. So it is a recommendation to the compiler to use a register.

    The usage is straight forward just put the register keyword in front of your variable declaration. See also:

    http://www.geeksforgeeks.org/archives/4346

    In case you want to manually control the register allocation you have to go to assembly.

    Kind regards,

    one and zero

  • Hello One and Zero,

    Thanks for your reply. Yeah, I was also aware of the fact that registers are 32 bit and 64 in numbers including both sides A and B. And I had doubt on c Code by Register keyword as well. But, I am not interested in controlling and also make sure everything calculation which has happened should be into the Registers only. So, I read some where that I need to go by Linear Assembly .

    Anyways, can you provide some example codes to use by Assembly because I have no prior experience with it. I know way but I have not written any Assembly.

    Thanks and regards,
    Arun 

  • Hi Arun,

    I'd recommend to stay with C since our Compiler does an excellent job in optimization. You can also do a lot on the C-level to optimize your code so that it fits the C6000 architecture best. Please have a look at the TMS320C6000 Programmer’s Guide.

    In case you want to educate yourself in linear assembly Chapter 5 of the Programmer's Guide will be helpful also showing code examples.

    Kind regards,

    one and zero

  • ... forgot to mention the very useful application report about Hand-Tuning Loops and Control Code on the TMS320C6000

    It is already a bit old and talking about older compiler versions and only up to the 64x+ architecture but the fundamentals and principles still apply today and also for c66x.

    Kind regards,

    one and zero

  • Hello One and Zero,

    Thank you very much for all this links and Knowledge. I also want to stick only with C only. But the problem is I want to access and play with registers and As far as I have understood and read, for Dealing with registers, I need to shift to Linear Assembly. As I have not write any type of code on Linear Assembly ever before so I am kind of hesitation but I am not seeing any other option as well.

    Yes, I am trying to work on Optimization and I am also working on new paper which TI has published for SGEMM and DGEMM on C6678 and I am trying to optimize SGEMM kernel over there. Let see how far I can go. I will keep you guys busy with my questions.

    Thank You very much for all your support and help! I appreciate it. 

  • Hi Arun,

    I'm sure you're interested in the paper Unleashing DSPs for General-Purpose HPC which describes how to implement GEMM on a C6678 in C + using intrinsics.

    I hope that helps and gives you some more ideas  ...

    Kind regards,

    one and zero

  • Hello One and Zero,

    Yes, That is the paper which I am talking about. I have already seen and Read that and working on it. Anyways, If you have mentioned then let me ask some couple of questions on it.

    1. I have one major doubt on Kernel. Why do we need kernel code, Can we not write any of our own simple matrix to matrix multiplication code and try to optimize, paralleled  and then try to change memory locations based on chunks we are creating and sending in a way we want to do multiplication. Because It is already quite hard to understand kernel code.

    2. We all know there is a onboard emulator on C6678 and which is very slow. So, I think for achieving the results which are mentioned in this paper we need some external emulators. Because whatever knowledge and understanding I get from this paper I have tried using same kernel code and all and trying to optimize it, unfortunately I have got very very poor results, somewhere about 1%or 2% what they have got. I know i haven't understood it properly but still. 

    3. Another thing is they have not talked anything about registers in this on which I am quite interested this time. I want to start from very first level then move on to next memory level and see the difference.

    Thanks and Regards,
    Arun 

  • Hi Arun,

    1. I'm not quite sure what your question is. Of course you could write your own kernel

    2. The benchmarking result is not dependent on the emulator you're using.

    3. If you want to look at a real linear (or serial) assembly implementation, you can look into the DSPLIB there's FFT implemented that way (DSPF_sp_fftSPxSP.sa)

    Kind regards,

    one and zero

  • Hello One and Zero,

    Thanks for your reply.

    1. I mean by kernel is like, Can't I write my own code in C and trying to paralleled it and then optimize it after that change or configure memory accordingly. Do i really need kernel like thing?

    2. I have tried installing DSP lib for Linux and then I go to the folder where I have installed and KI looked into packages  then Src and there are some examples for codes.Folder which you have mentioned there is nothing with.sa extension but there are codes But i didn't get any Linear assembly in it. All are c codes. can you attached one folder to me. I will appreciate your help.

    I am trying hard to understand Linear Assembly for C6678.

    Thanks and regards,
    Arun