Hello,
I am trying to search for some examples for multiprocessing using the C674x, I have enabled the openMP 3.0 feature on the project settings. Not sure if there is any available.
Samuel
This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hello,
I am trying to search for some examples for multiprocessing using the C674x, I have enabled the openMP 3.0 feature on the project settings. Not sure if there is any available.
Samuel
Hi Samuel,
The OpenMP feature is no supported on the C674x platforms. This is supported only on the Keystone platforms as far as I know. You might want to check on the compiler forums if their is some sample that they can share with you.
Regards,
Rahul
Hi Rahul,
Thanks for the reply, I think its called parallel processing? Anyway this is the problem I have. I am trying to do a summation of 100 values in 3us time.
for(index = 0; index < 100; index++){
comp = buffer[index] + ext_counter;
if(comp > 0x3FFF)
{
sumvalue += comp;
}
else
{
sumvalue -=comp;
}
}
Its only able to do about 25 index at below 3us is there another way to do this?The project is already optimized with memory running in L1 and L2 cache where applicable etc.
Samuel
Samuel,
Which C674x DSP are you using? What clock speed is the DSP running and what speed is the external memory running (external speed may be irrelevant, but it is always a good question to ask)?
You may be looking for what is called "software pipelining". This is a part of optimization on any C6000 DSP.
If you go to the Wiki and search for "C6000 Optimization" (no quotes), you will find links to some workshop material and some helpful articles about optimization techniques. Since you do not show any use those techniques, my assumption is that the workshop and the articles will be very useful to you.
Some of the critical things you need to look at in addition to those materials are:
You have not supplied enough information for anyone to give you direct advice or predictions. My expectation is that you can do the 100 passes through this loop in under 3us, but there could be constraints or tradeoffs that make that difficult. Or it could end up being much faster, also depending on the details.
My guess is that either you are not using cache correctly, or you have not turned on the compiler's optimization by using the Release build configuration.
C674x DSP core naturally implements parallel processing since it has 8 processing units. There is nothing you need to do to use this, except to use the knowledge that you gain from the workshop and articles, and the consideration you may give to my items listed above. But starting from the knowledge level after going through the workshop material from the Wiki and reading the articles on the Wiki, then we can continue with discussing your new results.
Regards,
RandyP
Excellent reply. I am going thru the optimization workshop now. I am using the L1 and L2 setup as per my evmomap L-138 PD LOGIC board example code with the following code inserted
setup_DDR2_cache1();
enable_L1();
enable_L2();
I know it works cos I have tried the same piece of loop with and without it with more than 10x performance increase. the memory maps in the l138.cmd file is using external mDDR RAM on all the sections.
I am using a IO toggle which is used on the external ADC for convert start to measure the time it takes to run the loop code block. It cannot get past more than 25 loops @ 3us. The processor is running at 300MHz.I see on the disassembled has about 50 lines for each loop when i multiply the summation with a float. All local variables are declared with the keyword "register".
sum -= buffer[index] * ft_value;
Seeing that the processor runs at 3125 MIPS and 3us allows about 100 instructions....so it might be impossible to do summation of 100 numbers in 3us? Any thoughts on this?
Samuel
Samuel,
The optimization study will help you improve the performance. Please keep us posted on your progress with that.
Samuel Raj1 said:All local variables are declared with the keyword "register".
I never use the register keyword. The compiler is smarter about allocating variables than I am.
Samuel Raj1 said:the disassembled has about 50 lines for each loop when i multiply the summation with a float.
sum -= buffer[index] * ft_value;
Where did this come from? There was no float multiplication mentioned in the first code example. For that matter, you still have not said what datatypes the variables are. Are you changing your algorithm by adding new requirements, but still want it to run 4 times faster than it does? I am very confused.
Why would this multiplication be inside the loop? If you are multiplying every summed value by a constant, then you should be able to simply do the summation and then do a single multiplication at the end, outside of the loop.
Samuel Raj1 said:The processor is running at 300MHz....
the processor runs at 3125 MIPS and 3us allows about 100 instructions
I do not understand this calculation. I understand the processor is running at 300MHz, but I do not know where you got the number 3125 MIPS and I do not know how got to 100 instructions in 3us.
At a clock rate of 300MHz, you can run 300 instructions in 1us and 900 instructions in 3us.
Regards,
RandyP
Hi RandyP,
Thanks for the reply. Yes the example I gave prior to my last reply didnt have a float, so this float is in a table the code will look like this.
loop(index){
array_pos = table[index]
if(....)
sum_y += buffer[array_pos] * float_table[index];
else
sum_y -= buffer[array_pos] * float_table [index];
}
Yes I think my calculation is wrong on the number of instructions for 3us. Thanks for the correct, according to the datasheet...
http://www.ti.com/lit/ds/sprs586d/sprs586d.pdf
3648MIPS is stated for the processor running at 375MHz so there for 1us is allows 3648 instructions runs. Therefore 3us will allow 10944 instructions. Hope I am spot on this time unless there is something I might have misunderstood from the datasheet.
So even if one run of the iteration is 50 instructions...100 will just be 5000 instructions in total. So its possible to clear the piece of execution in less than 3us. But that;s not happening.
sum_y is unsigned int
buffer is unsigned int
index is unsigned int
array_pos is unsigned long
Regards,
Samuel
Samuel,
Continue your studying of the optimization training and literature. This is what will help you with this project and any in the future.
It is not practical for me to give you advice on code that changes substantially with every repetition. The use of unsigned variables for math seems impractical.
What I can tell is that you originally wanted to perform 100 reads and 100 adds in 3us. And now you want to do 300 reads and 100 floating point multiplies and 100 fixed-to-float conversions and 100 float-to-fixed conversion, all in 3us.
You will understand better how to reach your goal when you have completed the studying. Do the labs and follow the material in the workshop, and you will learn a lot about the architecture of the C674x DSP core and how to use it to full advantage.
Samuel Raj1 said:3648MIPS is stated for the processor running at 375MHz so there for 1us is allows 3648 instructions runs. Therefore 3us will allow 10944 instructions. Hope I am spot on this time unless there is something I might have misunderstood from the datasheet.
The maximum internal clock speed for the C6748 is 456MHz. There are 8 processing units that can execute instructions simultaneously. 456 * 8 = 3648 MIPS, which is a theoretical limit for the number of instructions that can be executed. But this is not a useful number for figuring how fast your algorithm will run, this is more of a marketing number that sets the "up to" limit for comparison purposes.
This is not for 375MHz operation, although it is not clear from the datasheet's Features page what the 3648 number means.
At 375MHz, you can have 3 * 375 = 1125 "execution cycles". Each execution cycle may include 1 instruction execution or 2 or up to 8 instruction executions. Your job using the optimization techniques is to give the compiler's optimizer as much information as possible so it can use as many parallel instructions as possible.
To get the best possible performance for this loop, you will need to use at least part of L1D as SRAM and put all of table[], buffer[], and float_table[] in that L1D SRAM.
If this is an urgent project, you may find it profitable to contract with one of the TI Design House companies who have a great deal of experience with optimizing C6000 algorithms. Go to TI.com to start the search for an appropriate company, if that is a direction you want to go.
Regards,
RandyP
Hi RandyP
I have tried implementing the code without the if statements in the loop. I was able to go up to 100 summations in less than 3 us. However I also have to remove float multiplication in order to achieve this. With the float multiplication it will just go back to 20 counts even if there was no conditional statements in the loop.
sum_number and data_input are unsigned int. f_multiplier is a float array with 100 elements.
for(n = 0; n < 100; n++)
sum_number += (data_input - 0x2000) * f_multiplier[n];
Now this is how the code looks. Its would look like I am very close to achieving a solution, any advise on how I can improve/optimize the float multiplication?
Samuel
Samuel,
May I again point out that your code snippet has changed substantially? Since you do not show any use of the optimization techniques and pragmas, I can tell that you are still working through that information. You will have your best chance for success after you have 1) decided what your algorithm needs to be, and 2) learned the optimization material and applied it to your code.
Regards,
RandyP
Hi RandyP,
This will be finial code, also like to ask about
Samuel,
That is an excellent application note, so I could not explain any of it better than what you can read there. The text in that App Note explain what the lower bound and upper bound are, and an example how to use the factor. The same is true for the UNROLL pragma.
Have you tried these to see what they do? What have you tried?
Have you looked at the compiler's advice? Look at the compiler switches available for optimization and turn on the ones that help with advice or consultant, if they are available for your version of the compiler.
What compiler switches are you using?
The Compiler User Guide also has explanations for these, so the extra wording may help you. And the workshop material not only has the syntax and some descriptions, but the labs and solutions show you how to use it and show you exactly what they do.
Regards,
RandyP