Any example for C674x on Multiprocessing? Using PD Logic EVM?

Samuel Raj1

Hello,

I am trying to search for some examples for multiprocessing using the C674x, I have enabled the openMP 3.0 feature on the project settings. Not sure if there is any available.

Samuel

over 12 years ago

0 Rahul Prabhu over 12 years ago

TI__Guru** 116830 points

Hi Samuel,

The OpenMP feature is no supported on the C674x platforms. This is supported only on the Keystone platforms as far as I know. You might want to check on the compiler forums if their is some sample that they can share with you.

Regards,

Rahul

0 Samuel Raj1 over 12 years ago

Intellectual 280 points

Hi Rahul,

Thanks for the reply, I think its called parallel processing? Anyway this is the problem I have. I am trying to do a summation of 100 values in 3us time.

for(index = 0; index < 100; index++){

comp = buffer[index] + ext_counter;

if(comp > 0x3FFF)

{

sumvalue += comp;

}

else

{

sumvalue -=comp;

}

Its only able to do about 25 index at below 3us is there another way to do this?The project is already optimized with memory running in L1 and L2 cache where applicable etc.

Samuel

0 RandyP over 12 years ago in reply to Samuel Raj1

TI__Guru* 84110 points

Samuel,

Which C674x DSP are you using? What clock speed is the DSP running and what speed is the external memory running (external speed may be irrelevant, but it is always a good question to ask)?

You may be looking for what is called "software pipelining". This is a part of optimization on any C6000 DSP.

If you go to the Wiki and search for "C6000 Optimization" (no quotes), you will find links to some workshop material and some helpful articles about optimization techniques. Since you do not show any use those techniques, my assumption is that the workshop and the articles will be very useful to you.

Some of the critical things you need to look at in addition to those materials are:

Consider splitting L1D into 1/2 SRAM and 1/2 cache so you can have the data reside directly in L1D before reading it. This will save the time to load the cache.
If you are using cache and not SRAM, then make sure you have enabled caching for your data buffers; this uses the MAR registers.
Examine the datatypes and sizes to make sure your accesses are efficient.
Look at the assembly output of the compiler to see if it appears to be efficient. You can look in the CPU & Instruction Set Reference Guide to understand what each instruction does, but just concentrate on this loop and not the rest of the code and directives that are included in the asm file.
The 'if' statement in the loop may make the most advanced optimizations difficult. I do not understand the purpose of that comparison, but you have a good reason for it or it would not be there. Examine the need and try to find a way to change it or move it out of the loop.

You have not supplied enough information for anyone to give you direct advice or predictions. My expectation is that you can do the 100 passes through this loop in under 3us, but there could be constraints or tradeoffs that make that difficult. Or it could end up being much faster, also depending on the details.

My guess is that either you are not using cache correctly, or you have not turned on the compiler's optimization by using the Release build configuration.

C674x DSP core naturally implements parallel processing since it has 8 processing units. There is nothing you need to do to use this, except to use the knowledge that you gain from the workshop and articles, and the consideration you may give to my items listed above. But starting from the knowledge level after going through the workshop material from the Wiki and reading the articles on the Wiki, then we can continue with discussing your new results.

Regards,
RandyP

0 Samuel Raj1 over 12 years ago in reply to RandyP

Intellectual 280 points

Excellent reply. I am going thru the optimization workshop now. I am using the L1 and L2 setup as per my evmomap L-138 PD LOGIC board example code with the following code inserted

setup_DDR2_cache1();
enable_L1();
enable_L2();

I know it works cos I have tried the same piece of loop with and without it with more than 10x performance increase. the memory maps in the l138.cmd file is using external mDDR RAM on all the sections.

I am using a IO toggle which is used on the external ADC for convert start to measure the time it takes to run the loop code block. It cannot get past more than 25 loops @ 3us. The processor is running at 300MHz.I see on the disassembled has about 50 lines for each loop when i multiply the summation with a float. All local variables are declared with the keyword "register".

sum -= buffer[index] * ft_value;

Seeing that the processor runs at 3125 MIPS and 3us allows about 100 instructions....so it might be impossible to do summation of 100 numbers in 3us? Any thoughts on this?

Samuel

0 RandyP over 12 years ago in reply to Samuel Raj1

TI__Guru* 84110 points

Samuel,

The optimization study will help you improve the performance. Please keep us posted on your progress with that.

Samuel Raj1 said:
All local variables are declared with the keyword "register".

I never use the register keyword. The compiler is smarter about allocating variables than I am.

Samuel Raj1 said:

the disassembled has about 50 lines for each loop when i multiply the summation with a float.

sum -= buffer[index] * ft_value;

Where did this come from? There was no float multiplication mentioned in the first code example. For that matter, you still have not said what datatypes the variables are. Are you changing your algorithm by adding new requirements, but still want it to run 4 times faster than it does? I am very confused.

Why would this multiplication be inside the loop? If you are multiplying every summed value by a constant, then you should be able to simply do the summation and then do a single multiplication at the end, outside of the loop.

Samuel Raj1 said:

The processor is running at 300MHz....

the processor runs at 3125 MIPS and 3us allows about 100 instructions

I do not understand this calculation. I understand the processor is running at 300MHz, but I do not know where you got the number 3125 MIPS and I do not know how got to 100 instructions in 3us.

At a clock rate of 300MHz, you can run 300 instructions in 1us and 900 instructions in 3us.

Regards,
RandyP

0 Samuel Raj1 over 12 years ago in reply to RandyP

Intellectual 280 points

Hi RandyP,

Thanks for the reply. Yes the example I gave prior to my last reply didnt have a float, so this float is in a table the code will look like this.

loop(index){

array_pos = table[index]

if(....)

sum_y += buffer[array_pos] * float_table[index];

else

sum_y -= buffer[array_pos] * float_table [index];

}

Yes I think my calculation is wrong on the number of instructions for 3us. Thanks for the correct, according to the datasheet...

http://www.ti.com/lit/ds/sprs586d/sprs586d.pdf

3648MIPS is stated for the processor running at 375MHz so there for 1us is allows 3648 instructions runs. Therefore 3us will allow 10944 instructions. Hope I am spot on this time unless there is something I might have misunderstood from the datasheet.

So even if one run of the iteration is 50 instructions...100 will just be 5000 instructions in total. So its possible to clear the piece of execution in less than 3us. But that;s not happening.

sum_y is unsigned int

buffer is unsigned int

index is unsigned int

array_pos is unsigned long

Regards,

Samuel

0 RandyP over 12 years ago in reply to Samuel Raj1

TI__Guru* 84110 points

Samuel,

Continue your studying of the optimization training and literature. This is what will help you with this project and any in the future.

It is not practical for me to give you advice on code that changes substantially with every repetition. The use of unsigned variables for math seems impractical.

What I can tell is that you originally wanted to perform 100 reads and 100 adds in 3us. And now you want to do 300 reads and 100 floating point multiplies and 100 fixed-to-float conversions and 100 float-to-fixed conversion, all in 3us.

You will understand better how to reach your goal when you have completed the studying. Do the labs and follow the material in the workshop, and you will learn a lot about the architecture of the C674x DSP core and how to use it to full advantage.

Samuel Raj1 said:
3648MIPS is stated for the processor running at 375MHz so there for 1us is allows 3648 instructions runs. Therefore 3us will allow 10944 instructions. Hope I am spot on this time unless there is something I might have misunderstood from the datasheet.

The maximum internal clock speed for the C6748 is 456MHz. There are 8 processing units that can execute instructions simultaneously. 456 * 8 = 3648 MIPS, which is a theoretical limit for the number of instructions that can be executed. But this is not a useful number for figuring how fast your algorithm will run, this is more of a marketing number that sets the "up to" limit for comparison purposes.

This is not for 375MHz operation, although it is not clear from the datasheet's Features page what the 3648 number means.

At 375MHz, you can have 3 * 375 = 1125 "execution cycles". Each execution cycle may include 1 instruction execution or 2 or up to 8 instruction executions. Your job using the optimization techniques is to give the compiler's optimizer as much information as possible so it can use as many parallel instructions as possible.

To get the best possible performance for this loop, you will need to use at least part of L1D as SRAM and put all of table[], buffer[], and float_table[] in that L1D SRAM.

If this is an urgent project, you may find it profitable to contract with one of the TI Design House companies who have a great deal of experience with optimizing C6000 algorithms. Go to TI.com to start the search for an appropriate company, if that is a direction you want to go.

Regards,
RandyP

0 Samuel Raj1 over 12 years ago in reply to RandyP

Intellectual 280 points

Hi RandyP

I have tried implementing the code without the if statements in the loop. I was able to go up to 100 summations in less than 3 us. However I also have to remove float multiplication in order to achieve this. With the float multiplication it will just go back to 20 counts even if there was no conditional statements in the loop.

sum_number and data_input are unsigned int. f_multiplier is a float array with 100 elements.

for(n = 0; n < 100; n++)

sum_number += (data_input - 0x2000) * f_multiplier[n];

Now this is how the code looks. Its would look like I am very close to achieving a solution, any advise on how I can improve/optimize the float multiplication?

Samuel

0 RandyP over 12 years ago in reply to Samuel Raj1

TI__Guru* 84110 points

Samuel,

May I again point out that your code snippet has changed substantially? Since you do not show any use of the optimization techniques and pragmas, I can tell that you are still working through that information. You will have your best chance for success after you have 1) decided what your algorithm needs to be, and 2) learned the optimization material and applied it to your code.

Regards,
RandyP

0 Samuel Raj1 over 12 years ago in reply to RandyP

Intellectual 280 points

Hi RandyP,

This will be finial code, also like to ask about

#pragma MUST_ITERATE(lower_bound,upper_bound,factor)

I am still not clear on how to use it, in my code block its just 100 loop cycles. So what is the lower bound and upper bound and how does factor help? Also this

#pragma UNROLL(factor)

The documention http://www.ti.com/lit/an/sprabf2/sprabf2.pdf 

seems a little vague on the usage. 

Regards,

Samuel

0 RandyP over 12 years ago in reply to Samuel Raj1

TI__Guru* 84110 points

Samuel,

That is an excellent application note, so I could not explain any of it better than what you can read there. The text in that App Note explain what the lower bound and upper bound are, and an example how to use the factor. The same is true for the UNROLL pragma.

Have you tried these to see what they do? What have you tried?

Have you looked at the compiler's advice? Look at the compiler switches available for optimization and turn on the ones that help with advice or consultant, if they are available for your version of the compiler.

What compiler switches are you using?

The Compiler User Guide also has explanations for these, so the extra wording may help you. And the workshop material not only has the syntax and some descriptions, but the labs and solutions show you how to use it and show you exactly what they do.

Regards,
RandyP

Processors

Processors forum

Any example for C674x on Multiprocessing? Using PD Logic EVM?