This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hi
I am trying to optimize the simple Sum of Product Code using #pragma directives.
Snippet of the code is shown below
#include <stdio.h>
unsigned char a[64];
unsigned char b[64];
int sum = 0;
main()
{
int i;
for(i = 0; i< 64; i++)
{
a[i] = b[i] = i;
}
//#pragma UNROLL(2)
//#pragma MUST_ITERATE( , , 2)
for(i = 0; i< 64; i= i++)
sum = sum + (a[i] * b[i]);
}
In the code above, I tried using both UNROLL and MUST_ITERATE pragma to unroll the loop two times. But, in both case, loop does not look to get unrolled and thereby reducing time.
Timing remained same with and without using pragma directives.
Apperciate your help in guidance to use #pragma directives
Welcome to the TI E2E forum. I hope you will find many good answers here and in the TI.com documents and in the TI Wiki Pages. Be sure to search those for helpful information and to browse for the questions others may have asked on similar topics.
The E2E forum is a very large one, and it can be difficult to figure out where the right experts are. In your case with a compiler question, this should be posted in the TI C/C++ Compiler Forum instead of this device-based C64x Single Core DSP Forum. This thread will be moved there this time for your convenience.
It will help us to help you if you will tell us which device you are using, which version of Code Generation Tools you are using, which version of CCS, and what the compiler switch settings are.
For our DSPLIB, there are C and Assembly versions of many of the library functions. The FIR filter is the classic DSP algorithm and you can find examples there, plus a library function that will do the function you are trying to write.
In the TI Wiki Pages, you will find some articles and workshop material on optimization methods. You can search for "c6000 optimization" (no quotes) to find a list of several of these. Other keywords may help you find additional material.
The fact that you are accumulating 8-bit multiplications into a 32-bit result may impact the efficiency of the architecture. If you are just trying to play with the tools for gaining experience, then you may want to try with native 32-bit operations or use 16-bit data and 16-bit accumulation.
For 8-bit data, the optimal solution would probably be using 4 parallel operations, but it may depend on the specific device you are using.
Please let us know what you find from the TI Wiki Pages and what questions you may have from those new insights.
Regards,
RandyP
The construct "i=i++" is illegal because it modifies i twice before a sequence point. Just use "i++"
When I uncomment those pragmas and compile this test case with -o2 (optimization level 2), the compiler does unroll and software pipeline the loop. As Randy suggests, we need to see the complete command line options as well as the version of the compiler (which is not the same as the version of CCS).
RandyP said:you are accumulating 8-bit multiplications
Actually, according to the rules of the C language, the 8-bit inputs are widened to 32-bit "int" before the multiplication, so each multiplication is actually (at C level) a 32x32 into 32-bit operation. The compiler will actually use the 16x16->32 multiplication instruction, but that's still considered a 32-bit multiplication as far as C is concerned.