Hi.
The TMS320C64x+ IQmath Library User's Guide states that the IQNcos function takes 54 cycles when stand-alone and only 4 cycles when pipelined (Table 3-7, page 18).
Bullet 2 of the Notes section on page 19 also states:
- The pipelined loop cycles mentioned are measured with only a single function being called in a loop for 1024 iterations. This figure also includes cycles for loading the input and storing output data. A combination of functions may not yield scalable performance. If a large number of functions are used in a loop, the loop may not schedule.
So, I have been trying to verify this using a simple loop code as well as following the software pipelining rules and all the relevant compiler options as explained in the TMS320C6000 Programmer's Guide in order to assist the compiler to pipeline the loop. Here is the portion of the code that contains the loop that I wrote:
#pragma MUST_ITERATE (1024, 1024, 2);
for( i=0; i<1024; i++)
{
test[i]=_IQcos(x);
}
with the following declared:
int i;
_iq test[1024];
Global Q is set at default 24, and Basic Compiler options are:
Target Version: C64x+
Optimize for Size (-ms): No
Optimize for Speed (-mf): 5
Opt Level: Function (-o2)
Program Level Opt:
-pm
When I profile the loop by counting clock cycles using the clock function in CCS3.3, the result is about 57 000 clock cycles for the loop to complete. This calculates to about 55 clock cycles for each iteration, which is very close to the stand-alone figure of 54 mentioned above.
The compiler is clearly NOT pipelining the loop... So, can anyone please tell me what I am doing wrong?
Regards.
Estian.