Hi, there,
I am using C67x DSP FastRTS library for single precision's sine and cosine functions. We are trying to increase the executions of sine/cosine functions. So we'd like to use the FastRTS (Inlining) Pipelining w/128 Calls.
According to the benchmark of FastRTS library (c67xfastRTS_Benchmarking.pdf), the FastRTS (Inlining) Pipelining w/128 Calls will increase the processing speed significantly.
For example, for sine function: FastRTS need 69 cycles, while FastRTS (Inlining) Pipelining w/128 Calls only need 17 cycles.
However, I implemented it into the DSP, and measured the processing time. I found FastRTS (Inlining) Pipelining w/128 Calls took much longer processing time than FastRTS. The function calls in my DSP code is below.
(1). test_a = sinsp(value_a); // processing time is about 70 cycles;
(2). test_b = sinsp_i(value_b); // processing time is about 130 cycles;
Could you please tell me what's the bug to make FastRTS (Inlining) Pipelining w/128 Calls not working as the benchmark declares? How can I invoke the function of FastRTS (Inlining) Pipelining w/128 Calls, in order to make its processing time to be 17 cycles?
Thank you.