This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Questions on C67x DSP FastRTS library

Other Parts Discussed in Thread: MATHLIB

Hi,

  I am using C67x DSP FastRTS library for single precision's sine and cosine functions. We are trying to increase the executions of sine/cosine functions. So we'd like to use the FastRTS (Inlining) Pipelining w/128 Calls.

  According to the benchmark of FastRTS library (c67xfastRTS_Benchmarking.pdf), the FastRTS (Inlining) Pipelining w/128 Calls will increase the processing speed significantly. 

 For example, for sine function: FastRTS need 69 cycles, while FastRTS (Inlining) Pipelining w/128 Calls only need 17 cycles. 

  However, I implemented it into the DSP, and measured the processing time. I found FastRTS (Inlining) Pipelining w/128 Calls took much longer processing time than FastRTS. The function calls in my DSP code is below.

 (1). test_a = sinsp(value_a);     // processing time is about 70 cycles;

 (2). test_b = sinsp_i(value_b);     // processing time is about 130 cycles;

Could you please tell me what's the bug to make FastRTS (Inlining) Pipelining w/128 Calls not working as the benchmark declares? How can I invoke the function of FastRTS (Inlining) Pipelining w/128 Calls, in order to make its processing time to be 17 cycles?

Thank you.

  • Eugene,

    There are test examples that come with some releases of FastRTS or of the sinsp_i function. Have you tried one of those.

    Which DSP are you using? Are you running on the EVM or simulator?

    Which version of CCS and the Code Generation Tools are you using?

    It is not obvious from your post whether you are pipelining the sinsp_i call 128 times in a tight loop. If you use it differently, you will get different results.

    Regards,
    RandyP

  • Hi, Randyp,

       Thank you for your reply. 

       I could not find where the test examples within the directory of FastRTS. Could you please pass it to me? Or could you paste the example for sinsp_i here?

       I am using C6747 DSP, running on customized PCB board. The CCS version is 4.2.4.. The code generation tool is C6000, version 7.0.3. 

      I do the pipelining sinsp_i call within a while loop, shown below. May I know what's the meaning of tight loop? What's the correct way to do pipelining sinsp_i call? Thank you so much.

    while (1)

    {

       if (enable)

    {

       sin_1 = sinsp_i(value_1);

       sin_1 *= parameter_1; 

       cos_1 = cossp_i(value_1);

       cos_1 *= parameter_1; 

       sin_2 = sinsp_i(value_2);

       sin_2 *= parameter_2; 

       cos_2 = cossp_i(value_2);

       cos_2 *= parameter_2; 

       sin_3 = sinsp_i(value_3);

       sin_3 *= parameter_3; 

       cos_3 = cossp_i(value_3);

       cos_3 *= parameter_3; 

       sin_4 = sinsp_i(value_4);

       sin_4 *= parameter_4; 

       cos_4 = cossp_i(value_4);

       cos_4 *= parameter_4; 

       //// and a lot of other codes below

      ..................

      ..................

    }

    else

    {

       //// a lot of other codes below

      ..................

      ..................

    }

    }

      

      

  • Eugene,

    The reference files I have are from the MathLib for C66x, but this code for a vector function may be generic:

      for (i = 0; i < BUF_SIZE; i++) {
        output[i] = sinsp_i(input[i]);
      }

    That is a tight loop. Nothing is running but this function call. This might get you to the benchmark numbers, but I do not have the benchmark tests and measurements.

    The point is that the result is simply stored to memory and not acted upon. If the code has to wait for the result of one sinsp_i call before starting to execute the next sinsp_i call, then they will not be pipelined.

    Regards,
    RandyP

  • Hi, Randyp,

      Thank you for your reply.

      I have tried the method your suggested, to use a tight for loop to calculate sinsp_i. I tried 4x loops and 16x loops. However, the results are the same. The processing time of sinsp_i is about 50% more than sinsp, for the for 4x and 16x loops. It seems the pipeline does not function.

      Do we need to configure something for pipelining function in CCS, to tell the DSP we are going to use pipelining here? Otherwise, how the DSP compiler knows whether to use pipelining or not?

      By the way, I am using DSP non-bios. 

      Thank you so much.

     

     

  • Eugene,

    Please search the TI Wiki Pages for "c6000 optimization" (no quotes) and find some useful articles on code optimization techniques. There is a C6000 Optimization Workshop that should be in that list. This workshop is an archive of a class we have taught for many years. The student guide and labs (with solutions) can be very helpful for learning optimization techniques.

    Regards,
    RandyP

  • Hi, Randyp,

      Thank you for your help.

    Best Rgds,