This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6747 Floating Point Performance is the pits!

Hello,

We are using the C6747 to run an audio algorithm and are running into a bit of a problem.  The algorithm is written using floating point math operations.  The C6747 should be capable of easily handling these operations.   WHen the data is in floating point format it can be run through the algorithm.

For some reason the DSP is taking a HUGE performance hit when we do any floating point operations.  The conversion process alone (from integers from DMA to normalized floats and back to integers to the output DMA) is taking about 50% of the audio loop capacity when it should be more like .5%.

There must be something wrong with the project settings or memory layout because the performance should not be this bad.  We are just beginning to review all the E2E posts and other resources out there, but due to an extremely tight timeline we are hoping to get some quick guidance and pointers.  Please let us know if you have any recommendations or thoughts.

Thanks.

  • Hi 

    I think you may not use FPU. If you use software floating point conversion, performance is very low.

    Doug

  • Hi DJH,

    Thanks for your post.

     Basically, the internal architecture of a floating point DSP is more complicated than for a fixed point device. All the registers and data buses must be 32 bits wide instead of only 16. Floating point DSPs typically use a minimum of 32 bits to store each value. This results in many more bit patterns than for fixed point, 232 = 4,294,967,296 to be exact. A key feature of floating point notation is that the represented numbers are not uniformly spaced. The represented values are unequally spaced between these two extremes, such that the gap between any two numbers is about ten-million times smaller than the value of the numbers. This is important because it places large gaps between large numbers, but small gaps between small numbers.

    Please refer the below E2E threads:

    http://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/112/t/242795.aspx

    http://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/112/t/137472.aspx

    Alternatively, you can use the profiling capabilities of CCS to help you optimize your algorithm. You can enable profiling for your application once you are connected to the target. Once connected, go to Tools->Profile->Profile Setup and select "Profile all Functions". After you execute your program, you can view the results by selecting Tool->Profile->View Function Profile.

    Also, please refer the below Wiki's:

    http://processors.wiki.ti.com/index.php/Tuning_Audio_Latency_on_C6747

    http://processors.wiki.ti.com/index.php/-mv_option_to_use_with_the_C674x

    Thanks & regards,

    Sivaraj K

    -------------------------------------------------------------------------------------------------------
    Please click the Verify Answer button on this post if it answers your question.
    --------------------------------------------------------------------------------------------------------

     

  • Hi,

    I am also facing the same problem.

    The processor cycles is more when performing floating point operations. Even DSPF_q15toFl() function takes more cycles.

    Is it better to convert all my floating point operations to fixed?

    Sivaraj,

    You mentioned Profiling. Whether is it possible in hardware? we are not able to profile all functions in hardware. Could you please help?

  • Hi Selvi PT,

    Yes, it is possible to profile functions and profiling is based out of hardware only.

    You shall refer the C6000 DSP Optimization document as below, in which you can see the five key techniques of optimization to meet the performance target and also refer Appendix B for profiling support in the document

    http://www.ti.com/lit/an/sprabf2/sprabf2.pdf

    Thanks & regards,

    Sivaraj K

  • Hi guys,

    I am a colleague of DJH trying to resolve the issue.  Thanks for your answers so far.  The conversion he is speaking of is to cast a 24-bit audio sample (represented as a 32-bit int) to a float value and then normalize it to a value between -1.0 and 1.0.  The inverse operation is done for the output samples.  This is a fairly common thing to do in a floating point processor, however as DJH says, this takes an unreasonable amount of our processing loop.  Because it is so severe, I don't believe this is just an optimization issue, but something that I am doing wrong fundamentally.  A few questions:

    Doug, do believe I may not be using the FPU?  Is there something special I have to do to direct the compiler to use it?  I would assume it would use the FPU when it sees the float data type.

    Sivaraj, in the fixed version of our algorithm we were using 32-bit ints, not 16-bit.  I have been looking through some optimization guides, but as I said I think this a more fundamental problem then optimizing the algorithm.  Any other ideas?

    Selvi, I haven't tried the DSPF_q15toFl() call yet.  I will try that too to see if I get the same result.  I would have to reduce my samples to 16-bit though.

    Any other ideas would be appreciated.  Thanks!

  • Hi Trent and DJH,

    Have you set the correct target (C674x) in build options in CCS?

    You should easily be able to see in the generated assembly code if the FPU is used or not. When the FPU is used you should see the INTSP instruction for the int-to-float conversion, without the FPU you'll likely see some function call instead.

    A code snippet from the routine where the conversions take place along with the generated assembly code would be useful in order to help.

    Best regards,

     Tobias

  • Can you please provide your compiler options here so that we can verify that you are using optimal settings to build the code for the C674x processor. How did you determine if the algorithm is running slow on the floating point operations? IF possible please provide a code snippet so that we can examine this issue.

    I am assuming you are using internal memory for these operation or if you are using DDR, you have enabled cache.

    Regards,

    Rahul

  • Hey guys,

    It looks like this issue was partially resolved after the suggestion by tsz.  I took a closer look at the project and it seems the selected target got messed up somehow and we weren't getting the -mv6740 option.  I had a few other issues with my project (it was converted from an older version of CCS) so I just re-created the project from scratch.  After doing this, my CPU utilization for the int-to-float conversion and back went from like 50% to about 9%.  9% still seems high but I was glad to make some progress.  I verified that the INTSP instruction is being used for the conversion as tsz suggested.

    My method of calculation for DSP utilization is to assert an LED on our board upon entering the DMA interrupt routine, then clearing it at the end of the of the interrupt routine after all the audio processing.  I use a logic analyzer to measure the duty cycle which gives me an idea how much room I have.

    So now I am able to enable more parts of my algorithm but as I add pieces I am still concerned how fast I am running out of processing time.  

    Rahul reqested a code snippet so I will give a small example of a biquad filter.  The C code for processing through the biquad filter looks like this:

    void CBiquad::Process(float *input, float *output)
    {
        float v;

        for (int n = 0; n < _blockSize; n++)
        {
            v = input[n] - _a[1]*_z[0] - _a[2]*_z[1];
            output[n] = _b[0]*v + _b[1]*_z[0] + _b[2]*_z[1];
            _z[1] = _z[0];
            _z[0] = v;
        }
    }


    Here is the generated assembly code:

    $C$DW$L$_Process__7CBiquadFPfT1$4$B:
               LDNDW   .D2T2   *B7,B1:B0         ; |47| <0,2>  ^ 
               NOP             1
               LDNDW   .D1T1   *A4,A7:A6         ; |48| <0,4> 
               NOP             1
       [!A1]   LDW     .D2T2   *B4++,B0          ; |47| <0,6> 
               MPYSP   .M2     B0,B2,B8          ; |47| <0,7>  ^ 
               MPYSP   .M2     B1,B3,B8          ; |47| <0,8> 
               MPYSP   .M1X    B1,A7,A6          ; |48| <0,9> 
    ||         MPYSP   .M2X    B0,A6,B9          ; |48| <0,9> 
               NOP             1
               SUBSP   .L2     B0,B8,B16         ; |47| <0,11>  ^ 
               NOP             2
               LDW     .D1T1   *+A5(28),A3       ; |48| <0,14> 
               SUBSP   .L2     B16,B8,B0         ; |47| <0,15>  ^ 
               NOP             3
               MPYSP   .M2X    B0,A3,B8          ; |48| <0,19>  ^ 
               NOP             3
               ADDSP   .L2     B9,B8,B8          ; |48| <0,23>  ^ 
               NOP             3
               ADDSP   .L2X    A6,B8,B1          ; |48| <0,27>  ^ 
               NOP             3
       [!A1]   STW     .D2T2   B1,*B5++          ; |48| <0,31>  ^ 
       [ A0]   BDEC    .S1     $C$L7,A0          ; |45| <0,32> 
    ||         LDW     .D1T2   *+A5(20),B0       ; |49| <0,32>  ^ 
       [!A1]   STW     .D1T2   B0,*+A5(20)       ; |50| <0,33> 
               NOP             2
               LDNDW   .D2T2   *B6,B3:B2         ; |47| <1,0> 
       [ A1]   SUB     .L1     A1,1,A1           ; <0,37> 
    || [!A1]   STW     .D1T2   B0,*+A5(24)       ; |49| <0,37>  ^ 
    $C$DW$L$_Process__7CBiquadFPfT1$4$E:
    ;** --------------------------------------------------------------------------*
    $C$L8:    ; PIPED LOOP EPILOG
    ;** --------------------------------------------------------------------------*
               RINT                              ; interrupts on
    ;** --------------------------------------------------------------------------*
    $C$L9:    
    .dwpsn file "C:/Users/Trent/tm/software/dsp/v2/src/Biquad.cpp",line 52,column 1,is_stmt
               LDW     .D2T2   *++SP(8),B3       ; |52| 
               NOP             4
    .dwcfi cfa_offset, 0
    .dwcfi restore_reg, 19
    .dwcfi cfa_offset, 0
    $C$DW$91 .dwtag  DW_TAG_TI_branch
    .dwattr $C$DW$91, DW_AT_low_pc(0x00)
    .dwattr $C$DW$91, DW_AT_TI_return
               RETNOP  .S2     B3,5              ; |52| 
               ; BRANCH OCCURS {B3}              ; |52| 
    Here is the summary of compiler flags set in the compiler options: 

    -mv6740 --abi=coffabi -O3 -ms0 -g --include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.4/include" --include_path="../../../src" --define=c6747 --display_error_number --diag_warning=225 --diag_wrap=off --optimizer_interlist

    My sample rate is 48kHz and my block size is 128, which gives me 2.667ms processing time for each block.  Each call to this Process() function takes .112ms which is 4.219% of my processing time.  That means that after like 20 biquads all my processing time is gone!

    Rahul, you asked how I know this is running slow.  I guess I don't have a good answer for that other than I expected the DSP to be able to do a lot more than the equivalent of 20 biquads.  I have experience with the SHARC ADSP-21489 from Analog Devices.  It also claims to be a 2.7 GFLOP processor like the C6747, and I am able to do hundreds of biquads on that DSP with room to spare.  So there must be something I am doing wrong!

    If you could take a look at this code and compiler options I would greatly appreciate it.

    Everything here is being done in internal memory.

    Thanks,

    Trent

  • Hey guys, still looking for help on this.  

    I was looking through the "TMS320C6000 Programmer’s Guide" (SPRU198K, pg. 2-2) and I found this statement:

    When using floating-point instructions on a floating-point device such as
    the C6700, use the −mv6700 compiler switch so the code generated will
    use the device’s floating-point hardware instead of performing the task
    with fixed point hardware. For example, the run-time support floating-point
    multiply will be used instead of the MPYSP instruction.

    This jumped out at me because as you can see from the last post my biquad Process() call I have lots of MPYSP instructions in the assembly.  However I do have the -mv6740 compiler option enabled.  Does the fact that I am using the MPYSP instruction definitely mean it is using fixed-point hardware for floating-point operations?

  • Trent,

    The bullet point that you quote above is very poorly written. We have really good engineers, but not all of us did well in writing classes. That last highlighted sentence mixes terms being compared in the clauses of the previous sentence. It should be reversed to be more clear: "For example, when using the -mv6700 compiler switch the MPYSP instruction will be used, and when using the -mv6000 compiler switch then the run-time support floating-point multiply will be used."

    The MPYSP is a floating-point hardware instruction. It is only available on the C67x, C674x, and C66x devices.

    The -mv6740 is what you should be using. Thanks to tsz for pointing you to it.

    Your assembly listing left out some of the code from the top of the biquad function, like the branch target for the BDEC, so it is hard to tell what else is above it that should have been better or that is not very good.

    The improvements from the C67x to the C674x brought in the C64x+ fixed-point architectures program flow improvements, especially the SPLOOP/SPKERNEL low overhead looping. Your code does not use this, so there are things you need to do to improve the code or the compiler switches or the information that can be available to the compiler.

    Since you have been on the forum a while, you are experienced with the architecture, so you should be able to find and skim through the optimization documents that we have plus the online training material (go to the TI Wiki Pages and search for "c6000 optimization" (no quotes) to find some Wiki articles and the C6000 Optimization Workshop).

    In a short list, look for removing the -g switch (I prefer to leave it in, but see if it improves), use the restrict keyword, use pragmas at the top of your loop for loop count information, use nassert to tell the compiler any alignment information that could help.

    I am not sure you shared any real numbers on your bad performance, or how it has improved from these changes that you have already made. I remember reading relative numbers, but that does not paint a very clear picture of what we are dealing with here.

    If you are doing all of this processing in your DMA ISR, it sounds like you may not have much else going on in the application. Otherwise, we generally recommend not doing that but using other SYS/BIOS features to allow more process-controlled execution of whatever is the highest priority thread that needs to run. I suspect you have that worked out in your case, so this is just for other readers with more generalized applications.

    Regards,
    RandyP

  • In moving from CCS3.3 to CCS5.5, here is the list of compile/link options that I have done - do you know if this is "complete" for full optimization? Are there options that should be deleted? Are there other options that should be included? (I tried to transcribe as best as possible the options on CCS3.3 to CCS5.5, but I may have missed something.)

    -mv6700 --abi=coffabi -O3 --optimize_with_debug=on --rtti --define=c6713 --quiet --display_error_number --gen_func_subsections=on --mem_model:data=far --use_const_for_alias_analysis --single_inline --std_lib_func_defined -k --output_all_syms -z --stack_size=0x2000 -m"7450010.map" --heap_size=0x5400 -i"C:/1000NX/Application/MigrationCCS5/DSP_BIOS/ae1000nxBios/7450010/lib" -i"C:/TI/bios_5_42_01_09/packages/ti/bios/lib" -i"C:/ccs5/ccsv5/tools/compiler/c6000_7.4.4/lib" -i"C:/ccs5/bios_5_42_01_09/packages/ti/rtdx/lib/c6000" -i"C:/ccs5/bios_5_42_01_09/packages/ti/bios/lib" --reread_libs --verbose_diagnostics --warn_sections --display_error_number --issue_remarks --xml_link_info="7450010_linkInfo.xml" --absolute_exe --rom_model --trampolines=on

    I realize stack and heap size are probably dependent on the particular project...

    One other question - in the past, there was a way to have the compiler "fill" the stack (at c_int00?) with a value - that value was "c0ffee" - is this not available anymore? If it is not, we will need to write some code to do this at bootload time - presently we do that with the heap.

     

  • RandyP,

    Thanks for clearing up that paragraph, I suspected that it might have just been worded incorrectly.
    Thanks for your suggestions.  I will try your suggestions and also start going through the optimization documentation more thoroughly.
    You're right, I haven't provide many good metrics yet.  I will try to get some better numbers.  The one measurement I provided was this:
    My sample rate is 48kHz and my block size is 128, which gives me 2.667ms processing time for each block.  Each call to this Process() function takes .112ms which is 4.219% of my processing time.

    I have tried adjusting a few different optimization options thus far but the usage didn't experience any meaningful changes.

    RandyP, just as a gut feel, do you agree that that seems like a lot for just one biquad filter?  Or does it seem reasonable?  

    I'll post more after I try some things.

    Thanks!
    Trent
  • From the assembly, it looks like the data pointers are not aligned and therefore the loops are not taking full advantage of the C674x CPU architecture.

    Randy pointed out good wiki resources. Additionally there is a general appnote on code tuning for the C6000 (generic across CPU generations).

    For the compiler to do an optimal job on performance loops, it's important for it to know:

    - pointers are independent

    - data is 64-bit aligned

    - loops can be unrolled (e.g. loop count info)

    Take a look at www.ti.com/lit/pdf/SPRA666‎ and see if you can add restrict and _nassert() info for the compiler. You can see in the compiler feedback what the loop utilization and limiting factors are to help with further tuning.

    Best regards,

    Dave

  • @Todd Anderson,

    Your compiler switches seems fine, but to get the best answers on compiler-specific questions I recommend you re-post this on the Compiler Forum.

    Regards,
    RandyP

  • @Trent,

    Trent Rolf said:
    just as a gut feel, do you agree that that seems like a lot for just one biquad filter?  Or does it seem reasonable?

    It is pretty hard to have an accurate guess without knowing what _blocksize is or how fast the processor is running or where everything is located relative to internal/external/cache for program and data.

    One pass through the assembly code shown above would take about 40 CPU cycles if there were no stalls. The most common stalls come from memory accesses, and it is common to have caching disabled or data in the wrong location.

    Our online training material is the best to find a more comprehensive list of things to do to get your code performing well. It would be good for us to be able to address questions you have from that material, and for you to see anything that might apply to your implementation.

    Your benchmarking method may be including a lot of overhead. Have you tried replacing the biquad with a simple assignment out=in and see what the time measurements are?

    Regards,
    RandyP

  • Randy:

    Thanks - just wanted to know if something glaring jumped out at you. I have been working on cutting down the processing time during the "main" loop, and just needed to know if my settings made sense. Inspecting the resulting assembly code did show that the compiler was making use of both multiplier-accumulators on the C6713. My best processing find was to do inlining on filter iterations, and to pull the constructors out of "loop."

     

  • Hey guys,

    I was able to get my algorithm to fit by using the DSPLIB heavily.  For example the DSPLIB biquad used about 0.8% of my processing loop instead of the 4.219% for my compiled version.  I wouldn't expect the difference to be that dramatic, but oh well.  Also compiling in the C674x-MATHLIB for floating point operations helped my performance.

    Thanks,

    Trent

  • Trent,

    Good news to hear. If you have resolved your issue, please mark as Answered the post(s) that will be best for future readers to benefit from your work. Most likely this will at least include your post above telling your result.

    Regards,
    RandyP

  • I'm going to need DJH mark the solution since he was the one that started the thread.