This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OMAP-L138 execution times

Other Parts Discussed in Thread: OMAP-L138

Hi,

I ran following code on my omap-l138 LCDK board. but this code takes about 40us to finish. Is this normal?

void Matrice(void)
{
      uint16_t i,k;
      int32_t accumulator;

      for(i=0;i<14;i++)
      {
            accumulator=0;

            for(k=0;k<21;k++)
           {
               if(MatriceInterne[i][k]==1)
               {
                     accumulator += matrice_data_input[k];
               }
            }
           matrice_data_output[i]= accumulator;
      }

}

my code is running on external memory(DDR2  at 132MHz.)

matrice_data_output[][] is on L2 memory.

thanks,

Ron

  • Hi Rong Wang2,
    I hope you would get better performance if you move your code to Shared RAM.
    You can also increase the CPU frequency from default 300MHz to 456MHz.

  • Ron,

    By the way, what is the method you used to measure it?
  • Hi Shankari,

    I configured timer0 as free running, and reset the timer just before calling this funciton and get timer after function call to get the execution time. I am wondering if there is anyting I did wrong since even if the code is on external memory it should not take that long to finish this fucntion. 

    thanks,

    Ron

  • Hi Shankari,

    I moved my code to Shared RAM, but it did not help much, just reduced by 6us. I don't understand that this function does not have huge computing, why it took that long time?

    thanks,

    Ron
  • anyone has any idea?

    thanks in advance!

    Ron 

  • Hi
    If you are running this on the DSP and have program/data on DDR or Shared RAM, please also make sure that the corresponding MAR bits are enabled to make these memories cacheable.
    you will find the details on the MAR bits in the datasheet.

    Regards
    Mukul
  • Hi Mukul,

    thanks for your suggestion. I tried to have program/data on L2 memory, but did not help much(just about 6 micro seconds faster). Now I have to leave my project in Debug mode to use emulator. In debug mode, the complier will turn off pipline optimization?

    Thanks,

    Ron
  • Hi
    Please see some additional collateral on compiler optimization etc on the following wiki
    processors.wiki.ti.com/.../Optimization_Techniques_for_the_TI_C6000_Compiler

    The first link has an application note that has some additional things you can try.

    Regards
    Mukul
  • Hi Mukul,

    thanks for your suggestion. I tried to have program/data on L2 memory, but did not help much(just about 6 micro seconds faster). Now I have to leave my project in Debug mode to use emulator. In debug mode, the complier will turn off pipline optimization?

    Thanks,

    Ron
  • Hi Mukul,

    I have tried a few optimization, such as -O3, -mt and #pragma MUST_ITERATE(lower_bound, upper_bound, factor), but no help.

    I just found a interesting thing. if I comment out the addtion code(accumulator += matrice_data_input[k];), then the execution time reduced by 60%( execution time=25us), however, if I keep this code, but make the condition never be met( such as if(MatriceInterne[i][k]>1), and made sure the addition code is never be executed, but the execution time is still the same (execution time=75us). It seems as long as this line of code is there(accumulator += matrice_data_input[k];), no matter it is executed or not. the execution time for this function is the same. what is the problem?

    void Matrice(void)
    {
    uint16_t i,k;
    int32_t accumulator;

    for(i=0;i<14;i++)
    {
    accumulator=0;

    for(k=0;k<21;k++)
    {
    if(MatriceInterne[i][k]==1)
    {
    accumulator += matrice_data_input[k];
    }
    }
    matrice_data_output[i]= accumulator;
    }

    }

    Thanks,

    Ron
  • Hi Ron

    I will forward this thread to other colleagues to see if they can provide more pointers on the specific loop in question etc. 

    Discussing this with another colleague, here are possible suggestions/suspects

    Try to  enable compiler feedback to be put in the listing file. Look at SPRU187 revision U. This is the C6000 compiler guide. The compiler option is --advice:performance

     It is possible  that it is not pipelining because of pointer aliasing.  It does not matter of the condition is not met.  The scheduler would not pipeline the code so it will be slower. This is because they could be pointing to the same location since you did not use the restrict key word making it clear they don’t point to the same location.

    You will find some guidance on restrict key word in the application note in the wiki SPRABF2 Section 3.2 as well in the compiler documentation. 

    I believe you are following most of the other suggestions that were provided on the thread. 

    One thing that you can try , is to increase the speed of the processor from 300 to 456 MHz to see if it allows you to get to your performance targets.  This assumes that you have a part that supports the higher speed/performance and is something you can use in your end product etc (power-performance trade offs).

    Hope this helps some.

    Regards

    Mukul 

  • Hi Mukul,

    Thanks for your help. I will give a try.

    I understand that increase the sppeedof the processor would improve the performance. I just want to find out why this funciton runs mush slower on omap-l138 than on my old DSP C5410A.

    another thing I found is that if you move MatriceInterne[i][k] from Shared memory to L2 in which my code is sitting, then the execution time changed from 75us to 35us. it seems that if the data buffer is not in the same memory location as the code, the execution time will increase significantly.

    35 micro seconds execution time is still too much for this function.

    Regards,

    Ron 

  • Hi Ron
    Thanks. Curious as to what performance are you getting in 5410 and what performance are you trying to get to?
    Is the 5410 code the same C code or written in assembly.

    When running from Shared RAM, please do ensure that you are running with caching enabled.
    The MAR register is
    0x0184 8200 MAR128 , and it needs to be set to 1.

    I would also want you to make sure that you are truly running at 300 MHz - I am assuming you are using the gel file to set this up.

    Regards
    Mukul
  • Hi Mukul,

    I had verified DSP is running at 300MHz. this function was wirrten in assembly code and only takes 7 micro seconds on old DSP (C5410A). On omap-l138, it takes 35 micro seconds so far. since I need to all of functions to be finished within 125 micro seconds, so 35 micro seconds just for this function is too much. I wish it can be reduced to less than 10 micro seconds.

    I tried a option, such as -O3, -mk, but seems no effect. I also tried to get compiler feedback by adding -k and -mw in make file, but I did not see asm file is generated.

    following is my make file.

    ################################################################################
    # Automatically-generated file. Do not edit!
    ################################################################################

    SHELL = cmd.exe

    CG_TOOL_ROOT := C:/ti_CCS6/ccsv6/tools/compiler/c6000_7.4.16

    GEN_OPTS__FLAG :=
    GEN_CMDS__FLAG :=

    ORDERED_OBJS += \
    "./src/Baudot.obj" \
    "./src/DPRAM.obj" \
    "./src/DSP_init.obj" \
    "./src/Main.obj" \
    "./src/MemoirePartageeUSB.obj" \
    "./src/TTY.obj" \
    "./src/Vocoding.obj" \
    "./src/gpio.obj" \
    "./src/gpio_switches_leds.obj" \
    "./src/matrice.obj" \
    "./src/psc.obj" \
    "./src/vectors_intr.obj" \
    "../lib/vocal6x_npp.lib" \
    "../src/linker_dsp.cmd" \
    $(GEN_CMDS__FLAG) \
    -l"C:/bsl/lib/evmomapl138_bsl.lib" \
    -lrts6740.lib \

    -include ../makefile.init

    RM := DEL /F
    RMDIR := RMDIR /S/Q

    # All of the sources participating in the build are defined here
    -include sources.mk
    -include lib/subdir_vars.mk
    -include src/subdir_vars.mk
    -include lib/subdir_rules.mk
    -include src/subdir_rules.mk
    -include objects.mk

    ifneq ($(MAKECMDGOALS),clean)
    ifneq ($(strip $(S_DEPS)),)
    -include $(S_DEPS)
    endif
    ifneq ($(strip $(S_UPPER_DEPS)),)
    -include $(S_UPPER_DEPS)
    endif
    ifneq ($(strip $(S62_DEPS)),)
    -include $(S62_DEPS)
    endif
    ifneq ($(strip $(C64_DEPS)),)
    -include $(C64_DEPS)
    endif
    ifneq ($(strip $(ASM_DEPS)),)
    -include $(ASM_DEPS)
    endif
    ifneq ($(strip $(CC_DEPS)),)
    -include $(CC_DEPS)
    endif
    ifneq ($(strip $(SV7A_DEPS)),)
    -include $(SV7A_DEPS)
    endif
    ifneq ($(strip $(S55_DEPS)),)
    -include $(S55_DEPS)
    endif
    ifneq ($(strip $(C67_DEPS)),)
    -include $(C67_DEPS)
    endif
    ifneq ($(strip $(CLA_DEPS)),)
    -include $(CLA_DEPS)
    endif
    ifneq ($(strip $(C??_DEPS)),)
    -include $(C??_DEPS)
    endif
    ifneq ($(strip $(CPP_DEPS)),)
    -include $(CPP_DEPS)
    endif
    ifneq ($(strip $(S??_DEPS)),)
    -include $(S??_DEPS)
    endif
    ifneq ($(strip $(C_DEPS)),)
    -include $(C_DEPS)
    endif
    ifneq ($(strip $(C62_DEPS)),)
    -include $(C62_DEPS)
    endif
    ifneq ($(strip $(CXX_DEPS)),)
    -include $(CXX_DEPS)
    endif
    ifneq ($(strip $(C++_DEPS)),)
    -include $(C++_DEPS)
    endif
    ifneq ($(strip $(ASM_UPPER_DEPS)),)
    -include $(ASM_UPPER_DEPS)
    endif
    ifneq ($(strip $(K_DEPS)),)
    -include $(K_DEPS)
    endif
    ifneq ($(strip $(C43_DEPS)),)
    -include $(C43_DEPS)
    endif
    ifneq ($(strip $(INO_DEPS)),)
    -include $(INO_DEPS)
    endif
    ifneq ($(strip $(S67_DEPS)),)
    -include $(S67_DEPS)
    endif
    ifneq ($(strip $(SA_DEPS)),)
    -include $(SA_DEPS)
    endif
    ifneq ($(strip $(S43_DEPS)),)
    -include $(S43_DEPS)
    endif
    ifneq ($(strip $(OPT_DEPS)),)
    -include $(OPT_DEPS)
    endif
    ifneq ($(strip $(PDE_DEPS)),)
    -include $(PDE_DEPS)
    endif
    ifneq ($(strip $(S64_DEPS)),)
    -include $(S64_DEPS)
    endif
    ifneq ($(strip $(C_UPPER_DEPS)),)
    -include $(C_UPPER_DEPS)
    endif
    ifneq ($(strip $(C55_DEPS)),)
    -include $(C55_DEPS)
    endif
    endif

    -include ../makefile.defs

    # Add inputs and outputs from these tool invocations to the build variables
    EXE_OUTPUTS += \
    eHDT_DSP.out \

    EXE_OUTPUTS__QUOTED += \
    "eHDT_DSP.out" \

    BIN_OUTPUTS += \
    eHDT_DSP.hex \

    BIN_OUTPUTS__QUOTED += \
    "eHDT_DSP.hex" \


    # All Target
    all: eHDT_DSP.out

    # Tool invocations
    eHDT_DSP.out: $(OBJS) $(CMD_SRCS) $(LIB_SRCS) $(GEN_CMDS)
    @echo 'Building target: $@'
    @echo 'Invoking: C6000 Linker'
    "C:/ti_CCS6/ccsv6/tools/compiler/c6000_7.4.16/bin/cl6x" -mv6740 -k -mw --abi=coffabi -g --define=c6748 --diag_warning=225 --profile:breakpt -z -m"eHDT_DSP.map" --stack_size=0x800 --heap_size=0x800 -i"C:/ti_CCS6/ccsv6/tools/compiler/c6000_7.4.16/lib" -i"C:/ti_CCS6/ccsv6/tools/compiler/c6000_7.4.16/include" --reread_libs --warn_sections --xml_link_info="eHDT_DSP_linkInfo.xml" --rom_model -o "eHDT_DSP.out" $(ORDERED_OBJS)
    @echo 'Finished building target: $@'
    @echo ' '

    eHDT_DSP.hex: $(EXE_OUTPUTS)
    @echo 'Invoking: C6000 Hex Utility'
    "C:/ti_CCS6/ccsv6/tools/compiler/c6000_7.4.16/bin/hex6x" -o "eHDT_DSP.hex" $(EXE_OUTPUTS__QUOTED)
    @echo 'Finished building: $@'
    @echo ' '

    # Other Targets
    clean:
    -$(RM) $(EXE_OUTPUTS__QUOTED)$(BIN_OUTPUTS__QUOTED)
    -$(RM) "src\Baudot.d" "src\DPRAM.d" "src\DSP_init.d" "src\Main.d" "src\MemoirePartageeUSB.d" "src\TTY.d" "src\Vocoding.d" "src\gpio.d" "src\gpio_switches_leds.d" "src\matrice.d" "src\psc.d"
    -$(RM) "src\Baudot.obj" "src\DPRAM.obj" "src\DSP_init.obj" "src\Main.obj" "src\MemoirePartageeUSB.obj" "src\TTY.obj" "src\Vocoding.obj" "src\gpio.obj" "src\gpio_switches_leds.obj" "src\matrice.obj" "src\psc.obj" "src\vectors_intr.obj"
    -$(RM) "src\vectors_intr.d"
    -@echo 'Finished clean'
    -@echo ' '

    .PHONY: all clean dependents
    .SECONDARY:

    -include ../makefile.targets

    I also tried to add this option from Project->properties->compliter-Edit Flags, but when it is done and I click OK, but -k and -mw is not showing on Summery of flags set:

    -mv6740 --abi=coffabi -g --include_path="C:/ti_CCS6/ccsv6/tools/compiler/c6000_7.4.16/include" --include_path="../src" --include_path="../include" --define=c6748 --diag_warning=225 --profile:breakpt --debug_software_pipeline -k

    Did I miss something?

    thanks,

    Ron

  • Ron

    What are the values of matrice_data_output[][] ? are they all 0 and 1 or other values?

    Ran
  • Hi Ran,

    the the values of matrice_data_output[][] would be full range of a signed int(32 bit).

    The problem was solved by switch build model from debug to release with option -O3. Now execution time of this fucntion is only 2 micro seconds.

    Thanks all for your help!

    Ron

  • Ron

    Thanks for the update. Now the performance is more inline on what I would've expected moving from 54x to c674x.

    Regards

    Mukul 

  • While this issue is closed , the compiler team also pointed me to this useful wiki that i am archiving on this discussion , in case it has helpful to other users

    processors.wiki.ti.com/.../C6000_Compiler:_Recommended_Compiler_Options

    Regards
    Mukul