This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RTOS/AM3358: Performance issue when porting from TI-RTOS PSDK4.1 (GCC4.9.3) to PSDK5.1 (GCC6.3.1)

Part Number: AM3358
Other Parts Discussed in Thread: SYSBIOS, TEST2

Tool/software: TI-RTOS

Hi, I am porting some code from TI-RTOS PSDK4.1 to PSDK5.1.

I noticed some functions were taking longer and happening later in time.. In other words, things seems to be slower. I originally thought this could be related to PDK changes, as the function I am profiling happens inside PRU-ICSS EMAC LLD. However, because I also see the function happening later, I am wondering if this is more related to newer GCC toolchain. 

As a test, I tried to compile my code with PDK 1.0.9 (or later) and GCC4.9.3. In order to rule out (or confirm) GCC or PDK version issues. However, I hit so many error. It is really difficult to build with tools from a PDK4.2 (or greater) and GCC4.9.3.. so many changes.. 

For TI employs, who want to see more details there is a JIRA

 My questions:

- Is it any known issue between these two GCC toolchains?

- I found a similar E2E. Here this user pointed out could be due to heap size. I am not sure, but in a simple program EMAC loopback is difficult to notice this, but in my program (much bigger) it is easy. Any suggestions?

- Any way we can rule out GCC version?. Just FYI, my current error when building with GCC4.9 and PSDK4.2 tools is "undefined reference to `clock_gettime'"

- Attaching a couple of scope snapshots to illustrate the issue.

Thanks in advace for your help

Paula

  • I am not familiar with the Processor SDK, or any components inside it, such as the PDK.  But I can offer some general comments.

    Be sure you build with optimization.  If performance is a concern, this is a must.  If you use the ARM GCC compiler that comes with CCS, then the compiler manual is something similar to ...

    C:\ti\ccsv8\tools\compiler\gcc-arm-none-eabi-7-2017-q4-major-win32\share\doc\gcc-arm-none-eabi\pdf\gcc\gcc.pdf

    Search for the sub-chapter titled Options That Control Optimization.  You'll discover that the default is to create executables that are easy to debug, not to run fast.

    Version 4.9.3 is fairly old.  As a point of comparison, the ARM GCC which comes with CCSv8.0 is version 7.2.1.  I realize you probably need to use the compiler version the SDK is tested with.  But it probably makes sense for all programmers, including the SDK development team, to use a more recent compiler.

    Thanks and regards,

    -George

  • Hi George, thanks for your reply.

    After checking suggested gcc.pdf, I build my code with O3 (as per my understanding cover most, if not all, optimizations). Results are the same, which is my code being slower with latest PSDK and GCC6.3.

    For your information AM335x TI RTOS PSDK 4.1 uses below tools versions:

    Compiler: GNUv4.9.3

    XDC 3.32.01.22_core

    SYSBIOS 6.46.05.55

    AM335x PDK 1.0.8

    My code with these tools versions runs OK

    AM335x TI RTOS PSDK 5.1 uses below tools versions:

    Compiler: GNUv6.3.1 (Linaro)

    XDC 3.50.7.20_core

    SYSBIOS 6.73.0.12

    AM335x PDK 1.0.12

    My code with these tools versions runs slower.

    No sure how to rule out which component(s) make it slower. If it makes any sense, I can share my code. Just let me know.

    thank you,

    Paula

  • Hi George, a small update. I was able to build/run my code with TI RTOS PSDK 5.1 tools versions, but with latest compiler GNU v7.2.1 (Linaro). Results are similar to when building the code with compiler GNUv6.3.1 (Linaro). Slower.

    thank you,

    Paula

  • Hi George another update , I did an experiment  with a simpler program. A packet loopback . I build the same code with PSDK4.1 and PSDK5.0 tools versions. And I don’t see any performance difference. For the sake of the experiment, I was profiling the same functions in the scope (same board as well).

    Then, I think this rule out it is a GCC version issue (or any other Processor SDK tool version). But maybe is it related to heap size? (as mentioned on this E2E)

    FYI, simple loopback program and my code, both exercise PRU-ICSS EMAC + TTS from AM335x PDK. I used same packet size for both codes, to keep queuing packets conditions as similar as possible.

    Main difference is that my code has a more complex stack on top for processing the packets (SercosIII softmaster industrial protocol), for simple loopback code I used dummy pkts.

    Any thoughts or pointers would be highly appreciate it.

    thank you,

    Paula

  • Hi George, I did some progress. Actually there is a performance difference between PSDK4.1 (GCC4.9) and PSDK5.0 (GCC6.3) tools if  "-ffunction-sections -fdata-sections"  are used.

    When creating my small loopback experiment, I missed to add those flags. My original application project uses them.

    Let me share some snapshots to illustrate my experiments

    Test1: loopback demo, built with PDSK4.1 tools. Without "-ffunction-sections -fdata-sections" optimizations

     

    Test2: loopback demo, built with PDSK4.1 tools. With "-ffunction-sections -fdata-sections" optimizations

    Test2 behaves similar  (similar time for queuing packets - green graphic) to my application when build with PSDK4.1 tools

     

    Test3: loopback demo, built with PDSK5.1 tools. With "-ffunction-sections -fdata-sections" optimizations

    Not improvements observed with 

    Just FYI, GNU compiler flags used

    Test2:

    -mcpu=cortex-a8 -march=armv7-a -mtune=cortex-a8 -marm -mfloat-abi=hard -mfpu=neon -D${COM_TI_UIA_SYMBOLS} -D${EDMA3_LLD_SYMBOLS} -D${TI_PDK_SYMBOLS} -D${BIOS_SYMBOLS} -Dam3359 -DSOC_AM335x -DICEV2_AM335X -Dicev2AM335x -I"${COM_TI_UIA_INCLUDE_PATH}" -I"${EDMA3_LLD_INCLUDE_PATH}" -I"${TI_PDK_INCLUDE_PATH}" -I"${BIOS_INCLUDE_PATH}" -I"${PROJECT_ROOT}" -I"${PDK_INSTALL_PATH}/ti/drv/icss_emac/src" -I"${PDK_INSTALL_PATH}/ti/drv/icss_emac" -I"${PDK_INSTALL_PATH}" -I"${PDK_INSTALL_PATH}/ti/starterware" -I"${PDK_INSTALL_PATH}/ti/starterware/include" -I"${PDK_INSTALL_PATH}/ti/starterware/include/hw" -I"${PDK_INSTALL_PATH}/ti/starterware/soc/am335x" -I"${PDK_INSTALL_PATH}/ti/starterware/board" -I"${PDK_INSTALL_PATH}/ti/starterware/board/am335x" -I"${PDK_INSTALL_PATH}/ti/starterware/include/am335x" -I"${PDK_INSTALL_PATH}/ti/starterware/device" -I"${PDK_INSTALL_PATH}/ti/starterware/include/utils" -I"${PDK_INSTALL_PATH}/ti/starterware/soc" -I"${EDMA3LLD_BIOS6_INSTALLDIR}/packages" -I"${CG_TOOL_INCLUDE_PATH}" -ffunction-sections -fdata-sections -gstrict-dwarf -Wall 

    Test3:

    -mcpu=cortex-a8 -march=armv7-a -mtune=cortex-a8 -marm -mfloat-abi=hard -mfpu=neon -D${COM_TI_UIA_SYMBOLS} -D${EDMA3_LLD_SYMBOLS} -D${TI_PDK_SYMBOLS} -D${BIOS_SYMBOLS} -Dam3359 -DSOC_AM335x -DICEV2_AM335X -Dicev2AM335x -I"${COM_TI_UIA_INCLUDE_PATH}" -I"${EDMA3_LLD_INCLUDE_PATH}" -I"${TI_PDK_INCLUDE_PATH}" -I"${BIOS_INCLUDE_PATH}" -I"${PROJECT_ROOT}" -I"${PDK_INSTALL_PATH}/ti/drv/icss_emac/src" -I"${PDK_INSTALL_PATH}/ti/drv/icss_emac" -I"${PDK_INSTALL_PATH}" -I"${PDK_INSTALL_PATH}/ti/starterware" -I"${PDK_INSTALL_PATH}/ti/starterware/include" -I"${PDK_INSTALL_PATH}/ti/starterware/include/hw" -I"${PDK_INSTALL_PATH}/ti/starterware/soc/am335x" -I"${PDK_INSTALL_PATH}/ti/starterware/board" -I"${PDK_INSTALL_PATH}/ti/starterware/board/am335x" -I"${PDK_INSTALL_PATH}/ti/starterware/include/am335x" -I"${PDK_INSTALL_PATH}/ti/starterware/device" -I"${PDK_INSTALL_PATH}/ti/starterware/include/utils" -I"${PDK_INSTALL_PATH}/ti/starterware/soc" -I"${EDMA3LLD_BIOS6_INSTALLDIR}/packages" -I"${CG_TOOL_INCLUDE_PATH}" -I"${CG_TOOL_ROOT}/arm-none-eabi/include/newlib-nano" -ffunction-sections -fdata-sections -gstrict-dwarf -Wall 

    From GGC6.3 gcc.pdf it says:

    -ffunction-sections
    -fdata-sections
    Place each function or data item into its own section in the output file if the
    target supports arbitrary sections. The name of the function or the name of
    the data item determines the section’s name in the output file.
    Use these options on systems where the linker can perform optimizations to
    improve locality of reference in the instruction space


    Questions:

    - Could be possible current linker cannot perform these optimizations? or is any known issue for GCC6.3 w.r.t ffunction-sections and/or fdata-sections?

    thanks for your help,

    Paula

  • The compiler options you show do not use any optimizations.  Please add -O3, or something similar.  I realize that, in an earlier experiment, adding -O3 caused no change in the performance difference between 4.9.x and 6.3.x.  Add it anyway.  It is difficult to accuse the compiler of a performance problem while, at the same time, no optimization switches are used.

    Paula Carrillo said:
    - Could be possible current linker cannot perform these optimizations? or is any known issue for GCC6.3 w.r.t ffunction-sections and/or fdata-sections?

    I sent a request to the Linaro compiler team with a similar question.  Because of the holidays, I am not sure when we will hear back.

    Thanks and regards,

    -George

  • Hi George, thanks for involving Linaro compiler team. On the other hand, let me give you an update. I recompiled the code with GCC6.3 + 03 and I also checked -gc-sections flag was selected in CCS. Same results =/ no performance improvement.

    Just FYI, Linker flags used:

    -Wl,-Map,"${ProjName}.map" -nostartfiles -Wl,--gc-sections -L"${COM_TI_UIA_LIBRARY_PATH}" -L"${EDMA3_LLD_LIBRARY_PATH}" -L"${TI_PDK_LIBRARY_PATH}" -L"${BIOS_LIBRARY_PATH}" -L"${BIOS_CG_ROOT}/packages/gnu/targets/arm/libs/install-native/arm-none-eabi/lib/hard" -L"${BIOS_INSTALL_PATH}/packages/gnu/targets/arm/libs/install-native/arm-none-eabi/lib/hard" -Wl,--defsym,STACKSIZE=0x1C000 -Wl,--defsym,HEAPSIZE=0x400 -static ${PDK_INSTALL_PATH}/ti/drv/icss_emac/firmware/icss_dualemac/bin/am335x/a8host/REV1/icss_dualemac_PRU0.bin ${PDK_INSTALL_PATH}/ti/drv/icss_emac/firmware/icss_dualemac/bin/am335x/a8host/REV1/icss_dualemac_PRU1.bin --specs=nano.specs 

    Thank you,

    Paula

  • Hi George and Sumit, let me give you the example code and instructions, so you can take do some code inspection and/or reproduction of the issue. I really appreciate your help.

    TEST 2
    1) Download and install TI-RTOS PSDK 4.1 in default directory (C:\TI)
    2) Go to C:\TI\pdk_am335x_1_0_8\packages\MyExampleProjects\ and unzip "loopback_Softmaster_test2.zip"


    TEST 3
    1) Download and install TI-RTOS PSDK 5.0 in default directory (C:\TI)
    2) Go to "C:\TI\pdk_am335x_1_0_11\packages\MyExampleProjects" and unzip "loopback_Softmaster_test3.zip"

    Common steps:

    3) Open CCS, I have CCSv8.
    4) Clean/build
    5) For running the project you would need an ICEv2 board and a loopback cable

    6)Function to profile is in "test_common_utils.c". Function is "ICSS_EmacTxPacket". I call this function 4 consecutive times, simulating Sercos packets in CP2. "ICSS_EmacTxPacket" is defined inside "icss_emacDrv.c"

    loopback_Softmaster_test2.zip

    loopback_Softmaster_test3.zip

    thank you,

    Paula

  • All, after some additional debugging and some tips from Linaro's team. We found out the issue was related to Memory functions (i.e., memcpy/memset) in "newlib-nano".

    "newlib-nano" is optimized for size and therefore performance could be reduced in some areas. In this "newlib-nano" Memory functions use byte-wise instead of word-wise.

    As a test, I changed memcpy() with a word-wise version found online inside ICSS_EmacTxPacketEnqueue(), and rebuilt PDK and application. Performance now is similar to TEST2 (PSDK4.1, GCC4.9).

    Thanks you,

    Paula