This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler/RM57L843: Understanding .tramp section in map file and placing RTS function calls in TI ramfunc section.

Part Number: RM57L843

Tool/software: TI C/C++ Compiler

Champs,

I have a couple of questions about a code that we are trying to optimize. I have attached the map file for the R5 binary built from the application.

drv_gpio_led_blink_app_tirtos_mcu1_0_release.xer5f.map

In the .TI.ramfunc  section , I have observed the trampoline references .tramp  to TI ARM RTS libraries. Does it indicate that the code makes call to that RTS function  but is not placed in that section? the critical code is currently placed in TCM memory. and the RTS functions are in the OCRAM memory. Is there a way that I can prevent the branch or place those RTS library code section in .TI.ramfunc to avoid penalty to go to a SRAM via cache.

Regards,

Rahul

  • From your map file I can tell there is a function named various_type_calc_proc1, located in the memory range named MCU0_R5F_TCMA, which, at 4 different sites, calls the RTS function __aeabi_lmul, located in the memory range named OCMRAM.  This RTS routine performs 64-bit multiply. (I presume you are not interested in how I worked that out.  If you want, I can cover how that is done.)

    Rahul Prabhu said:
    Is there a way that I can prevent the branch

    No

    Rahul Prabhu said:
    or place those RTS library code section in .TI.ramfunc to avoid penalty

    Yes.  I'll explain how to do that.

    But, first, consider whether is it worth it.  While there are 4 call sites, that says nothing about how often this RTS function is called.  It may be called a lot.  It may be called very few times.  And if it is called very few times, then this optimization is unlikely to be worth the trouble.  

    I understand why you think that the RTS function needs to be in the same section.  That isn't completely wrong, but it is imprecise.  It is precise to say the difference in memory addresses between a call site and the destination of the call cannot exceed that which is supported by the function call instruction(s).  (I'm sorry I don't know the exact number.  I think it is about a megabyte, or 2 ** 20 bytes.)  This means the RTS function doesn't have to be in the same section, it just needs to be closer in memory.  There is probably an entry in the linker command file similar to ...

        .TI.ramfunc > MCU0_R5F_TCMA

    The idea is to create another output section for the desired RTS function, and allocate it to the same memory range.  Something similar to ...

        .rts.ramfunc : 
        {
           rtsv7R4_A_le_v3D16_eabi.lib<ll_mul32*.obj>(.text)
        }
          > MCU0_R5F_TCMA
    

    Line 1 names the output section.  Line 3 says to get the .text (code) section from the object file ll_mul32.obj in the RTS library rtsv7R4_A_le_v3D16_eabi.lib.  This line unfortunately hard codes some project specific details.  But that is unavoidable.  The precise library name must be used.  The syntax ll_mul32*.obj says to use any object file name that begins with ll_mul32 and ends with .obj.  This accounts for the fact that in 16.9.x.LTS releases, the file with the 64-bit multiply routine is named ll_mul32.obj.  In 18.1.x.LTS releases, this same file is named ll_mul32.asm.obj.  Line 5 allocates the output section to the same memory range.

    For further details on the syntax for allocating functions from the RTS library, please search the ARM assembly tools manual for the section titled Specifying Library or Archive Members as Input to Output Sections.  For further information about trampolines, search the same manual for the section titled Generate Far Call Trampolines.

    Thanks and regards,

    -George

  • Thanks George. This is very useful and indeed answers my question.

    The reason to place the RTS function in TCM memory is that the function is called about 100 time during different iterations and the code is currently in the OCRAM memory which goes through cache as opposed to TCM which is directly connected to the core.

    I made this modification and do see about 3 usec lower number with that benchmark (previously 18usec) so avoiding the branch to OCRAM does seem to make an impact. Thanks for the solution and detailed explanation here.

    Regards,
    Rahul
  • George,

    One follow up question, We also see the following in the map file:

    00004ac0      00000008 <whole-program> (.tramp.sqrt.1)

    It is unclear that this is linking to a RTS library. Is there some way to eliminate this trampoline call or is this something that is generated as part of link time optimization so we can leave them in the binary. 

    Regards,

    Rahul

  • I don't see this line in the map file you attach to your first post.  So, I can only give you general information about it.

    This is the trampoline, created by the linker, to the function sqrt.  If you search on the address 00004ac0, you will eventually hit on the entry in the FAR CALL TRAMPOLINES table, where you can find more information about it.  Including how many calls site branch to this trampoline, and the address of those call sites.

    Rahul Prabhu said:
    Is there some way to eliminate this trampoline call

    Yes, but it may not be practical.  One method is to move things around so the call sites and the destination of the call are close in memory.  The other is to change the code to not call that function.  

    Thanks and regards,

    -George

  • Understood thanks. Moving the memory closer is not an option at present so I will leave it in for the time being. Eliminating other trampolines to RTS function from the code executing in TCM memory of the R5F saved us 5-10% performance penalty that we would other wise observe when the call goes to OCRAM through cache so this is indeed useful information for users.


    Regards,
    Rahul