This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MCU-PLUS-SDK-AM263X: TCM Enable

Part Number: MCU-PLUS-SDK-AM263X
Other Parts Discussed in Thread: AM2634

Tool/software:

Hi,

I'm currently investigating placing code in Tightly Coupled Memory on the AM2634 platform for performance improvement of critical code sections. 

Having placed test code in TCM and compared its execution with OCRAM the number of CPU cycles are exactly the same. I've confirmed that the code has been placed in the correct memory regions using both the map file and by inspecting the memory region using the debugger.

Checking the information in the TRM in section 7.1.3.2.2 Tightly-Coupled Memories (TCMs) I found the following which mentions that TCM needs to be enabled:

Searching through the reference manual and register addendum I can find no reference to the mentioned ENABLE bit or any other method of confirming if the TCM is enabled.

Is it possible to get more information on how the TCM is enabled, and how to confirm that it is being utilised by the R5F core at runtime?

Thanks,

Carwyn

  • Hi Carwyn,

    When you connect your AM2634 to the debugger and open the memory browser in CCS, are you able to view the memory regions? For example if I run the UART Echo example from the SDK, this is the TCM view in memory browser:

    TCMA start address: 0x40, length = 0x7FC0

    TCMB start address: 0x80000, length = 0x8000.

    The enabling of the TCM is not to be done in the application. You can see the TCM configurations in the example.syscfg of your application in the Memory Configurator section -> Memory region -> Regions.

    My best guess here is that your OCRAM is cached. So whatever read/write you are performing are not happening from the actual RAM but rather the data is being fetched/written to the cache memory. TCM is a single cycle access zero wait state memory so read writes should theoretically take 1 CPU cycle while the OCRAM will take a lot more.

    In your MPU Config from example.syscfg, try to mark the OCRAM region as non-cached and try re-building the example and benchmark again.

    Regards,

    Shaunak

  • Hi Shaunak,

    Yes, I can see the memory regions in CCS for the TCM and can confirm that all functions being tested have been placed there.

    I have set the OCRAM to non-cached in the MPU as you suggested and can confirm that it does take significantly longer than when set to cached. This would explain how memory from OCRAM was able to run at the same rate as memory in the TCM.

    Could you advise on how the MPU should be configured for the TCM regions? 

    Thanks,

    Carwyn

  • Hi Carwyn,

    I would say there is no general recommended configuration but rather it is use case specific. This is what I used in some profiling activities in the past.

    TCMA:

    TCMB:

    Regards,
    Shaunak

  • Hi Shaunak,

    I've tested the TCM being configured as both cached and non-cached and found no difference in the number of CPU cycles, however there are slightly fewer cycles than the OCRAM cached memory.

    Are you able to provide more information on what changing the MPU region attributes is actually doing? Is there a register field which is being changed?

    Thanks,

    Carwyn

  • Hi Carwyn,

    Marking the TCM as Cached or Non-cached will not make any difference. I would like to know what is the memory access latency you are getting with TCM? Is it an issue for your use-case? TCM is zero-wait state single cycle access memory and it being cached or uncached will not make a difference. 

    Changing the region attributes in example.syscfg does the following:

    1. Set the memory region as Cachable, Shareable, Bufferable, permission to allow code execution from the Region 

    2. These changes are done in the ti_dpl_config.c which is auto generated from the syscfg. When you start your application, the MPU init is called and based on the above config, the MPU attributes are set as follows (MpuP_armv7r_asm.S):

    3. You can read about what registers are modified in the ARM R5F documentation: developer.arm.com/.../Region-attributes

    Regards,
    Shaunak

  • Hi Shaunak,

    Thanks for all the information, it was very helpful and clarifies how the TCM is being configured.

    From running a simple counter for testing, the CPU cycles recorded were as follows:

    TCM - 22220
    OCRAM cache - 22329
    OCRAM non-cache - 156730

    From this test the conclusion is that TCM is marginally better than cached OCRAM but significantly better than non-cached OCRAM.

    Thanks,

    Carwyn

  • Hi Carwyn,

    TCM and Cache have almost the same access latency. What might be happening here is:

    Scenario-1: TCM vs OCRAM Cached (all cache hit)

    In this case, you will get almost similar cycles as the CPU has to fetch from Cache and not the actual RAM. So your CPU cycles will be almost be same in both the TCM and the OCRAM.

    Sceanrio-2: TCM vs OCRAM Cached (Some cache miss occurs)

    This is what I feel is happening in the data shared above by you. The cases where cache hit occurs, the access time would be same for Cache and TCM, when Cache miss occurs, the CPU will have to get the data from the RAM which adds the cycles and the difference is seen.

    Scenario-3:TCM vs OCRAM Non-cached 

    Since no cache is involved, all the data will be fetched from OCRAM so this scenario shows the actual raw OCRAM latencies without cache or any other optimizations.

    Regards,
    Shaunak

  • Hi Shaunak,

    So to clarify, TCM offers performance advantage over cached memory as it avoids cache misses with the only disadvantage being the need to add each function to the TCM memory in the source code and linker file? Is this correct?

    Thanks,

    Carwyn

  • Hi Carwyn,

    The CPU cycles needed to access Cache memory and the TCM is the same. When data is not found in the cache, CPU fetches it from OCRAM. TCM eliminates this and proves to be an advantage as there would be no scenario of a "miss" since you manually place data/function in the TCM.

    I recommend going through the ARM R5F documentation: https://developer.arm.com/documentation/den0042/a/Tightly-Coupled-Memory/Performance-of-TCM-compared-to-cache

    Regards,
    Shaunak