This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SK-AM62: DDR Access from M4F using IPC RPMsg

Part Number: SK-AM62

Tool/software:

Hi Team,

In the related thread, it says that access to DDR from M4F core takes significantly longer. My customer is considering using IPC RPMsg. It says that shared memory can be either DDR or internal memory. Does this mean that when they use IPC RPMsg to load data from DDR, it will take significantly longer as well? Is there some workaround for this?

Also, when it refers to shared internal memory, is it referencing the 64KB OCRAM?

Best regards,

Mari Tsunoda

  • Hello Mari,

    Does this mean that when they use IPC RPMsg to load data from DDR, it will take significantly longer as well?

    Yes, the latency to access DDR is significantly higher than that of the OCRAM.

    Also, when it refers to shared internal memory, is it referencing the 64KB OCRAM?

    Yes, the internal shared memory refers to the OCRAM.

    Is there some workaround for this?

    Can you please tell which cores you will be using to do IPC?

    Regards,

    Tushar

  • Hi Tushar,

    Thanks for your reply.

    My customer is currently considering utilizing pins in the MCU domain and WKUP domain, but are concerned with DDR access latency. Is my understanding correct as indicated below? Can you help answer the question I have highlighted?

    A53(has cache)<=>M4F (no cache) 

    • Shared memory: DDR only
    • Huge DDR access latency when M4F accesses DDR due to no cache
    • What is the workaround for access latency issue for these cores? Is the recommendation just to develop application such that DDR access from M4F is avoided?

    A53(has cache)<=>R5F (has cache) 

    • Shared memory: DDR only
    • No big DDR access latency due to both cores having cache

    R5F(has cache)<=>M4F (no cache) 

    • Shared memory: DDR and OCSRAM ok
    • If OCRAM is used, access latency is ok
    • Therefore, OCRAM is the recommendation for IPC

    Another question I have is: Can the internal memory usage selected be different between cores for better performance? (ie. Can DDR be chosen between A53 and M4F while OCSRAM used between R5F and M4F?)

    Also, I see that VRING is used. Do you have more detailed documentation on this? I see it mentioned quite frequently in the MCU+SDK, but no clear explanation or diagram as to how it works. It is also mentioned in the IPC section of the PSDK documentation, but no clear explanation of what it is. However, in the PRU-SS section, I see it explained. Does it apply to regular RPMsg? Any differences that I should be aware of?

    3.6.3.2. RPMsg — Processor SDK AM62x Documentation

    Best regards,

    Mari

  • Hi Mari,

    Generally, while using IPC, it is recommended to use non-cached memory regions, since a memory update performed by one core will be reflected in it's cache and when the other core tries to access the memory it will receive a 'stale' value.

    Can the internal memory usage selected be different between cores for better performance?

    Yes, you can define 2 separate shared memory regions and give access to selective cores to selective regions. If your use case does not require the A53 and M4F IPC then you can create separate regions for A53-M4F and M4F-R5F.

    I see that VRING is used. Do you have more detailed documentation on this?

    The concept of VRING memory is the same as mentioned in the PRU documentation. The different types of data stored in the VRING could be different in the case of PRU and ARM cores. If you are interested I can look into the VRING structure used in the case of PRU.

    On a very basic level, VRING is a shared memory segment, between a pair of CPUs, which holds the messages passed between the two CPUs. The sequence of events as a message is passed from a sender to receiver and back again is shown in below figure:

    The local structure to maintain state of a given VRING is as below (from ipc_rpmsg_priv.h)

    typedef struct
    {
        uint16_t lastUsedIdx;            /* last read index into used Q */
        uint16_t lastAvailIdx;           /* last read index into avail Q */
        uint16_t vringNumBuf;            /* number of buffer in the vring */
        struct vring_desc  *desc;        /* pointer to buffer descriptors in VRING shared memory */
        struct vring_avail *avail;       /* pointer to avail Q in VRING shared memory */
        struct vring_used  *used;        /* pointer to used Q in VRING shared memory */
        uint8_t            *bufBaseAddr; /* pointer to message buffer 0 in VRING shared memory */
    } RPMessage_Vring;

    I do not think there is a documentation explaining VRING in great detail. I will look into it and share it with you if I find any.

    Regards,

    Nitika

  • Hi Nitika,

    Thanks for the reply.

    Can you elaborate more on this for me? I think I had a general misunderstanding on IPC as I thought the lack of cache would be a limiting factor for DDR access time.

    Generally, while using IPC, it is recommended to use non-cached memory regions, since a memory update performed by one core will be reflected in it's cache and when the other core tries to access the memory it will receive a 'stale' value.

    So is it normal to expect big DDR access latency times?

    Best regards,

    Mari Tsunoda

  • Hi Mari,

    Yes, the access latency associated with M4 DDR access in case of IPC is expected.

    The shared memory used for IPC communication is marked as non-cache because of the reason that I stated above.

    If you want to know more about IPC, you can go through the below guides:

    1. MCU+SDK IPC guide: https://software-dl.ti.com/mcu-plus-sdk/esd/AM64X/09_02_00_50/exports/docs/api_guide_am64x/IPC_GUIDE.html

    2. AM64x academy IPC guide: https://dev.ti.com/tirex/explore/node?node=A__ASn.0Gvx.CK7j7a0EWKc.w__AM64-ACADEMY__WI1KRXP__LATEST

    3. Processor SDK IPC documentation: software-dl.ti.com/.../Foundational_Components_IPC64x.html

    Regards,

    Nitika

  • Hi Nitika,

    Thanks. 

    Is there any significant difference between M4 DDR code execution and A53 or R5F core DDR code execution? Or is the access penalty the same? I understand that there is a significant delay (100x as mentioned in the original thread) when compared to internal SRAM code execution. What about when comparing DDR code execution from different cores? If the latency is the same, what is the workaround because the DDR execution latencies would be a bottleneck.

    Also regarding the 100x latency mentioned, can we provide clearer benchmark values as to how long it would take? Not just a general x times comparison? Any additional information would be helpful for my customer.

    Also, can the M4F access instruction code and data allocated inside A53 memory and also PRUSS internal memory? M4F internal memory may not be enough, so they want to use the aforementioned internal memory. Do we haver any performance information on these as well? Is there any other internal memory they could use as an extension of the M4F internal RAM?

    Also, is there any memory that can be used as M4F cache? My understanding is that there isn't.

    Best regards,

    Mari Tsunoda

  • Hi Nitika,

    Can I get an update on this?

    Best regards,

    Mari Tsunoda

  • Hi Mari,

    I have send your query to the author of the original FAQ, they are more familiar with the domain of your questions.

    Allow them some time to get back to you.

    Regards,

    Nitika

  • Hi Nitika,

    Thanks for the response.

    Hi

    Is it possible to get a response by tomorrow? We have a meeting with the customer tomorrow. Sorry for the rush.

    Best regards,

    Mari Tsunoda

  • Hi Mari,

    Can you please tell us what is the exact size of application?

    Are you executing code from DDR or you just want to store the data section in DDR?

    What is the size of .text section and .data section in the application code?

    Also can you please share the map file of the project?

    Regards,

    Tushar

  • Hi Tushar,

    Thanks for your reply.

    They are just starting to consider IPC so the exact application size is not known.

    Are you executing code from DDR or you just want to store the data section in DDR?

    How different is latency for both? I assume that executing code from DDR is very slow. 

    Best regards,

    Mari Tsunoda

  • Hello Mari,

    They are just starting to consider IPC so the exact application size is not known

    If the size of application or .text section is less than 255KB, I would suggest to keep the .text section in M4F's internal RAM and data section to be in DDR.

    How different is latency for both? I assume that executing code from DDR is very slow. 

    Yes, executing code & accessing data from DDR is ~4.25x times slower than accessing data from DDR and executing code from M4F's RAM. 

    Note : The above method will only increase the performance in terms of code execution, but latency to IPC communication will remain same. 

    Regards,

    Tushar