This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

The most efficient IPC mechanism on C6678

Hi,

I'm trying to find the most efficient way to exchange "data" between several cores on the C6678 DSP.  I've attached a figure of the basic application topology.

Background information:

We have two independent real-time digital signal processing (RT-DSP) paths for channel 1 and channel 2. The total workload of each data-plane path is partitioned among two cores (core-pipelining). Further, co-processing-cores provide different signal parameters to the RT-DSP-Tasks running on the data-plane. 

Requirements:

What I mean "with the most efficient way to exchange data between cores" is in terms of minimal latency. Further, with "minimal latency" is not meant a few microseconds but rather a few core clock cycles. The size of the exchanged data packages between the cores is configurable and ranges from one single data samples (single precision floating-point) --> 4Byte up to several tenth of data samples --> max. 256Byte. The data flow between all cores is absolutely regular and invariant --> continuous data streaming application.

The hard timing constraints (for completing the entire data-plane path (RT-DSP#1 and RT-DSP#2) for a single data sample) are at least 200ns and must not be exceeded! These results in only 250 instruction cycles @1.25GHz core clock. The total complexity of the signal processing, which is completed on the data plane, is lower than 400 instruction. Partitioning the total workload among two cores results in a total workload for RT-DSP#1 and RT-DSP#1 lower than 200 instruction. That's why each single instruction cycle counts...    

Solution Possibilities:

What I surely can say is that the application will not use any OS (e.g. SYS/BIOS, etc.). I would guess that an OS-based IPC mechanisms would surely take several hundreds of core clock cycles.

So I could get an rough overview about the C6678 architecture and have now a little idea what different mechanism for managing IPC tasks are available.

Only to name some:

  1. EDMA3 (-->for signaling events after data exchange completion via HW-IRQ)
  2. Multicore Navigator incl. PKTDMA and QM (-->for signaling events and data exchange)
  3. Semaphore2 hardware (-->only for signaling events. data exchange is done via shared memory)
  4. Notification/Signaling via SW-Interrupts (-->only for signaling events. data exchange is done via shared memory)
  5. Shared memory (MSMC-RAM) (-->for signaling events and data exchange)
  6. Direct access to local L2 memory of destination core-pack (-->for signaling events and data exchange)
  7. etc.

Possibly you could complete the above list with additional techniques and give me an advise what to look at first in more detail and why (ideally it should be the most promising mechanism regarding the requirements mentioned above). I try to avoid writing a benchmark application for every provided IPC mechanism.

Kind regards,

Viktor  

    

  • Hi Viktor,

    As you are not going to use any RTOS than you can't use the SW IPC as it is based on the SYS/BIOS, How do you get the samples? can you move the samples directly to the private L2 of the core who handles the data,

    Thanks,

    HR

  • Hi HR,

    do you mean with "SW IPC" thes packages?

    • ti.ipc:   Contains common interface files (e.g. MessageQ, ListMP, HeapMemMP, Notify, etc.)         
    • ti.sdo.utils: Contains helper modules (e.g.MultiProc, List and NameServer)      
    • ti.sdo.ipc:  Contains multicore modules (e.g.MessageQ, Notify, ListMP, etc.)         

    I think I have read this in one of the countles user guides :-) that this mechanism is only available with SYS/BIOS.

    Since this is only an evaluation application the input samples are synthetically generated/simulated on one CorePack and should feed the initial CorePacks inside the data plane.

    In the final application it is intended to get the required data samples (ADC samples) via hyperlink. But this task will be addressed in the next evaluation step. But, it is likely that the DSP-core that is responsible for sample data generation at the moment will, then, be used to handle the data receiving via hyperlink and dispatching the data samples to the appropriate data plane path (CH1 or CH2).

    I think i should be able to move the data into the L2 of the core. So, would you suggest to handle the entire communication/notification process via private L2 access from one core to another?

    regards,

    viktor 

     

  • Hi Viktor,

    Yes, you will not be able to use it as you are not using SYS/BIOS, 

    You can use the Hyperlink to push the data directly to the desire CPU L2, as this is not a large amount of data (4-256Bytes) than it should be as closer as possible to the core so the closest place as I assume you will have L1 as cache will be L2,

    Thanks,

    HR

  • Hi HR,

    ok, sounds intresting regarding hyperlink.

    your are right, L1-D is planned to be a cache. L1-P will be partly a cache that is also capable to hold time-critical code statically inside L1-P.

    Just to get you right:

    I would use the L2 of each CorePack that respectively holds message-buffers (ping-pong) and a notify-flag. Each CorePack of the signal processing chain (totall path length is 4 DSP Cores. 2x for the signal processing pipeline, 1x CorePacks that generates sample data and 1x CorePack that performs signal analysis on the calculated results --> see also attached figure) receives its inputs samples directly into its private L2 ping-pong, gets a notification that new data is valid and starts the signal processing theron. During the ongoing signal processing CorePack_#1 writes the newly calculated results direclty into the private L2 ping-pong of the adjacent CorePack_#2. The same applies for CorePack_#3.

    Thanks,

    Viktor 

     

  • Hi Viktor,

    Yes, this should work, for the cores signaling (no data) you can use the HW IPC,

    Thanks,

    HR

  • Hi HR,

    ok, I think its worth trying then to do it that way.

    Any further experiences with other types of inter-processor communication techniques (EDMA3, MultiCore Navigator, etc.)?  

    Especially with the EDMA3. It seems promising to me to release the DSP cores from copying data from one L2 to another. Further EDMA3 would kill two birds with one stone. That is, data movement is handled and corresponding notification/irq is generated after completion. I think it should also be possible with EDMA3 as long as global address representation of the respective L2's are used, shouldn't it? But, what ultimately counts is latency!

    Then there is the MultiCore Navigator! Hm, I'm not sure if this is not over-sized for that purpose?

    Well, I think it should have been enough for me that year. :-)  

    Thanks so far HR and happy new year!

    regards,

    viktor

     

  • Hi Viktor,

    What data do you want to pass between the cores? is it the 4-256Bytes? Yes the EDMA is a very good option as it works in parallel to the CPU, the Navigator can also work there could be long initialization but than it will work fine,

    Good-Luck and Happy New Year !

    HR

  • Hi HR,

    actually there will be two different communication paths.

    • The first type is the unidirectional continuous data path (hard real-time) on data plane starting from signal generation corepack over the both pipe-lined corepacks ending inside the signal analysis corepack. These data interfaces will carry 4-256 Bytes.
    • The second type is a bidirectional data path (firm real-time) between data plane and the co-processing core-packs. Each of the two pipeline corepack has its own co-processor. These data interfaces will typically carry 64-512Bytes in each direction.

    Long initialization during application start-up shouldn't be a problem at all. As long as the introduced latencies during run-time are short. With latency is meant, how many instruction cycles are needed to trigger the data transfer (with EDMA or Navigator) until the initiating corepack can continue processing and how many instructions cycles are needed until the consumer corepack receives the first data or signaling.

    Aren't there any benchmarks available from TI that cover this topic in order to demonstrate the performance of each hardware module? Things like data-throughput, latencies, overhead, etc, etc. These are important criterion before selecting an appropriate mechanism for building efficient applications.

    Regards,

    Viktor

         

     

     

  • Hi Viktor,

    Have you checked sprabk5a.pdf -Throughput Performance Guide for C66x KeyStone Devices,

    BR,

    HR

  • Hi HR, 

    yes, I've already found this document (sprabk5a.pdf ) and i currently go through it to find further information. Maybe it helps to clarify some things. 

    Also interesting info's are given in data manual (Section 4.2) and in that thread: "How do the data pass through the internal ports and buses in C6670?"

    If i got this information right, then I think it would be wiser NOT to write the data directly from CorePac_0 L2 into CorePac_1 L2 initiated by CorePac_1 as master. This would result in quite a long communication path, namely, "CorePac0 MDMA -> MSMC CorePac0 Slave Port -> MSMC System Master Port -> TeraNet CPU/2 -> Bridges -> TeraNet CPU/3 -> CorePac1 SDMA -> CorePac1 L2". In my opinion, this would introduce additional latencies for moving the data from one TeraNet domain (CPU/2) to another one (CPU/3) by using bridges. See figure:

    I think it would be better to stay within the CPU/3-TeraNet domain and use EDMA-CC1 or EDMA-CC2 (but not EDMA-CC0, hence it is in the CPU/2-TeraNet domain and have to use bridges to access CorePac's SDMA-Port) to maintain the data exchange between the CorePac's L2 RAM, because this would result in a shorter communication path: "CorePac0 L2 -> CorePac0 SDMA -> TeraNet CPU/3 -> CorePac1 SDMA -> CorePac1 L2" But, in case of simultaneous data transfers there will be conflicts while accessing CorePac's SDMA interface, since SDMA-IF is available only once! Here, additional latency is introduced in case of contention but not due to passing the data through bridges as in the above case. See figure:

    The question is which approach results in more latency?

    I've went roughly through the document (sprabk5a.pdf), but in case of EDMAv3 the are only throughput measurements given in terms of MB/s. No latencies as provided for DDR3 RAM. That's a pity!

    Well, maybe a TI employee can go more into detail and provide some additional information regarding this topic.

    Thanks HR and kind regards,

    Viktor