This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler/TMS320C6678: Inter-Processor Communication and Latency

Part Number: TMS320C6678
Other Parts Discussed in Thread: 4460, 66AK2G12

Tool/software: TI C/C++ Compiler

Hi there,

I'm interested in implementing some parallel master-slave model where core 0 works as master and all other cores execute a job once the master notified them. I went through several recommended approaches and measured the latency for requesting from all slaves and receiving the acknowledgement from all slaves (without job). The latencies using following three approaches are:

1 2 3 4 5 6 7
Notify 4469 5918 6835 7754 9421 13138 14634
MessageQ 13610 20000 28100 36600 45000 57800 63000
Navigator 1556 2155 2590 3487 4460 5435 6380

Note that:

1) I went through the optimization guide for Notify and message queue

2) I haven't amended the navigator to use monolithic descriptors

The jobs which should be divided among slaves are either 8000 cycles or 40'000 cycles long (before parallelization, note that this is dictated by the application): the notification times above are way too long. I am looking for something which uses around 100 cycles.

- Are the timings around what you would expect?

- Is it possible to get this job done with a latency of around 100 cycles?

- Do you expect using the interrupt controller in combination with e.g. semaphores to be faster?

- Would you know of an example which configures the interrupt controller to manually trigger interrupts on other cores?

Thank you very much for your answer.

(

My related posts:

https://e2e.ti.com/support/processors/f/791/t/815698
http://e2e.ti.com/support/processors/f/791/p/813105/3010666#3010666

)

  • Hi Idris,

    I'm checking internally to see if we've collected IPC benchmarking data in the past that we can compare to. 

    Are the latencies you posted round-trip or one-way numbers?

    Have you tried using messageQ transport ti.sdo.ipc.transports.TransportShmNotify? I had done some benchmarking on AM57xx in the past and this transport gave the best performance. 

    This wiki on configuring interrupts may be of some help: 

    http://processors.wiki.ti.com/index.php/Configuring_Interrupts_on_Keystone_Devices

    If your related threads have to do with this issue, it is best if we close those out and continue the discussion here to make it easier for us to track. 

     

  • Hi Sahin,

    yes please feel free to close the other threads which I gradually opened after going through all the approaches. Thank you very much for the link.

    These latencies are for a round-trip: core 0 notifies all other 7 cores, all 7 other cores send a message back to acknowledge the notification.

    Here some additional information:

    - Notify uses:
    Notify.SetupProxy = xdc.module('ti.sdo.ipc.family.c647x.NotifyCircSetup');

    - MessageQ uses:
    MessageQ.SetupTransportProxy = xdc.module('ti.sdo.ipc.transports.TransportShmSetup');
    Notify.SetupProxy = xdc.module('ti.sdo.ipc.family.c647x.NotifyCircSetup');

    - Both approaches use the BIOS.LibType_Custom

    I'm a bit puzzled about the fact that it takes so long to notify other cores on an octa-core. It would therefore be great if you could approximately confirm the above latencies or hopefully tell me that I'm doing something wrong and slowing down the whole process.

    Thanks again for your help.

  • Hi Idris,

    Unfortunately we don't have any IPC benchmark numbers for C6678 that we can compare to. However, I had done some IPC benchmarking in the past on 66AK2G12, which is an ARM + DSP (C66x) device, and I got around 20 microseconds round-trip for messageQ without optimization. I don't suspect it to be drastically different for DSP <-> DSP. 

    Some additional things you can try:

    • Place the SharedRegion in MSMCRAM instead of DDR.
    • Try with SharedRegion cache disabled and set "cacheEnable: false" - this should save some cycles from cache operations. However if the entire message is being touched then this will degrade performance. 
    • Make sure cache is enabled for code/data sections and try placing them in L2/MSMC.

    Regards,
    Sahin

  • Also, you may want to look into writing to the IPCGR registers directly instead of using the IPC module. 

    Please see section 3.3.13 of the C6678 datasheet and section 2.4 of Chip Interrupt Controller (CIC) for KeyStone Devices User's Guide.

    The C6678 has eight IPCGRx registers (IPCGR0 through IPCGR7). These registers can be used to generate interrupts to other cores. A write of 1 to IPCG field of IPCGRx register will generate an interrupt pulse to coreX (0 <= X <= 7). This method would provide the best performance. 

    Please see this thread for some additional guidance on this: https://e2e.ti.com/support/legacy_forums/embedded/tirtos/f/355/t/167364

  • Hi Sahin,

    Thank you very much for your answers and your ideas. I will certainly have a look at the method with the IPC registers. Regarding the messageQ latency: 20 mus would correspond to 20k cycles which is a stunning performance without optimisation compared to my implementation which used all available optimisation techniques. I will verify whether I correctly set-up the things you mentioned.

    In the meantime I dug a bit deeper into that problem and followed a very simple but effective approach certain of your libraries use as well I believe:

    I created a non-cacheable section in the memory (with the help of some virtual address space remapped to the shared memory using MPAX) in which I created some request- and acknowledgement flags. The slaves are then waiting in a blocking read on their request flags. Because the flags are placed in a non-cacheable section, there is no need for flushing or writing back the cache. With this approach I was able to boil the latency down from 60k cycles (messageQ) to 450 cycles:

    1 2 3 4 5 6 7
    Notify 4469 5918 6835 7754 9421 13138 14634
    MessageQ 13610 20000 28100 36600 45000 57800 63000
    Navigator 1556 2155 2590 3487 4460 5435 6380
    Flags in SHRAM w Cache 528 790 1187 1581 1980 2379 2762
    Flags in DDR3 non-cacheable section 305 405 540 680 850 1000 1150
    Flags in virtual SHRAM non-cacheable section 110 150 190 240 275 320 450

    Before I dig into your IPC register approach it would be nice to know whether you expect the interrupt to be faster than the blocking read (which is just an empty while loop)? I'm asking because a colleague of mine told me that interrupts usually come with some non-negligible overhead. Is that correct? It can be assumed that the slaves are in some idle mode before the interrupt comes in.

    Thank you very much for your answer.

  • Hi Idris,

    I'm sorry for the delayed response. 

    Your approach looks solid. As long as your application allows for it, I agree that spin waiting on a flag would be faster than using interrupts.

  • Thanks for confirming!