Compiler/TMS320C6678: Inter-Processor Communication and Latency

Idris Kempf

Part Number: TMS320C6678
Other Parts Discussed in Thread: 4460, 66AK2G12

Tool/software: TI C/C++ Compiler

Hi there,

I'm interested in implementing some parallel master-slave model where core 0 works as master and all other cores execute a job once the master notified them. I went through several recommended approaches and measured the latency for requesting from all slaves and receiving the acknowledgement from all slaves (without job). The latencies using following three approaches are:

	1	2	3	4	5	6	7
Notify	4469	5918	6835	7754	9421	13138	14634
MessageQ	13610	20000	28100	36600	45000	57800	63000
Navigator	1556	2155	2590	3487	4460	5435	6380

Note that:

1) I went through the optimization guide for Notify and message queue

2) I haven't amended the navigator to use monolithic descriptors

The jobs which should be divided among slaves are either 8000 cycles or 40'000 cycles long (before parallelization, note that this is dictated by the application): the notification times above are way too long. I am looking for something which uses around 100 cycles.

- Are the timings around what you would expect?

- Is it possible to get this job done with a latency of around 100 cycles?

- Do you expect using the interrupt controller in combination with e.g. semaphores to be faster?

- Would you know of an example which configures the interrupt controller to manually trigger interrupts on other cores?

Thank you very much for your answer.

(

My related posts:

https://e2e.ti.com/support/processors/f/791/t/815698
http://e2e.ti.com/support/processors/f/791/p/813105/3010666#3010666

)

over 6 years ago

0 Sahin Okur over 6 years ago

TI__Mastermind 27355 points

Hi Idris,

I'm checking internally to see if we've collected IPC benchmarking data in the past that we can compare to.

Are the latencies you posted round-trip or one-way numbers?

Have you tried using messageQ transport ti.sdo.ipc.transports.TransportShmNotify? I had done some benchmarking on AM57xx in the past and this transport gave the best performance.

This wiki on configuring interrupts may be of some help:

http://processors.wiki.ti.com/index.php/Configuring_Interrupts_on_Keystone_Devices

If your related threads have to do with this issue, it is best if we close those out and continue the discussion here to make it easier for us to track.

0 Idris Kempf over 6 years ago in reply to Sahin Okur

Intellectual 685 points

Hi Sahin,

yes please feel free to close the other threads which I gradually opened after going through all the approaches. Thank you very much for the link.

These latencies are for a round-trip: core 0 notifies all other 7 cores, all 7 other cores send a message back to acknowledge the notification.

Here some additional information:

- Notify uses:
Notify.SetupProxy = xdc.module('ti.sdo.ipc.family.c647x.NotifyCircSetup');

- MessageQ uses:
MessageQ.SetupTransportProxy = xdc.module('ti.sdo.ipc.transports.TransportShmSetup');
Notify.SetupProxy = xdc.module('ti.sdo.ipc.family.c647x.NotifyCircSetup');

- Both approaches use the BIOS.LibType_Custom

I'm a bit puzzled about the fact that it takes so long to notify other cores on an octa-core. It would therefore be great if you could approximately confirm the above latencies or hopefully tell me that I'm doing something wrong and slowing down the whole process.

Thanks again for your help.

0 Sahin Okur over 6 years ago in reply to Idris Kempf

TI__Mastermind 27355 points

Hi Idris,

Unfortunately we don't have any IPC benchmark numbers for C6678 that we can compare to. However, I had done some IPC benchmarking in the past on 66AK2G12, which is an ARM + DSP (C66x) device, and I got around 20 microseconds round-trip for messageQ without optimization. I don't suspect it to be drastically different for DSP <-> DSP.

Some additional things you can try:

Place the SharedRegion in MSMCRAM instead of DDR.
Try with SharedRegion cache disabled and set "cacheEnable: false" - this should save some cycles from cache operations. However if the entire message is being touched then this will degrade performance.
Make sure cache is enabled for code/data sections and try placing them in L2/MSMC.

Regards,
Sahin

0 Sahin Okur over 6 years ago in reply to Sahin Okur

TI__Mastermind 27355 points

Also, you may want to look into writing to the IPCGR registers directly instead of using the IPC module.

Please see section 3.3.13 of the C6678 datasheet and section 2.4 of Chip Interrupt Controller (CIC) for KeyStone Devices User's Guide.

The C6678 has eight IPCGRx registers (IPCGR0 through IPCGR7). These registers can be used to generate interrupts to other cores. A write of 1 to IPCG field of IPCGRx register will generate an interrupt pulse to coreX (0 <= X <= 7). This method would provide the best performance.

Please see this thread for some additional guidance on this: https://e2e.ti.com/support/legacy_forums/embedded/tirtos/f/355/t/167364

0 Idris Kempf over 6 years ago in reply to Sahin Okur

Intellectual 685 points

Hi Sahin,

Thank you very much for your answers and your ideas. I will certainly have a look at the method with the IPC registers. Regarding the messageQ latency: 20 mus would correspond to 20k cycles which is a stunning performance without optimisation compared to my implementation which used all available optimisation techniques. I will verify whether I correctly set-up the things you mentioned.

In the meantime I dug a bit deeper into that problem and followed a very simple but effective approach certain of your libraries use as well I believe:

I created a non-cacheable section in the memory (with the help of some virtual address space remapped to the shared memory using MPAX) in which I created some request- and acknowledgement flags. The slaves are then waiting in a blocking read on their request flags. Because the flags are placed in a non-cacheable section, there is no need for flushing or writing back the cache. With this approach I was able to boil the latency down from 60k cycles (messageQ) to 450 cycles:

	1	2	3	4	5	6	7
Notify	4469	5918	6835	7754	9421	13138	14634
MessageQ	13610	20000	28100	36600	45000	57800	63000
Navigator	1556	2155	2590	3487	4460	5435	6380
Flags in SHRAM w Cache	528	790	1187	1581	1980	2379	2762
Flags in DDR3 non-cacheable section	305	405	540	680	850	1000	1150
Flags in virtual SHRAM non-cacheable section	110	150	190	240	275	320	450

Before I dig into your IPC register approach it would be nice to know whether you expect the interrupt to be faster than the blocking read (which is just an empty while loop)? I'm asking because a colleague of mine told me that interrupts usually come with some non-negligible overhead. Is that correct? It can be assumed that the slaves are in some idle mode before the interrupt comes in.

Thank you very much for your answer.

0 Sahin Okur over 6 years ago in reply to Idris Kempf

TI__Mastermind 27355 points

Hi Idris,

I'm sorry for the delayed response.

Your approach looks solid. As long as your application allows for it, I agree that spin waiting on a flag would be faster than using interrupts.

0 Idris Kempf over 6 years ago in reply to Sahin Okur

Intellectual 685 points

Thanks for confirming!

Processors

Processors forum

Compiler/TMS320C6678: Inter-Processor Communication and Latency