This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Notify benchmark on C6678 : performance issue

Other Parts Discussed in Thread: SYSBIOS

Hello all,

We are working on a benchmark of Notify performances for inter-processor communications (with Sys/bios for the DSP C6678).
We started from the example in the mcsdk and adapted it for parallel work.

The core 0 is the master, it sends a Notify event to all slaves (cores 1 to 7) so that they work in parallel and when they're done they send an event back to the master.
The work is simply adding value from 2 separate tables to another one. All data is in L3.

The project is build with O3 optimization and no debug. Cycles are mesured using TSCL.
The application is working as expected in all cases. The problem is the performance.

Here are the results :

Total number of cores used | Notify cost
2 -> 9288 cycles
4 -> 119423 cycles
8 -> 352984 cycles

for 2 cores it is good but for 4 and 8 it is extremely high.

We are not saying that Notify performance is terrible, maybe how we built the test isn't scalable for 4 or 8 cores and there's a better way to do it.
We hope someone can help us identifying what's wrong with the test.

Here's the pseudo code :

nb_cores = 8;

Void cbFxn(UInt16 procId, UInt16 lineId,  UInt32 eventId, UArg arg, UInt32 payload){

    Semaphore_post(semHandle); }

Void tsk0_func(){

    if (MultiProc_self() == 0)   {

       /* Master Code */

             /* SENDING NOTIFY FOR ALL SLAVES CORES */

            for(p = 1; p < nb_cores; p++)    {

                  status = Notify_sendEvent(p, INTERRUPT_LINE, EVENTID, NULL, TRUE);

                  if (status < 0) {     System_abort("sendEvent failed\n");      }

             }

            /* some work */

            /* Wait to be released by the cbFxn posting the semaphore */

            Semaphore_pend(semHandle, BIOS_WAIT_FOREVER);
            Semaphore_pend(semHandle, BIOS_WAIT_FOREVER);
            Semaphore_pend(semHandle, BIOS_WAIT_FOREVER);
            Semaphore_pend(semHandle, BIOS_WAIT_FOREVER);
            Semaphore_pend(semHandle, BIOS_WAIT_FOREVER);
            Semaphore_pend(semHandle, BIOS_WAIT_FOREVER);
            Semaphore_pend(semHandle, BIOS_WAIT_FOREVER);

    }

    else    {

      /* Slaves Code */

                   /* wait forever on a semaphore, semaphore is posted in callback */

                  Semaphore_pend(semHandle, BIOS_WAIT_FOREVER);

                  /* some work */

                   status = Notify_sendEvent(0, INTERRUPT_LINE, EVENTID, NULL, TRUE);

                   if (status < 0) {  System_abort("sendEvent failed slave\n");    }

    }

     BIOS_exit(0);

}

Int main()

{

    status = Ipc_start();

    if (status < 0)     {     System_abort("Ipc_start failed\n");    }

    if (MultiProc_self() == 0)     {

      /* Master : will receive event from each 1-7 cores */

      for (core_sending_event = 1; core_sending_event < nb_cores; core_sending_event++)

      {

            status = Notify_registerEvent(core_sending_event, INTERRUPT_LINE, EVENTID,    (Notify_FnNotifyCbck)cbFxn, NULL);

          if (status < 0)    {     System_abort("Notify_registerEvent failed\n");   }

      }

    }

    else

    {

      /* Each slave will receive event from the master (core 0)*/

      status = Notify_registerEvent(0, INTERRUPT_LINE, EVENTID, (Notify_FnNotifyCbck)cbFxn, NULL);

      if (status < 0)   {   System_abort("Notify_registerEvent failed\n");    }

    }

    BIOS_start();

    return (0);

}

Thank you

  • Clement:

    I'm looking into this now and will try to get back to you tonight or tomorrow.

    -John

  • Thank you for your time John. I look forward hearing your thoughts about this.

    CM

  • Clement:

    Firstly, we do have an optimized Notify driver which should improve performance.  Try adding the following to your .cfg file.

    /* more optimized Notify driver */

    var Notify = xdc.module('ti.sdo.ipc.Notify');

    Notify.SetupProxy = xdc.module('ti.sdo.ipc.family.c647x.NotifyCircSetup');

     

    Additionally, I think the performance would be improved using Events as opposed to semaphores.  I will investigate that this afternoon.  Could you tell me specifically which code within the MCSDK you are basing yours off of?

     

    John

  • Hello John,

    I work with Clement on this.

    We'll look on the optimized Notify driver to see if it improves performances in our case.

    We started from the example that is available when you create a new project :

    Sysbios -> IPC and I/O examples -> C6678 -> Notify single processor

    Thank you

    Nadim and Clément

    EDIT : our setup :

    IPC 1.24.3.32
    Sysbios 6.33.6.50
    CCS 5.1.02012061800

  • John,

    So far first results are showing better performances with the modification of the .cfg that you suggested.
    Though we need to benchmark it a little further to draw definitive conclusions.

    We are curious about the line "Notify.SetupProxy = xdc.module('ti.sdo.ipc.family.c647x.NotifyCircSetup');"
    what does it do exactly ?

    Thanks,

    Clement

  • The change in the configuration file causes notify to use a circular buffer in shared memory to store notifications. 

    I do believe better performance can be acheived by assigning a different semaphore to each slave, or using the events instead.  I was attempting to use these last week but was getting an error I need sorted out.  I will look into this and let you know if I find anything.

    -John

  • John,

    We'll look into using different semaphores if we have time for that.

    For the circular buffer tip we found relevant information here http://processors.wiki.ti.com/index.php/SysLink_Notify_Drivers_and_Transports and in the Inter-processor user guide.

    Two new questions

    a) In the cfg file there is a shared region that takes half of the MSMCRAM (leaving only 2 Mo for our data)

    /* Shared Memory base address and length */
    var SHAREDMEM           = 0x0C000000;
    var SHAREDMEMSIZE       = 0x00200000;

    var SharedRegion = xdc.useModule('ti.sdo.ipc.SharedRegion');
    SharedRegion.setEntryMeta(0,   { base: SHAREDMEM,  len:  SHAREDMEMSIZE,  ownerProcId: 0,  isValid: true,  name: "DDR2_RAM", });

    It's not clear to us how this region is used and what is its optimal minimal size.
    We noticed that if we reduce it too much (or remove it completely) IPC start fails.

    b) In an use case with 2 cores (or 4 cores but not all 8 cores) would the use of IPC_attach - to link only the relevant cores - lead to better performances ?

    Thank you,
    Clement

  • Clement,

    a) SharedRegion is for IPC communication between cores.  The size is really app dependent but at some point there is a minimum to support all the IPC drivers.  The more cores, the bigger the size you need.  Also if you are sending many messages or large messages, then you need more memory here.  There is no "optimial" size, it just depends on how you are using IPC.

    b) Using IPC_attach() as oppose to syncing all cores using IPC_start() isn't going to get you better performance (it might be insignificantly better).  It will get you a smaller footprint since some of the IPC drivers wouldn't be created for some of the connections.

    Judah

  • Judah,

    Thank you for the precisions about IPC, I'm not surprised by your answers. I have a few more questions regarding SharedRegion.

    a) Is it possible to put the SharedRegion in DDR3 ? Would the performances drop a lot ?
    b) is there a way to know how much of the SharedRegion is used on a given application (to fine tune its size) ? We looked into the memory map but found no clues about it.

    Regarding Notify performances, we used global variables instead of semaphores and it worked well with much better performance.

    Thanks,
    Clement

  • Clement

    a)  Yes, you can put SharedRegion in DDR3.  I think if you have the cache enabled for this region, the performance should not drop too much.

    b)  Yes, but not statically.  If you know about ROV tools, you can look at the usage of the heap that is created in SharedRegion 0 (I assume you have only 1 SharedRegion).
          Basically, a portion of SharedRegion 0 is used for IPC handshake between processors and then the rest is created into a HeapMemMP.  If you look at the
          HeapMemMP using ROV, you should get an idea of how big/small is your SharedRegion 0.

    Judah

  • Judah,

    Thank you for your answers, we'll look into that if we have time. And yes we do have only 1 SharedRegion.

    So far our latest benchmarks (it depends on the size of data) are showing :

    cores | cycles gain
    2         |  from 1.84 to 1.96
    4         | from 3.17 to 3.79
    8         | from 3.70 to 5.71

    (data in L3, shared region in L3, Notify with circ buffer)

    Clement

  • Judah,

    Quick follow up

    a) we moved the SharedRegion in DDR3 and saw a really tiny drop in performances : it's good news for us.

    b) about ROV tools and how much of the SharedRegion is used

    here's what we got with SharedRegion in DDR3.
    If we're not mistaken what is used is : totalSize - totalFreeSize
    so here : used = ff380 - f2f80 = C400 = 50176 bytes = 49ko

    but that seems very low so we aren't sure that we're looking at the right place.
    Can you confirm that our methodology is correct ?

    Thank you,
    Clément

  • Clement,

    That's basically correct.  Notice that the buf starts at 0x80000c80 which means 0xc80 or ~3K was used by IPC for its handshake so you should consider this as part of the total that is required for SharedRegion 0 size.

    Judah

  • Judah,

    Ok for the precisions.
    Now we are looking into using System Analyser to get an execution graph of our application to better understand how things are executed.
    So far we couldn't get the execution graph working (we didn't try very hard though)

    Do you have any tips about that ?

    Thanks,
    Clément