IPC Notify - Performance difference between Loopback mode and regular mode.

Clement FR

Genius 4750 points

Other Parts Discussed in Thread: SYSBIOS

Hello everyone,

We did some benchmarks on Notify for its two modes : Loopback (CoreA notifies itself) and normal (CoreA notifies CoreB)

We measure the time taken to do a NotifySendEvent.

We noticed an average of 500 cycles between a NotifySendEvent to itself and a NotifySendEvent to another core.

My question is : what are the internal differences between the two modes that explain this performance difference ?

Does NotifySendEvent in loopback mode bypass the IPCGR registers ?

Thank you
Clément

over 12 years ago

0 judahvang over 12 years ago

TI__Mastermind 32475 points

Clement,

Loopback bypasses the IPCGR. In loopback mode if I recall does not generate an interrupt.

Judah

0 Clement FR over 12 years ago in reply to judahvang

Genius 4750 points

Judah,

Ok, fair enough.

We have test case where CoreA continously sends Notify events to core B. We benchmark the time taken to send the event.

We use the DriverCirc, O3 optimisation, Sysbios custom lib and 32 numMsgs for the DriverCirc. Wait = true when sending the events.

With Asserts disabled we get a Stall (sending an event takes more time) every 32 events. Which is what we expected given that the NumMsgs in the sharedRegion is 32. It means we filled the circular buffer and we have to wait a bit.

With Asserts enabled we get the same behavior every 24 events. We don't get why it's not every 32 events.

Any idea why ?

Regards,
Clement

0 judahvang over 12 years ago in reply to Clement FR

TI__Mastermind 32475 points

Clement,

Sorry, I have no idea why it would be every 24 events. I would also expect it to be 32 events. Could you try changing the buffer to 64 to see what happens?

Judah

0 Clement FR over 12 years ago in reply to judahvang

Genius 4750 points

Judah,

I did more tests and here's what I found :

Setup : Sysbios custom, IPC driver circular, asserts didn't matter, wait = true or false didn't matter either.

The application : 2 cores. One core sends events to the 1st core in a loop. Cycles are measured at the end of the loop (once that 200 events have been sent).

Results :

NumMsg = 32 (default) -> the 32th event sent takes more time and then every 24 events there is a stall (250 cycles more)

NumMsg = 64 -> the 64th event sent takes more time and then every 48 events there is a stall (250 cycles more)

I don't understand the 24/48 events period instead of 32/64.

Could you try to reproduce it on your side ?

Regards,
Clément

0 judahvang over 12 years ago in reply to Clement FR

TI__Mastermind 32475 points

Clement,

The core that is receiving the messages, what is it doing? Is it waiting until it receives 32 events before it starts processing any events? Or is it trying to process events as fast as possible? If its trying to process stuff as fast as possible, could it be that it takes 24 events before it starts processing the first one?

Also, is there anything else in your system, like a Timer or something that goes off periodically such that it happens on everyt 24th event?

Judah

0 judahvang over 12 years ago in reply to judahvang

TI__Mastermind 32475 points

Clement,

Okay, I did a quick experiment with MessageQ, and on my end, I see something slightly different from what you are seeing.

I see the delays on event 32 then event 56. This is with the Notify number of messages being 32.

I can explain 32 then 56 but I'm not quite sure how to explain 24 then 48 but they may be related or similar?

The first time around the Circular buffer, it should be able to go 32 times before it needs any delay. The delay is due to a Cache invalidate operation. We only invalidate the cache line when we think there are no more slots available. Now assuming that you cache invalidate on the 33rd time, the other core may not be done with all of the 32 outstanding events. When you read the line in on the 33rd time, some events could be outstanding so the sender thinks those are still not free. So the next time the sender needs to invalidate the cache will before the next 32 events and it seems like 24 is where its hitting. Now, if the receiver is really slow at processing the events, this could potentially make the sender stall even more and it would need to do more cache invalidates quicker. So I could see it needing to do an invalidate every 16 events or every 8 events.

Judah

0 Clement FR over 12 years ago in reply to judahvang

Genius 4750 points

Judah,

First I dismiss the clock/timer hypothesis. The project is minimal and there is only what is needed. Bios clock is even disabled.

Then, here's the code for the
master
for ( index = 0; index < nbSendEvent; index++)
        {
           start = TSCL;
           status = Notify_sendEvent(1, INTERRUPT_LINE, EVENTID, 0, FALSE);
           stop = TSCL;

           if (status < 0)
           {
               System_abort("sendEvent failed\n");
           }
           nbcycles[index] = stop - start;
        }

       /* print results */
       for ( index = 0; index < nbSendEvent; index++)
       {
           System_printf("cycles %d = %d\n",index,nbcycles[index]);
       }
}

and for the slave
while(1){};

and the callback fcn
void SlaveCbFxn( UInt16 procId, UInt16 lineId, UInt32 eventId, UArg arg, UInt32 payload)
{
count++;
}

As you can see the slave doesn't process the events and the callback function simply increments a counter.

I like the idea of "could it be that it takes 24 events before it starts processing the first one?" I hadn't thought about it this way. I'll try to get evidence of how events are processed.

Then you say "I see the delays on event 32 then event 56. This is with the Notify number of messages being 32."

To be clear that's what I see too. Delays at 32, 56, 80, 104, ... that's why I say 32 the first time then every 24 events.

In the case of NumMsg = 64, I see delays at 64, 112, 160, 208 ... first 64 then every 48 events.

I'll have to read again your comment about cache invalidation, I didn't grasp everything as of now.

Thanks for your work and follow up about this.

Clement

0 Clement FR over 12 years ago in reply to Clement FR

Genius 4750 points

We tried a little experiment where the core receiving the events spends a lot of time in the callback function.

We see - as you predicted - the delay every 8 events (except the first time which is at the 32nd event).

The delay seems directly related to the time spent in the callback function. We would spend 30000 cycles in the callback function and the delay between event number 40 and 41 was around 240000 which is 8*30000.

When we doubled the time spent in the CbkFcn to 60000, we saw a delay twice as large (around 480000).

Clement

0 judahvang over 12 years ago in reply to Clement FR

TI__Mastermind 32475 points

Clement,

Yes, now its coming back to me. On the receive side, we only writeback the cache every N/4 message. So for N = 32, every 8 message. That is why you're seeing this factor of about 8 on the send side. We don't want to take a cache hit on every message so we choose N / 4.

Judah

0 Clement FR over 12 years ago in reply to judahvang

Genius 4750 points

Judah,

Ok good to know.

I just found this in the source code of NotifyDriverCirc :

        /*
         * Write back the getReadIndex once every N / 4 messages.
         * No need to invalidate since only one processor ever
         * writes this. No need to wait for operation to complete
         * since remote core updates its readIndex at least once
         * every N messages and the remote core will not use a slot
         * until it sees that the event has been processed with this
         * cache wb.
         */

It confirms what you are saying.

I'm going to study a bit more the source code to get a better understanding of these cache operations.

I think we can call this thread closed.

Clément

Processors

Processors forum

IPC Notify - Performance difference between Loopback mode and regular mode.