MCU-PLUS-SDK-AM243X: IPC does send packages while the other side has not finished reading the vring

Felix Heil

Hello,

we are using the MCU PLUS SDK 09.00 compiled with LTS 3.2.0.

We have an issue with the Ipc which took us weeks to be catched and it's really hard to reproduce, let alone creating a minimal example so we would need some help for a good practice to keep on debugging this issue.

So after a lot of debugging we identified this issue to be settled inside the RPMessage/IpcNotify-drivers of the SDK. We noticed that one core tries to write a package via RPMessage into the VRing while the other core has not read the packet.

We are using RPMessage for core-to-core-communication.

If one core sends a packet the other cores gets an interrupt and the callback is called. Inside this callback - which is still in interrupt context - we are writing a value back into the VRing-Buffer into the user-space of the packet, so we know this packet was already read. On the other side for every send we check if this value is present. If the value is not present this means the packet was not read.

It happens that the value was not written, which means the packet was not read but the other cores tries to wite at this location and we cannot identify why this happens. This only happens from time to time. sometimes quite early, sometimes a bit later.

I also tried to verify this with another project with a heavy load communication via Ipc but I cannot recreate this issue unfortunately so we can't even provide a minimal example.

Do you have any idea how we can proceed from here? We are also open for a debug-session.

Best regards

Felix

over 1 year ago

0 Swargam Anil over 1 year ago

TI__Guru 50587 points

Hello Felix,

I am looking at your queries and you may expect reply in one or two days .

Regards,

S.Anil.

0 Swargam Anil over 1 year ago in reply to Swargam Anil

TI__Guru 50587 points

Hello Felix,

Sorry for the delayed reply as I was busy with other escalations.

Felix Heil said:
If one core sends a packet the other cores gets an interrupt and the callback is called. Inside this callback - which is still in interrupt context - we are writing a value back into the VRing-Buffer into the user-space of the packet, so we know this packet was already read. On the other side for every send we check if this value is present. If the value is not present this means the packet was not read.

Your understanding is correct. This is how IPC works.

Felix Heil said:
It happens that the value was not written, which means the packet was not read but the other cores tries to wite at this location and we cannot identify why this happens. This only happens from time to time. sometimes quite early, sometimes a bit later.

Actually, every core has unique buffers for TX and RX. So, it is not possible that one core will write and read at the same assigned buffers.

I need some more details to debug the issue.

1. Did you modify the IPC driver ? If you modify the driver please share your code.

2. IPC is used in between which cores in your application ?

3. If data is received at the remote core please check the IPC shared address values in the buffer and see whether your shared data is available or not ?

Regards,

Anil.

0 Felix Heil over 1 year ago in reply to Swargam Anil

Expert 1235 points

Hey Anil,

thanks for your reply. We did modify the Ipc-driver.

int32_t RPMessage_send( void*    data,
                        uint16_t dataLen,
                        uint16_t remoteCoreId,
                        uint16_t remoteEndPt,
                        uint16_t localEndPt,
                        uint32_t timeout
                      )
{
    int32_t status = SystemP_FAILURE;

    if(remoteCoreId < CSL_CORE_ID_MAX && gIpcRpmsgCtrl.isCoreEnable[remoteCoreId]
        && data != NULL && dataLen != 0
        )
    {
        uint16_t vringBufId;

        status = RPMessage_vringGetEmptyTxBuf(remoteCoreId, &vringBufId, timeout);
        if(status == SystemP_SUCCESS)
        {
            uint8_t *vringBufAddr = RPMessage_vringGetTxBufAddr(remoteCoreId, vringBufId);
            uint16_t vringBufLen = RPMessage_vringGetTxBufLen(remoteCoreId, vringBufId);
            RPMessage_Header *header = (RPMessage_Header *)vringBufAddr;

            if(dataLen > (vringBufLen - sizeof(RPMessage_Header)) )
            {
                dataLen = vringBufLen - sizeof(RPMessage_Header);

                DebugP_logWarn("[IPC RPMSG] Message send to remote core %d @ %d end point truncated due to lack of space in vring buffer !!!\r\n",
                    remoteCoreId, remoteEndPt);
            }

            header->srcEndPt = localEndPt;
            header->dstEndPt = remoteEndPt;
            header->srcCoreId = gIpcRpmsgCtrl.selfCoreId;
            header->flags = 0;
            header->dataLen = dataLen;

            uint32_t dummy = 0;
            memcpy(&dummy, &vringBufAddr[sizeof(RPMessage_Header)+4], sizeof(uint32_t));

            DebugP_assert((dummy == 0) || (dummy == 0xABCDEF12));


            memcpy( &vringBufAddr[sizeof(RPMessage_Header)], data, dataLen);

            status = RPMessage_vringPutFullTxBuf(remoteCoreId, vringBufId, dataLen + sizeof(RPMessage_Header), timeout);

            if(status != SystemP_SUCCESS)
            {
                DebugP_logWarn("[IPC RPMSG] Message send to remote core %d @ %d end point failed due to lack of space in Notify Queue !!!\r\n",
                remoteCoreId, remoteEndPt);
            }
        }
        else
        {
            DebugP_logWarn("[IPC RPMSG] Message send to remote core %d @ %d end point failed due to lack of space in vring !!!\r\n",
                remoteCoreId, remoteEndPt);
        }
    }
    else
    {
        DebugP_logError("[IPC RPMSG] Message send to remote core %d @ %d end point failed due to invalid parameters !!!\r\n",
            remoteCoreId, remoteEndPt
            );
    }
    return status;
}

so every time we receive a packet, we work on that and after working on it we write back a pattern (0xABCDEF12) in our callbacks of the ipc, so still in interrupt-context of the ipc-interrupt. Next time we send, right before we send it in the ipc_rpmsg, we check if this pattern is present. This is done by both cores the same way.

2. it is between mcu0_0 and mcu1_0 (the thing is we load another firmware to the second cluster right after our application has started, this is needed since we cannot load it onto mcu0_1 after the cluster has already started.)

3. I f I get you right, we checked that and the shared addresses match for each coreObj on every core. so one's cores rx is the other cores tx and vice versa.

Our current situation is this one:

so some explanation to that:

1. we enabled global breakpoints and I did set a hw-breakpoint at that time, when the assert is thrown (so inside the DebugP_assert()), so the other core also stops.

2. the configuration is: 12 buffers, 128 bytes each, endpoint-number is 2. We have our own 32-bit big header following the RPMessage header.

3. on the right you can see the memory of the vring. the buffer starts at 0x701D8140

4. the check we did is, that we copied the data of the buffer into our dummy-variable, which is 12 or 0x0C. Since this is not the pattern, the assert is thrown.

5. As you can also see in the memory view: the pattern is present! It is located at the packet at address 0x701D3C0

In my conclusion this seems to be a runtime-issue, since between copying the value from the buffer and throwing the assert, the other core found the time to handle this packet and writes back the pattern in the meanwhile. The global breakpoint did work but it seems it was "too late".

This also makes sense regarding the issues we had in the beginning, since some values were wrong when we received them, which could've happened because one core wrote into the buffer while the other one was currently reading that value.

In my understanding the handling - reading the vring, getting it to a queue-packet and so on - is safe because of the index-handling, but somehow we get a problem here. Is it possible that it has something to do with interrupts being locked for too long from another entity?

Regards,

Felix

Edit:

We now have an idea what could go wrong somehow: we noticed that the idx-values are never decreased or checked or anything. They are just incremented and then a modulo is done. This would mean a value of 12 buffers is not possible, since then there yould occur a really odd value, esecially when reaching the end of the max uint16_t-value (65536/65535) right before overflowing back to 0. We think that this somehow messes up the idx-handling. We are investigating now with 8 and 16 buffers

0 Swargam Anil over 1 year ago in reply to Felix Heil

TI__Guru 50587 points

Hey Felix,

Thanks for sharing above details .

Felix Heil said:
1. we enabled global breakpoints and I did set a hw-breakpoint at that time, when the assert is thrown (so inside the DebugP_assert()), so the other core also stops.

2. the configuration is: 12 buffers, 128 bytes each, endpoint-number is 2. We have our own 32-bit big header following the RPMessage header.

3. on the right you can see the memory of the vring. the buffer starts at 0x701D8140

4. the check we did is, that we copied the data of the buffer into our dummy-variable, which is 12 or 0x0C. Since this is not the pattern, the assert is thrown.

5. As you can also see in the memory view: the pattern is present! It is located at the packet at address 0x701D3C0

I understand all the above points. You are sending the packet from MAIN core (R5F0_0) to remote core (R5F1_0) and the remote core reads the data, and again it will send the magic flag (0xABCDEF12) data to the Main core ., In the main core side , for every time, you will be sending data based on this flag. If this flag has not the magic value (0xABCDEF12) then you are inserting debug assert else again you are starting new transmission.

So, my question is how frequently you are sending ?

I assume that you are first calling RPMessage_send, followed by calling RPMessage_recv API's.

Like below .

Felix Heil said:
We now have an idea what could go wrong somehow: we noticed that the idx-values are never decreased or checked or anything. They are just incremented and then a modulo is done. This would mean a value of 12 buffers is not possible, since then there yould occur a really odd value, esecially when reaching the end of the max uint16_t-value (65536/65535) right before overflowing back to 0. We think that this somehow messes up the idx-handling. We are investigating now with 8 and 16 buffers

Can you please elaborate more ? I really don't understand this point.

If you remove your application code and keep the only IPC code, it is also fine for me.

If you have any problem to share your code in public forum , please try to share code with private messages or contact your FAE for sharing your code to me with out your Application .

Regards,

S.Anil.

0 Felix Heil over 1 year ago in reply to Swargam Anil

Expert 1235 points

Hey Anil,

so we were sending every 1ms and some calls in bewteen so probably every 500 µs a packet was sent.

The RPMessage_send is done our of task-context of FreeRTOS-tasks and the receive is done in the Callback of the RPMessage that can be registered so in interrupt-context.

For the last point I think that's exactly our issue. We now used values for the number of buffers of 8, 16 and 32 and we do not see those issues again. Only a number of buffers which can be used for a divison of 65536/number of buffers without (!) a remainder is suitable.

What I mean is that internally the RPMessage uses an index for the VRING. This Index is used for calculating the current packet to work on with a modulo. E.g. for lastAvailIdx at this point:

in RPMessage_vringGetEmptyTxBuf. That's also the case for the use of other indexes of the vringObj. There is only the increment, never a decrement. So it is by design that this value overflows and starts again at 0. If you now have e.g. a 12 for vringNumBuf there are edge-cases where the modulo won't get 0 where it normaly should, e.g. right before the overflow and thus a packet could be missed. that's not the case for 2, 4, 8, 16 and 32 and so on (so all 2^n-cases). Somehow this seems to be the reason that it mixes up smth when working on the packets.

So our current case now runs for 4 hours while previously the problem occured right after 80s or even earlier which was always reproducable.

I think that's never a problem for people using the SysCfg tool since this does not provide any options other than 8, 16 and so on as far as I remember. It's only a problem for us since in one thread about the IPC a colleague said the IPC does not have the requirement to be 2^n: https://e2e.ti.com/support/microcontrollers/arm-based-microcontrollers-group/arm-based-microcontrollers/f/arm-based-microcontrollers-forum/1319529/mcu-plus-sdk-am243x-possible-packet-loss-in-ipc/5037866?tisearch=e2e-sitesearch&keymatch=%25252525252525252520user%2525252525252525253A453845#5037866

But I think that's now exactly a requirement. So a number of buffers being not 2^n could cause such issues.

I will let our test run until tomorrow and if no problem occurs I think we can see this as resolved.

Best regards,

Felix

0 Felix Heil over 1 year ago in reply to Felix Heil

Expert 1235 points

So I want to update that we could let it run overnight without any problems. I also modiefied another project to use 12 buffers and the error occured there as well.
So yes, the number of buffers not being 2^n is a problem.

Best regards

Felix

0 Swargam Anil over 1 year ago in reply to Felix Heil

TI__Guru 50587 points

Hello Felix,

I understand your problem .As o now, you can go ahead with the 2^n buffers allocation and give me some time , I can post the my analysis report.

Regards,

S.Anil.

Arm-based microcontrollers

Arm-based microcontrollers forum

MCU-PLUS-SDK-AM243X: IPC does send packages while the other side has not finished reading the vring