LP-AM243: xTaskNotifyFromISR and xTaskNotifyWait causing task to crash

Tron

Part Number: LP-AM243

I'm having a problem with xTaskNotifyFromISR and xTaskNotifyWait.

I have had a collection of them working for a long time and not sure what's changed. Recently one stopped working, and as of today, I can't get any of them to work.

I have an IPC interrupt function as follows:

void ipcCommandHandler(uint32_t remoteCoreId, uint16_t localClientId, uint32_t msgValue, void *args)
{

    uint32_t i = (uint32_t)args;
    int32_t status;

    gIpcCallCount++;

    // Send the IPC Notify message value to the corresponding task using Task Notify
    BaseType_t xHigherPriorityTaskWoken = pdFALSE;
    xTaskNotifyFromISR(systemcfg[i].task.handle, (uint32_t)msgValue, eSetValueWithOverwrite, &xHigherPriorityTaskWoken);
    portYIELD_FROM_ISR(xHigherPriorityTaskWoken);

    /* Echo the message back as an ack. Server is waiting. */
    status = IpcNotify_sendMsg(remoteCoreId, localClientId, (uint32_t)msgValue, 1);
    DebugP_assert(status == SystemP_SUCCESS);

    gIpcCallCount++;

    return;
}

And in my tasks, I have the following:

uint32_t message;
int32_t status;
DebugP_log("[%s] Waiting to receive a command.\r\n", systemcfg[sysIdx].name);
status = xTaskNotifyWait(0, 0, &message, portMAX_DELAY);
DebugP_log("[%s] Notify status: %d.\r\n", systemcfg[sysIdx].name, status);
DebugP_assert(status == pdTRUE);

The problem is when I trigger the IPC Notify interrupt, I can use the gIpcCallCount variable inside the ISR to confirm it's starting to execute, but then two things happen:

gIpcCallCount only gets a value of 1, meaning the second increment after the xTaskNotify isn't reached
The task being notified seems to crash - ROV shows that before the event the task is blocked, but after it's not there at all

Any idea why this might be happening? This has me stumped.

10 months ago

0 Tron 10 months ago

Prodigy 170 points

Also, if I'm fast to pause execution, I can see we land inside HwiP_data_abort_handler_c when this happens, before the execution moves onto app idle with the task no longer showing in the ROV.

0 Tron 10 months ago in reply to Tron

Prodigy 170 points

The more I try and analyse this, the more I think it could be caused by IPC Notify instead and has nothing to do with Task Notify.

I'm not sure why it's any different now, but I can't get gIpcCallCount to increment at all anymore, and if I remove the xTaskNotify code from the interrupt function the entire core crashes at the interrupt, landing at 0x7016d1** (or a location close to this one, it changes each time) with "no symbols are defined". I feel like there's a hint in the HwiP_data_abort_handler_c when the xTaskNotify lines are in there, given that afaik there is no hardware interrupt for Task Notify and it can only be IPC.

If I pause execution on the client core after IPC initialisation, then trigger a IPC Notify message from the server core and begin to step through the client core eventually it just crashes unexpectedly and without warning: a call to vApplicationIdleHook() takes me to a call to vApplicationLoadHook() and then when I try and step into line 391 of TaskP_freertos.c (uint64_t curUpdateTime = ClockP_getTimeUsec();) the execution crashes and I can only see the "no symbols defined" error. There's no attempt execute anything related to the IPC interrupt, but it's definitely caused by the IPC Notify message posted by the server core.

I'm at a total loss on what else I can do to debug and gain any extra insight. Why might IPC Notify cause a crash when calling/executing the interrupt function, particularly after it's been initialized without error?

0 Ashwin Raj 10 months ago in reply to Tron

TI__Intellectual 2235 points

Hi Tron,

Below is the code after removing xTaskNotify(). Can you check exactly where it crashes?

void ipcCommandHandler(uint32_t remoteCoreId, uint16_t localClientId, uint32_t msgValue, void *args)
{

    uint32_t i = (uint32_t)args;
    int32_t status;

    gIpcCallCount++;

    /* Echo the message back as an ack. Server is waiting. */
    status = IpcNotify_sendMsg(remoteCoreId, localClientId, (uint32_t)msgValue, 1);
    DebugP_assert(status == SystemP_SUCCESS);

    gIpcCallCount++;

    return;
}

Also, are any further IPC messages sent by the server core before this ISR is exited?

Regards,

Ashwin

0 Tron 10 months ago in reply to Ashwin Raj

Prodigy 170 points

Hi Ashwin. Thanks for checking in.

Funny timing - I created a new thread here seconds ago as my diagnosis has developed further and it's definitely IPC: e2e.ti.com/.../lp-am243-ipc-notify-is-causing-hard-faults

I was about to delete this one since it's evidently not task notify.

0 Tron 10 months ago in reply to Ashwin Raj

Prodigy 170 points

I'm not sure if I was doing something different yesterday, but today I created an empty interrupt handler for testing each line specifically and I see the following:

Empty there is no crash
With gIpcCallCount++ there is no crash, but the value doesn't increment so I can't be sure the handler is being executed
With the IpcNotify_sendMsg lines, it does hard fault, but the gIpcCallCount variable doesn't increment
With the xTaskNotify lines and no ipcNotify command, it doesn't hard fault but execution does land at HwiP_data_abort_handler_c, and the gIpcCallCount variable doesn't increment

Even stranger is that I have two AM243X launchpads with me at the moment. One one launchpad the crash only happens on two of the R4F cores. On the other it happens on any core, including the R5F core. That's running the exact same code on both.

I'd have thought maybe our boards are old and beginning to become defective (and so I've ordered new ones) but I'm not convinced, as I can still run the IPC Notify example without any issues.

0 Tron 10 months ago in reply to Tron

Prodigy 170 points

I wonder if it's not Launchpad related, but local workstation related? Is there a chance there could be a configuration difference within CCS or FreeRTOS locally causing some cores to work and not others? Sadly I'm not in the office to confirm whether the differences I've noted are due to the different launchpad, or because I'm on a different workstation.

0 Ashwin Raj 10 months ago in reply to Tron

TI__Intellectual 2235 points

Hi Tron,

I don't think any CCS configuration other than build mode (Release vs Debug) can cause this. Will it be possible to share a sample application where I can recreate the issue? I can try to debug from my end.

Regards,

Ashwin

0 Tron 10 months ago in reply to Ashwin Raj

Prodigy 170 points

Hi Ashwin. I started to strip down the project but the nature of the issue changes, and not in a helpful way.

For example, there's a core structure we store in USER_SHM_MEM containing a tree of settings - many of which aren't used beyond initialisation because we're still writing the code but they're defined in anticipation anyway. When I start to prune the initialisation code of the unused variables of the structure then eventually I reach a point where the M4 core stops hard faulting on the IPC interrupt, and sometimes R5F0_0 also. But it's all seemingly nondeterministic. Nothing uses the global variables I've removed the init code for, so the execution shouldn't be any different.

That implies to me something strange is happening with memory. I've tried:

multiplying the stack size of the tasks (including main) to ensure they're not overflowing
increasing the stack and heap sizes in the linker files
increasing the IRQ stack size in the R5F linker files
confirming that the shared structure isn't greater than the 64kB we've allocated to USER_SHM_MEM

I'm not sure what else I can try, though, and I can't give you a complete version of our project without an NDA. I'm at a loss on what to do.

0 Tron 10 months ago in reply to Tron

Prodigy 170 points

Hi Ashwin. I've managed to create a minimal project that retains the issue. I've sent you a friend request to send you a zip file. Thanks again for your help.

0 Tron 10 months ago in reply to Tron

Prodigy 170 points

I think I've found the problem and I think it's a bug that TI will want to look into. We were passing the function name to the registerClient function via a variable, and if we instead hard code it in the problem dissipates.

Here's the code that we were using the register the IPC channels:

void init_ipc_channels() {
    int32_t status;

    DebugP_log("[IPC] Initialising IPC channels... \r\n");

    /*For each channel of this core, construct the semaphores and register the handler functions */

    uint32_t selfCoreID = IpcNotify_getSelfCoreId();
    DebugP_log("    |  This core ID: %d \r\n", selfCoreID);

    for (uint32_t i = 0; i < gNumSystems; i++) {

        //DebugP_log("    |  Checking %d - ServerID: %d, ClientID: %d \r\n", i, systemcfg[i].ipc.serverID, systemcfg[i].ipc.clientID);

        if (systemcfg[i].ipc.serverID == selfCoreID) {
            /* Register a channel to receive the ACK back, and wait for a response. */
            DebugP_log("    |  Registering core %d with a semaphore to wait for ack... ", systemcfg[i].ipc.serverID);
            SemaphoreP_constructBinary((SemaphoreP_Object *)&systemcfg[i].ipc.doneSem, 0);
            DebugP_log("DONE!\r\n");
            DebugP_log("    | Registering core %d as server on channel %d... ", systemcfg[i].ipc.serverID, systemcfg[i].ipc.channelID);
            status = IpcNotify_registerClient(systemcfg[i].ipc.channelID, ipc_server_msg_handler, (void*)i);
            DebugP_assert(status == SystemP_SUCCESS);
            DebugP_log("DONE!\r\n");
        } else if (systemcfg[i].ipc.clientID == selfCoreID) {
            /* Client needs to prepare to receive messages from server
             * No semaphore needed. When we receive, we send ACK back, and action the data. */
            DebugP_log("    |  Registering core %d as client on channel %d... ", systemcfg[i].ipc.clientID, systemcfg[i].ipc.channelID);
            status = IpcNotify_registerClient(systemcfg[i].ipc.channelID, systemcfg[i].ipc.clientFunction, (void*)i);
            DebugP_assert(status == SystemP_SUCCESS);
            DebugP_log("DONE!\r\n");
        }
    }

    DebugP_log("    | IPC channels initialized.\r\n");

}

And here's the shape of clientFunction in the struct:

void (*clientFunction)(uint32_t, uint16_t, uint32_t, void *);

Looking through our git history this code has been in use for almost 12 months and for the most part was never problematic. It also intermittently runs fine for the M4 core and R50_0, but with the hardcoded change it's certainly working - there's no hard fault on the IPC interrupt on any core.

Frustratingly, I had a similar issue once trying to store a function pointer in a variable to be used by xTaskCreate. I had to create an absurd switch statement for every permutation of the xTaskCreate function call with hard coded function pointers, which we begrudgingly still use today.

If you can let me know if there's a way to achieve this without hard coding function pointers, please do, because it would save us a lot of noisy code!

0 Ashwin Raj 10 months ago in reply to Tron

TI__Intellectual 2235 points

Hi Tron,

I'm not able to understand why passing a variable which holds the function pointer can cause this issue. Can you please share that zip file with me ? I'll try to debug on my end.

Also, how do you synchronize between sender and receiver core to make sure that sender sends messages only after receiver has registered the client ID ? If you are not doing this already, you can use IPC_Sync API's for synchronization.

Regards,

Ashwin

0 Tron 10 months ago in reply to Ashwin Raj

Prodigy 170 points

Thanks Ashwin. Yes, I'll send you a ZIP file shortly.

You'll see soon, but the process at boot for each core is:

Init the shared memory global variables using files linked to each core app from the system project folder
Wait for IPC_syncAll to ensure each core has completed initialising the same shared memory space
The synchronized memory contains a list of settings, including which cores are clients/servers to each other, and functions to cycle through all of these to establish the IPC channels. Each core logs the activity to the CIO so we can confirm that each core is paired correctly before any messages are sent.
In the example, I've made one core periodically post IPC notify messages over different channels to different cores. Usually, this core processes commands received over UART first, but I've removed that for demonstration purposes.

You'll see that after initialization, with R51_1 as the server to all other cores, only the M4 core responds to the message. The other cores all hard fault when interrupted, unless you change the function pointer as explained above.

0 Ashwin Raj 10 months ago in reply to Tron

TI__Intellectual 2235 points

Hi Tron,

Which is the CCS Version you are using?

I'm getting a metadata error when importing the Project into workspace.

Regards,

Ashwin

0 Tron 10 months ago in reply to Ashwin Raj

Prodigy 170 points

That's strange. We're using 12.6.0.00008. What's the error exactly?

0 Tron 9 months ago in reply to Tron

Prodigy 170 points

Hi Ashwin. I thought I'd circle back on this, as the issue has emerged again, but this time the IpcNotify_SyncAll() function is hard faulting the M4F core specifically.

If I run a Core Trace, the fault doesn't happen. This is similar to earlier when the IPC notify message was causing cores to hard fault, and if I stepped through it there would be no hard fault (at least, not until the next time execution calls IpcNotify_SyncAll() without the core trace running).

I haven't changed any code related to this core, or even IPC. I suspect there's an issue during the build, as sometimes we can solve unexpected hard faults by simply adding one or more no-op lines of code in different places (usually something as simple as "i;").

0 Tron 9 months ago in reply to Tron

Prodigy 170 points

I managed to get a dump. Sometimes it works, sometimes it doesn't. Does this mean anything to anyone? Why would vTaskDelay, within IpcNotify_SyncAll, cause a switch to hypervisor mode? This is a vanilla FreeRTOS with MCU+ project, no hypervisors as far as I'm aware.

0 Tron 8 months ago in reply to Ashwin Raj

Prodigy 170 points

Hi. If someone can shed some light on this it would be greatly appreciated. Everytime I reshuffle code (literally, shuffling unrelated code makes a difference).

Currently, I can't reliably communicate with core R50_1. It will accept ~20 messages spaced about 100ms apart, before it crashes. There's nothing in the console, but if I pause the core I can see it's been locked in a DebugP assert infinite loop, caused by the IPC notify ISR. No IPC code has changed, as mentioned earlier.

Probably related:
- this core also has a DM timer, which has also stopped working and causes the task it's started within to be blocked forever. That was also affecting and causing R51_0 to lock up, until I commented it out.
- I have a different core with only two tasks, no shared dependencies. When I remove too many DebugP log messages or reduce the vTaskDelay time of the respective task loops too low, over time at least one of the tasks becomes blocked indefinitely for an unexplained reason.
- If I switch to release mode, things get more weird. I have a switch statement that checks whether a character is with C or X. When it's C, the switch statement selects the default option instead, printing to console that the character is 'C' and doesn't match 'C' or 'X'.

These glitches are pretty consistent across different Launchpads and PCs, but there is some randomness when building the project as sometimes it does work fine.

What could be causing this? It's rather frustrating contorting my code around the mood of the the processor, especially with no-ops littered everywhere.

Because of the holidays, TI E2E™ design support forum responses will be delayed from Dec. 25 through Jan. 2. Thank you for your patience.

Arm-based microcontrollers

Arm-based microcontrollers forum

LP-AM243: xTaskNotifyFromISR and xTaskNotifyWait causing task to crash