MCU-PLUS-SDK-AM243X: Broken GPIO Interrupts, IPC Notify SyncAll with multi-core FreeRTOS project on MCU+ SDK 08.06.00

Tron

Part Number: MCU-PLUS-SDK-AM243X
Other Parts Discussed in Thread: SYSCONFIG

I have a multi-core FreeRTOS project, with R5_0-0 initialising GPIO Bank 0 with a HW interrupt. I also have a collection of variables stored in shared memory, which are successfully initialised across all cores at startup.

The GPIO bank interrupt function identifies which pin was triggered, and then records the state to a variable in shared memory.

I've got a relatively comprehensive multi-core application extensively sharing memory, communicating via IPC to synchronise operations, etc. and it mostly works fine. However, I have two problems that I suspect are related and I'm hoping someone can help shed some light on:

1. If I use IpcNotify_syncAll after initialising shared memory across the cores, two of my cores/apps crash with a HWI assert issue when the first task tries to init. I don't know why, as the IPC Sync part seems to work fine, and I'm not aware of any other HWI's on those cores. I've been working around this by not syncing the cores at all and realising that if I load each core one-by-one with a few seconds in between then each will successfully initialise its tasks and operate as expected. But, of course, that's not deployable.

2. There's one exception to everything working given the workaround - the GPIO bank interrupt works fine when I load R5_0-0 before loading any of the other cores or on it's own. I trigger the switch, the interrupt happens, and the state of the input is stored to a variable in the shared memory space (not really shared, yet, as it's the only core running). As soon as I load any other core, the 'new' core successfully initialises its shared memory (and overwrites the data for R5_0-0 as is to be expected, with the R5_0-0 console showing this), but then the GPIO bank interrupt stops working. It's almost like the GPIO pin configuration set by R5_0-0 is reset or altered by the other core, but I can't imagine why that would be given Sysconfig doesn't show any conflicts.

I had the same problem with 05.00.24 also.

To combat (1) I'm thinking of implementing an inelegant solution attempt to either make each core wait a given amount of seconds before creating it's tasks, so that only one core is doing that at a time, or using a chain of IPC Notify messages so each core will wait for the previous core to create it's tasks before creating it's own - but this seems wasteful and I'm no closer to understanding why creating tasks on separate cores in parallel are creating such a problem.

For (2) I'm out of ideas. I've been battling this for a few weeks, and I suspect it has to be something related to (1), since I can't think of any other aspect to help debug.

If anyone has any ideas, I'd love to hear them. Thanks in advance.

over 2 years ago

0 Kowshik Gudimetla over 2 years ago

TI__Genius 15435 points

Tron said:
realising that if I load each core one-by-one with a few seconds in between then each will successfully

HI Tron,

Couple of queries here,

1. Are you seeing this issue only when you run a freertos application?

2. Can you start by unhalting the local core first and then unhalt the remote cores at the last and see if the issue is being reproducible? The reason I ask is, I've seen this issue some time back where the problem is with the interrupt registration. If the remote cores are unhalted first they all send an IPC syncALL message and if your ipc notify Init is done after the interrrupt registration your local core receives an immediate interrupt but as the notify params are not initialized you will get a data aborts.

Tron said:
but then the GPIO bank interrupt stops working

This seems to be a little weird to me. You should atleast get the interrupts correctly and ISR's should be hit and saving the variable data might not happen due to the mailbox initializations. Can you confirm that the below?

1. What is the Interrupt priority of the GPIO you registered?

2. Have you enabled interrupt nesting?

0 Tron over 2 years ago in reply to Kowshik Gudimetla

Prodigy 170 points

Hi Kowshik,

Kowshik Gudimetla said:
1. Are you seeing this issue only when you run a freertos application?

Yes, if there are no tasks to initialise on the remote cores, then the core doesn't fail a HWI assert and the application/core idles. That said, one core in my current application is an exception, it doesn't seem to be effected by any of this and always runs perfectly. I haven't uncovered any reason for this core to be different to the others.

My work around for this has been to disable the IpcNotify_syncAll() command so that I can unhalt the first core, wait for it to initialise it's tasks, and then to unhalt the remaining cores. For some reason this means they won't experience the HWI assert crash and their tasks initialise correctly. But, as mentioned, this means that the GPIO interrupt on R5_0-0 stops working once the remote cores have started so this is not a long term solution.

It's alternatively possible for me to leave IpcNotify_syncAll() and to disable initiating the GPIO bank interrupt instead, avoiding the remote cores from crashing. This also isn't a long term solution because I need the input interrupts as well as the tasks.

Because of this I don't think it's IPC Notify that is causing the problem, but something related to the GPIO bank interrupt registration on one core causing other cores to experience issues.

Could it be related to the Sciclient configuration being done in registering the GPIO bank interrupt on one core? Perhaps I need to do something on the other cores to keep them in sync also? Here's what my Sciclient_IrqSet() function looks like in case it's insightful:

static void Sciclient_gpioIrqSet(void)
{
    int32_t                             retVal;
    struct tisci_msg_rm_irq_set_req     rmIrqReq;
    struct tisci_msg_rm_irq_set_resp    rmIrqResp;
    rmIrqReq.valid_params           = 0U;
    rmIrqReq.valid_params          |= TISCI_MSG_VALUE_RM_DST_ID_VALID;
    rmIrqReq.valid_params          |= TISCI_MSG_VALUE_RM_DST_HOST_IRQ_VALID;
    rmIrqReq.global_event           = 0U;
    rmIrqReq.src_id                 = TISCI_DEV_GPIO1;
    rmIrqReq.src_index              = TISCI_BANK_SRC_IDX_BASE_GPIO1 + GPIO_GET_BANK_INDEX(0);
    rmIrqReq.dst_id                 = TISCI_DEV_R5FSS0_CORE0;
    rmIrqReq.dst_host_irq           = CSLR_R5FSS0_CORE0_INTR_MAIN_GPIOMUX_INTROUTER0_OUTP_7;
    rmIrqReq.ia_id                  = 0U;
    rmIrqReq.vint                   = 0U;
    rmIrqReq.vint_status_bit_index  = 0U;
    rmIrqReq.secondary_host         = TISCI_MSG_VALUE_RM_UNUSED_SECONDARY_HOST;

    retVal = Sciclient_rmIrqSet(&rmIrqReq, &rmIrqResp, SystemP_WAIT_FOREVER);
    if(0 != retVal)
    {
        DebugP_log("[Error] Sciclient event config failed!!!\r\n");
        DebugP_assert(FALSE);
    }
    return;
}

Kowshik Gudimetla said:
2. Have you enabled interrupt nesting?

I haven't altered the default, which appears to be enabled.

Kowshik Gudimetla said:
What is the Interrupt priority of the GPIO you registered?

Here's the code I'm using on core R5_0-0 to register the interrupt, I'm not setting the priority, just copying the format from the GPIO Input Interrupt example and API examples.

void initSwitchInputs() {

    int32_t         retVal;
    HwiP_Params     switchInputHWIParams;

    Board_gpioInit();

    /* Register pin interrupt */
    HwiP_Params_init(&switchInputHWIParams);
    switchInputHWIParams.intNum   = Board_getGpioButtonIntrNum();
    switchInputHWIParams.callback = &switchInputInterruptFunction;
    switchInputHWIParams.args     = (void *) INPUT_0_PIN;
    retVal = HwiP_construct(&gGpioHwiObject, &switchInputHWIParams);
    DebugP_assert(retVal == SystemP_SUCCESS );
}

0 Tron over 2 years ago in reply to Tron

Prodigy 170 points

I've made some progress. I found one bug in my code, which was assigning an incorrect pinNum value to a GPIO output on the remote cores. That fixes one issue where the bank interrupt stopped working once all cores loaded.

However, I still have the issue where I need to manually pause between unhalting each core. If I unhalt them too closely together, or use IPCNotify_syncAll, then I get HWI abort crashes when the first task of that core attempts to initialise.

Interestingly, if I use IPCNotify_syncAll() but put a ClockP_sleep(10); at the start of the main task of the usually problematic core, then that core does initiate all tasks just fine after the sync, but then the usually fine core experiences a HWI abort crash.

If I use ClockP_sleep(10); to try and automate my manual process of incrementally unhalting processes only once each has finished initiating each task, it crashes still and doesn't act like my manual unhalting procedure. I'm currently unsure how to make this a workable solution, even if inelegant.

0 Kowshik Gudimetla over 2 years ago in reply to Tron

TI__Genius 15435 points

Hi Tron,

Glad that one issue is solved. Regarding the IPC issue, Can you clearly explain the following?

1. Are all the cores running FreeRTOS?

2. Where are you placing the IPC Notify SyncAll function in all the cores?

3. Where are you placing the ClockP_Sleep(10) function in all the cores? It would be great if I can see the code snippets of your program.

Thanks,
G Kowshik

0 Tron over 2 years ago in reply to Kowshik Gudimetla

Prodigy 170 points

Thanks for your help, Kowshik.

Kowshik Gudimetla said:
1. Are all the cores running FreeRTOS?

Yes, all the R5 cores are running FreeRTOS. The M4 core is unused.

Kowshik Gudimetla said:
2. Where are you placing the IPC Notify SyncAll function in all the cores?

This is the first task that main() calls, which I have pretty much standard across each core. Those functions under the //Init Shared Memory title set up a collection of structs in shared memory, they don't do anything interesting or complex - I leave the real work like creating tasks for after the SYNC call.

I've created a configIPC_SYNC flag in a linked header file accessable to each core application so I can easily turn on and off the IPC Notify Sync across all cores.

void system_main(void *args)
{

    //ClockP_sleep(10);

    /* Open drivers for any peripherals in use */
    Drivers_open();
    Board_driversOpen();

    // Init shared memory
    init_peripherals();
    init_sensor_state_map();
    init_policy_map();
    init_error_map();
    init_machine_state();

    /* Wait for all cores to synchronise shared memory space */
    if (configIPC_SYNC == 1) {
        DebugP_log("[IPC] Waiting for all cores to init...   ");
        IpcNotify_syncAll(SystemP_WAIT_FOREVER);
        DebugP_log("Done!\r\n");
    }

    init_action_tasks();

    DebugP_log("\r\n***** Core Initialised! *****\r\n\r\n");

    //Board_driversClose();
    /* Drivers_close(); */
}

Kowshik Gudimetla said:
3. Where are you placing the ClockP_Sleep(10) function in all the cores? It would be great if I can see the code snippets of your program.

You can see where I had the sleep function earlier today, commented out at the top of this task.

0 Kowshik Gudimetla over 2 years ago in reply to Tron

TI__Genius 15435 points

Tron said:
but then the usually fine core experiences a HWI abort crash.

Hi Tron,

Once the fine core which has experienced an Data abort or pre-fetch abort pause the core and do "LR-4" in the Disassembly and this reveals which instruction has caused the Data abort. From there please check what is happening in the function.

Also I am little concerned about this,

Tron said:
title set up a collection of structs in shared memory

What is the shared memory you're referring to? is it the IPC Mailbox memory? Please let me know

0 Tron over 2 years ago in reply to Kowshik Gudimetla

Prodigy 170 points

Hi Kowshik.

I'm not sure it's giving me anything useful when I look at the LR register. Just refers to HwiP_data_abort_handler_c(), not how we got there. Here's a screenshot in case you can see something I can't.

Kowshik Gudimetla said:
What is the shared memory you're referring to? is it the IPC Mailbox memory? Please let me know

.bss.user_shared_mem as per this. I'm as close to certain as I can get that this isn't the source of the problem, though.

0 Kowshik Gudimetla over 2 years ago in reply to Tron

TI__Genius 15435 points

Tron said:
look at the LR register

I guess I've asked to check LR-4 and not LR.

Also if this method doesn't reveal much then the only option I have in my mind is to use core trace tool (ARM's native feature) which captures every memory access done by CPU. This will be very helpful for debugging.

Please follow this tutorial to dump the data.

Basically core trace should stop when you experience data aborts or etc

https://youtu.be/PXMvAnzA7Vs

0 Tron over 2 years ago in reply to Kowshik Gudimetla

Prodigy 170 points

Hi Kowshik.

Thanks for your help. I'm not sure how to check LR-4?

In any case, I've narrowed the problem down. It seems to be an issue with the task schedular and the way I'm using xTaskCreate. For the first argument of type TaskFunction_t I'm trying to pass a variable of type TaskFunction_t that I'm calling from shared memory.

If I explicitly type the function name in place of the variable, or by using a normal variable in local memory, then there is no HwiP_prefetch_abort_handler crash. If I do use the shared memory variable but only load one core and allow the tasks to initialise before I try and load a second core, then there is also no problem. The problem only happens if I try and use a TaskFunction_t variable within xTaskCreate while booting multiple cores concurrently.

This seems to imply that there's some sort of race condition, but nothing ever writes to this variable from any core after initialisation and I can use other variables from shared memory for other arguments of xTaskCreate just fine - it's only the TaskFunction_t that is a problem.

If anyone can shed any light on this it would be greatly appreciated. Here are some snippets of the code:

typedef struct {
    TaskFunction_t functionPointer;
    TaskHandle_t handle;
} ACTUATOR_TASK;

volatile peripheral_t   peripheralcfg[MAX_PERIPHERALS] __attribute__((aligned(32), section(".bss.user_shared_mem")));

peripheralcfg[i].actuator.task.functionPointer = vOutputTaskFunction;

TaskFunction_t taskName = (TaskFunction_t)peripheralcfg[i].actuator.task.functionPointer;

xTaskCreate(
    (TaskFunction_t)taskName, //vOutputTaskFunction,
    (const char*)peripheralcfg[i].name,
    ACTION_TASK_STACK_SIZE,
    (void*)i,
    priority,
    (TaskHandle_t *)&peripheralcfg[i].actuator.task.handle
);
configASSERT(peripheralcfg[i].actuator.task.handle != NULL);

0 Ashwin Raj over 2 years ago in reply to Tron

TI__Intellectual 2425 points

Hi Tron,

Is the shared memory marked as Non-cached in MPU for all cores?

Also, can you try to check the value of taskName variable in CCS for each core ?

Regards,

Ashwin

0 Tron over 2 years ago in reply to Ashwin Raj

Prodigy 170 points

Thanks for your input Ashwin.

Ashwin Raj said:
Is the shared memory marked as Non-cached in MPU for all cores?

Yes.

Ashwin Raj said:
Also, can you try to check the value of taskName variable in CCS for each core ?

I have and it's exactly as expected. The task even appears in ROV, but it crashes the core when it tries to initialize - I now assume because the task isn't correctly pointing to the intended function code.

0 Ashwin Raj over 2 years ago in reply to Tron

TI__Intellectual 2425 points

Hi Tron,

Can you check the highlighted variable when all cores are running concurrently? This might help to understand if other cores are somehow corrupting the function pointer variable.

Regards,

Ashwin

Arm-based microcontrollers

Arm-based microcontrollers forum

MCU-PLUS-SDK-AM243X: Broken GPIO Interrupts, IPC Notify SyncAll with multi-core FreeRTOS project on MCU+ SDK 08.06.00