This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

PROCESSOR-SDK-AM64X: MCU+ SDK 08.05 bug due to nested interrupts on R5f

Part Number: PROCESSOR-SDK-AM64X
Other Parts Discussed in Thread: SYSCONFIG

Dear TI team,

the MCU+ SDK 08.05 re-enabled support for nested interrupts. The release notes for 08.05 list MCUSDK-1016 as fixed:

MCUSDK-1016 Semaphore does not function as expected when "post" call is present in multiple ISRs at different priorities DPL 7.3.0 onwards AM64x, AM243x Fixed

Since SDK 08.00, nested interrupts have been disabled as a workaround for MCUSDK-1016:

MCUSDK-1016 Semaphore does not function as expected when "post" call is present in multiple ISRs at different priorities DPL 7.3.0 onwards AM64x, AM243x

Interrupt nesting should be disabled. SDK disables interrupt nesting by default.

Unfortunately it seems that there are still issues with nested interrupts with MCU+ SDK 08.05. Our EtherCAT master (running on AM64x R5f w/ MCU+ SDK) experienced timeouts because the cyclic task didn't run at the expected intervals, and our EtherCAT slave application inexplicably dropped back to SAFEOP.

We created some minimal test cases to be able to debug the issue, and while two ISRs that triggered two separate semaphores to unblock two tasks didn't cause any problems even if one ISR preempted the other, we found that an interrupt that preempted the FreeRTOS tick interrupt caused severe problems.

  • 1 task executing a while loop with only a vTaskDelay(1)
  • 1 task executing a while loop blocking for a semaphore and then checking how longs its been (ClockP_getTimeUsec()) since the last run
    • if it's been < 80% / > 120% of the expected period a warning is output
  • 1 TimerP timer configured via SysConfig for cyclic operation with a 997µs (to make it drift versus the 1000µs tick period) cycle that posts the semaphore from its ISR

With this test program I've seen two different failures:

  • Often FreeRTOS would crash because it tried to schedule a task with task handle NULL, eventually causing an undefined instruction exception
  • Sometimes the task blocking on the semaphore would wake up ~1 cycle (990-1005µs) too late, despite the processor being otherwise idle

We've implemented a trace feature within the MCU+ SDK that allows us to trace execution of HWIs, task switches and so on, and found that the issues always occured when the tick interrupt got preempted by our TimerP interrupt.

TI's R5f port's vPortTimerTickHandler calls into xTaskIncrementTick without disabling interrupts. Within xTaskIncrementTick a lot of code executes without entering a critical section that is obviously not intended to be preempted. FreeRTOS seems underdocumented, and I couldn't find any explicit specification that states which functions must only be called with interrupts disabled, but e.g. the Cortex-M3/M4 ports, including the one for the M4 in MCU+ SDK, disable interrupts around xTaskIncrementTick.

After adding portDISABLE_INTERRUPTS() / portENABLE_INTERRUPTS() calls in vPortTimerTickHandler (mcu_plus_sdk_am64x_08_05_00_24\source\kernel\freertos\portable\TI_ARM_CLANG\ARM_CR5F\port.c) we haven't seen any issues within the test program nor within our EtherCAT master application. Tests with our EtherCAT slave application are pending, as are further tests with more complex master applications.

  • Is this a known issue and is there a fix from TI available?
  • Are there any known (stability) issues with MCU+ SDK 08.05?
  • Could someone from TI with knowledge about how the R5f port is intended to work verify our findings and if our proposed fix is a) necessary and b) sufficient?

Best Regards,

Dominic

  • Some feedback to this would be highly welcome.

    The way I understand it every MCU+ SDK 08.05 based R5f application risks running into this issue, and potential consequences are crashes and tasks getting stuck.

    Regards,

    Dominic

  • Dominic,

    Apologies for the delay, I am checking the issue with the dev team, please expect response within this week

    Regards

    Anshu

  • Hello Anshu,

    any news here?

    Regards,

    Dominic

  • Dominic,

    the dev team agrees on the issue and we are reviewing if there are any other latencies that would come up because of the change proposed.

    would you be able to share any test application for us to easily reproduce the issue.

    Regards

    Anshu

  • Hello Anshu,

    the dev team agrees on the issue and we are reviewing if there are any other latencies that would come up because of the change proposed.

    it would be great if you could let me know at which conclusion your dev team arrives.

    would you be able to share any test application for us to easily reproduce the issue.

    nested_interrupts_r5fss0-0.zip

    I've attached a CCS project for MCU+ SDK 08.05 that exhibits the problem:

    • AM64x EVM SR2.0 HS-FS, booted with SDK 08.05 NULL bootloader from SD card
    • Load & launch the application via CCS
    • If it doesn't crash within ~20 seconds I reset and launch the application again. It seems that the error is much more likely in the first few seconds, probably relating to caches. The time window where the nested interrupts might cause problems is pretty narrow. I've seen crashes/stuck tasks about 2-3 out of 10 attempts.
    • If the application doesn't crash early on it sometimes crashes after a few hours.
    • The release version most often fails in a way that the timer0_task isn't being scheduled anymore. When checking via ROV it appears to be gone.
    • The debug version most often fails with an illegal instruction.
    • In a few rare cases the check for the time difference within timer0_task triggers.

    Best Regards,

    Dominic

  • The MCU+SDK has been tested with interrupt nesting enabled with a LwIP networking usecase which has a number of tasks and interrupts and there are no functional issues we have observed in our long duration testing.  There are no known issues with interrupt nesting. You have to use the FreeRTOSConfig.h that is part of the MCU+SDK. Incase you are modifying the FreeRTOSConfig and using your own version let me know so that I can review your changes

    There is no official documentation from FreeRTOS which explicitly mentions xTaskIncrementTick() has to be invoked with IRQs disabled. Looking at the different FreeRTOS ports I see only interrupts that invoke xTaskIncrementTick being disabled in some ports. You should not hit this condition as it means the OS tick is delayed by greater than 1 ms duration. 

    DIsabling IRQs before calling xTaskIncrementTick() will not cause any adverse impact and if this works for you reliably you can continue to use this change . We have to rootcause the issue at our side using the testcase you have provided for us to be able to confirm if invoking xTaskIncrementTick() without critical section is indeed the issue. 

    We will work on recreating the issue and debugging it .Based on how quickly we are able to recreate the issue , will have an update by Wednesday March 01 2023

    Regards

    Badri

     

  • Hello Badri,

    have you already been able to recreate the issue?

    Best Regards,

    Dominic

  • Dominic,

    apologies for the late response, the team is still working on this issue.

    Regards

    Anshu

  • Hello Anshu,

    sorry, I accidentally clicked the resolved button, and it seems there's no way to revoke that.

    Is it taking longer because

    • they haven't been able to recreate the issue
    • or because it is taking longer to verify the root cause and validate our proposed fix
    • or because they found that our proposed fix isn't sufficient?

    Regards,

    Dominic

  • Dominic,

    Last I checked  the dev team mentioned they have the Fix after recreating the issue. let me talk to the dev team and I will be able to provide more pointed answers. but this is an issue which is important to be resolved and the team has worked on it.

    I also wanted to thank you for your detailed questions that you ask. I have noticed your other posts as well. This is very helpful.

    Regards

    Anshu

  • Hello Anshu,

    the one thing I'm really interested in would be if they found any other (related) issues. If that's not the case then I really don't mind them taking whatever time it takes.

    Regards,

    Dominic

  • Dominic,

    I have checked with the developer, I understand the motivation behind the ask. I will post a reply early next week

    regards

    Anshu

  • Dominic the explanation given for the issue is as below

    If the OS tick scheduler is preempted at point when it was ready to move a task from suspended to ready state and the preempting ISR invokes a FreeRTOS API that makes another task ready it corrupts the scheduler data structure

    Hope this helps.

    Regards

    Anshu