This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RTOS/TM4C1294NCPDT: FPU registers corrupted when using in SWI

Part Number: TM4C1294NCPDT
Other Parts Discussed in Thread: SYSBIOS

Tool/software: TI-RTOS

Hello all,

I'm using FPU in a SWI context (priority 15). When doing a comparison between 2 floats always expected to be equal sometime (about one over some million) I get comparison fail.

Placing a hw breakpoint shows that S0, and sometime also S1 report uncorrect value.

I've attached a screenshot where CPU is halted after comparison fails.

There S0 and S1 are loaded from addresses contained in R0 and R1, both locations contain the correct values, 30.152075 that is 0x41F13773. However, in S0 now I see 0 which is the reason breakpoint is hit.

By tracing flow with some homemade ram logger I see that problem occurs only if this SWI function is interrupted by a HWI, that I would consider not using floating point unit.

I'm using tirtos_tivac_2_14_00_10.

All I can think about is something that FPU regs are not saving at context switch and someone could use them.

Is there any way to tell OS to save them when preempting SWIs?

BR.

Lorenzo.

  • Hi Lorenzo,

    The dispatcher pushes d0-d7 as well as the fpscr on Hwi entry and pop them on exit. Is the Hwi managed by the kernel or is it a zero-latency interrupt. If it is the latter, the ISR code is responsible for preserving the registers.

    Todd
  • Hi Todd,

    thank you for your quick response.

    The Hwi is managed by the kernel, actually the interrupting HWIs are 2: (INT_CAN0_TM4C129 and INT_CAN1_TM4C129); their priority is defined below.

    #define HWI_PRIORITY_CAN0INT (3 << 5)
    #define HWI_PRIORITY_CAN1INT (4 << 5)

    There are other 2 HWIs, constructed at runtime trough Hwi_construct(...), UART and ADC, they have  (7 << 5) and (1<< 5) as priority; so no HWI is zero-latency interrupt as Hwi_disablePriority is equal to 32.

    There is no ISR registered outside TI-RTOS.

    When bk is hit, and also in any other moment, the FPCCR register shows ASPEN bit clear, should it be set or it's OK?

    I've checked the disassembly of ti_sysbios_family_arm_m3_Hwi_dispatch__I and conditional instructions under __TI_VFP_SUPPORT__ are really present. 

    I'll try to recap in order to be as clear as possible.

    • ADC is configured to have conversions triggered by a hw timer.
    • A hwi, associated to INT_ADC1SS1, copies converted values and post the swi where there are a lot of calculations.
    • If a hwi (the CAN related in my case) interrupts the swi, I may get swi resuming with some fpu reg corrupted.
    • The CAN hwi anyway has no code using float

    Just last addendum, I put in the CAN HWIs a trace of S0 and S1 by calling at function entry and exit the following 2 fnc:

    .global __get_S0
    __get_S0:
    vmov r0,s0
    bx lr


    .global __get_S1
    __get_S1:
    vmov r0,s1
    bx lr

    Both of them reported always the correct value (0x41F13773 = 30.152075)

    BR

    Lorenzo.

  • Hi Lorenzo,

    First, can you confirm which compiler you are using. From the snapshot it looks like IAR, but please confirm.

    Also, can you bump the priority up from (1<5) to (2<5). I know it should be fine since the Hwi_disablePriority is 32, but it's an easy test:)

    Based on your description, we think your CAN ISR is corrupting the stack. What are the local variables in the ISR? Maybe a buffer is being overwritten and corrupting the D0-D7 registers that are saved on the stack? An easy test is to add a char buffer[32] as the first local variable. Initialize all the elements to some known value (e.g. 0xa5) first thing in the ISR. At the end of the ISR, confirm the buffer values are still that value. Note: you may need to bump up the system stack in case you are close to the top of it. You can check ROV->Hwi->Module to see the peak to see if you are close.

    Todd

  • Hi Todd,

    the compiler I'm using is TI ARM v5.2.2; the snapshot is taken from Trace32 Lauterbach which is my debug system.

    I've moved priority from 32 to 64.

    I've added also the 32 bytes guard buffer as first local variable.

    The local variables were:

        CanDrvChannelId            canDrvChannelId = (CanDrvChannelId) arg;
        CanDrvChannelStructPtr canDrvChannelPtr;
        UInt32                              canDrvChannelBase;
        CanMsgObjId                   canMsgObjId;
        UInt32                              canMsgObjInts;

    where:

    /* CAN driver channel identifiers */
    typedef enum {
        CAN_DRV_CHANNEL_PROCESS = 0,
        CAN_DRV_CHANNEL_SERVICE,
    
        CAN_DRV_CHANNEL_ID_MAX
    } CanDrvChannelId, *CanDrvChannelIdPtr;

    CanMsgObjId is another enum ranging from 0 to 31 and
    CanDrvChannelStructPtr is a pointer to a structure containing the base address of the relevant CAN peripheral and also some counter to be updated by HWI to provide outside some statistics.

    Of course, I've checked I'm not running out of stack.

    I also added an automatic check on guard buffer at the end of HWI: This way I count to intercept the problem with a hw breakpoint where corruption is asserted.

        for (i=0; i<=7;++i) {
            if (buffer[i] != 0xA5A5A5A5) {
                corruption = TRUE;
            }
        }

    The new FW is just up and running. Since a failure typically happens in 4-24 hrs, I'll let you know something more tomorrow.

    Lorenzo.

  • Lorenzo,

    Thanks for the update. Let's hope this reveals something.

    Todd
  • Hi Todd,

    the problem occurred again, please find the attached presentation.

    In my check at the end of Hwi the unsigned long buffer[8] was found integer.By the way, when problem occurs I see those values as partially corrupted: only 3 word instead of 8. By looking at what is close the remaing A5 words I thin that the missing 5 are the one with higher addresses (0x200014D8--0x200014EB. I've recognized TIMER1_BASE in that area.

    At 0x20001530 I see LR pointing to my interrupted SWI, (please find disassembly on the right); that is fully compatible with a corruption of  S0-S7 in FPU context: in fact at this point S1 has still to be written, instead S0 is going to be recovered from corrupted stacked copy.

    I also found in yellow what in my opinion was the stacked fpu context.

    In light blue is the current SP.

    The only positive side is that problem occurs systematically in some hrs, so I can take all data needed once cpu is halted in the bk.

    Just a remark, I use nested interrupt.

    regards.

    Lorenzo.

    FPU stacking.zip

  • Lorenzo,

    “By the way, when problem occurs I see those values as partially corrupted:”

    I interpret this statement to mean that something within the body of the Hwi function is corrupting the stack.

    Am I understanding this correctly?

    Alan

  • Hello,

    I meant that after control returns to SWI, the 0XA5A5A5A5 words are corrupted. But I wouldn't think that corruption happens inside the body of HWI since that function ends with:

    for (i=0; i<=7;++i) {
        if (buffer[i] != 0xA5A5A5A5) {
            corruption = TRUE;
        }

    and I set a bk in corruption assignment statement. I got the problem without stopping there.

    So, my guess is that part of memory is corrupted later in the dispatcher or somewhere before restoring SWI. Might be that RTOS can behave that way if not configured correctly?

    Lorenzo.

  • Lorenzo,

    You say you have nested Hwis. Are all of the Hwi functions instrumented with the corruption buffer check? Perhaps the condition only appears when Hwi nesting occurs.

    Alan
  • Lorenzo Verniani said:

    I'm using tirtos_tivac_2_14_00_10.

    All I can think about is something that FPU regs are not saving at context switch and someone could use them.

    You mention that zero-latency interrupts are NOT being used so this maybe won't help, but does applying the code change listed in TIVA SYS/BIOS FPU context switch corruption with zero latency interrupts prevent the problem?

    SYSBIOS-208 was raised for the bug in the referenced thread, which was fixed in SYS/BIOS 6.46.00.23. tirtos_tivac_2_14_00_10 uses SYS/BIOS 6.42.01.20 and so has that bug.

  • Alan DeMars said:
    Lorenzo,

    You say you have nested Hwis. Are all of the Hwi functions instrumented with the corruption buffer check? Perhaps the condition only appears when Hwi nesting occurs.

    Alan

    Alan, in my last test, guard buffer was inserted only in CAN Hwis. I'll include it in ADC and Uart Hwis as well.
    Unfortunately I cannot run any test now since my office is closed until 6th January. However, this issue is escalating in my team, so as soon as I get back to work I'll post the results.
    I agree with you that nested irqs should be kept in consideration. How could I investigate about that? In your opinion might disabling interrupt nesting from SYSBIOS configuration be an effective test or it wouldn't help to rule out anything?
    Thanks for now.
    Lorenzo.
  • Chester,

    I'll try to apply the change to the Hwi dispatcher and let you know. As I've post earlier, next test will run on 7th January.
    Regards.

  • Bumping up the thread so it does not get locked.
  • Hello,

    I've tried to insert guard buffer also in Uart and ADC hwis.

    More important, I changed the hwi dispatcher section taking care about FPUregs stacking as suggested by Alan in the post linked by Chester.

    from 

        .if __TI_VFP_SUPPORT__
            vstmdb  {d0-d7}, r1!    ; push vfp scratch regs on appropriate stack
            vmrs    r2, fpscr       ; push fpscr too
            str     r2, [r1, #-8]!  ; (keep even align)
    
            tst     lr, #4          ; context on PSP?
            ite     NE
            msrne   psp, r1         ; update appropriate SP
            moveq   sp, r1
        .endif

    to

        .if __TI_VFP_SUPPORT__
            sub     r2, r1, #72     ; back up by 9*8 bytes
            tst     lr, #4          ; context on PSP?
            ite     NE
            msrne   psp, r2         ; update appropriate SP before pushing
            moveq   sp, r2
            vstmdb  {d0-d7}, r1!    ; push vfp scratch regs on appropriate stack
            vmrs    r2, fpscr       ; push fpscr too
            str     r2, [r1, #-8]!  ; (keep even align)
        .endif

    After more than 28 hrs of test problem has not yet occurred. Typically 6-10 hrs were enough to see it.

    By the way, I can't yet figure out what happens: I'm not using zero latency interrupts. I have just only a swi (using fpu) preempted twice by 2 CAN hwis.

    I'll continue the test and keep you updated.

    BR

    Lorenzo.

  • Hi Lorenzo,

    Glad to hear it's still running. So nothing has been written into the guard regions...correct.

    Todd
  • Hi Lorenzo,

    I just talked with Alan. We missed which version of SYS/BIOS you were using. Yes, the bug that Chester pointed to is in your version. There is a small race window when a zero-latency or high interrupt preempts a running interrupt. The fix in the thread is the recommended solution. We have fixed this in newer versions of SYS/BIOS (6.46), but unfortunately there are no TI-RTOS for TivaC versions that have the fix (and while potentially feasible, we don't recommend mixing SYS/BIOS versions with TI-RTOS for TivaC).

    Given the need to release your product, the best solution is to use the fix you have.

    Todd
  • Hi Todd,

    it's still working perfectly!

    I'm going to keep testing till Monday.

    Then I'll mark the post as resolved.

    Thanks for now.

    P. S. Is there any planned date for releasing a new TI_RTOS for Tiva? ☺️

    Lorenzo.

  • Lorenzo Verniani said:
    it's still working perfectly!

    Great!

    Lorenzo Verniani said:
    Then I'll mark the post as resolved.

    Thanks

    Lorenzo Verniani said:
    P. S. Is there any planned date for releasing a new TI_RTOS for Tiva? ☺️

    Unfortunately no. We don't have any schedule patch release for TI-RTOS for Tiva. You'll need to maintain the bug fix in the version you have.

    Todd