This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CMSIS throwing NANs when interrupted

Other Parts Discussed in Thread: SYSBIOS

Hi all,

I've started using the CMSIS library (see also: http://e2e.ti.com/support/microcontrollers/tiva_arm/f/908/p/271533/952855.aspx), and everything seems to be working. We are using the arm_rms_f32 function for doing an RMS on an AC voltage and current, and the arm_fir_f32 in an interrupt handler to do a moving-window averaging filter.

We noticed something strange though, every so often the arm_rms_f32 function throws QNAN, when the values in the input array look fine. If we put in a breakpoint when a NAN occurs and have it re-calculate the value, the rms function always comes up with the right answer. 

During our debugging, I tried this with an IntMasterDisable/Enable around the function, and this caused it to work just fine and never throw NAN. I don't like the disable-interrupts solution, it seems like there's some bigger problem at work here. Since this has to do with the FPU and DSP logic, I tried a solution that the estimable cb1_mobile threw out for another problem we were having and I now call the functions FPUEnable() and FPULazyStackingEnable() at the start of the code, before I enable Sys/BIOS. This had no effect, unfortunately - it looks like the IntMasterDisable is the only way to get it to work.

Anyone know if my problem is that there are FPU instructions being used in both regular code and the interrupt handler, that my FPU-using function (arm_rms_f32) is being interrupted, or something else? Is there any solution besides holding off all interrupts?

Thanks for the help.

-Ben

  • Benjamin Fitzpatrick said:
    every so often the arm_rms_f32 function throws QNAN

    How delightful - SW's version of past, always dreaded, hardware "intermittent!"

    Truth in advertising - have only scanned CMSIS - rely instead upon (one hopes) logic, past related experience - and assorted witchcraft...

    a) Can you improve upon, "every so often?"  Substantial clue emerges should some pattern be recognized - when massaged/coaxed or teased out. 

    b) Might the last value - or past several values - have caused a computational irregularity - thus triggering QNAN?

    c) Might elevating the priority of your, "arm_fir_f32" interrupt reduce and/or eliminate this issue? 

    d) Might the rate of your call to, "arm_rms_f32()" influence the QNAN occurrence? 

    e) Might a "safer" time window exist for your call to, "arm_rms_f32()?"  (i.e. when most "likely" troublesome suspects have just completed)

    f) Might some collision or conflict be occurring during the pack and/or unpack of your (assumed) data buffer?

    g) Use of IntMasterDisable - bit like 105mm Howitzer - when small Colt may suffice!  By selective disabling of say "half" of your potential interrupts - might you be able to narrow the, "offender?"

    h) While not "satisfying" - perhaps upon the detection of, "QNAN" - you could generate a, "re-calculation" - while discarding the QNAN result.

    We were taught that missile may fly when paperwork pile exceeds missile's height.  Perhaps something here warrants inclusion - that pile...

    btw - thanks your kind mention - always satisfying when non-trivial solution can be, "coaxed out..."

     

  • Kindly note - have added to/rearranged above suggestion list.  (09:45 CST - 23 Jun '13.)

  • @cb1-

    One more point to add to your previous list :

    i) what is amount of stack provided to this application? calling a floating point routine in interrupt means some f.p registers are saved on stack - and the function called has one more nested level, also floating point. Maximum registers saved on stack is 32 registers 32 bits each. Of coarse not all will be moved on stack, but must be taken into account.

    Petrei

  • @Petrei-

    Indeed.  And your suggestion far faster - easier to implement - and test - than this reporter's.  (laundry list)

    That said/acknowledged - review & some/better compliance w/key issues listed - may lead to improved, more robust program execution...  (and - critically - points the way toward solution - which expanded stack does not (should it fail...))

  • Thank you for your advice, both of you.

    To clarify for cb1, we are pretty sure that the 'every so often' occurrence of getting a NaN is that it happens when we get interrupted in the middle of the routine. I have not conclusively proved this, but it fits with the IntMasterDisable making the problem go away.

    I tried the strategy of disabling specific interrupts, but it was not successful. I disabled the interrupts doing floating point math first (this reduced the frequency, but not much), then added in the timer interrupts, then added in the generic processor interrupts (0-15). After that the problem happened very infrequently, but it did still happen. My suspicion is that we have additional interrupts running that I am not aware of, and that any interrupt doing floating point math or not can cause a NaN from the RMS function. I'm not sure how to get a list of all the interrupts that are enabled.

    I have also tried increasing the stack size, but this didn't seem to have an effect - we still failed after several minutes of operation.

    I tried that again with IntMasterDisable, and was able to run long-term without incident. No question that it is the Howitzer of solutions, but the Colts don't seem to be working right now...

  • Benjamin Fitzpatrick said:
    suspicion is that we have additional interrupts running...of which I'm unaware

    Would not each/every enabled interrupt be listed w/in, "Start-Up" file?  And - have a servicing interrupt - w/in your code?

    Perhaps by creating a code test for "NaN" (w/in your existing code block) - and inserting a break-point therein - cause may better reveal...

  • I'm pretty sure that in CCS5 or at least for the LM4/TM4C line, they have removed the 'start-up' file. For instance, the interrupts 0-15 which seem to be processor interrupts we found in documentation - they do not appear in the app configuration or any other file that we've been able to find. 

    As far as needing code to service the interrupts, all of the interrupts we have code for were disabled, that was the first thing I tried. Yet we still see behavior with all of those interrupts disabled that we do not see when IntMasterDisable is called.

    I did create a test for NaN and put a breakpoint in it - this is how I can tell that IntMasterDisable is working, but that disabling the individual interrupts aren't. However, since the breakpoint seems to be after the NaN is generated, I can't tell what (if any) interrupt is causing it to return.

    I'm not certain if I can instrument the code for the CMSIS library without breaking some of the optimizations they did (I've looked at it and it's pretty arcane), but I suppose that for figuring out where the NaN originates, this is probably the best option. I'll definitely give it a try as soon as I can, but it probably won't be today. Thanks so much for helping cb1, I can't believe TI hasn't hired you to run this place yet...

  • Benjamin Fitzpatrick said:
    I'm pretty sure that in CCS5 or at least for the LM4/TM4C line, they have removed the 'start-up' file. For instance, the interrupts 0-15 which seem to be processor interrupts we found in documentation - they do not appear in the app configuration or any other file that we've been able to find.

    Where are you looking in the app configuration?

    In a Cortex M4F there is a difference between Vector number (offset into the vector table) and interrupt number (bit in VNIC interrupt vector registers)

    Vector numbers 0 to 15, which are Processor Exceptions, don't have entries in the NVIC interrupt registers

    Vector number 16 is interrupt number 0 (bit in NVIC interrupt vector register)

    Vector number 17 is interrupt number 1

    And so on...

  • Benjamin Fitzpatrick said:
    I tried a solution that the estimable cb1_mobile threw out for another problem we were having and I now call the functions FPUEnable() and FPULazyStackingEnable() at the start of the code, before I enable Sys/BIOS.

    I had a look in the debugger on a LM4F120H5QR running a SYS/BIOS project in CCS 5.4:

    1) At processor reset the ASPEN (Automatic State Preservation Enable) and LSPEN (Lazy State Preservation Enable) bits are set in the Floating-Point Context Control (FPCC) register. This is line with the documented reset values given in the LM4F120H5QR datasheet.

    2) The ASPEN and LSPEN bits are still set when main is called. FPULazyStackingEnable sets the ASPEN and LSPEN bits, so as they are already set at processor reset adding a call to FPULazyStackingEnable won't change the behaviour.

    3) When one of my SYS/BIOS tasks stopped at a breakpoint, i.e. after BIOS_start had been called, the ASPEN and LSPEN bits were shown as cleared. This means automatic floating point state presevation has been disabled, which may explain the problems with CMIS throwing NANs when interrupted.

    I looked at the SYS/BIOS properties exposed in the CCS GUI and couldn't see any obvious settings which control floating point state presevation so will have to read the manuals / study the source code....

  • Chester Gillon said:
    When one of my SYS/BIOS tasks stopped at a breakpoint, i.e. after BIOS_start had been called, the ASPEN and LSPEN bits were shown as cleared. This means automatic floating point state presevation has been disabled, which may explain the problems with CMIS throwing NANs when interrupted.

    Setting a hardware watchpoint on writes to NVIC_NVIC_FPCC (0xE000EF34) showed that the _ti_sysbios_family_arm_m3_Hwi_initStacks__E function in the bios_6_35_01_29/packages/ti/sysbios/family/arm/m3/Hwi_asm_switch.sv7M source file is clearing the APSEN and LSPEN bits. The source file comments don't say why the bits are cleared.

    I will add some task floating-point operations to the SYS/BIOS task to try and check for floating point corruption.

  • Again - great effort/detail/awareness by Monsieur Gillon. 

    That said - poster reports, "every so often the arm_rms_f32 function throws QNAN!"

    That "every so often" restriction - at least to this reporter - appears to confound (or be outside) the 2 critical bit discovery you have unveiled...

    Poster holds steady as regards, "IntMasterDisable" preventing this unwanted QNAN.  Perhaps the critical overlap of that fact - w/in your new findings - may prove fruitful...

    Unfortunate that "officialdom" has thus far avoided this "unfruited" plain...

  • They are supposed to release CMSIS for Cortex-M4 in this thread:

    http://e2e.ti.com/support/microcontrollers/tiva_arm/f/908/p/261081/960090.aspx

    Regards

  • PAk SY said:
    They are supposed to release CMSIS for Cortex-M4 in this thread:

    http://e2e.ti.com/support/microcontrollers/tiva_arm/f/908/p/261081/960090.aspx

    I am not sure what is "missing" from the CMSIS support for Cortex-M4. e.g. using CCS 5.4 I have just built the CMSIS library for Cortex-M4 using:

    a) The CMSIS source code v3.01 downloaded from ARM. In this version the ARM provided core_cm3.h, core_cm4.h, core_cm4_simd.h, core_cmFunc.h and core_cmInst.h files already have #if statements to detect use of the TI CCS compiler by checking for the predefined symbol __TMS470__

    b) The TI provided cmsis_ccs.h installed from sw01291

    c) In the CCS project properties added the predefined symbol ARM_MATH_CM4 to specify the build the CMSIS library for a Cortex M4.

    The only issue during compilation was that two compiler specific intrinsics required by some of the ARM supplied source files were missing from the TI supplied cmsis_ccs.h. See my post in the Problem compiling the CMSIS library in CCS thread for the suggested additions to cmsis_ccs.h

    Edit: I also ran the CMSIS arm_fft_bin_example and arm_fir_example examples on a LM4F120H5QR and the tests passed - the examples compare the results calculated by the CMSIS functions against an expected result.

  • Benjamin Fitzpatrick said:
    Anyone know if my problem is that there are FPU instructions being used in both regular code and the interrupt handler, that my FPU-using function (arm_rms_f32) is being interrupted, or something else?

    Which version of SYS/BIOS are you using?

    I noticed the following in Bios_6_35_01_29_release_notes.html :

    Defects Fixed in SYS/BIOS 6.35.01.29 GA (Fixes since SYS/BIOS 6.35.00.20):
    ID Headline
    SDOCM00099886 Interrupt dispatcher in M4F targets does not restore FPSCR properly

  • Chester Gillon said:
    I noticed the following in Bios_6_35_01_29_release_notes.html :

    Defects Fixed in SYS/BIOS 6.35.01.29 GA (Fixes since SYS/BIOS 6.35.00.20):
    ID Headline
    SDOCM00099886 Interrupt dispatcher in M4F targets does not restore FPSCR properly

    [/quote]Comparing bios_6_34_02_18\packages\ti\sysbios\family\arm\m3\Hwi_asm.sv7M (from a CCS 5.3 installation) and bios_6_35_01_29\packages\ti\sysbios\family\arm\m3\Hwi_asm.sv7M (from a CCS 5.4 installation) shows a difference in:

    - How _ti_sysbios_family_arm_m3_Hwi_dispatch__I pops fpscr

    - How _ti_sysbios_family_arm_m3_Hwi_pendSV__I pops fpscr

    Using CCS 5.4.0.00091 and SYS/BIOS 6.35.01.29 created a project for a LM4F120H5QR which:

    a) Had a "low priority" task which initially called the arm_fir_example_f32 and arm_fft_bin_example_f32 CMSIS examples a number of times.

    b) The "low priority" task then changed to continuously called arm_fir_example_f32, while a "higher priority" task called arm_fft_bin_example_f32 when activated by a sempahore given from a SYS/BIOS counter function.

    The idea was to call CMSIS examples, which check results from CMSIS functions against expected results, from two SYS/BIOS tasks to see if any errors were detected. Note that the test DIDN'T call the TivaWare FPUEnable or FPULazyStackingEnable functions, so that all the floating point support was handled by SYS/BIOS.

    When compiled using the SYS/BIOS 6.35.01.29 from CCS 5.4 this test ran for 20 minutes without detecting any errors.

    The Hwi_asm.sv7M used was then changed to the version from SYS/BIOS 6.34.02.18, i.e. without the fix for SDOCM00099886. This then caused arm_fir_example_f32 to detect an error on the 2nd iteration. Therefore, the fix made for SDOCM00099886 in SYS/BIOS 6.35.01.29  does appear to fix an error where floating point operations can be affected by interrupts.

  • Poster repeatedly reports, "every so often the arm_rms_f32 function throws QNAN!"

    Diagnosis - minus proper regard of symptoms/case history - concerns... 

  • cb1_mobile said:
    Poster repeatedly reports, "every so often the arm_rms_f32 function throws QNAN!"

    Diagnosis - minus proper regard of symptoms/case history - concerns...

    Is that concern about my attempted diagnosis?

    I agree that since I don't have the actual source code any diagnosis isn't necessarily accurate. However, my line of thinking was:

    1) The reported problem was that if interrupts are enabled then every so often the arm_rms_f32 function throws QNAN.

    2) In the original system I don't know how often the arm_rms_f32 function is called, or how often interrupts occur. The relative rate of the calls to arm_rms_f32 .vs. rate of interrupt may lead to a failure "every so often".

    3) Lacking details for 2, a SYS/BIOS test was created which had two tasks perfoming CMSIS calculations and checking the results, with regular clock interrupts enabled. The theory was that if there was a SYS/BIOS problem with the floating point context not being saved correctly across task switches or interrupts then the test would highlight the problem.

    With the latest SYS/BIOS 6.35.01.29 no test failures were observed.

    4) The release notes for SYS/BIOS 6.35.01.29 reports that that version has corrected the defect "SDOCM00099886 Interrupt dispatcher in M4F targets does not restore FPSCR properly". If the Floating-Point Status Control (FPSCR) is not restored properly after an interrupt then an interrupted floating point operation could return incorrect results.

    When the interrupt dispatcher in the previous SYS/BIOS 6.34.02.18 was used the CMSIS test then detected failures.

    Therefore, my suggestion is for the poster to check which version of SYS/BIOS is in use:

    a) If a version prior to SYS/BIOS 6.35.01.29 is in use, then try upgrading to SYS/BIOS 6.35.01.29 to see if the fix in the interrupt dispatcher makes the "every so often" failure go away without having to disable interrupts around the arm_rms_f32 function.

    b) If SYS/BIOS 6.35.01.29 is in use then the problem is something else, and needs further investigation. 

  • As past stated here - your work is detailed and very often tightly focused.  And - I've memorialized & appreciated those facts.

    It is the "every so often" (loose end) which does concern.  Unfortunate that op has gone silent - especially in light of your constructive efforts.

    Having some past, professional success wrt remote, tech diagnosis - I reacted to and challenged poster's, "every so often."  And he responded w/out adding requested specificity.  That said - have been at this long enough to suspect that, "every so often" usually ranges between 1 event out of 20, or 50 or even far higher.  (again - that's past experience speaking)

    Data you gathered suggests (to me) that the errant QNAN results would occur at a rate beyond, "every so often." 

    As I stated in opening/answering post - intermittent problems - be they HW or SW - are often the most difficult.  (especially when presented w/imprecision!)

    I am in high agreement that upgrade to newest SYS/BIOS appears to make sense - but remain reluctant to, "bet the farm" on its successful outcome...

    And - myself/others would have found your analysis more compelling if some "interweave was attempted" - better linking the "QNAN randomness" to your findings.  There was none - and this silence/avoidance stood in contrast to the otherwise great detail & effort w/in your report - motivating my comment...

  • Hello everyone,

    Apologies for 'going silent' the last week - I was on vacation, and while your analyses are amazing and thorough, the beach was more attractive at the time ;)

    Now that I'm back on the case, I can report that I am using TI-RTOS 1.1.0.25, which seems to include Sys/BIOS 6.34.04.22 (this is what's in the 'products' folder inside the TI-RTOS folder). This seems to be significantly out of date with your versions, but after checking for an update in CCS I don't see one. I'll prod our TI reps and find out what the story is there.

    As for the 'every so often' part of the problem, I'll try and shed some light, though unfortunately as it's intermittent I can't offer you too many hard numbers. Our application runs a 'main loop' which performs non-safety-critical tasks, and this loop runs as often as it can (it's just a while(true) loop). As interrupts, we process values from the ADCs and perform safety-critical calculations. Our RMS processing function, the one that's throwing QNAN, is part of our main loop but operates on a 1 millisecond delay timer, so it won't process RMS values any more often than once per millisecond. In practice, it's probably fairly close to every millisecond because our processor isn't very busy.

    We have four interrupts, two of which perform floating point operations. One implements our timer code which we use to implement long-duration timers, a 32-bit timer isn't much good if you want to wait for hours or days, so once a millisecond we 'tick' several timers. Another interrupt just notifies us when a hardware GPIO goes high - this feeds into our state machine but does nothing else, it can happen at any time though. Once per millisecond we also process ADC values and do FIR filter calculations and floating point math on them, and then once every 800 microseconds we process other ADC values and do FIR calculations and floating point math on them. None of these interrupts are synchronized to each other - that is, it's possible for all four of them to fire at the same time just out of happenstance. It's also possible none of them will ever fire at the same time. In addition, I think there are BIOS interrupts which are hidden from me, possibly to implement the timers that fire the interrupts, or possibly ADCs.

    When I have all the interrupts enabled, the QNAN seems to crop up in the code very quickly, I would say that it's rare for the code to run for more than a second or two between throwing a QNAN. Since all but the hardware GPIO interrupts are on a timer, they are happening fairly rapidly, but I can't tell you whether or not they are happening *during* the RMS processing function. As I disable interrupts, including interrupts 0-15 which I don't have code for and in theory don't 'use', the frequency of QNAN becomes less and less, to the point when I have disabled everything I can think of it takes more than a minute for a QNAN to occur. Or, if I disable the master interrupt, QNAN never occurs.

    It is important to note that when I 'disable' interrupts I am only doing that for the duration of the RMS processing function and then immediately re-enabling them. Several folks I have discussed this with here thought I meant I was disabling them completely for the application, but that would be ludicrous since it would mean many important pieces of code would never run, and it would ruin the troubleshooting.

    Thank you all for your amazing insights and the time you have put into this problem! I never expected so much to be done when I was out of the office or I would have posted something about my vacation. If there's anything else I can tell you about my interrupts occurring or their schedules, or if you have any ideas of how to see whether I am being interrupted or not during the RMS loop, I will gladly provide if I can. I have to say that Chester's solution of FPULazyStacking not being enabled seems to make sense, but I'd like to root-cause it as well.

    Oh, and I did another test right before I left based on something cb1 said - I tried instrumenting the CMSIS library itself to see if I could discover where the QNAN was actually being generated. The code is an unrolled loop, which looks like this:

    for each 'block':

        in = pointer++;

        sum += in * in;

       // above code x4 because the loop is 4x unrolled

    With the above code, I instrumented both 'in' and 'sum' to see what the values were when the QNAN was generated. It occurred right in the middle of a block, which was not the first nor last block. The 'in' value was 0 (which should not be the case, there were no 0.0 values in the input buffer I handed the function), and the 'sum' value generated from multiplying 0*0 was QNAN. So, it seems something horrible happened instead of the right thing.

  • Welcome back.  Appreciated your instrumenting the code - attempting to coax/tease out some further clue...

    That illegal "in" value of 0 - does this not suggest a faulty pointer?  (however - would expect this to more normally occur @ initial read - not w/in the middle)  Is it possible that someway/somehow - your input buffer was "unstable" during that data read?  Or that one/multiple of your, "Interrupts temporarily disabled" - "poisoned" your pointer?

    Do the reads on each side (i.e. prior to and post) the QNAN event fetch the buffer content properly - or does the process "screech to a halt" - upon QNAN?

    Far earlier - asked if the QNAN (upon detection) could be discarded - and your processing allowed to continue.  (if the pointer had been "spoiled" - seems doubtful that your operation could have "auto-recovered" - but we do not know...)

    I've missed - or the "size" of each "block" - was not presented.  (or presented days ago...kindly provide)

    Might the QNAN's "favoring" - "right in the middle of a block" - signal some overflow - or some regularity - which points toward a solution?

    Suggest that you create a, "secure, fixed buffer" - which disallows any/all updates - and re-run your test.  (eliminates any, "buffer in transition" issues)

    And of course - if newer/better RTOS release exists - do try...

    Update: little experience w/"this" CMSIS or vendor restricted RTOS.  That said - might CMSIS and/or RTOS have "expected" - or been targeted toward - MCUs w/larger SRAM?  As current LX4F and rebrand thus far constrain SRAM to 32KB - might this 32KB be impacting?  (Certain, past LM3S devices far exceeded - your "mid-block" QNAN occurrence triggers this line of thought.)

  • Benjamin Fitzpatrick said:
    I have to say that Chester's solution of FPULazyStacking not being enabled seems to make sense, but I'd like to root-cause it as well.

    I am not sure FPULazyStacking being disabled by SYS/BIOS is actually causing a problem, since SYS/BIOS seems to stack floating point registers itself. Now that I know your application performs floating point operations from interrupts, as well as tasks, I will try and enhance my test program to perform floating point from interrupts as well.

    How are the interrupts configured in your SYS/BIOS:
    - HWI or SWI?
    - What is the interrupt priority?

    [I found that on a Cortex M4, if a HWI interrupt is defined with a priority of less than disablePriority then the interrupt is a "Zero Latency" interrupt handler which is severely restricted in which SYS/BIOS functions can be called from the handler]

  • Haven't had a chance to try cb1's solutions yet, but it looks like I can answer Chester's questions fairly easily....

    • All interrupts are HWI. The SWI module isn't enabled in our app.cfg.
    • The interrupt priority for all interrupts appears to be the default, which seems to be -1. 
    I am unfamiliar with 'disablePriority', but I would also indicate that we don't call any Sys/BIOS functions in our interrupt handlers. Everything we do should be Stellarisware or CMSIS. I wonder if these are also affected if we are in a "Zero Latency" situation, or whether stacking does not work properly there?
    I can certainly change these interrupt priorities and see if it has any effect, I will try this out when I also try cb1's suggestions.
  • Follow on to yesterday's (10:24 CST) suggestions:

    Upon the detection of, "in == 0," (which triggers QNAN) cannot you capture the pointer?  Would not this prove most beneficial?

    Beyond one single pointer capture - might you capture those pointers bit prior to "middle of block" - which you report as "usual occurrence" of this issue?

    Another method - seed a safe buffer (i.e. buffer nowhere else "known/accessed") w/ a regular pattern of valid data.   Review then of your converted data should reveal pointer's "correctness" - up to the QNAN.

    Believe the facts/data you've harvested/presented "justify" bit more probing... 

  • Benjamin Fitzpatrick said:
    but it looks like I can answer Chester's questions fairly easily....
    • All interrupts are HWI. The SWI module isn't enabled in our app.cfg.
    • The interrupt priority for all interrupts appears to be the default, which seems to be -1. 
    I am unfamiliar with 'disablePriority', but I would also indicate that we don't call any Sys/BIOS functions in our interrupt handlers.

    The mention of 'disablePriority' was copied from the SYS/BIOS help without context - 'disablePriority' and interrupt priorities are documented in the CCS help contents for SYS/BIOS -> API reference -> all packages -> ti.sysbios.family.arm.m3 -> HWI. 'disablePriority defaults to 32 (shown in SYS/BIOS configuration under SYS/BIOS -> Target Specific Support -> M3 (Ducati) Hwi - All Options in the Basic tab). See also SYS/BIOS M3 Hardware Interrupt (Hwi) Handling
    Benjamin Fitzpatrick said:
    I wonder if these are also affected if we are in a "Zero Latency" situation, or whether stacking does not work properly there?
    I can certainly change these interrupt priorities and see if it has any effect
    Setting the default interrupt priority of -1 means a value of 255 which is the lowest priority. This means your interrupt handlers won't be "Zero Latency" and will be able to use the SYS/BIOS APIs documented as callable from a HWI thread. Therefore, doubt changing the interrupt priorities will correct the problem.
  • Benjamin Fitzpatrick said:
    I am unfamiliar with 'disablePriority', but I would also indicate that we don't call any Sys/BIOS functions in our interrupt handlers. Everything we do should be Stellarisware or CMSIS. I wonder if these are also affected if we are in a "Zero Latency" situation, or whether stacking does not work properly there?

    I did look at the SYS/BIOS documentation but couldn't find a statment on if user code was allowed to perform floating point in a Hwi thread. However in SYSBIOS and VFP/NEON registers in task context I did find this comment from a TI employee:
    We have regression tests that simultaneously perform floating point operations in Hwi, Swi, and Task threads while sweeping the Hwi through all of the Swi and Task code so I'm surprised we didn't catch this problem.

    Perhaps we need to do a specific kind of floating point operation to trigger the problem?
    I.e. calling floating point from a Hwi is supposed to work. Note that this other SYSBIOS and VFP/NEON registers in task context thread was a problem on a Cortex-A8 where the  FPSCR (float-point status and control register) wasn't restored after a Hwi which caused "chaotic bugs in different parts of code, but all of them are concerned with floating-point". This problem which was found in the Cortex-A8 SYS/BIOS appears to be a similar problem which was fixed in SYS/BIOS 6.35.01.29 for Cortex M3/M4.

  • I have some updates on this issue. I've tried the various suggestions for removing it, except for upgrading TI-RTOS which I will discuss momentarily. In short, none were successful. Here's some data:

    cb1 suggested we look for a 'poisoned pointer'. I recorded three values: The original pointer value, the location of the first NAN, and the location of the last NAN. The system ran for some time before generating a NAN, indicating that there were many cycles where no NAN was seen. After the system halted on a NAN, I checked the stored values (I am using a macro to store the values in the library, putting breakpoints on them is difficult, so I just stored the values off into global integers), and the start pointer, end NAN pointer, and first NAN pointer look reasonable. We're definitely  not walking off the end of my array (which is 24 doubles long, to answer the question of how big this is).

    cb1 also suggested we look at multiple reads and writes, which is why I'm capturing the location of the first and last NANs, to see if they were the same: they were not. This implies that the issue is happening multiple times during one calculation, even though it never happens at all for quite some time. This behavior was consistent, I ran multiple runs and every time it failed it was on more than one NAN in a function. Sadly my efforts to make a count of how many times NAN occurred in the function were unsuccessful.

    I also tried the suggestion of copying all of the data to a fixed buffer that never changes. This had no effect.

    Another suggestion was that we auto-recover - that is, detect the NAN and re-do the calculation, which would likely work. Fortunately we don't have to do this because disabling the master interrupt around this function only *shouldn't* cause us any problems.

    The only other suggestion, made by both cb1 and Chester was to upgrade to the new BIOS/RTOS. I tried this and the upgrade requires CCS 5.4, which I am going to install now - but I won't have a chance to install this, install the new BIOS, and test before I have to head out for the week. More updates on Monday!

    Thanks yet again for all your assistance.

    -Ben

  • Benjamin Fitzpatrick said:

    Oh, and I did another test right before I left based on something cb1 said - I tried instrumenting the CMSIS library itself to see if I could discover where the QNAN was actually being generated. The code is an unrolled loop, which looks like this:

    for each 'block':

        in = pointer++;

        sum += in * in;

       // above code x4 because the loop is 4x unrolled

    With the above code, I instrumented both 'in' and 'sum' to see what the values were when the QNAN was generated. It occurred right in the middle of a block, which was not the first nor last block. The 'in' value was 0 (which should not be the case, there were no 0.0 values in the input buffer I handed the function), and the 'sum' value generated from multiplying 0*0 was QNAN. So, it seems something horrible happened instead of the right thing.

    Above - extract from your 08 July post

    I (hope w/some justification) challenged the sanctity of "pointer" (in the above code) - your response today indicates that your tracked pointer values appear w/in reason.  Now you past reported - the harvested "in" value was 0 - both you/I found that unusual.  (3rd participant always silent - in that regard)   If the pointer is correct (or seemingly so) does that not divert suspicion to any/all other manipulations of "in?"  Suspect now that other operations are "in play" - and have yet to arrive - this theater - for analysis...

    Further - does not above code cause sum to retain previous value (sum) - and not become 0 - even if present in is 0?

    One final swag - you report 1mS as call-period for this ailing function.  Temporarily change that to 10mS - see how/if incidence of QNAN is impacted...

  • cb1- said:
    I (hope w/some justification) challenged the sanctity of "pointer" (in the above code) - your response today indicates that your tracked pointer values appear w/in reason.  Now you past reported - the harvested "in" value was 0 - both you/I found that unusual.  (3rd participant always silent - in that regard)

    My understanding of the problem is that the when interrupts are disabled for the duration of the CMSIS library call that the program no longer fails. That points to something interrupt handling is doing to corrupt the state of the CMSIS library call that it presumably interrupts. This is why I was suggesting the SYS/BIOS bug fix to the restoring of the Floating Point Status register should be investigated first to see if the problem no longer occured (without having to disable interrupts).

    Also, if the CMSIS library function failed to produce a correct result, which wasn't a QNAN, would that be detected a failure? I.e. have we fully determined the effect of the failure mechanism?

    When the CMSIS library function returns a QNAN result it would be usefull to instrument SYS/BIOS to determine which interrupt handler(s) interrupt the CMSIS library function. If only one interrupt handler causes the QNAN result it would point at something that interrupt handler is doing. If more than one interrupt handler causes the QNAN result this points at code common to handling all interrupts. However, not yet sure if SYS/BIOS can be instrumented in this way.

  • Myself/others believed that poster's report, " "in" value was 0 - and should not have been!" provided a substantial (if not the greatest) clue - and continues to escape (beyond my, singular) comment...

  • I have had another go at creating a SYS/BIOS program using CMSIS which is closer to the original posters in that:

    a) arm_rms_f32 is called from a SYS/BIOS task

    b) arm_fir_f32 is called from a SYS/BIOS (timer) interrupt

    The test consists of pseudo-random input data, and "expected" results which are calculated in main before interrupts are enabled and SYS/BIOS is started.

    For arm_rms_f32 the output is tested for an incorrect result, with is either a NAN or not a NAN.

    The same test has been created for both:
    - CCS 5.3 with SYS/BIOS 6.34.02.18
    - CCS 5.4 with SYS/BIOS 6.35.01.29

    Both use the same TI ARM compiler 5.0.5 and CMSIS v3.01 library.

    Running the CCS 5.3 - SYS/BIOS 6.34.02.18 version for one minute on a Stellaris Launchpad produced the following results:

    num_rms_nan_failures 70
    num_rms_non_nan_failures 589444
    num_rms_iterations  676666
    num_fir_failures  40555
    num_fir_iterations  48471

    Running the CCS 5.4 - SYS/BIOS 6.35.01.29 version for one minute produced the following results:

    num_rms_nan_failures 0
    num_rms_non_nan_failures 0
    num_rms_iterations 665219
    num_fir_failures 0
    num_fir_iterations 48576

    My conclusions are:

    1) With SYS/BIOS 6.34.02.18 floating point CMSIS library functions called from task context and a (timer) interrupt context can sometimes return incorrect results.

    2) With SYS/BIOS 6.34.02.18 the occurance of an incorrect NAN result from arm_fir_f32 is less than a non-NAN incorrect result.

    3) In going to SYS/BIOS 6.35.01.29 the floating point CMSIS were no longer seen to return incorrect results.

    I have attached both CCS test projects for reference.3808.CCS_5.3-SYSBIOS-6.34.02.18-sysbois_cmsis_rms.zip
    7762.CCS-5.4-SYSBIOS-6.35.01.29-sysbios_cmsis_rms.zip

  • While somewhat promising - a one minute test does not provide high confidence.  (we have experienced PWM Gen triggered failures of LX4F ADC post 2 Billion such PWM Gen successful triggers...  PWM Gen @ 20KHz - thus run succeeded for 27.7 hours (continuous) - prior to ADC ceasing (apparently) to respond to trigger.  Operation was noted/tracked via SW counters placed atop PWM interrupt & ADC interrupt.  PWM counter continued - ADC counter froze (along w/every ADC channels's content.)  Data presented to highlight requirement for "adequate" test design & duration.)

    Do not believe we have necessary insight into poster's code/handling and buffer mechanics to really produce meaningful, "apples to apples" comparison. 

    And - as always - would surely benefit from the attempt to execute poster's "in" and "sum" operations - which he proclaimed "failed" - and see if new test succeeds...

    Appears that a very limited probe/test of this SYS/BIOS - CMSIS combination is in play (this thread - certainly not your/my nor poster's job/responsibility.)  Such focused/confined probe/tests cannot be expected to result in great confidence across the full range of this combination's capabilities...

  • cb1- said:
    Do not believe we have necessary insight into poster's code/handling and buffer mechanics to really produce meaningful, "apples to apples" comparison. 

    I don't know the details of the original posters "buffer mechanics". However, the SYS/BIOS test programs I posted were designed to be as simple as possible to try and answer the following question from the original post:
    Benjamin Fitzpatrick said:
    Anyone know if my problem is that there are FPU instructions being used in both regular code and the interrupt handler, that my FPU-using function (arm_rms_f32) is being interrupted, or something else? Is there any solution besides holding off all interrupts?
    The test program approach was:
    a) Allocate fixed size global arrays for inputs, initialised with pseudo-random data and never changed during the test.

    b) In main, while interrupts are disabled and before SYS/BIOS is started, call arm_rms_f32 and arm_fir_init_f32/arm_fir_f32 on the pseudo-random input data and save the "expected" results in global variables.

    c) In a SYS/BIOS task repeatidly call arm_rms_f32 on the same input data, performing a binary compare on the "actual" result in the task against the "expected" result stored at initialisation. Also checks for NAN results by using isnan() from math.h

    d) In a SYS/BIOS interrupt call arm_fir_init_f32/arm_fir_f32 on the same input data, performing a binary compare on the "actual" result in the interrupt against the "expected" result stored at initialisation.

    e) For simplicitly, counts of total test iterations and any failures are stored in global variables which can be inspected in the debugger.

    For a comparison ran the same test using CCS 5.3 - SYS/BIOS 6.34.02.18 and CCS 5.4 - SYS/BIOS 6.35.01.29 for an arbitrary period of one minute:

    1) The CCS 5.4 - SYS/BIOS 6.35.01.29 version performed 665219 RMS test iterations with no failures detected, and 48576 FIR test iterations with no failures detected.

    2) The CCS 5.3 - SYS/BIOS 6.34.02.18 version:
    a) Performed 676666 RMS test iterations, with 589514 iterations (87%) having failures. Of the failures 70 were due to the output being NAN. Inspecting one NAN failure in the debugger showed a 1.#QNAN value (hex 0x7FC00000). Inspecting non-NAN failure in the debugger showed the "actual" output was 18.66426 (hex 0x41955067) rather than the "expected" 18.6644 (hex 0x419550B0)

    b) Performed 48471 FIR test iterations, with 40555 iterations (83%) having failures. Inspecting a sample failure showed index [7] in the "actual" output had 97.8345 (hex 0x42C3AB43) rather than the "expected" 97.83452 (hex 0x42C3AB46).

    The test was designed to keep the test application, TI ARM compiler version, CMSIS library version the same and only change the SYS/BIOS version. The fact that changing the SYS/BIOS version caused the failure rate to drop from > 80% to zero even though it was only over one minute does seem to hint that SYS/BIOS 6.35.01.29 has prevented some intermitant floating-point problems.

    cb1- said:
    And - as always - would surely benefit from the attempt to execute poster's "in" and "sum" operations - which he proclaimed "failed" - and see if new test succeeds...

    The posters "in" and "sum" operations are in the source code for the CMSIS arm_rms_f32 function which was part of my test, and considered a "pass" with SYS/BIOS 6.35.01.29.

    cb1- said:
    While somewhat promising - a one minute test does not provide high confidence.

    Guess I need to find what has changed between SYS/BIOS 6.34.02.18 and 6.35.01.29 to see if that explains why with SYS/BIOS 6.34.02.18:
    - Ocassionally arm_rms_f32 returns a NAN (which was detected the the original poster)
    - The majority of the failures in arm_rms_f32 and arm_fir_init_f32/arm_fir_f32 returned a result which was differed in a few least significant bits.

  • Guess I need to find what has changed between SYS/BIOS 6.34.02.18 and 6.35.01.29 to see if that explains why with SYS/BIOS 6.34.02.18:
    - Ocassionally arm_rms_f32 returns a NAN (which was detected the the original poster)
    - The majority of the failures in arm_rms_f32 and arm_fir_init_f32/arm_fir_f32 returned a result which was differed in a few least significant bits.

    There was a bug in the Hwi dispatcher for the Cortex-M4F in 6.33.00->6.35.00.   This bug was fixed in 6.35.01.

    SDOCM00099886 -- Interrupt dispatcher in M4F targets does not restore FPSCR properly

    The symptoms you are seeing (intermittent floating point problems when interrupts are enabled) can be root caused to this bug.   You need to update to 6.35.01 or later.

    Thanks,
    -Karl-

  • Chester Gillon said:
    Guess I need to find what has changed between SYS/BIOS 6.34.02.18 and 6.35.01.29 to see if that explains why with SYS/BIOS 6.34.02.18:
    - Ocassionally arm_rms_f32 returns a NAN (which was detected the the original poster)
    - The majority of the failures in arm_rms_f32 and arm_fir_init_f32/arm_fir_f32 returned a result which was differed in a few least significant bits.

    When running in SYS/BIOS 6.34.02.18 the QNAN output from arm_rms_f32 is generated by the sqrtf() function in the TI ARM compiler run time library:
              sqrtf:
    00004928:   EEB50AC0 FCMPEZS         S0, S0 <-- This actually compares S0 against zero - see Incorrect disassembly for ARM floating point compare instruction in CCS 5.3
    0000492c:   B508     PUSH            {R3, LR}
    0000492e:   EEF1FA10 FMXR            PC, FPSCR
    00004932:   D206     BCS             $C$L1 <-- Should jump when input >= 0.0
    00004934:   2001     MOV             R0, #0x1 <-- If execution reaches here, arm_rms_f32 result is a QNAN
    00004936:   F7FFFE2F BL              _Feraise
    0000493a:   4803     LDR             R0, $C$CON1
    0000493c:   ED900A00 FLDS            S0, [R0, #0] <-- Return a QNAN result
    00004940:   BD08     POP             {R3, PC}
              $C$L1:
    00004942:   EEB10AC0 FSQRTS          S0, S0 <-- Return sqrt(input)
    00004946:   BD08     POP             {R3, PC}
    Previous investigation showed that SYS/BIOS 6.34.02.18 had a bug where the FPSCR was not restored correctly after an interrupt, where this bug was fixed in 6.35.01.29. The theory for why SYS/BIOS 6.34.02.18 can cause arm_rms_f32 to ocassionally return a QNAN is that there is a timing window between the FMXR instruction which sets the FPSCR from a floating point comparison, and FMXR which copies the condition flags from FPSCR to ASPR, where if the FPSCR gets corrupted by an interrupt that sqrtf thinks it has been called with an invalid argument and so returns a QNAN.

    The FPSCR contains a "Rounding Mode" field which contains how floating point instructions round results. If the FPSCR gets corrupted by an interrupt that could change the rounding mode and may explain why the results of both arm_rms_f32 and arm_fir_init_f32/arm_fir_f32 sometimes return differences in the least significant few bits.

  • Karl Wechsler said:
    There was a bug in the Hwi dispatcher for the Cortex-M4F in 6.33.00->6.35.00.   This bug was fixed in 6.35.01.

    SDOCM00099886 -- Interrupt dispatcher in M4F targets does not restore FPSCR properly

    The symptoms you are seeing (intermittent floating point problems when interrupts are enabled) can be root caused to this bug.   You need to update to 6.35.01 or later.

    Thanks for that, that was the conclusion I had just come to.

    [I don't if you had read all of this thread, but one someone else doubted that failing to restore the FPSCR properly would actually explain the symptoms - so as a debugging exercise I set up a SYS/BIOS test program to investigate]

  • This reporter stands as aforementioned, "dummkopf" - found it hard to accept that such a SW problem would prove intermittent.  Is not CMSIS designed - and reported to be most robust?

    Indeed Chester analyzed and persisted in digging out specific detail - deserves applause.

    Earlier arrival by the cavalry would have lessened work-load - suggests that far better "alert" to known issues -most indicated...  (i.e. "known" issue sat/festered here 5 days shy of one month - unduly exercising we outsiders "attempting to help!")

    At minimum - Chester should be awarded, Verify Answer status...

  • Hi all,

    Thank you so much for the extreme amount of help you have all provided. Chester and cb1, I truly appreciate your attempts to tease additional information out of this problem, especially in light of TI's quite delayed response. I have been waiting to post this until I could verify that the latest TI-RTOS (1.10) actually fixes this problem.

    Sadly, it has taken quite some time as the update to TI-RTOS requires CCS 5.4, and the install process leaves something to be desired. After 2 days unable to build and posting furiously elsewhere on this forum, I am now able to tell you...

    This problem is RESOLVED! The update to TI-RTOS does indeed fix it. I have run for quite some time with no interrupts disabled and no NANs present. Given the data from Chester regarding the nature of the problem, I believe that's enough data to confirm a fix.

    Thank you kindly,

    Ben

  • Just to complete the debugging exercise for the failing test program in SYS/BIOS 6.34.02.18, I hacked together a GEL script to dump the floating point registers since CCS doesn't display the floating point registers for a Stellaris LM4F120H5QR.

    Chester Gillon said:
    The FPSCR contains a "Rounding Mode" field which contains how floating point instructions round results. If the FPSCR gets corrupted by an interrupt that could change the rounding mode and may explain why the results of both arm_rms_f32 and arm_fir_init_f32/arm_fir_f32 sometimes return differences in the least significant few bits

    In the main function FPSR = 0x20000010, meaning rmode = 0x = Round to Nearest (RN) mode. After arm_rms_f32 and arm_fir_init_f32/arm_fir_f32 had reported a result which was different in a few lsbs FPSR = 0x2780009B, meaning rmode = 0x2 = Round towards Zero (RZ) mode. Therefore, the corruption in the rounding mode in the FPSR explains why the results sometimes differed from the expected results in a few lsbs.

    Chester Gillon said:
    The theory for why SYS/BIOS 6.34.02.18 can cause arm_rms_f32 to ocassionally return a QNAN is that there is a timing window between the FMXR instruction which sets the FPSCR from a floating point comparison, and FMXR which copies the condition flags from FPSCR to ASPR, where if the FPSCR gets corrupted by an interrupt that sqrtf thinks it has been called with an invalid argument and so returns a QNAN.

    A breakpoint was set in the sqrtf function when it decided to return a QNAN:
              sqrtf:
    00004928:   EEB50AC0 FCMPEZS         S0, S0
    0000492c:   B508     PUSH            {R3, LR}
    0000492e:   EEF1FA10 FMXR            PC, FPSCR
    00004932:   D206     BCS             $C$L1
    00004934:   2001     MOV             R0, #0x1
    00004936:   F7FFFE2F BL              _Feraise <-- breakpoint set here
    0000493a:   4803     LDR             R0, $C$CON1
    0000493c:   ED900A00 FLDS            S0, [R0, #0]
    00004940:   BD08     POP             {R3, PC}
              $C$L1:
    00004942:   EEB10AC0 FSQRTS          S0, S0
    00004946:   BD08     POP             {R3, PC}
    The floating point register values when the breakpoint was hit:
    CORTEX_M4_0: GEL Output: FPSR = 0x43800014
    CORTEX_M4_0: GEL Output: S0 = 0x43AE2D14 348.3522
    S0 is the input to sqrtf, and given that that the expected result of the sqrtf is 18.6644 then S0 holds the correct value (allowing for the corruption to the rounding mode). The condition code flags in the FPSR are N=0, Z=1, C=0, V=0. Because S0 is > 0.0 the floating point comparison should have set the condition flags in FPSR as N=0, Z=0, C=1, V=0. The corruption of the condition flags in the FPSR explains why sqrtf incorrectly thought that the input argument was invalid, and so returned a QNAN.