Massive RAM Overwrite -- Need Help

Victor Wheeler61

Other Parts Discussed in Thread: AM3358, SYSBIOS

Hello!

I appear to have a spurious (unintended, unserviced) interrupt bringing my AM3358 to its knees and if nested interrupts is turned on, then it is overflowing the HWI/SWI stack with endless nested unserviced interrupt. I need help tracing down what it is and where it is coming from, and possibly how to turn it off (since I don't THINK I turned it on!).

What I'm working with: Win7-64-bit,

Dev Env: CCS 6.1.0, CCS 6.1.2 and CCS 6.1.3 (I started with 6.1.3 and then reverted back to see if it would remedy the problem -- it hasn't.)

Platform: Custom board with MYIR brand MCC-AM335X-Y board with AM3358, 250MB RAM and other electronics that seem to be working perfectly.

Packages: SYS/BIOS 6.45.1.29, UIA 2.0.5.50, AM335x PDK 1.0.3

History:

In order to fill in some of the gaps in the AM335X TRM regarding its touchscreen controller, I did a lot of silicon testing with a small application under RTOS with 2 tasks: Idle (doing no custom steps yet), and UiTask() which I am developing into a user-interface thread with an 800x480 TFT LCD panel. To do this, I basically logged into an 8MB array of bytes in RAM, and then after a certain number of ADC events, I halted touchscreen/ADC activity and tabulated the results. This was very successful and I never ran into any problems other than determining how the touchscreen controller was behaving in certain aspects that weren't fully spelled out in the TRM (the results of this are published in another Forum posting). I captured the interrupt with two HWIs set up to capture Interrupt #16 (TSC_ADC_GEN), and another to capture Interrupt #115 (TSC_ADC_PEN), and both of the (separate) ISRs were well tested, did their thing and exited quickly. (Only loops were to read the FIFOs [2 or 3 deep, depending on the FIFO].) The tests included testing with only #16 enabled, only #115 enabled, and both enabled at the same time (which interestingly produced a cleaner set of interrupts with the PEN UP interrupt being the last of the PEN interrupts, but ONLY when both were enabled -- the other configuration produced a dozen or so spurious PEN-ASYNC and PEN-SYNC interrupts after the PEN-UP interrupt).

Roll forward to adding a great deal of IP in terms of a windowing system (I left the 8MB array there in the .bss section) and added 14 full-size screen buffers (an array uint32_t arrays of about 20 MB in size). I kept running into ROV tool complaining about stack overflow, which didn't seem right because I have no recursive logic and my call-stack depth is maybe 10 at most (and this windowing system has been well tested elsewhere).

I tried increasing stack size to no avail, since this apparently was not the right target. Stacks increased: 1) SYS/BIOS > Runtime > System (Hwi and Swi) stack size, increased to 32768, (Heap size 65536, though I am not using any heap right now -- all RAM is defined statically at the moment.). 2) UiTask() stack size 8192 (should be far more than enough -- I figured I could cut it back when I saw how much unused stack space there was).

I have the HWI module set up to [x] Initialize stack, and [x] Check for stack overflow, and the Dispatcher set up to enable interrupt nesting. This was true during the testing above, though I know for sure my 8MB "log" array was not being overwritten -- more on this below.

Increasing the stack size only seemed to add more delay until the program hit an exception following which the ROV tool showed a large number of errors under ROV > BIOS > Scan for errors. And the first one in the list was, of course, stack overflow. (I believe all that REALLY means is that something has overwritten the initialized (unused) stack values, which COULD be a real stack overflow or a runaway pointer.)

I placed a breakpoint in the Interrupt #115 ISR, but observed all this happened and the breakpoint was not being hit!

I started cutting back the program, to less and less code actually being executed, and now I have it back to even SIMPLER than the test environment that I was using to test the touchscreen controller: Idle task and UiTask() that ONLY does the set-up steps for the touchscreen controller and turns it on (with interrupts). And then it goes into this endless loop just for testing purposes (to eliminate other causes). In fact, the following is the TOTALITY of the application in the UiTask() (again Idle task is empty):

	UART_printf("UiTask:  launched.\n");

	TAM_Initialize();  // Initializes touchscreen controller and turns on its interrupts.
        // This is virtually unchanged from successful testing, except PEN-SYNC interrupt is no longer enabled in favor of PEN-ASYNC.
	for (;;) {
		UART_printf("AFTER TOUCH init endless loop...\n");
		Task_sleep(1000);
	}

Just to get things going, I have large delays between touchscreen steps, pretty much the same as when I was testing. Writing into the 8MB array was removed, so that 8MB array is just sitting there.

Revelation 1: I went into RAM to confirm the 8MB array was all 0's, and to my surprise, it wasn't! Filled with some other binary value (all 8 MB!). The behavior is similar to an erring recursive function where the stack grows and overwrites everything -- OR -- something like a run-away pointer in an endless loop, which as you can see (above) is not coming from my application.

Revelation 2: Just before this array is a "count" variable that I was using as an index into the 8MB array during TSC testing. It too was being overwritten. So I set a HARDWARE WATCHPOINT to detect when this value was being written to, and re-tested. Sure enough, the routine that initializes the BSS to 0's wrote a zero to it, and finally, the next time it was overwritten was at the ENTRY POINT TO Hwi.c::Hwi_dispatchIRQC() function! And by time this HW WATCHPOINT stopped the execution in the debugger, all 8 MB of the array had already been overwritten. (Apparently the HWI/SWI stack grows downward.)

Revelation 3: if SIMPLY UNCHECKED the ENABLE AT STARTUP for the HWI #115 (the one I'm working with currently), then all of the above problems go away! No stack overflow. No array or RAM being overwritten. And the for (;;) {} loop spelled out above executes forever without any problems. When paused, ROV > BIOS > Scan for errors shows no errors, and the message "AFTER TOUCH init endless loop..." rolls out forever at 1-second intervals.

Revelation 4: On a hunch, I turned back on [x] Enable at Startup for the HWI #115 and UNCHECKED [ ] Enable interrupt nesting. Voila! Suddenly, the breakpoing inside the #115 ISR is now being hit! Yeay! Again, no stack overflow. No array or RAM being overwritten. And the for (;;) {} loop spelled out above executes forever without any problems. When paused, ROV > BIOS > Scan for errors shows no errors.

So it would appear that there is some spurious interrupt that is bringing the AM3358 to its knees (apparently by endlessly nesting an unserviced interrupt).

I see that in Hwi.c::Hwi_dispatchIRQC() function that just before the ISR is called, there is a Log_write5(), then interrupt nesting is enabled, then the ISR is called, and after the ISR, a Log_write1() call before the rest of the function executes. I am not at all sure where the data from the Log_write5() is going, but is it possible to place a breakpoing in this dispatcher function and look at variables or registers to determine where the trouble is coming from (the dispatcher is optimized), or track down the contents of the Log_write5() to determine where the spurious interrupt is coming from???

I would really like to get to the bottom of this, because obviously it has cost me a great deal of time up to this point, and in the end, I WOULD like to nest interrupts. Technically, I MAY not need to, but the above indicates I have some interrupt that is firing that I am not aware of, and THAT needs to get handled!

Reverting back into the stack-overflowing configuration, after arming the touchscreen interrupt, the 'irp' argument to the dispatcher function is arriving with the value 0x80011b84 (once = address of TAM_Initialize() function in MAP), and repeatedly 0x8001D9B4 (with the breakpoing in my touchscreen ISR NOT BEING HIT). The closest thing to that value in the MAP is the address of the ti_sysbios_family_arm_a8_intcps_Hwi_dispatchIRQC function at 0x8001d838. So I'm not sure what this means. Return address? I'm hoping it is a clue as to what the spurious (or unintentional, unserviced) interrupt is.

Help!

Kind regards,
Vic

over 9 years ago

0 Victor Wheeler61 over 9 years ago

Expert 2215 points

P.S. Looking at the disassembly: 0x8001d9b4 is the address of the Hwi_disable() call inside the dispatcher.

0 Chester Gillon over 9 years ago

Guru 92251 points

Victor Wheeler61 said:
So it would appear that there is some spurious interrupt that is bringing the AM3358 to its knees (apparently by endlessly nesting an unserviced interrupt).

I haven't attempted to use nested interrupts on an AM335x, but my guess is that if the interrupt handler doesn't clear the interrupt source then with nested interrupts enabled you would encounter the problem of continuous nested interrupts which would lead to a stack overflow.

Can you check that the interrupt handler is clearing the interrupt source.

Also, what rate do you expect the interrupts to be generated at?

0 Victor Wheeler61 over 9 years ago in reply to Chester Gillon

Expert 2215 points

Hi, Chester!

I am assuming that the interrupt handler (SYS/BIOS 6.45.1.29) is indeed clearing the interrupt source, since with nesting turned off, the application has plenty of Task time -- isn't stuck in the ISR. But! It might not be being cleared until after the dispatcher returns!

Expected frequency of interrupts from the TSC_ADC Subsystem: not more than 30-50/sec at peak, and maybe only 10/sec when the screen isn't being touched (handling 3 other ADC channels that are low priority).

My concern is: WHAT IS THE INTERRUPT THAT IS NESTING ITSELF? Do you know how I can find that out?

Interestingly, this code is at the top of the HWI dispatcher:

    /* ignore spurious ints */
    if (Hwi_intc.SIR_IRQ & 0x80000000) {
        Hwi_module->spuriousInts++;
        Hwi_module->lastSpuriousInt = Hwi_intc.SIR_IRQ & 0x7f;
        Hwi_intc.CONTROL = NEW_IRQ_AGR;
        return;
    }

However, since the address being passed in in the 'ipr' argument is way later in the function than that (where Hwi_disable() is called), it is clear that SOMETHING is happening causing nesting to occur before this function exits. My attention still rests on finding out what that interrupt is and remedying it (turning it off, or mitigating the problem somehow). Do you know how to tell what interrupt number is pending by looking at the AM335x's registers?

Kind regards,
Vic

0 Chester Gillon over 9 years ago in reply to Victor Wheeler61

Guru 92251 points

Victor Wheeler61 said:
Do you know how to tell what interrupt number is pending by looking at the AM335x's registers?

From a quick look at the AM335x TRM the following registers indicate which IRQ interrupt(s) are pending after masking:

- INTC_PENDING_IRQ0
- INTC_PENDING_IRQ1
- INTC_PENDING_IRQ2
- INTC_PENDING_IRQ3

There is also the ActiveIRQ field in the INTC_SIR_IRQ Register which contains the currently active IRQ interrupt number.

0 Chester Gillon over 9 years ago in reply to Victor Wheeler61

Guru 92251 points

Victor Wheeler61 said:
I am assuming that the interrupt handler (SYS/BIOS 6.45.1.29) is indeed clearing the interrupt source, since with nesting turned off, the application has plenty of Task time -- isn't stuck in the ISR. But! It might not be being cleared until after the dispatcher returns!

OK, that suggests a possible timing issue.

The AM335x TRM contains the following description for preemptive interrupts:

6.2.3 INTC Preemptive Processing Sequence

Preemptive interrupts, also called nested interrupts, can reduce the latencies for higher priority interrupts. A preemptive ISR can be suspended by a higher priority interrupt. Thus, the higher priority interrupt can be served immediately. Nested interrupts must be used carefully to avoid using corrupted data. Programmers must save corruptible registers and enable IRQ or FIQ at ARM side. IRQ and FIQ processing sequences are quite similar, the differences for the FIQ sequence are shown after a '/' character in the code below.

To enable IRQ/FIQ preemption by higher priority IRQs/FIQs, programers can follow this procedure to write the ISR.

At the beginning of an IRQ/FIQ ISR:
1. Save the ARM critical context registers.
2. Save the INTC_THRESHOLD PRIORITYTHRESHOLD field before modifying it.
3. Read the active interrupt priority in the INTC_IRQ_PRIORITY IRQPRIORITY/INTC_FIQ_PRIORITY FIQPRIORITY field and write it to the PRIORITYTHRESHOLD(1) field.
4. Read the active interrupt number in the INTC_SIR_IRQ[6:0] ACTIVEIRQ/INTC_SIR_FIQ[6:0] ACTIVEFIQ field to identify the interrupt source.
5. Write 1 to the appropriate INTC_CONTROL NEWIRQAGR and (2) NEWFIQAGR bit while an interrupt is still processing to allow only higher priority interrupts to preempt.
6. Because the writes are posted on an Interconnect bus, to be sure that the preceding writes are done before enabling IRQs/FIQs, a Data Synchronization Barrier is used. This operation ensure that the IRQ line is de-asserted before IRQ/FIQ enabling.
7. Enable IRQ/FIQ at ARM side.
8. Jump to the relevant subroutine handler.

Looking at the Hwi_dispatchIRQC function bios_6_45_02_31\packages\ti\sysbios\family\arm\a8\intcps\Hwi.c I can't see a Data Synchronization Barrier (step 6 from the TRM section quoted above) prior to the following line in Hwi_dispatchIRQC which enables nested interrupts:

    if (Hwi_dispatcherAutoNestingSupport) {
        Hwi_enable();
    }

Wondering if the lack of a Data Synchronization Barrier will cause a timing problem leading to the observed behavior. Will try and investigate.

0 Chester Gillon over 9 years ago

Guru 92251 points

Victor Wheeler61 said:
Packages: SYS/BIOS 6.45.1.29, UIA 2.0.5.50, AM335x PDK 1.0.3

Am I correct in assuming that the 1641.01_MCC_DEBUG_2.zip in your other thread https://e2e.ti.com/support/embedded/tirtos/f/355/t/537005 is the example which shows the problem with the nested interrupts?

Having installed AM335x PDK 1.0.3 in ti-processor-sdk-rtos-am335x-evm-03.00.00.04 on a Linux host I was getting linker errors on the 1641.01_MCC_DEBUG_2.zip project due to the TSCADCSetStepDelay and TSCADCTsSetChargeStepDelay functions being undefined. The pdk_am335x_1_0_3/packages/ti/starterware/include/tsc_adc_ss.h has function prototypes for those functions, but the pdk_am335x_1_0_3/packages/ti/starterware/dal/tsc_adc_ss.c is missing the functions.

Did you have to modify the AM335x PDK 1.0.3 to add the TSCADCSetStepDelay and TSCADCTsSetChargeStepDelay functions?

0 Chester Gillon over 9 years ago in reply to Chester Gillon

Guru 92251 points

Chester Gillon said:
Did you have to modify the AM335x PDK 1.0.3 to add the TSCADCSetStepDelay and TSCADCTsSetChargeStepDelay functions?

I found the thread Bug in PDK 3.0 Touch-Screen Controller Driver TSCADCStepFifoConfig() function which describes the problem. Applying the corrections detailed in that thread allowed me to compile the 01_MCC_DEBUG_2_for_sasha example project.

Using CCS 6.1.3 to run the 01_MCC_DEBUG_2_for_sasha example project on a TMDSSK3358 I can repeat the problem that the program crashes due to the interrupt handler constantly handling nested interrupts.

Chester Gillon said:
Wondering if the lack of a Data Synchronization Barrier will cause a timing problem leading to the observed behavior. Will try and investigate.

Tried adding a Data Synchronization Barrier to the Hwi_dispatchIRQC() function prior to the Hwi_enable() call but didn't change the failure symptoms. For a Data Synchronization Barrier tried:

a) A "Drain write buffer" as suggested the AM335x TRM

b) A DSB instruction

Victor Wheeler61 said:
Do you know how to tell what interrupt number is pending by looking at the AM335x's registers?

I tried to use the Hardware Trace Analyzer to capture the sequence of interrupt numbers handled by the Hwi_dispatchIRQC() function, but wasn't able to achieve that - see CCS 6.1.3 PC trace for a Cortex-A8 doesn't record any data values, when select to trace Data

Therefore, added some software tracing to the Hwi_dispatchIRQC() function:

/*
 *  ======== Hwi_dispatchIRQC ========
 *  Configurable IRQ interrupt dispatcher.
 */
#define MAX_NEST_DEPTH 50
Void Hwi_dispatchIRQC(Hwi_Irp irp)
{
    /*
     * Enough room is reserved above the isr stack to handle
     * as many as 16 32-bit stack resident local variables.
     * This space is reserved for the Swi scheduler.
     *
     * If the swi scheduler requires more than this, you must
     * handle this in Hwi_Module_startup().
     */

    Hwi_Object *hwi;
    BIOS_ThreadType prevThreadType;
    UInt intNum;
    Int swiKey;
    UInt32 oldThreshold;
    static UInt32 nest_depth = 0;
    static UInt32 intc_SIR_IRQ_history[MAX_NEST_DEPTH];
#ifndef ti_sysbios_hal_Hwi_DISABLE_ALL_HOOKS
    Int i;
#endif

    /* ignore spurious ints */
    if (Hwi_intc.SIR_IRQ & 0x80000000) {
        Hwi_module->spuriousInts++;
        Hwi_module->lastSpuriousInt = Hwi_intc.SIR_IRQ & 0x7f;
        Hwi_intc.CONTROL = NEW_IRQ_AGR;
        return;
    }

    /* save irp for ROV call stack view */
    Hwi_module->irp = irp;

    if (Hwi_dispatcherSwiSupport) {
        swiKey = SWI_DISABLE();
    }

    /* set thread type to Hwi */
    prevThreadType = BIOS_setThreadType(BIOS_ThreadType_Hwi);

    /* Process only this pending interrupt */
    intNum = Hwi_intc.SIR_IRQ;        /* get current int num */

    intc_SIR_IRQ_history[nest_depth] = intNum;
    nest_depth++;
    if (nest_depth == MAX_NEST_DEPTH)
    {
        for (;;)
        {   /* Set breakpoint here to trap when the maximum nesting depth is reached */
        }
    }

    /* remember previous priority threshold */
    oldThreshold = Hwi_intc.THRESHOLD;

    /* set the threshold to this interrupt's priority */
    Hwi_intc.THRESHOLD = Hwi_intc.IRQ_PRIORITY & (Hwi_NUM_PRIORITIES - 1);

    /* clear this interrupt, force a re-sort, and allow new ones in */
    Hwi_intc.CONTROL = NEW_IRQ_AGR;

    hwi = Hwi_module->dispatchTable[intNum];

    hwi->irp = Hwi_module->irp;

#ifndef ti_sysbios_hal_Hwi_DISABLE_ALL_HOOKS
    /* call the begin hooks */
    for (i = 0; i < Hwi_hooks.length; i++) {
        if (Hwi_hooks.elem[i].beginFxn != NULL) {
            Hwi_hooks.elem[i].beginFxn((IHwi_Handle)hwi);
        }
    }
#endif

    Log_write5(Hwi_LM_begin, (IArg)hwi, (IArg)hwi->fxn,
               (IArg)prevThreadType, (IArg)intNum, hwi->irp);

    if (Hwi_dispatcherAutoNestingSupport) {
        /* Attempts to ensure preceding writes to the INTC registers are done before re-enabling IRQs.
         * Commented out as didn't change the failure symptoms. */
        //asm volatile (" DSB;");
        /* Flush write buffer */
        //asm volatile (" MOV R0, #0;"
        //              "MCR P15, #0, R0, C7, C10, #4;");
        Hwi_enable();
    }

    /* call user's ISR func */
    (hwi->fxn)(hwi->arg);

    Hwi_disable();
    nest_depth--;

    Log_write1(Hwi_LD_end, (IArg)hwi);

    /* restore previous threshold priority */
    Hwi_intc.THRESHOLD = oldThreshold;

#ifndef ti_sysbios_hal_Hwi_DISABLE_ALL_HOOKS
    /* call the end hooks */
    for (i = 0; i < Hwi_hooks.length; i++) {
        if (Hwi_hooks.elem[i].endFxn != NULL) {
            Hwi_hooks.elem[i].endFxn((IHwi_Handle)hwi);
        }
    }
#endif

    /* Run Swi scheduler */
    if (Hwi_dispatcherSwiSupport) {
        SWI_RESTORE(swiKey);
    }

    /* restore thread type */
    BIOS_setThreadType(prevThreadType);
}

The intc_SIR_IRQ_history array stores which interrupt numbers have been nested, and there is a for loop which halts execution once nesting has run-away. This showed that all 50 entries in the intc_SIR_IRQ_history array were interrupt number 115.

0 Victor Wheeler61 over 9 years ago in reply to Chester Gillon

Expert 2215 points

Hi, Chester!

Sorry for the delay getting back to you. After you sent the 2nd-to-last message, I studied the first 2/3 of the Interrupts section of the TRM. Very orienting!

To answer your question, any MCC_DEBUG_2 project I posted here has been gutted of intellectual property because I can't have our competition seeing what we are doing (i.e. it is publicly downloadable). However, given that Sasha noted stack overflows when he ran the version I gave to for demonstrating a different problem (when he ran it past main() ) -- I would say that is indeed very likely to be a demonstration of the problem.

Very astute observation about the data barrier! I was just about to scour the TI-RTOS design for its interrupt handling and you beat me to it!

Re AM335x PDK 1.0.3, TSCADCSetStepDelay and TSCADCTsSetChargeStepDelay functions, yes again! (The link problem you ran into is caused by a bug in the released Starterware code for the TSC_ADC DAL code, and in the process of testing, I found another bug in the code and corrected THAT as well. Plus I brought these bugs to the attention of the Starterware group through their Forum when I found them and corrected them on my system.) In case you need to modify yours, here are the mods:

//<pdk>\packages\ti\starterware\dal\tsc_adc_ss.c
//  line 380:
        HW_WR_FIELD32((baseAddr + ADC0_FIFOTHR(fifoSel - 1)),    // orig
        HW_WR_FIELD32((baseAddr + ADC0_FIFOTHR(fifoSel)),        // new
//  line 387:
        HW_WR_FIELD32((baseAddr + ADC0_DMAREQ(fifoSel - 1)),     // orig
        HW_WR_FIELD32((baseAddr + ADC0_DMAREQ(fifoSel)),         // new

//<pdk>\packages\ti\starterware\include\tsc_adc_ss.h
//  insert these 2 lines near top (I inserted mine before line 104):
#define TSCADCTsSetChargeStepDelay           TSCADCTsChargeStepDelay /* vw: this is the function name in the .C file. */
#define TSCADCSetStepDelay                   TSCADCStepDelayConfig   /* vw: this is the function name in the .C file. */

Then I re-built my DAL library (among other things) like this:

> cd <pdk>\packages\

> pdksetupenv.bat

> cd ti\starterware

Then mimick the <pdk>\packages\ti\starterware\build\release_am335x.sh by

> gmake PROFILE=debug PLATFORM=am335x-evm -s KW_BUILD=no

and

> gmake PROFILE=release PLATFORM=am335x-evm -s KW_BUILD=no

My next task in this matter is

1) re-create the problem (turn nested interrupts back on and pause it in the middle of overwriting everything in RAM by setting a HARDWARE WATCHPOINT on the variable I mentioned above.

2) Check out the Cortex-A8's INTC module registers to find out if it is the SAME interrupt that is nesting itself (which could point to data barrier missing!).

Further comment:

It happens that through the APP.CFG user interface I set MaskingOption_ALL for both interrupt #16 and #115 and was having this problem DESPITE that, so one of the things I am going to be scouring for is: is that actually implemented? And if so, how early is it implemented in the sequence is it implemented? In fact, now that I am looking at the code, I am NOT seeing that implemented! Instead I am seeing what gets written into the THRESHOLD register is ALWAYS coming from the IRQ_PRIORITY register, whereas it should be something more like this pseudocode:

SavedThreshold = THRESHOLD;

If This_HWI->MASK_ALL then
    THRESHOLD = 0xFF;   // disable nesting for this HWI
    Call User ISR;
else if This_HWI->MASK_NONE then
    THRESHOLD = 0; // I don't know why but this is in the APP.CFG user interface -- I think I wouldn't implement this.
    // I think I would only offer "regular" or "nested", because I can't think of a situation where this would be useful.
    etc.
else
    // Nesting supported -- Mask current priority or lower.
    THRESHOLD = IRQ_PRIORITY;
    CONTROL = NEW_IRQ_AGR;
    data barrier; <<<<<<<<<< currently missing
    Hwi_enable(); // Translates to _enable_IRQ(); = CPSR[7] = 0 (CPU IRQs enabled);
    // Higher prio interrupts now armed.
    Call User ISR;
    Hwi_disable(); // CPSR[7] = 1 (CPU disable IRQs); Nesting ended for clean-up.
endif

THRESHOLD = SavedThreshold;

Instead what I'm seeing makes me think this whole routine made need a logic re-work. The MASKING option for each interrupt is not implemented (!) in the ...family\arm\a8\intcps\Hwi.c code (though it may be implemented for others :-( ), but the single checkbox [x] Enable Interrupt Nesting IS IMPLEMENTED! By way of the choice of population of the THRESHOLD register and

    if (Hwi_dispatcherAutoNestingSupport) {
        Hwi_enable();
    }

just before call to the user's ISR!

I am in agreement with you:

The data barrier is missing (dangerous and could cause what I am experiencing for nested interrupts).

I have 2 questions about the TRM section 6.2.3 (Preemptive Processing Sequence) you quoted above:

1. Am I understanding correctly that the sequence:

    THRESHOLD = IRQ_PRIORITY;
    CONTROL = NEW_IRQ_AGR;
    DATA BARRIER;

is going to turn off the IRQ line from the INTC to the Cortex-A8? Or should it be:

    THRESHOLD = IRQ_PRIORITY;
    DATA BARRIER;   // Forces INTC IRQ output to be turned off
    CONTROL = NEW_IRQ_AGR;   // Forces an INTC re-sort, possibly re-activating the INTC IRQ output?

2. In that same TRM section, list item #7 states:

"Enable IRQ/FIQ at ARM side."

Is that in references to bits 6 and 7 of the CPSR register?

3. I'm trying to trace down this whole IRQ sequence. It would appear that something forces the CPU's PC to 0x00000018 (IRQ branch), so some BRANCH instruction is probably positioned at that address, and that gives control to ?

It looks like something winds up giving control to the

ti_sysbios_family_arm_a8_intcps_Hwi_dispatchIRQ__I

function in <bios>\packages\ti\sysbios\family\arm\a8\ijntcps\Hwi_asm_gnu.sv7A (an assembly file),
which turn does the TI-RTOS housekeeping around calling the

ti_sysbios_family_arm_a8_intcps_Hwi_dispatchIRQC__I

function which by some #defines winds us up in the Hwi_dispatchIRQC() function in Hwi.c.

Do I have this right so far? Do you know what the gap is between instruction address 0x00000018 (hardware IRQ branch) and the ti_sysbios_family_arm_a8_intcps_Hwi_dispatchIRQ__I assembly function?

Continuing to investigate (re-create the problem and look at what IRQ number is active when the stack-overflow is happening, and read the last 1/3 of the Interrupts section of the TRM) ....

Will report here when I have found more.

Kind regards,
Vic

0 Victor Wheeler61 over 9 years ago in reply to Chester Gillon

Expert 2215 points

Okay, Chester!

We're on a HOT trail here!

Now that I have studied 2/3 of the TRM's Interrupts section, I recognized some things that should not be. Here is the data trail thus far:

1. First off, I re-created the problem and sure enough, while the HWI stack is overflowing, the INTC_SIR_IRQ = 115 (touchscreen -- the one I just enabled), and INTC_IRQ_PRIORITY is 0x00! Really??? I don't remember setting any 0-priority (highest priority) interrupts, so I checked a little further.

2. In the APP.CFG user interface, I originally (days ago) wanted a LOW interrupt priority and so I selected interrupt priority 64 for the #115 interrupt. I see now from some notes I took that I (incorrectly) "remembered" the lowest interrupt priority value from the TRM as 0x7F (127), so I thought I was hitting "right in the middle" with 64. However, the TRUTH is that the TRM says it clearly that the lowest valid interrupt priority value is 0x3F (63)! Big difference. And I know now that I changed this priority AFTER I had done all my successful testing on the Touchscreen Controller, not having a clue that I had changed it to an out-of-bounds value!

On a hunch, I changed this to priority 31 and put a breakpoint in my interrupt #115 ISR and re-tried my test.

TWO THINGS CHANGED:

A. The stack overflow is no longer happening, and

B. INTC_SIR_IRQ is (as expected) 115, but the INTC_IRQ_PRIORITY is now 31! Not 0!

PARTIAL OBSERVATIONS:

- I THINK you were right that there was a problem. I believe an interrupt SHOULD be able to be set as priority 0 and should not cause a stack overflow!

- I am HOPING that missing DATA BARRIER before IRQs are re-enabled in the Cortex-A8 is relevant. Otherwise, why would an interrupt #115 be nesting itself even after the THRESHOLD register was set to INTC_IRQ_PRIORITY which was.... 0?

(*late realization*) The Threshold register was being set like this in HWI.C IRQ dispatcher:

THRESHOLD = INTC_IRQ_PRIORITY;

and so it was being set to 0! In other words, no THRESHOLD at all, and the #115 interrupt hadn't been serviced, so ANY as-yet-unserviced interrupt would be nesting itself (overflowing the HWI stack) under those circumstances.....

-----------------------------------
CAUSE OF PROBLEM
-----------------------------------

If the THRESHOLD is being set to 0, no nesting attempts are going to work right.

To quote the TRM from section 6.1.1.2.2 Priority Masking: "...priority 0 can never be masked by this threshold; a priority threshold of 0 is [otherwise] treated the same way as priority 1. ... 0x7F is the lowest priority."

So any OTHER priority other than 0 (e.g. 1) would have caused the THRESHOLD = INTC_IRQ_PRIORITY; to "mask" the current priority interrupt allowing only higher-priority interrupts through. However, the (unintended) value of 0 having this UNMASKABLE characteristic is the cause of this problem.

To confirm: I just re-tested the #115 interrupt with interrupt priority 1, and it does not cause a stack overflow. And interrupt priority 0 is NOT COMPATIBLE WITH THE HWI.C IRQ dispatcher code's implementation of INTERRUPT NESTING! In fact, I might go as far to say that any PRIORITY 0 interrupt is not compatible with interrupt nesting....

Either A) never use these together, or B) code the IRQ dispatcher such that it does not attempt to nest when INTC_IRQ_PRIORITY == 0.

As a programmer and designer, I would probably opt for (A), not (B). Here's why: I abhor too much "defensive programming", as it adds up and loads the CPU unnecessarily. Rather, a well-disciplined system of assertions (preconditions, postconditions, check assertions) that validate such things during testing and can be turned off for release is by far my preferred method (both for reliability and efficiency), and it results in extremely efficient code. My opinion: ESPECIALLY defensive code should not be being used in an area like an IRQ dispatcher. Rather, this code should be as streamlined and as quick as possible. And (B) above WOULD be defensive coding. I would instead "head it off at the pass" by A) in the APP.CFG user interface, not allowing priority 0 to be used with nesting turned on at the same time, and B) whatever compiles that APP.CFG code also generate an ERROR (not a warning) if these two are found together.

Just to check, in the APP.CFG user interface (XDC tools I believe) I put the value 40000 in for the Interrupt priority, and saved it, and there were no complaints! Hmmmm.... So I guess I made an assumption that I was protected there and 64 was a valid number! NOT! (For the reader, 63 is the LOWEST valid interrupt priority for the AM335x. Now I know that.)

Under TASK PRIORITY if I put a value out of range and attempt to save APP.CFG, I am greeted with a nice red "X" next to the field and hovering my mouse over the "X" gives me a nice informative message about what the valid range is and where that setting is controlled.

=-=-=-=

Okay, now that I know how to avoid this problem:

A. Never use interrupt priority 0 with interrupt nesting turned on. It's not compatible with the HWI.C IRQ dispatcher code, and is guaranteed that the HWI stack overflow that results will wipe out everything below it in RAM until the CPU has an exception.

B. Never put an interrupt priority over 63 into the APP.CFG user interface for HWI interrupt priority. If you're flying without knowing how AM335x interrupts work and some of the limitations (namely, limits of priority and how it works for interrupt nesting), you're on your own.

=-=-=-=

And now I can also make some recommendations for the XDC tools and SYS/BIOS software engineers:

1) I suggest a numeric value check on the HWI interrupt priorities (and SWI priorities if such a check is not already in place). Certainly disallowing HWI priorities out of the range of 0-63 would be part of this.

2) I think DISALLOWING priority 0 and INTERRUPT NESTING at the same time would be a good idea, as they are not compatible. I would suggest this be both part of the user interface, and part of whatever compiles APP.CFG during code generation (to generate an ERROR, not a warning).

3) I think it would be GREAT if you would implement the per-HWI selectable interrupt masking in the actual HWI IRQ dispatcher code. Currently it is not implemented (at least for the family\arm\a8\intcps\Hwi.c code). Otherwise, don't present the mask dropdown list if it is not implemented, as it is A) deceptive, and B) can cause design bugs that are difficult to trace (to unimplemented interrupt masking where the selected masking may be assumed to be implemented, tested, and working by the system designer).

4) It is POSSIBLE that that missing data sync barrier as described in TRM section 6.2.3 list item #6 is going to be important. I'm going to study and understand the rest of the TRM Interrupts section (including the assembly language) to see what more I can determine about this, but this might be worth a very close evaluation if such an evaluation has not already been done.

=-=-=-=

Finally, thank you very much Chester for your help! It was instrumental to guiding me as to where to look.

Kind regards,
Vic

0 Victor Wheeler61 over 9 years ago in reply to Victor Wheeler61

Expert 2215 points

To clarify as to your question: "Can you check that the interrupt handler is clearing the interrupt source."

If you were referring to the USER ISR, yes, definitely it was clearing the source (i.e. causing the TSC_ADC_SS peripheral to cease asserting the IRQ signal coming into the INTC).

However, the stack overflow was being caused BEFORE that ISR was being called. I did finally ferret out the source of the problem, and posted it just a few minutes ago.

0 Victor Wheeler61 over 9 years ago in reply to Chester Gillon

Expert 2215 points

Hi, Chester! Wow, I started writing the 5:52pm posting in the morning, and then went out to brunch, and then didn't see your posting with your elegant method of stopping the stack overflow after it had run away. As you'll see, I uncovered the same finding (IRQ 115) with a somewhat less elegant method of stopping the stack overflow after a bit. In the last posting however, I think I have found the cause and a multi-point list of things that will make the pitfall I fell into a little less wide for future system designers/implementers! It would be great to have your eyes confirm my findings.

0 Chester Gillon over 9 years ago in reply to Victor Wheeler61

Guru 92251 points

Victor Wheeler61 said:
Just to check, in the APP.CFG user interface (XDC tools I believe) I put the value 40000 in for the Interrupt priority, and saved it, and there were no complaints! Hmmmm.... So I guess I made an assumption that I was protected there and 64 was a valid number! NOT! (For the reader, 63 is the LOWEST valid interrupt priority for the AM335x. Now I know that.)

The AM335x TRM shows that the INTC_ILRx Registers have only 6 bits for the "Priority" field.

The Hwi priority fields in the app.cfg file ends up getting passed to the Hwi_setPriority() function in packages/ti/sysbios/family/arm/a8/intcps/Hwi.c, where the Hwi_setPriority stores the requested priority in the corresponding INTC_ILRx register. Since the "Priority" field in the INTC is only 6 bits an attempt to set a priority of 64 results in the priority zero being set in the INTC which matches your observed behavior.

The original app.cfg had interrupts 16 and 115 both set to priority 64:

hwi1Params.instance.name = "ghTscAdcHwiHandle";
hwi1Params.priority = 64;
hwi1Params.maskSetting = xdc.module("ti.sysbios.interfaces.IHwi").MaskingOption_ALL;
hwi1Params.enableInt = true;
Program.global.ghTscAdcHwiHandle = Hwi.create(16, "&TAM_InterruptHandler16", hwi1Params);

hwi2Params.instance.name = "ghTsc115AdcHwi";
hwi2Params.priority = 64;
hwi2Params.maskSetting = xdc.module("ti.sysbios.interfaces.IHwi").MaskingOption_ALL;
hwi2Params.enableInt = true;
Program.global.ghTsc115AdcHwi = Hwi.create(115, "&TAM_InterruptHandler115", hwi2Params);

With dispatcherAutoNestingSupport enabled, by changing the priorities found:

a) With interrupt 16 and 115 both set to priority 64 (i.e. actually zero priority) then the program get stuck continuously handling nested interrupt 115.

b) With interrupt 16 set to priority 64 and interrupt 115 set to priority 31 then the program got stuck continuously handling nested interrupt 16.

c) With interrupt 16 and 115 both set to priority 31 the nested interrupt problem no longer occurred.

i.e. can repeat the nesting problem with a zero priority on either of the interrupts.

Agree that the SYS/BIOS configuration tool doesn't trap out-of-range priority values.

0 Victor Wheeler61 over 9 years ago in reply to Chester Gillon

Expert 2215 points

Yep! This one is solved. Thank you again for your help, Chester!

Processors

Processors forum

Massive RAM Overwrite -- Need Help