How to debug a stack overflow?

Lee Holeva

Other Parts Discussed in Thread: SYSBIOS

In sysbios6, how should I go about debugging the occurance of a random stack overflow? I'm seeing the stack getting wiped-out randomly. For example, in execuiting the code, the stack pointer starts out at 0x98109760 and changes to 0x98106DD0 during normal execution. When the random error occurs, I stop the DSP, and notice that the SP is back to 0x98109760 and DSP is in the idle loop. Sometimes I get an expilict stack overflow error, sometimes I do not. This is doubly difficult to figure-out as the stack corruption is not deterministic, processing the same sequence of images multiple times, the error happens randomly at any image in the sequence. Is there a way to instrument my code using the System Analyzer to leave a trace of the events that led up the stack corruption?

Lee

over 13 years ago

0 MarkGrosen over 13 years ago

TI__Expert 4125 points

Lee,

Have you tried the suggestions in Section 3.5.4 of the SYS/BIOS User's Guide, "Testing for Stack Overflow", pasted below:

3.5.4 Testing for Stack Overflow
When a task uses more memory than its stack has been allocated, it can write
into an area of memory used by another task or data. This results in
unpredictable and potentially fatal consequences. Therefore, a means of
checking for stack overflow is useful.
By default, the Task module checks to see whether a Task stack has
overflowed at each Task switch. To improve Task switching latency, you can
disable this feature the Task.checkStackFlag property to false.
The function Task_stat() can be used to watch stack size. The structure
returned by Task_stat() contains both the size of its stack and the maximum
number of MAUs ever used on its stack, so this code segment could be used
to warn of a nearly full stack:
Task_Stat statbuf; /* declare buffer */
Task_stat(Task_self(), &statbuf); /* call func to get status */
if (statbuf.used > (statbuf.stacksize * 9 / 10)) {
Log_printf(&trace, "Over 90% of task's stack is in use.\n")
}
See the Task_stat() information in the "ti.sysbios.knl" package
documentation in the online documentation.
You can use the Runtime Object Viewer (ROV) to examine run-time Task
stack usage. For information, see Section 6.5.3.

0 Lee Holeva over 13 years ago in reply to MarkGrosen

Genius 4615 points

MarkGrosen said:

Have you tried the suggestions in Section 3.5.4 of the SYS/BIOS User's Guide, "Testing for Stack Overflow", pasted below:

No, I haven't tried that, but I just stepped through the code using a sequence of 30 images using CCS5.1 and the stack never overflowed. I put a breakpoint at the end of the main sequence of computation and gathered benchmark statistics using System Analyzer and everything worked fine. I'm thinking that somewheres there is a timing issue.

Lee

0 MarkGrosen over 13 years ago in reply to Lee Holeva

TI__Expert 4125 points

Ok, but even in the case where things appear to work, it would be useful to open the ROV (Runtime Object Viewer) and look how much stack space is being used by each task in the system. If any of them are close to the edge, that would indicate a place to start looking.

Mark

0 Lee Holeva over 13 years ago in reply to MarkGrosen

Genius 4615 points

MarkGrosen said:

Ok, but even in the case where things appear to work, it would be useful to open the ROV (Runtime Object Viewer) and look how much stack space is being used by each task in the system. If any of them are close to the edge, that would indicate a place to start looking.

I've been looking at the peak stack usage in ROV and when the code runs normally I'm seeing no more than 1756 bytes of stack, well under the limit. It appears that by adding the Log_write1s for benchmarking the stack overflow problem has gone away. What would explain this?

I have placed this pair of instructions in the code for benchmarking:

Log_write1(UIABenchmark, start, (xdc_IArg)"name");

Log_write1(UIABenchmark, stop, (xdc_IArg)"name");

With the Log_write1s commented-out the code goes off into the weeds and I see the stack getting corrupted.

Lee

0 Steven Connell over 13 years ago in reply to Lee Holeva

TI__Mastermind 45025 points

Lee,

If the problem went away when you added the Log statements, this does not mean that the problem has been fixed; it just means you are getting lucky. Basically, bringing in the Log module has caused the layout of your program in memory to change. The stack overflow is still happening, only in the previous case, it was overwriting "important" data that in turn caused the program to crash. Now, it is likely that upon overflowing, it's still overwriting data, but "un-important" data (what I mean by that is data is still getting corrupted, only this corruption is not causing a crash).

I think another thing to try would be to analyze each of your Task stacks at the point of the crash. To do this, try opening a memory window (view->memory) and type in the address of your stack.

Actually, maybe you should do this at the beginning of your program first, so that you can see how the Task stack memory is initialized to contain all "0xBEBEBEBE". Then, once the program crashes, you can again look at the stack and see where "0xBEBEBEBE" has been replaced by actual data. If at the top of the stack you still see some "0xBEBEBEBE" then this means that the stack has *not* overflowed.

But, if you see all of the "0xBEBEBEBE" are gone, then the stack either was fill up to the exact maximum, or it overflowed past the top.

Hope this helps,

Steve

0 Norman Wong over 13 years ago in reply to Lee Holeva

Guru 26430 points

Another possibility is that some function is using uninitialized variables on the stack and corrupting its own stack or some other tasks stack. The Log_write1() might be initializing stack memory to "safe" values. Although most compilers will warn of uninitialized variable usage in C code. Not sure about assembler code.

0 Lee Holeva over 13 years ago in reply to Steven Connell

Genius 4615 points

Steven Connell said:

Actually, maybe you should do this at the beginning of your program first, so that you can see how the Task stack memory is initialized to contain all "0xBEBEBEBE". Then, once the program crashes, you can again look at the stack and see where "0xBEBEBEBE" has been replaced by actual data. If at the top of the stack you still see some "0xBEBEBEBE" then this means that the stack has *not* overflowed.

But, if you see all of the "0xBEBEBEBE" are gone, then the stack either was fill up to the exact maximum, or it overflowed past the top.

Now I'm not sure that the problem that I'm having is actually a stack overflow. I put in a breakpoint mid-way through the code and when the running correctly the stack pointer is at 0x98106E70 and contains the following values:

0x00000001 98159D73

BEBEBEBE BEBEBEBE

The hic-up occured, I stop the DSP, and I see this:

Close-up of the error:

This is C674x DSP code, why is CCS asking for a Linux file?

The memory contents at 0x98106E70 look ok, but the stack pointer is now pointing to 0x98109758. The main thread is blocked by a DMA input semaphore, DMA from DDR memory to L1D stopped working.

Depending upon where I put the breakpoint, CCS will sometimes hang when the hip-up occurs, forcing me to kill the CCS process.

Update:

I'm having serious issues with CCS5.1 hanging. I'm getting to a breakpoint, ROV begins to update, but CCS never completes.

Lee

0 Steven Connell over 13 years ago in reply to Lee Holeva

TI__Mastermind 45025 points

Lee,

Lee Holeva said:

Now I'm not sure that the problem that I'm having is actually a stack overflow. I put in a breakpoint mid-way through the code and when the running correctly the stack pointer is at 0x98106E70 and contains the following values:

0x00000001 98159D73

BEBEBEBE BEBEBEBE

This looks normal to me. The stack actually grows "downward" in the memory window. So the first line above (0x00000001 98159D73) shows that data has been written to the stack (previous to that data write, this memory location would have contained BEBEBEBE BEBEBEBE since the entire stack would be initialized to that).

The second line (BEBEBEBE BEBEBEBE) means that the stack has not "grown to that address" yet - no data has been written there yet.

Lee Holeva said:
This is C674x DSP code, why is CCS asking for a Linux file?

Please ignore this. This message is merely saying that CCS cannot open the original source file for the function which the PC is at. The path shown shows Linux in it because this is the location in which that file existed on our build servers when creating the XDCtools product. This path is internal and should be ignored.

For the future when this happens, you could click the locate file button and then browse to the location of that file on your local computer (in this case it would be in your XDCtools installation directory).

Steve

0 Michelle Vickrey over 9 years ago in reply to MarkGrosen

Prodigy 10 points

I have a further question about how and where to implement the Task_stat() call. I'm using 4 tasks which includes idle. I'm also using IAR and the ROV support available there. I suspect one of my tasks is overflowing. I'm just not sure where to put the Task_stat() calls. Should I place calls in every task or just the suspect one?

Thanks

0 Steven Connell over 9 years ago in reply to Michelle Vickrey

TI__Mastermind 45025 points

Can you please open up a new thread for this?

Thanks,

Steve

Processors

Processors forum

How to debug a stack overflow?