How to detect stack corruption vs stack overflow?

dewilso4

I'm having a problem with random behavior within my C55x DSP/BIOS application. I've tracked it down to what I believe is a stack corruption problem, but 1.) I'm not sure how to track where the corruption occurs and 2.) it may not be a corruption problem at all. I've found several TI and other resources online regarding detection of stack overflows, but I'm at a loss when it comes to detection of stack corruption. Let's say for the time being that I'm positive no stack overflows are occurring (watermark verified in each task's stack/sysstack as well as the application's stack/sysstack).

The behavior I'm seeing is that, while in one of my SWI handlers, I enter a function which appears to run to completion, but upon exiting from the function, I don't return to my SWI handler. I've verified this by having a status variable which gets incremented prior to entering the function and decremented upon returning from the called function and noting that the decrement doesn't occur. The problem is further complicated by the fact that it seems to be very closely related to timing (i.e. of the various HWIs and different priority SWIs), as usually the application runs fine.

Are there any resources explaining the debugging of stack corruption issues (particularly using CCS tools to do so)? Anybody out there want to give me some personal insight into debugging these types of problems in other projects? Your help is appreciated!

over 13 years ago

0 Cong Van Nguyen over 13 years ago

Intellectual 970 points

Hi,

C55x DSPs do not have any mechanism to detect stack overflow. However, you can do it manually by filling the stack with a know pattern (e.g. 0xBEEF), running the program, and reading back the stack to see how much stack the program used.

0 dewilso4 over 13 years ago in reply to Cong Van Nguyen

Expert 1215 points

Thanks, Cong. I appreciate your reply. I'm fairly certain I'm not introducing stack overflow problems, as I've performed the analysis you just suggested. I have all my tasks' stacks as well as the application's stacks filled with a watermark (i.e. 0xBEEF), and even after a crash, I can examine the stack and note that it is still mostly filled with watermark values.

The problem seems to be more of a stack corruption problem, i.e. somewhere in that function I must be writing beyond the bounds of some local object, overflowing into the stack, and corrupting my return address. I just don't see anywhere in my code where that's possible (no memcpy/memset, no stepping through an array, etc.).

I've begun using the entry/exit hook functions, which allows me to have a function called each time I enter and leave a function. This is giving me a little more insight, as now I can see my whole stack trace, even if the system's stack trace becomes corrupted. We'll see if this leads anywhere...

0 Cong Van Nguyen over 13 years ago in reply to dewilso4

Intellectual 970 points

Hi Derek,

Sorry that I didn't realize that you had checked for stack overflow already.

You can check pushing/poping registers in ASM routines, if you have any. Wrongly-ordered or misaligned pushing/poping usually cause catastrophic crash like that.

You seem to have a corrupted program flow. In general, to deal with such a problem, I would try to trap the DSP at where the bug occurs and use the tracer to track back the program flow, then add some debugging code trying to put a breakpoint BEFORE when the bug occurs.

Cheers,

Cong-Van

0 dewilso4 over 13 years ago in reply to Cong Van Nguyen

Expert 1215 points

Hi, Cong-Van,

Thanks again for your advice. You make some good points. In my application, there's no hand-written assembly, so I don't think it has to do with the order of pushing/popping registers. Also, the problem seems to be very time dependent. In other words, my application will run fine for minutes to hours but then all of a sudden, the DSP goes into the weeds. I very much suspect it to be related to a non-reentrant function somewhere which is getting interrupted.

My debugging code has led me to a few functions, which certainly becomes tractable. However, in reviewing those functions' code, I've not found any indication that they should not be reentrant. It is also very difficult to find a place to breakpoint before the bug occurs, as I have not found anything (values of variables, order of function calls, etc.) which is the same from crash to crash. However, I agree that would be the best way to proceed.

0 Danny Corey over 13 years ago

Prodigy 150 points

Derek,

The problems you have encountered may be a result of a problem found with the C55xx linker. Please refer to the eZdsp5502 FAQ on the Spectrum Digital support website at http://support.spectrumdigital.com/boards/ezdsp5502/revc/files/ezdsp5502_faq.htm. The FAQ describes the issue found with the C55xx linker.

Regards,

Danny Corey

0 dewilso4 over 13 years ago in reply to Danny Corey

Expert 1215 points

Wow, thanks a million, Danny; you're a lifesaver! I've been trying to track this down for 3+ weeks, and changing the CGT from 4.3.8 to 4.4 seems to have fixed the problem! While relieved that the problem was not in my code, what a frustrating experience this has been :-)

0 Jordon over 13 years ago in reply to dewilso4

Intellectual 635 points

Hello,

I am wondering if this bug is in 4.3.1. What is the defect number and is there a way to look into the defect database? In looking at the release notes sometimes it does not list what previous versions of the tools the bug is present.

Also how would one confirm that this is indeed happening via the linker map or other analysis?

Thanks.

0 dewilso4 over 13 years ago in reply to Jordon

Expert 1215 points

It seems to be more easily noticed in disassembly view when running through JTAG. If you have a suspecting function which is not aligned, the linker will attempt to align it to 4 byte boundaries. Unfortunately, instead of putting in NOP instructions when aligning, random memory exists. Thus, if you look at the end of a function in disassembly, after the last line of the function, but before the return, you'll find random opcodes. It turns out the problem wasn't fully fixed in the 4.4.0 release but is supposedly fixed in the 4.4.1 CGT release.

Processors

Processors forum

How to detect stack corruption vs stack overflow?