Program Restarting

Abu Alam61097

Other Parts Discussed in Thread: MSP430F5438

Hi,

I am developing a wireless network using a set of eZ430 boards. The code has grown quite a bit.

For some reason the program periodically restarts (at the top of main) after a certain point in the code. I have disabled the watchdog time using this: WDTCTL = WDTPW + WDTHOLD;

Its kind of hard to debug and find the point where the program restarts since the timers are used to continuously set off interupts. Is there a particular strategy I should take to debug in order to stop the program from restarting? What are some causes of such behavior?

thx

over 14 years ago

0 Piotr Romaniuk over 14 years ago

Expert 2840 points

Hi,

Can you write something more about 'certain point of the code'?
Is there something specific?

There is a lot of possible options, it is rather guessing with such limited information.
Maybe you have some issue with interrupts, stack corruption or NMI.

It is strange that the software restarts, I'd rather expect blocking if something wrong happens.

General strategy is:
1) add changes is small parts and have set of versions (it is good practice to use version control system, e.g. git, svn, etc.)
2) test consecutive versions for error presence. When error appears, then compare two versions: one that works and the errorneous. Error is caused (or manifested) because changes that you add from one to another version
3) try to find out what are conditions that imply error
4) make the error repetible
5) try to narrow the code where you are looking for error cause, find exact point where the error appears
6) add debugging code that helps you to know more about execution sequence and its condition

Regards,
Piotr Romaniuk, Ph.D.
ELESOFTROM

0 Jens-Michael Gross over 14 years ago

Guru 227245 points

Typical reasons for program restarts are:

- unhandled interrupts and

- stack overflows.

Unhandled interupts occur if you enable an interrupt but didn't write an ISR for it. It causes the processor to jump to 0xfffe instead of an ISR. After that point, almost anything can happen. A typical cas is when someone enables CCIE on CCR0 and CCR1 but does only write one ISR for them (Timer0_A0 is required for CCR0 and Tiemr0_A1 is required for CCRx and TAIFG)

Stack overflows can happen all the time. If the stack overflows into an area where data is stored, it will not only corrupt the data, if the data is altered inside the ISR, this may affect the return address of the ISR, causing it to not return to the interrupted code but to anywhere. Besides that, using nested interrupts (enabling GIE inside an ISR) can cause immediate stack overflows as well as intermittent ones. But even without nested interrupts, stack overflows happen often.

keep in mind that teh 'stack size' settign in the IDE does not in any way limit the processors use of the stack. If the stackgrows (due to local variables or funciton calls or ISRs), it grows. No IDE setting will limit it. This setting only sets a warning/error threshold when your used static/global variables do not leave enough space for the defined amount. How much of it (or more) is actually used at runtime, cannot be limited.

0 Piotr Romaniuk over 14 years ago in reply to Jens-Michael Gross

Expert 2840 points

Hi Jens-Michael,

To the list of reasons I can add:
- non initiated pointer,
- violation of local table bounds. Writing to such location can result in altering return address from the function (that is also on the stack).

Jens-Michael Gross said:

Typical reasons for program restarts are:
- unhandled interrupts and [...]

This is very interesting problem. I think that program restart because of unhadled interrupt (i.e. missing interrupt vector) is very unlikely. In a case of msp430 architecure the most probable result is lock at the address after the end of the flash.

Stack overflow or corruption is more probable reason of serious problems, including restart. Nevertheless, I think that regular restart is difficult to be performed in this way.
The software may look like it has been restarted because next entry to main or other 'deep function' on the stack trace (i.e. called early).

There is a way to guard stack state, in CCSv.4 there are Entry/Exit Hook functions (Properties|C/C++ Build|MSP430 Compiler|Entry/Exit Hook options), they could be used for testing if the stack is still correct, but it needs to add some functionality and is not available at once.

Regards,
Piotr Romaniuk, Ph.D.
ELESOFTROM

0 Jens-Michael Gross over 14 years ago in reply to Piotr Romaniuk

Guru 227245 points

Piotr Romaniuk said:
I think that program restart because of unhadled interrupt (i.e. missing interrupt vector) is very unlikely. In a case of msp430 architecure the most probable result is lock at the address after the end of the flash.

This 'loc' only happens if you're still in the address space assigned to the flash controller. On devices <64k Flash, you'll rollover to 0, which is module space and will likely cause an access violation ro something similar.

Piotr Romaniuk said:
Stack overflow or corruption is more probable reason of serious problem

And there are so many ways to get one :)

0 Piotr Romaniuk over 14 years ago in reply to Jens-Michael Gross

Expert 2840 points

Jens-Michael Gross said:
This 'loc' only happens if you're still in the address space assigned to the flash controller. On devices <64k Flash, you'll rollover to 0, which is module space and will likely cause an access violation ro something similar.

If there is no memory, msp430 behaves like 0x3FFF (JMP $) has been fetched.
I checked only a few MCUs but from 0x0000 was peripherial space, but not assigned to any module nor register, hence above behavior.

So the question is, if the particular microcontroller has anything at 0x0000.

Jens-Michael Gross said:

Stack overflow or corruption is more probable reason of serious problem
And there are so many ways to get one :)
[/quote]

Indeed, there are many options to do this. I think that we cannot imagine all possible scenarious :)

Regards,
Piotr Romaniuk, Ph.D.
ELESOFTROM

0 Jens-Michael Gross over 14 years ago in reply to Piotr Romaniuk

Guru 227245 points

Piotr Romaniuk said:
If there is no memory, msp430 behaves like 0x3FFF (JMP $) has been fetched.

Sure? This is true for the memory range assiogned to the flash controller. But outside any memory, who will provide this value? If yould be as well 0xffff (sicne nobody provides data).

Anyway:

Piotr Romaniuk said:
So the question is, if the particular microcontroller has anything at 0x0000.

AFAIK, the lowest 16 bytes are always reserved for the SFRs. The 1x series has no safety feature outside the flash memory area. so if an 1x processor roll sover to 0x000, it continues executing what it finds there.

On 5x series, any access to non-existing flash memory (above the flash end) causes a vacant memory interrupt (NMI). WHich will result in either anothe rjump to 0xfffe or nothing if you're already 'inside' the NMI.

An instruciton fetch from 0x0100 to 0x0ff7 will generate a PUC. Below 0x0100, the result isn't exactly defined in the datasheets. (however, at 0x00 there is at least the SFRIE1, SFRIFG1 and SFRRPCR register, which isn't vacant memory in any case).

The documentaiton is a bit confusing about vacant memory: The table tells that 'vacent' memory 0x45c00-0xfffff drives an NMI on read/write/fetch.
The section about vacant memory space tells that any access will trigger an NMI only when enbaled. THen it tells that in case of a fetch, 0x3fff is fetched, which is taken as a jump, but then continues fetches from vacant memory space will result in a PUC (rendering the notice about 0x3fff being a jump$ pretty much useless, as a PUC would never allow this to be executedt at all).
It's really not that clear what will happen.

0 Piotr Romaniuk over 14 years ago in reply to Jens-Michael Gross

Expert 2840 points

Jens-Michael Gross said:

If there is no memory, msp430 behaves like 0x3FFF (JMP $) has been fetched.
Sure? This is true for the memory range assiogned to the flash controller. But outside any memory, who will provide this value? If yould be as well 0xffff (sicne nobody provides data).
[/quote]

Let me quote again, part from msp430 documentation ( I am aware that this may be not sure for all msp430 families):

http://e2e.ti.com/support/microcontrollers/msp43016-bit_ultra-low_power_mcus/f/166/p/100486/354685.aspx#354685

I checked it only for 0x0000 and 0x0004 for msp430f5438 on debugger. I did not observe restart. I begun main function and then entered PC=0x0000. It spins at this address. Also, in above mentioned thread the person that initiated it had 0x3FFF executed at 0x0004.
I think that there is some address decoder, when address is outside the allowed range, NMI is generated and 0x3FFF is fetched.

I am not sure but I expect that NMI source must be inmasked to generate interrupt of this type. If a programmer did not do this, it would not generate NMI.

Regards,
Piotr Romaniuk, Ph.D.
ELESOFTROM

0 Jens-Michael Gross over 14 years ago in reply to Piotr Romaniuk

Guru 227245 points

That's the same I was referrign to. And again I say that this is somewhat irritating. What good is it for to state thet 0x3fff is fetched when also a PUC is generated on a fetch (which nullifys any fetched value, resetting the device). The whole part of the documentation is rather confusing than enlightening or exhausting.

Some devices do not have an SFR at 0x0004. The range to 0x0010 is reserved for SFRs, but only part of it (and to a different extent) is actually populated on different MSPs/families.

Piotr Romaniuk said:
I am not sure but I expect that NMI source must be inmasked to generate interrupt of this type.

I'd say you're right. If the NMI source is maskable. There are user NMIs and System NMIs. And for system NMIs I wouldn't take it for granted (unless the docs say so).

Piotr Romaniuk said:
I think that there is some address decoder

There sure is. But if no valid destination is decoded, who will answer? The address decode does not provide anything to the data bus. It just emits chip select signals to the addressed components. SO it's easy to decode all unused spaces to a reset/NMI request signal. On the typical address decoders used on older systems, a simple PAL chip, it didn't take any additional complexity. And even if you just create the gates on silicon, it doesn't take many more gates than the minimized/optimized design without this 'emergency' signal. Putting something on the data bus, however, would require an additional circuitry that is connected to the data bus etc.
Well, wiothout havving the core blueprints, nobody can say what's true. And those who have them, have not documented it in every detail in the users guide.

0 Piotr Romaniuk over 14 years ago in reply to Jens-Michael Gross

Expert 2840 points

Jens-Michael Gross said:
And again I say that this is somewhat irritating. What good is it for to state thet 0x3fff is fetched when also a PUC is generated on a fetch (which nullifys any fetched value, resetting the device).

I agree, if PUC happens the valueble feature '0x3FFF' is destroyed.

Jens-Michael Gross said:
Some devices do not have an SFR at 0x0004. The range to 0x0010 is reserved for SFRs, but only part of it (and to a different extent) is actually populated on different MSPs/families.

Yes. I pointed out that for msp430x5xxx family it may be like I described. I have only msp430f5438, so I can only check it on this chip. Hence, my observation is specific for one chip, not confirmed for family (but possible). For sure it is not general.

Jens-Michael Gross said:

I am not sure but I expect that NMI source must be inmasked to generate interrupt of this type.
I'd say you're right. If the NMI source is maskable. There are user NMIs and System NMIs. And for system NMIs I wouldn't take it for granted (unless the docs say so).
[/quote]

It is not maskable, but in documentation (http://focus.ti.com/lit/ug/slau208h/slau208h.pdf) there is a note that programmer can choose what sources of NMI are allowed. Unfortunatelly schematic is uncomplete (Figure 1-3) and further information must be searched "somewhere between lines" of text. Anyway, access to 'Vacant Memory Space' "generates a system (non)maskable interrupt (SNMI) when enabled (VMAIE=1)[...]".
I see now the sentence that I missed: "Fetch from vacant peripherial space result in a PUC". But, I can only add that when I redirected execution to 0x0000 (under debugger) it run endless loop there, I did not noticed restart. The same, on other chip, was in mentioned thread, where CPU tried to execute 0x3FFF at 0x0004.

According to the decoder, it was only my guess. If someone designed this '0x3FFF' mechanism carefully, considering its work, consequences and global function, there can be a part of the chip that generates 0x3FFF when decoder cannot find a device for specific address.

Jens-Michael Gross said:
Well, wiothout havving the core blueprints, nobody can say what's true. And those who have them, have not documented it in every detail in the users guide.

I agree, but we can test it as a black box (I hope it is not considered as hacking msp ;) and make some conclusions. There is a risk, such undocumented behavior can change when the next version of the chip is issued or the family is changed to another.

Regards,
Piotr Romaniuk, Ph.D.
ELESOFTROM

0 Jens-Michael Gross over 14 years ago in reply to Piotr Romaniuk

Guru 227245 points

Piotr Romaniuk said:
generates a system (non)maskable interrupt (SNMI) when enabled (VMAIE=1)

Well, technically, it is a maskable interrupt. Just that you mask it by clearing VMAIE instead of GIE :)

Piotr Romaniuk said:
when I redirected execution to 0x0000 (under debugger) it run endless loop there, I did not noticed restart.

Well, 0x0000 isn't vacant memory. It is special funciton register memory. So it seems that the SFR provides 0x3fff on a fetch while it provides the 'real' value on a read.

It's interesting that the MSP differentiates between read and fetch. Others don't.

Piotr Romaniuk said:
The same, on other chip, was in mentioned thread, where CPU tried to execute 0x3FFF at 0x0004.

Yep. Still not vacant memory there. The 0x0004, however, is mos tlikely because the last instruciton before had two word parameters, so 0x0000-0x0003 were skipped/read as parameters. I wonder what the 'opcode' on 0xfffe was :)

0 Piotr Romaniuk over 14 years ago in reply to Jens-Michael Gross

Expert 2840 points

Jens-Michael Gross said:

The same, on other chip, was in mentioned thread, where CPU tried to execute 0x3FFF at 0x0004.
Yep. Still not vacant memory there. The 0x0004, however, is mos tlikely because the last instruciton before had two word parameters, so 0x0000-0x0003 were skipped/read as parameters. I wonder what the 'opcode' on 0xfffe was :)
[/quote]

I acked Bob about that, but because:
1) the chip memory start from 0x8000,
2) IAR start up code begins most likely from begin of the FLASH,
3) and 0x8000 means:

sub.w pc, pc

It is jump to 0x0000. :)

Regards,
Piotr Romaniuk, Ph.D.
ELESOFTROM

PS
Vacant memory term is not well defined, from documentation it is not clear if it means (i) empty space between modules (address spaces) or undefined location in the module. Maybe it is not distinguished, or [i] does not exist.

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

Program Restarting