MSP430FR5989: PUC reset caused by MPUSEGIIFG or ACCTEIFG

Javier Navarro

Prodigy 10 points

Part Number: MSP430FR5989

Tool/software:

Hi,

We are using a MSP430FR5989 in one of our projects. We configure no wait cycles for FRAM access (NWAITS = 0).

FRCTL0_H = (UCHAR)(FRCTLPW >> 8);

FRCTL0_L = NWAITS_0; /* FRAM wait states: 0 */

GCCTL0 = !UBDRSTEN /* Disable PUC on uncorrectable bit error detection flag. */

| !UBDIE /* Disable NMI for the uncorrectable bit error detection flag (UBDIFG). */

| !CBDIE /* Disable NMI for the correctable bit error detection flag (CBDIFG). */

| !FRLPMPWR /* Disable FRLPMPWR while keeping FRPWR set */

| FRPWR; /* Enable ACTIVE mode. */

GCCTL1 = 0x00; /* All flags cleared */

FRCTL0_H = 0x00;

We are using 8 MHz for the MCLK, using the following configuration:

CSCTL1 = DCORSEL | DCOFSEL_3; /* 8 MHz */

Most of the devices works fine without any problems. But some few devices have been reset by a PUC. The problem have happened more than once in each device (2 or 3 times) but in the same few devices. When we read the System Reset Interrupt Vector (SYSRSTIV) to know the reset cause, all the times the reset cause is MPUSEGIIFG, except once that it was reset by ACCTEIFG access time error.

Reading the Recommended Operation Conditions in the microcontroller datasheet, it shows that for a MCLK at 8MHz, we can use NWAITS = 0.

Additionally, we reviewed the microcontroller errata document, implementing workarounds recommended for the possible issues related to the frequency deviations or for PMM configurations, as the recommended for CS12 and PMM29.

So, my questions are the following:

Is it possible, that for some devices, the factory calibration could be affected in such a way that the microcontroller ends up running at a little more than 8 MHz, and with NWAITS = 0 produce the PUC caused by ACCTEIFG?

Or it is possible that in some conditions, the device run at a higher frequency for a short period of time that also produce a PUC by ACCTEIFG?

Finally, there is any possibility that this conditions could cause a memory protection violation (as MPUSEGIIFG), which is precisely the most common cause?

Thank you in advance for your help.

over 1 year ago

0 Eason Zhou over 1 year ago

TI__Mastermind 39775 points

It is possible, that for some devices, the factory calibration could be affected in such a way that the microcontroller ends up running at a little more than 8 MHz. However, TI will reserve some margin for that. And use ATE(Auto test equipment) to cover these test items.

I would guess the problem happens when the clock start again after exit LPM3, which can't be covered by PMM29. Can you do some test:

1. Change NWAITS = 1 to see if this can solve the problem

2. Change the LPM3 to like LPM0 to see if this can solve the problem

0 Javier Navarro over 1 year ago in reply to Eason Zhou

Prodigy 10 points

Hi Eason, thank you for your response.

For power consumption restrictions, we cannot use LPM0. In the other hand, changing the NWAITS is a high impact change and we prefer not to do it unless we are sure that this cause both issues.

As I mentioned in my first post, the problem happens in few devices, but in the devices where the problem happens, have happened more than once. Mainly, the PUC is caused by MPUSEGIIFG (write or execute violation in the info segment). We are not writing nor executing on that segment, so, we thought it could be caused by some misbehavior. We want to know if you have noticed any kind of issue related to MPU violations that could be related to other issue. As in one device, we had a PUC caused by ACCTEIFG, we want to know if for any reason, a problem related to the NWAITS and the access time violation could trigger a MPU violation.

0 David Schultz over 1 year ago

Guru 24885 points

I don't know how the hardware determines when an access time violation has happened. Details of that would determine how much margin there is on the 8MHz limit. But looking at the data sheet I don't get a warm fuzzy feeling.

Note 7 on the Recommended Operating Conditions states the clock frequencies equal to or less than the MAX value specified are permitted. But you look at the nominal 8MHz DCO setting and it has a +/-3.5% tolerance. That +3.5% would seem to be a problem.

You could try using the 7MHz DCO setting and see if that helps.

0 Eason Zhou over 1 year ago in reply to Javier Navarro

TI__Mastermind 39775 points

Can you check if these bit is cleared in your application?

From my experience, this may happens when:

1. Protection is enabled for information memory

2. MCU is in abnormal state and PC access the information memory by accident.

Can you try what David suggested?

0 Javier Navarro over 1 year ago in reply to David Schultz

Prodigy 10 points

Hi, thank for your answer. For now, I don't want to change the DCO frequency because that could be a high impact change in the normal operation of our device. This is something that we have to evaluate first. However, the main failure we are experiencing in few devices is the Memory protection unit violation on the information area. We have this area protected against writes and execution. But we never write nor execute in this area, so, we don´t know why that happens. But, my question goes to try to find if there could be some relation between the NWAITS in a wrong value and the MPU violations.

0 David Schultz over 1 year ago in reply to Javier Navarro

Guru 24885 points

The reason to try the lower frequency is to test the hypothesis that there is an FRAM access time violation related to it. Not to always use that lower frequency.

0 Eason Zhou over 1 year ago in reply to Javier Navarro

TI__Mastermind 39775 points

Hi Javier,

Doing changes is to check the root cause. After that, you can evaluate to take which solutions.

0 Javier Navarro over 1 year ago in reply to Eason Zhou

Prodigy 10 points

I cannot perform that tests, neither changing DCO to 7MHz nor changing NWAITS to 1. The problem here is that the devices which are having the failure are not accessible to us, and for safety reasons, we cannot implement those changes because they could have a big impact in the normal operation of the devices.

We can do those changes in test devices we have here in the lab, but we couldn´t find so far a device in the lab with the failure. So, implementing those changes in devices that have never failed won’t give us any information if some of them is the root cause of the problem.

We are doing some tests in order to see if we can reproduce the failure periodically in the lab, in order to implement the changes and test if any of them fix the problem. Meanwhile, we cannot test that solutions.

However, as I mentioned before in the forum, whichever of those changes could fix the FRAM access time error, but I am not sure that helps with the memory protection violation, that is the most frequent problem we are experiencing. My original question goes with the intention to know if anybody have experienced a MPUV problem that could be related to access time errors.

0 David Schultz over 1 year ago in reply to Javier Navarro

Guru 24885 points

Getting a system reset from a MPU access violation requires that you request that behaviour. The power up default being an interrupt. If you aren't using the information memory, I can't see any reason to force a reset as that doesn't seem to be a critical error. Fielding an interrupt and logging the error perhaps, but not a reset.

0 Javier Navarro over 1 year ago in reply to David Schultz

Prodigy 10 points

Hi David. For safety reasons we cannot allow continue executing code after an error of that kind happens. We have to enable the PUC and once the device is reset, read the reset cause and put the device in a special safe mode.

The thing here is to understand why is happening that MPUV. Due to the fact that have happened only in a few devices of the same MPS430FR5889 batch, and not in other devices made with other batches, we suspect of a misbehavior of the microcontroller.

Taking advantage of the opportunity, I want to ask something related to the PMM32 errata of this microcontroller. The errata document says something like:

Device may enter lockup state or execute unintentional code during transition from AM to
LPM2/3/4

Specifically, related to the condition 2, the document says something like:

Condition2:
The following events happen at the same time:
1) The device transitions from AM to LPM2/3/4 (e.g. during ISR exits or Status Register
modifications),
AND
2) An interrupt is requested (e.g. GPIO interrupt),
AND
3) Neither MODCLK nor SMCLK are running (e.g. requested by a peripheral),
AND
4) SMCLK is configured with a different frequency than MCLK.

In our device, points 1, 2 and 4 could happen. We use LPM3 (we need that mode and we cannot use other), we have some interrupts enabled and we have SMCLK configured with a different frequency than MCLK. But respect to point 3, we are not sure what it means. If the points 1,2 and 4 happen at the same time and in that moment the SMCLK is being used by another peripheral (i.e. the ADC), are we in the errata condition or not?

It looks like the condition implies that SMCLK is not running when the other 3 points happens (the most common condition). But we are not sure if it implies that or implies the opposite. Please, help us to clarify that.

0 Eason Zhou over 1 year ago in reply to Javier Navarro

TI__Mastermind 39775 points

The problem happens at the point when MCU turns to LPM2/3/4, in a us level time.

For 2), it means when MCU turns to LPM mode, a interrupt request is generated. It doesn't means you can't use any interrupt.

For 3), it means these clock is not used by any peripheral. If ADC sourced by SMCLK and an ADC interrupt generated when MCU turn to LPM mode. You will not meet this problem, as SMCLK is used by peripheral. If an interrupt sourced from GPIO and no other peripherals work when MCU turn to LPM mode, you will meet 3).

0 Javier Navarro over 1 year ago in reply to Eason Zhou

Prodigy 10 points

Hi Eason, thank you for your answer. Yes, we could be in the condition because none peripheral uses the SMCLK while in LPM mode. However, I don't think this could be the cause of the issue we are experiencing, cause it is happening only in few devices, and apparently only in a specific microcontroller batch.

0 Eason Zhou over 1 year ago in reply to Javier Navarro

TI__Mastermind 39775 points

Yes. I understand you concern. I have talked with your assgined Sales team.It may be a process problem. But just may be. Please refer to the comment bellow:

For this case, I would suggest to involve CQE to track this issue. We will have a conclusion and a solution after we finish the bench test, ATE test, and know details about this problem. Currently, my only suggestion is that:
1. For the first step, please get a NG device which can recreate this failure.
2. For the second step, please create a smallest software example which can recreate this failure.
3. For the third step, send the NG devices and software code to us.

0 Javier Navarro over 1 year ago in reply to Eason Zhou

Prodigy 10 points

Hi Eason, thank you for your answer.

Yes, we are trying to get a "failed" device to proceed with some tests and follow the steps you sent.

Regards

0 Javier Navarro over 1 year ago in reply to Javier Navarro

Prodigy 10 points

To add to this thread, we have noted an apparent correlation between the unexplained occurrences of the MPUV events (with MPUSEGIIFG = 0x0028) and the batch of the M430FR5989SRGCREP microcontrollers. For M430FR5989SRGCREP microcontrollers used in the same hardawre and running the same code, we have identified the following statistical differences in the occurrence of failures:

Number of units that fail over total number of units distributed, depending on the batch:

Units with microcontrollers of Lot Trace Code 36I A0KR and 36I A0KJ: 7 units failing over 74 units

Units with microcontrollers of other Lot Trace Codes: 0 units failing over 71 units

Number of failures over accumulated period of use, depending on the batch:

Units with microcontrollers of Lot Trace Code 36I A0KR and 36I A0KJ: 25 failures (happening in a total of 7 units) over an accumulated period of use of 13 years

Units with microcontrollers of other Lot Trace Codes: 0 failures over an accumulated period of use of 59 years

This correlation does not necessarily imply causality, but we must consider all elements.

We do not have physical access to any of the units already distributed (including the 7 units that have failed up to the moment), but have means to monitor them remotely, and that is how we can compute occurrences of the MPUV events and their characteristics. Information of the Lot Trace Codes was obtained through traceability records. We have not been able to reproduce any of these failures in our facilities.

0 Eason Zhou over 1 year ago in reply to Javier Navarro

TI__Mastermind 39775 points

Hi Javier Navarro,

Thank you for sharing the details. I still suggest you to involve our CQE. They can help track the Lot difference with the factory and they also have more experience that how to handle this condiction.

Javier Navarro said:
25 failures (happening in a total of 7 units) over an accumulated period of use of 13 years

See from your information, I think it may be hard to recreate this problem, especially when it is related to the environment, like power supply or the temperature.

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

MSP430FR5989: PUC reset caused by MPUSEGIIFG or ACCTEIFG