TMS320F28P650DK: ECC test issue

Barbara Meglioli

Tool/software:

Good afternoon,

I'm using TMS320F28P650DK9 microcontroller and I'm trying to perform ECC error test following the Diagnostic library example.

The test is usually working, but I'm experiencing a few situation in which the test had failed.

As the issue is very sporadic, I'm not able to reproduce it on my test bench (it was found on a couple of samples of our customer).

- It seems that onces the test has failed it always fails until I remove power supply (a SW reset seems not enaugh).

Can you confirm this?

- Is it possible that an instable power supply during the test can lead to the failure?

- Is there any known issue on the ECC test procedure ?

- Can you suggest a possible improvement or verification that we can introduce to increase the robustness of the test?

Many thanks,

Barbara

5 months ago

0 Whitney Dewey 5 months ago

TI__Guru 54445 points

There is a known issue with the example where it uses an address for a Flash location that is not programmed, so it generates an incorrect error type (an unexpected correctable error). I don't think this would cause the behavior you're seeing though.

I think one thing that you need to be careful about implementing this test is to make sure not to use the Flash while the test mode is active. So that includes the ISR caused by the error and any functions it may call, and other interrupts that may happen to occur during that time (it is likely safer to disable them temporarily).

If neither of those suggestions help, see if you can get any more details. In what way is it failing? Is the application crashing? Or is the test function just returning a fail value? What is the nature of the failure (no NMI generated, interrupt/error status is incorrect, etc...)?

Whitney

0 Barbara Meglioli 5 months ago in reply to Whitney Dewey

Prodigy 50 points

Hello,

thankyou for the answer.

Is it a problem if ISR located in RAM are active?

In addition, can you give a feedback on these two topics:

Is it possible that the failure is removed ony with a reset of power supply and not with a SW reset?

Is it possible that an instable power supply during the test can lead to the failure?

Unfortunately we don't have additional information as we are not able to reproduce the issue on our test bench.

But the customer is pressing for a feedback.

Thankyou

0 Whitney Dewey 5 months ago in reply to Barbara Meglioli

TI__Guru 54445 points

Barbara Meglioli said:
Is it a problem if ISR located in RAM are active?

If the ISR is located in RAM and you can confirm it doesn't call any functions that are located in Flash, it should be okay.

Barbara Meglioli said:
Is it possible that the failure is removed ony with a reset of power supply and not with a SW reset?

It's hard to guess why this may be the case without more info about the nature of the failure. I've gone through the descriptions of the related registers and they all appear to be reset by SYSRSn (debugger reset), not POR.

I can't think of any reason why an unstable power supply would affect this test in particular.

How do they know it's the ECC error test causing the issue? What does the application do in event of a failure of the ECC test?

Whitney

0 Barbara Meglioli 4 months ago in reply to Whitney Dewey

Prodigy 50 points

Hello Whitney,

thankyou for your reply.

Now I have the unit with the issue.

The ECC test error is not related to flash as I supposed before but to ECC test on RAM.

In particular, the failure is present when I execute the function from the diagnostic library runCorrectableECCTest()

in the part commented as:

//
// Walk through a M0 RAM location until every ECC bit for the upper 16 bits
// (bits 14:8) has had an error injected into it.
//

The test that fails is :

if(errorAddr != (((uint32_t)&m0Data) + 1UL))

I've been using this test since a lot of time without having issues (and the issue is present on a few number of samples, the great part of samples works correctly).

Even the defective sample used with a previous firmware works well (previous firmware means a firmware with different functionalities but with the same code for ECC test)

I've done the following test: I've moved the test from M0 RAM to M1 RAM (M1 RAM was not used, so now it's only dedicated to the test).In this case the sample works well.

Can you give me an explanation of what can be the issue? Is there a limitation in the M0 address that I can use for the test? Would it be better if I reserve a fixed memory area in the linker to the test?(current I'm only mapping in M0 memory area, not at a fixed address)

Can you check if there's a know issue in runCorrectableECCTest() or some recommandation that I cpould have ignored?

Many thanks,

Barbara

0 Whitney Dewey 4 months ago in reply to Barbara Meglioli

TI__Guru 54445 points

What is M0 usually being used for? Stack? This check puts the RAM block being tested into a test mode (like all of M0, not just the specific address where the error is injected), so that we can create a mismatch between the data and ECC, and so while the RAM is in test mode, it shouldn't be accessed for things like stack, variables, or code execution until the test mode is disabled again.

Whitney

0 Barbara Meglioli 4 months ago in reply to Whitney Dewey

Prodigy 50 points

Hello,

the RAMM0 is used also for :

- hwbiststack (the test is executed before the ECC test, so I don't think it's the responsible)

-.data -> in this case I can't exclude that an access to one of them is performed during test execution because during this phase interrupts are enabled (as in the TI example sdl_ex_ram_ecc_parity_test.c), considering that I'm using corrErrorISR?

Many thanks,

Barbara

0 Barbara Meglioli 4 months ago in reply to Barbara Meglioli

Prodigy 50 points

Can you confirm that the use of DINT /EINT in this way can solve my issue?

DINT;
testRAMLogic(MEMCFG_SECT_M0, MEMCFG_TEST_WRITE_ECC,(uint32_t)&m0Data, (RAM_ECC_SINGLE_BIT << Idx));
EINT;

0 Whitney Dewey 4 months ago in reply to Barbara Meglioli

TI__Guru 54445 points

That could potentially fix it if the issue does appear to be caused by an interrupt during that function. More specifically it's these 3 lines that need to execute without interruption and without access to the memory in test mode..

The corrErrorISR won't execute until the M0 RAM is back in functional mode and the injected error has been read back, so that should be okay.

0 Whitney Dewey 4 months ago in reply to Whitney Dewey

TI__Guru 54445 points

When you get an incorrect address for the ECC error, can you compare those addresses to the map file to see if they match a particular variable's location? Are they all within the area reserved for .data? That might help you figure out if there's an interrupt causing the issue.

Whitney

0 Barbara Meglioli 4 months ago in reply to Whitney Dewey

Prodigy 50 points

Yes, the returned addresses are within .data

0 Whitney Dewey 4 months ago in reply to Barbara Meglioli

TI__Guru 54445 points

Were you able to figure what specific variables might be connected to those addresses? I'm wondering if figuring out the variables will provide a hint as to if/which interrupt is causing the issue.

Whitney

0 Barbara Meglioli 4 months ago in reply to Whitney Dewey

Prodigy 50 points

Hello, I was able to find two interrupts that could enter during test procedure and access to M0 memory (other interrupts didn't use M0 memory).

One of these interrupt performed only a read, the other a read and four write. The variables used for read and write are different.

Can you confirm if the address returned should be the one involved in read , in write or both?

0 Whitney Dewey 4 months ago in reply to Barbara Meglioli

TI__Guru 54445 points

Hi Barbara,

I did some experiments with the RAM test modes today and want to share a few things I observed that may help in your debug.

As I mentioned on our call, error detection occurs when reading the data from memory. However, I did find that if you perform the read while the test mode is active, it does not perform detection and does not seem to generate an error whether one is injected or not. It's not until after the RAM is back in "functional" mode that a read detects an injected error. So basically the reads in your ISR during the test on their own probably aren't enough to cause the incorrect addresses.

Regardless, there are still other issues with having interrupts enabled during the test. The variables you're writing to in your ISR if written while the RAM is in test mode will be written with mismatched ECC and data (assuming their value is changing) which is planting an error that you will eventually detect when those variables are read later after the RAM is returned to functional mode. This could be the cause of the error you're seeing although I don't know if the timeline necessarily makes sense for the particular behavior you're seeing since you say the read and write variables in your ISRs are different.

Another issue is with the variables you're reading. During the ECC test mode in particular, if you read a variable in that RAM, a read will get the ECC value rather than the data value. It probably won't cause an error to be detected, but the variable will contain invalid data which could be an issue depending on what you're using that variable for.

Whitney

0 Barbara Meglioli 4 months ago in reply to Whitney Dewey

Prodigy 50 points

Hello Whitney,

I've performed additiona tests.

First of all I was able to reproduce the issue on my bench using the "fault" sample.

In this way I was able to read the complete ram address of the fault. What was not clear to me before was that the returned address in case of fault changed, while I expected that the address should be the last address written within ISR. But from your last answer I've understood that the ECC reports the failure only if I change the content of the variable that I write within ISR.

And this explains what I'm seeing: 3 of the the variables that I'm writing contain an analog signal, so it can change or not, depending on the HW.

So, I think that all is clear now.

Thankyou so much

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F28P650DK: ECC test issue