AM6546: DDR4 multibit ECC fault

Johnny Mostraum

Part Number: AM6546
Other Parts Discussed in Thread: AM6548, TMDX654IDKEVM

Tool/software:

Hi,

We are investigating an issue where we lately are experiencing multibit ECC fault on some of our produced controller units with the Sitara AM6548 and Nanya DDR4 (NT5AD512M8D3-HRI).

The DDR registers has been configured with “AM65x_DRA80xM_EMIF_Tool_2.03.xlsm” with the following parameters.

Data Rate: 1600 MT/s
ECC Enabled: Yes
CA Parity Enabled: Yes (Parity Latency = 4tCK)
Data Width: 32 bits
CL = 12
CWL = 9
Refresh 3.9us (>85Deg)

Going through the DDR configuration and status registers we have some questions and observations.

We have two warning flags in the DDRPHY_DX0GSR3 to DDRPHY_DX4GSR3, HVWRN (Host VREF Training Warning) and DVWRN (DRAM VREF Training Warning), both field 0x1. What could be causing these warnings?

We noticed that in “SPRUID7E AM65x Technical Reference Manual rev E”, register DDRPHY_DX8SL2PLLCR0, CPPC, 6’b000110 = PLL reference clock ranges from 280MHz to 332MHz, we assume this is an error in the manual, and should be 6’b001110?

For registers DDRPHY_PGCR6 and DDRPHY_PGSR1 “AM65x_DRA80xM_EMIF_Tool_2.03.xlsm” sets the INHVT (VT Calculation Inhibit) and VTSTOP (VT Stop) flags. Are there any situations where it would be beneficial to disable these flags?

There are multiple DDRPHY registers mentioning that larger values will give a more conservative command to command timings (list below). Today we are using the datasheet / values from AM65x_DRA80xM_EMIF_Tool_2.03.xlsm. If we make adjustments in these registers, could this have “side effects”, meaning we have to also compensate other timing parameters?

DDRPHY_DTPR0, fields TRRD, TRAS, and TRP.
DDRPHY_DTPR1, fields TWLMRD, and TFAW.
DDRPHY_DTPR2, fields TCKE, and TXS.
DDRPHY_DTPR3, fields TRFC, and TWLO.
DDRPHY_DTPR5, fields TODTUP, TRCD, TWTR.

When configuring the CA Parity we had issues getting this to work with the updated register values calculated with AM65x_DRA80xM_EMIF_Tool_2.03.xlsm”. To get it to work we had to update the DDRCTL_DFITMG0 register, field DFI_TPHY_WRLAT and DFI_T_RDDATA_EN with the parity latency. Should the tool have updated this register fields as well?

Regards

Johnny Mostraum

5 months ago

0 JJD 5 months ago

TI__Guru* 90235 points

Hi Johnny, sorry for the late reply

1. I don't think the VREF warnings are significant. From the spec, it says:

This flag is asserted in the following situation: The WARNING flag is asserted when VREF value has reached limit (MIN or MAX) during sweep but no data failure detected.

This just tells me that you have a lot of VREF margin, as it hit one of the boundaries without a fail. So i don't think that's a problem.

2. Yes, this appears to be a typo, should be 6’b001110

3. The VT calculation inhibit needs to be set anytime you are writing to the delay lines, for example during initialization, that is why the configuration sets it this way. I believe you could clear the bit after init (not sure if our driver does this) to enable the VT compensation logic.

4. Without looking at the detail, i would say maybe. If you want to relax any of these values, i would try to do it through the xls, that way if there are dependencies, it would get calculated correctly.

5. there looks to be a bug in the tool, whereby PL is not included in the RL and WL parameters properly. Ideally, when you update PL in the DDR Timing tab, it should update those values you mentioned. I will have to check it out and update the tool

When you say you are seeing this lately, can you give me more context? Are these new board builds? different manufacturer? different device variants?

If you are thinking there is something marginal in the timing that is causing ECC errors, you can relax the parameters as necessary. Have you tried with CA parity disabled, to see if you still get ECC errors (this will help eliminate timing errors due to PL)?

Regards,

James

0 Johnny Mostraum 5 months ago in reply to JJD

Prodigy 80 points

Hi James,
There are minor changes to the hardware board, but nothing that should affect the memory. The AM65 Sitara and the memory chips are from the same production batch. In parallel we also plan to replace first the memory chips, then the Sitara, on failing modules, and analyzing the PCB.

We have tried with CA parity disabled, but we still get ECC errors.

When reading data from memory that has ECC errors, almost all bits are set to 1 except for one bit (example: 0xFFFFFFFF). Is it possible for you to say something about the ECC error based on this?

We have logged some DDR PHY registers on different modules (attached Excel file). The module in yellow gets ECC errors very frequently (after about 1 minute). Is it possible that you can see something in the registries that might be wrong?

Regards

Johnny

DdrPhyRegisters.xlsx

0 JJD 5 months ago in reply to Johnny Mostraum

TI__Guru* 90235 points

The 0xFFFFFFFF almost makes it seem like something more catastrophic is happening. When the error happens, do all address locations show all bits set to 1 except for one bit? Does this error occur after some time of successful operation (how much time?), or do these ECC errors happen immediately?

Is it possible to test these bad boards without ECC enabled? This may give you a better feel for what the error is. If the problem is a few bits in error at some address locations, that is different than if the total memory gets corrupted. With ECC disabled, this would be easier to see and assess. If all addresses are affected, there might be some event that is corrupting either the controller or the memory contents.

I looked over the registers, and i didn't see anything out of the ordinary. The training values will be slightly different from run to run, and i didn't see anything that would indicate bad training or saturated values.

Regards,

James

0 Johnny Mostraum 5 months ago in reply to JJD

Prodigy 80 points

One thing we have discovered is that increasing the write load increases the frequency of ECC errors. Are there any write parameters we can change that might resolve the ECC errors?

We have tested the bad board without ECC enabled, but we have not been able to detect any errors at this time.

Regards

Johnny

0 JJD 5 months ago in reply to Johnny Mostraum

TI__Guru* 90235 points

Maybe you are inducing more noise because of poor signal integrity. Since you can increase the issue with more writes, check the drive strength on the addr/data and ODT on the memory. Are the values you are using coming from board simulations? Or did you just use the default values in the spreadsheet?

So does that also mean that you don't see an increase of ECC errors when increasing read load?

What type of error are you seeing? Still what you described above?

Johnny Mostraum said:
We have tested the bad board without ECC enabled, but we have not been able to detect any errors at this time.

This is interesting. What's the DDR topology on your board? Can you send a schematic? Is the ECC memory a separate die?

Regards,

James

0 Johnny Mostraum 5 months ago in reply to JJD

Prodigy 80 points

Hi James,

We have not done any board simulations, our values for addr/data and ODT on memory are our best guesses (see attached image).

Since our memory layout is as similar to the evaluation board (TMDX654IDKEVM) as possible, we have performed a test with the default TI recommended impedance parameters (see attached image).

With the default configuration, we have tested for 3 days on 3 different modules without any ECC errors (single or double error), so it looks very promising.
Is this the best addr/data and ODT on memory for us?

There is a note in the spreadsheet:
** NOTE: Users should check for recommendations provided by their DDR manufacturer.

We have checked the datasheet for the DDR memory for recommendations but have not found any.
Do you know where we can find these recommendations?

Regards Johnny

0 JJD 5 months ago in reply to Johnny Mostraum

TI__Guru* 90235 points

Hi Johnny, without simulation data, and since the layout is similar to the EVM, then definitely the default values in the tool would be recommended. I think typically the vendors would have recommendations in the form of app notes or directly talking with them, they aren't usually in the datasheet.

If you can perform your tests across temperature, then i think that should give you some good confidence that the configuration is robust.

Regards,

James

0 Johnny Mostraum 5 months ago in reply to JJD

Prodigy 80 points

We have now tested on 3 different modules for 3 days at room temperature and 1 day at high temperature (ambient 65 degrees Celsius and the AM65 chip around 100 degrees Celsius) without any ECC errors (single or double errors), so it still looks very promising.

When choosing the termination impedance, we simply look at the termination resistors (39.2 ohms) and choose the one closest to this. Can you explain a little about the parameters for the termination impedance?

ODT / Rtt_Nom

Output Driver Impedance

ODT / Rtt

Output Driver Impedance: Addr/Ctrl/Clk

Output Driver Impedance: Data/Strobe

Then we have a question about DDRPHY_PGCR1 bit 8-7 ALERTMODE. We initialize this to 0b00 since we are using CA parity, and CA parity works as expected. However, the TRM states "After PHY initialization, this field should be programmed as 2'b00 if parity is supported." Is it important that these bits are set to something else (e.g.: 0b01) during initialization?

Regards Johnny

0 JJD 5 months ago in reply to Johnny Mostraum

TI__Guru* 90235 points

Hi Johnny,

ODT/RTT_Nom is the ODT setting in the memory (used during writes)

Output Driver Impedance is the driver impedance setting in the memory (used during reads)

ODT/Rtt is the ODT setting in the processor (used during reads)

Output Driver Impedance: Addr/Ctrl/Clk is the driver impedance setting in the processor for all Addr/Ctrl/Clk signals (used for all DDR commands)

Output Driver Impedance: Data/Strobe is the driver impedance setting in the processor for all DQ/DQS/DM signals (used during writes)

For ALERTMODE, i think the note is just saying that the Alert signal should be enabled with the 0b00 setting, otherwise CA Parity error signal from the memory will never be seen by the controller. It doesn't imply that it should be some other setting during init.

Regards,

James

Processors

Processors forum

AM6546: DDR4 multibit ECC fault