Problem with DDR3-RAM Initialization

Peter Hoepfner

Other Parts Discussed in Thread: TMS320C6672

Hi all:

We encounter problems when initializing DDR3 RAM at higher temperatures. The device was in a 40 degres Celsius (104 degrees F) environment.

e.g. after initializing RAM with all 0 we read back

0000: 00000000 00001000 00000000 00000000 00000000 00000000 00000000 00000000

0020: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

0040: 00000000 00001000 00000000 00000000 00000000 00000000 00000000 00000000

0060: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

0080: 00000000 00001000 00000000 00000000 00000000 00000000 00000000 00000000

... and so on.

Writing and/or reading does not change the behavior, the result is stable. Only restarting the DDR3 init procedure will bring the RAM into a working state.

Once the RAM is working correctly after initialization it will continue to work correctly regardless of the temperature. So it is an initialization only problem.

We follow the steps described in sprabl2.pdf with the dummy delay loops replaced with TSC clock loops. We entered the characteristic values from our board into the spreadsheets and use the values computed there. We use sprz335e errata Advisory 9 workaround 1 (Partial Automatic Leveling)

We are using TMX320C6672ACYP #20-######## with four Micron MT41J128M16HA-15E CPU is running at 1000 MHz and RAM is operating at 666.666 MHz (DDR3-1333) SDRAM bus interface 64-bit, one chip select

Our questions:

1) Is there any recommended procedure to avoid these kind of RAM errors ?

2) Where can we see the result of the leveling ? Is it just the success bit in the STATUS register ? Or can we see the values derived ?

3) Is there a way to find out whether the DDR3 PLL has locked other than just waiting ?

We have tried to activate the incremental leveling but there are open questions.

4) The documentation of sprugv8c 4.20-4.22 seems to contradict with sprabl2 Examples 20 and 22:

- Incremental leveling should be started by setting RDWR_LVL_EN=1 and RDWRLVLFULL_START=0 and any of RDLVLINC_INT and/or RDLVLGATEINC_INT and/or WRLVLINC_INT Value to a value different from 0.

- The time for the incremental leveling is

64 x (RDWRLVLINC_PRE+1) * (RDLVLINC_INT+RDLVLGATEINC_INT+WRLVLINC_INT) * 7.8 us

Do we need to manually start a delay while no DDR3 RAM access is allowed ?

Or can we simply read the DDR3_STATUS register and this access will be delayed by memory controller ?

- If we leave RDWR_LVL_EN=1 will every SmartReflex event delay execution for 10 ms while RAM is being leveled ?

5) Is our understanding correct that RDWR_LVL_RMP_WIN and RDWR_LVL_RMP_CTRL are used only with SmartReflex ?

Please help us,

- Peter

over 12 years ago

0 Tom Johnson 16214 over 12 years ago

TI__Mastermind 46460 points

Peter,

I am not sure why the initialization is failing at higher temperatures. Based on your statement I assume the initialization works robustly at lower temperatures. What ambient temperatures have you tested. Can you provide the temperature of the DSP and the DRAM at both the functional and failing cases?

DDR3 DRAM has a high temperature mode where the refresh rate has to double. Are you using the higher refresh rate?

The DDR3 controller has a status register at 0x2100_0004. Have you read this register? What value do you see for both the passing and failing cases? The leveling pointers are not readable as memory mapped values.

Incremental leveling is not needed for robust operation across the specified operating temperature range. The Partial Automatic Leveling is sufficient. I recommend that you resolve this issue without Incremental Leveling. Later you can add this feature to improve margin for long-term operation. If you are using the Read Incremental Leveling to converge the read pointer with full Automatic Leveling, you must delay accesses to the DDR3 memory during the period of convergence. The controller will not hold off these accesses and they cannot be guaranteed to be valid.

We originally added the Ramp timers for incremental leveling to support Smart Reflex Class 3 which would have periodically changed the CVDD voltage. Since we were able to achieve our power targets with Smart Reflex Class 0 which only adjusts CVDD at start-up, the Ramp timers are not required.

Incremental write leveling is not supported. Please see the Errata document. The app note sprabl2 is being updated to reflect this in the examples.

I reviewed sprugv8c 4.20-4.22 and sprabl2 Examples 20 and 22. Both indicate that RDWR_LVL_EN (bit 31 of RDWR_LVL_RMP_CTRL) and RDWRLVLFULL_START (bit 31 of RDWR_LVL_CTRL) are active high and must be set to enable their respective functions. Please provide more information on this issue.

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Hi Tom,

thank you very much for your answer.

We've ran additional tests at 50 degrees Celsius (120 degrees Fahrenheit). We've left it in the oven for a long while without powering it up to make sure both the DSP and the RAM were at exactly this temperature. Then we turned on the device for only short periods.

In this case the DDR3 RAM is not working. At lower temperatures i.e. 25 degrees Celsius (75 degrees fahrenheit) the device starts reliable.

The DDR3 memory controller STATUS register shows 0x40000004 always.

We have not tried the extended temperature range settings because we are below the 85 degrees Celsius level. Anyway, we've tried higher refresh rates which did not solve the problem.

With your explanations on incremental read leveling we've decided to stay away from this and continue with partial automatic leveling only.

I hope I provided all of the required information to give us hints on solving our problem.

Sincerely Peter

0 Tom Johnson 16214 over 12 years ago in reply to Peter Hoepfner

TI__Mastermind 46460 points

Peter,

I am not comfortable with your thermal assumptions. The DSP and SDRAM heat up very quickly after power is applied. The rate of heating varies greatly depending on the heatsink used and the amount of airflow. This test needs to be conducted with a thermo-couple attached to the DSP and to the SDRAM. Please provide the case temperature of these devices during the test.

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Hi Tom,

I am very sorry for not conidering the heat up. It was my understanding that the DSP and RAM do not heat up very much during the first second of system startup. We are using EMIF boot and the RAM init is among the very first things we do. There is less than a second time between system is powered and the critical DDR3 RAM init.

I'm afraid our thermal couples operate at lower measure intervals and will not produce reliable temperature values.

The C6672 has a heatsink with a fan on it but the fan is still spinning up when the RAM init procedure runs.

Sincerely Peter

0 Tom Johnson 16214 over 12 years ago in reply to Peter Hoepfner

TI__Mastermind 46460 points

Peter,

Your system needs to operate in steady-state operating conditions. Restricting your tests to events that must occur while the fan is still powering up is making your task too hard. Instead of debugging this as a "hot start" test that is very transient, I recommend operating at various steady state operating temperatures. Then you can say: it passed when operating the DSP and SDRAM when their case temperatures were xx and yy and started failing when the case temperatures were ww and zz. This also solves your thermo-couple sample rate problem.

Tom

0 Tom Johnson 16214 over 12 years ago in reply to Tom Johnson 16214

TI__Mastermind 46460 points

Peter,

Do you see this identical behavior on multiple boards?

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Hi Tom,

we have this problem on all of our boards.

However, the failure rates differ.

- All of the boards fail at 50 degrees Celsius at least one out of 20 system startups.

- Only a few boards show failures at 40 degrees already.

So the behaviour is not identical but very similar. I hope this information helps you directing us to the right track.

Sincerely Peter

0 Tom Johnson 16214 over 12 years ago in reply to Peter Hoepfner

TI__Mastermind 46460 points

Peter,

Please provide the test results requested with steady state case temperatures.

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Hi Tom,

I have the steady temperature values of DSP and RAM for you here. We've measured these values outside the oven in normal environment.

DSP 36.2 degrees Celsius (97.16 F)

RAM1: 35.5C (95.9 F), RAM2: 35.1C (95.18 F), RAM3: 34.6 C (94.28 F), RAM4: 34.8 C (94.64 F)

In the meantime we've made several more attempts and we discovered that in about 1 of 50 attempts the DDR3 memory controller status register reads 0x40000024 which means we have a Read Data Eye Training timeout. If we encounter this situation the RAM is not working at all.

This is strange because we always set bit 9 of DDR3_CONFIG_REG_23 to disable Read Data Eye training as described in Workaround #1 of the errata.

Are we doing something wrong here ?

Can you make suggestions how we should initialize the bits 0:7 of DDR3_CONFIG_REG_23 ?

Sincerely Peter

0 Tom Johnson 16214 over 12 years ago in reply to Peter Hoepfner

TI__Mastermind 46460 points

Peter,

You list the case temperatures when operated at nominal room temperature. Are you seeing failures at nominal room temperature?

Setting bit 9 of DDR3_CONFIG_REG_23 is the proper means to enable Partial Automatic Leveling which uses a fixed read pointer. When you set this bit, you must perform a read-modify-write where you only overwrite this one bit. The other bits including [7:0] must not be overwritten.

I have never heard of read data eye leveling time-out being reported. I am curious to know how you get this to happen.

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Hi Tom,

yes, we encounter the RAM problem at normal temperature as well but not so often. Even with the eval board and the values used there we have the problem. After initializing the RAM we run tests and a certain byte lane is failing. On our boards it's mostly byte lane 1, on the eval board is byte lane 0. We fill a 2 MByte portion of the RAM with different patterns and try to read the patterns back. Frequently we encounter up to 20 differences within this block.

If we use the original GEL file for the eval board (EVM667x.gel) we can see that there is a memory test built into it and if this test fails the RAM init is just started over again. This is exactly what we see on the EVM667x and on our board as well. Is this a known problem ? Are you suggesting to run the RAM and PLL init in a loop until it succeeds ?

The Read Data Eye Training Timeout in the DDR3 controller STATUS register occurs on one board only. We are still investigating this and I will keep you updated when we find out more about this.

Sincerely Peter

0 Tom Johnson 16214 over 12 years ago in reply to Peter Hoepfner

TI__Mastermind 46460 points

Peter,

The need for the retry loop is documented in the Known Issues document for the EVM. This was added to resolve an occasional failure in some TMX chips. We do not expect to see that failure in the latest TMS devices. Specifically, this was a leveling failure that resulted in a single byte lane not working correctly which matches your latest data. However, this is completely different from the temperature-related failure you originally reported which showed complete memory interface fgailure. Has this original problem been resolved?

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Hi Tom,

we are running TMX chips currently but will soon switch to TMS chips. Is there a possibility to check whether our chips are affected ?

We did see the problem originally with higher temperatures but the more we test and fix our software the more we are concerned about these startup issues.

One more questions: The spreadsheet document says to put 0 length for unsued lanes. Does this apply to the ECC lanes if no ECC is used ?

Sincerely Peter

0 Tom Johnson 16214 over 12 years ago in reply to Peter Hoepfner

TI__Mastermind 46460 points

Peter,

You did not answer the key question: is the observed single-lane leveling failure the same problem or a new problem? Did the original high-temp problem resulting in complete memory access failure get resolved?

Are you currently testing boards with C6672 devices or are they using TMX C6678 devices?

Yes, entering zero for unused byte lanes is correct. This includes the ECC byte lane.

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Hi Tom,

We are using TMX320C6672ACYP #20-######## with four Micron MT41J128M16HA-15E CPU is running at 1000 MHz and RAM is operating at 666.666 MHz (DDR3-1333) SDRAM bus interface 64-bit, one chip select.

The failures at high temperature were from a single byte lane i.e. lane 5 in the dump from my original post. But it is possible that the higher temperature was not the root cause. Because we first discovered and investigated this problem when we put real booting devices in the oven which did not start from JTAG we were possibly on a wrong track. However, I'm still sure that the problem is more visible at higher temperatures.

Sincerely Peter

0 Tom Johnson 16214 over 12 years ago in reply to Peter Hoepfner

TI__Mastermind 46460 points

Peter,

Thanks for the explanation. I misunderstood. I did not realize the original problem was byte-lane failure. I thought you were having complete DRAM access failure at high temperatures. Do you see the memory work robustly if you perform the retry loop seen in the EVM software? Does this also resolve the problem at higher temperatures?

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Hi Tom,

As far as we can tell the problem is gone with boot retries. So I suppose it was really a problem with our TMX processors.

We've added counters to our software to see the number of retries. At normal temperature we hardly see a retry ever. For 40 degree Celsius (100 Fahrenheit) we see for the device with the most problems up to 17 retries before we have a successfull boot. But at least the device boots now.

Thank you for help, our problem is solved with the retries. As soon as we receive the devices with the TMS processors I will run the tests again and post an update here.

Sincerely Peter

0 Peter Hoepfner over 12 years ago in reply to Peter Hoepfner

Intellectual 265 points

We now have our boards with the final TMS processors and the reboot is not required anymore. DDR3 RAM initialization works just great.

Thank you all again for helping us in this matter.

However, we encountered a new problem for which I have started a new thread under

http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/214381.aspx

Please feel free to help us again :-)

Sincerely Peter

0 Peter Hoepfner over 12 years ago in reply to Peter Hoepfner

Intellectual 265 points

Hi Tom,

i'm sorry to reopen this issue. We have now received more samples and are encountering the very same problem ( RAM test fails on a specific byte lane ) even with our latest TMS320C6672 processors on about 5 % of our new devices. The error is that Byte Lane 1-7 work w/o problems but Byte Lane 0 shows write errors: If we isue multiple reads we read back the same value always. After a write we read back a different value.

Are you aware of symptoms like this ?

We appreciate any suggestions that help us tracking down and solving this problem.

Sincerely Peter

0 Tom Johnson 16214 over 12 years ago in reply to Peter Hoepfner

TI__Mastermind 46460 points

Peter,

Yes, we are working this issue internally as well. We have discovered that the WRLVL_INIT_RATIO values calculated by the PHY_CALC spreadsheet need to be adjusted. I should have a new version of it available on the web in a few weeks. For now, add 0x40 to the WRLVL_INIT_RATIO values for all of the byte lanes in use. Please let me know if this resolves your issue.

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Hi Tom,

thank you very much, we entered the nw values and the RAM is working without errors on the boards. We are now running test with the other boards.

How are the chances that the new values will break something on other boards ?

Also, we have one board that shows RAM problems when Column Address 9 is high. In this case Data bits 0:15 are invalid.

Do you have any idea what could be the reason for this behaviour ?

Thank you so much

- Peter

0 Tom Johnson 16214 over 12 years ago in reply to Peter Hoepfner

TI__Mastermind 46460 points

Peter,

I would expect that to be either a part issue or an assembly issue.

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Tom,

yes, this is what we think, too. Just to be sure logical bit 0 is RAM bit 0 and so on ? Or is there some kind of mapping ?

- Peter

0 Tom Johnson 16214 over 12 years ago in reply to Peter Hoepfner

TI__Mastermind 46460 points

Peter,

I do not understand your question. If a column address bit is not being latched and everything else is good with the address lines, I assume a board / part issue with the address connections. What mapping are you referencing? The data bit numbering is explicit in both the C6671 Data Manual and in the DRAM Data Sheets.

Tom

0 Tom Johnson 16214 over 12 years ago in reply to Tom Johnson 16214

TI__Mastermind 46460 points

Peter,

The new KeyStone DDR3 Initialization Application Note and associated spreadsheets have been released. I was told they would be on the external web site today but I do not see them yet. They should appear in the next 24 hours. Let me know if you have any difficulty using these new tools and results.

Tom

0 Peter Hoepfner over 12 years ago in reply to Tom Johnson 16214

Intellectual 265 points

Tom,

thank you, both parts are online now. I will let you know about our results with the updated procedure and values.

Peter

Processors

Processors forum

Problem with DDR3-RAM Initialization