c66x DDR3 mem intermittent corruption

Jeff Brower73

Other Parts Discussed in Thread: SYSBIOS

TI Experts-

We are fighting an intermittent DDR3 memory corruption problem. Using an array of structs with this format:

typedef struct _CHANINFO_CORE {

:
: many elements
:

HOST_TERM_INFO* term;
struct _CHANINFO_CORE* link;

:
: many elements
:

} CHANINFO_CORE;

typedef struct {

uint32_t a : 8;
uint32_t b : 8;
uint32_t c : 16;

uint32_t bitrate;

struct ip_addr remote_ip;
struct ip_addr local_ip;

uint32_t remote_port : 16;
uint32_t local_port : 16;

} HOST_TERM_INFO;

Code such as the following that evaluates pointer-to-a-pointer:

test = ChanInfo_Core[n+DNUM*NCORECHAN].link->term->a; /* NCORECHAN = 1024 */

p = (uint8_t*)&ChanInfo_Core[n+DNUM*NCORECHAN].link->term;

where the struct location and the pointed-to locations are both in DDR3 memory, intermittently (it may take hours of run-time) gives a wrong value for test, or the upper two bytes of p as zero. In previous hardware design situations, I've seen similar "2 byte" errors when DDR3 settings are not 100% correct, or the memory can't quite support full clock rate under all conditions.

We haven't yet tried separating the term and link elements by some amount of bytes (maybe greater than DDR3 burst length, or cache line size ?), or removing optimization. But at this point I'd like to ask, given this struct layout, where one struct has a pointer to another of the same type, which in turn points to another struct, is there any known compiler or silicon issue where results of a DDR3 read should not immediately be used as a pointer to another DDR3 read ?

For 64-bit DDR3 width, do pointers within a struct need to be aligned to 8-bytes and does each struct in the array need to have a size divisible by 8 to preserve pointer alignment for all entries in the array ?

Thanks.

-Jeff
Signalogic

Notes on what we're using:

-c6678 @ 1 GHz, 64-bit wide DDR3 (2 GB) @ 1333 MHz

-CGT 7.4.2 and SYSBIOS 6.34.04.22

-O2 opt level

-size of CHANINFO_CORE is 152

over 7 years ago

0 Yordan Kovachev over 7 years ago

TI__Guru**** 161600 points

Hi Jeff,

I've notified the design team. Their feedback will be posted here.

Best Regards,
Yordan

0 Jeff Brower73 over 7 years ago in reply to Yordan Kovachev

Genius 3420 points

Yordan-

Can you provide an update ? We still see a problem with this, one thing we've done is de-reference pointers inside Hwi disable/enable, for example:

key = Hwi_disable();
pnLinkTerm = ChanInfo_Core[n+DNUM*NCORECHAN].link->term;
Hwi_restore(key);

test = pnLinkTerm->a;

and this seems to help -- suggesting a problem with CGT 7.4.2 when generating "pointer to a pointer within a struct pointing to something" type of code that will be pre-empted at -O2 level. I can't prove it yet, but there is something to this, otherwise we would not see the improvement.

-Jeff

0 ran35366 over 7 years ago in reply to Jeff Brower73

TI__Genius 12805 points

Jeff before we continue, I want to ask two questions -

1. Do you see a pattern of the error, that is, the error always happens on a page boundary?

2. You DDR is dual rank oe single rank memory?

Thanks

Ran

0 Jeff Brower73 over 7 years ago in reply to ran35366

Genius 3420 points

Ran-

Thanks for your reply. Sorry for the delay in getting back, we have to test the customer's system for several multi-hour runs to cause a failure that triggers one of our debug codes that can detect the problem and log information. Many times we don't even get an exception, one core (out of 160) just stops.

The board is an ATCA blade with 20 C6678 devices, the DDR3 mem on this board is single rank, with 2 kB page size. The errors continue to appear to be random locations with upper 2 bytes corrupted (often the upper 2 bytes are zero). We have seen cases where it's just one 32-bit value affected and other 32-bit values around the corrupted value look fine, and we're focusing on those.

One thing that is significant is that one chip on the board is more likely to experience the problem, although eventually others can show it. We're not allowed by the customer to disable that chip, but we can change settings such as clock rate and timing. At this point, the objective is simply to determine whether there is a hardware related component to the problem.

We've tested boards with 32-bit and 64-bit DDR3 width and see the same behavior. We reduced compiler level to -O1 and see the same behavior.

One question is, what DDR3 settings might we experiment with ? We can try reducing DDR3 clock rate, what else ?

Thanks.

-Jeff

0 ran35366 over 7 years ago in reply to Jeff Brower73

TI__Genius 12805 points

Jeff

My experiences were only with corrupt data on page boundary, and they were mostly timing issues. There are several registers that control the timing of the DDR, but I do not recall random errors like you described.

Before I forward this to the hardware team, here are some ideas. If you already did what I suggest tell me and I will get the hardware guys to look into it.

1. Look at the Errata http://www.ti.com/lit/er/sprz334h/sprz334h.pdf and especially at User Note 14 about the speed of the DDR. You may already run the DDR in the right speed. There are several advisory dealing with leveling of the memory (9, 26). This may be a reason why you see the errors in complete two bytes (and not random bit). Personally I would guess it is leveling issue, but I am not hardware guy.

2. In the DDR User Guide www.ti.com/.../sprugv8e.pdf in chapter 3.2 there is description of the control registers that controls the DDR. Read it and see if it helps.

Look at the above, tell me if it helps or not

Ran

0 Jeff Brower73 over 7 years ago in reply to ran35366

Genius 3420 points

Ran-

Ok thanks again for your reply.

Although I have done this before for c66x hardware designs, in this case we have to rely on the board manufacturer to take these steps. Between them and the customer is our software, and we're obligated to come up hard evidence there are intermittent DDR3 mem access errors. Our objective at this point is collecting this evidence, for example if we reduce the DDR3 clock and this decreases the error rate, that's useful. Likewise for any other DDR3 timing parameters we can adjust.

If you could ask your hardware guys to give us a couple of things to try, that would be great. I can be contacted privately (jbrower at signalogic dot com) to explain who is the end customer (and their customer, a major carrier) and who is the board manufacturer. Thanks

-Jeff

0 Jeff Brower73 over 7 years ago in reply to Jeff Brower73

Genius 3420 points

Ran-

An update on this. We managed to temporarily remove some application code and make room in MSMC SRAM for a few critical struct arrays that were in DDR3 mem.

This clears up the "upper 2 byte" errors we were seeing. We did not change any of the struct related application code or error checks (we had dozens of these in place) -- just moved the arrays -- so this is additional evidence for a DDR3 mem problem on the board.

We haven't tried half-rate DDR3 clock yet; we are coordinating with the customer to get this in place.

There are 32-bit wide / 2 chip and 64-bit wide / 4 chip versions of this board, so each DDR3 chip represents 2 bytes. Based on the error signature -- the upper 2 bytes of a 32-bit location are intermittently (rarely) corrupted, it looks to me like regardless of whether it's a 64-bit or 32-bit wide board, there is one DDR3 chip that struggles from time to time. Also it's worth noting that out of 20 c6678 devices on the board, there are four (4) that are more likely to show the problem. In the last year the mfg began using newer 25 nm technology devices. Can you ask your hardware experts if, besides DDR3 clock, there are other parameters we might adjust ?

Thanks.

-Jeff

0 Tom Johnson 16214 over 7 years ago in reply to Jeff Brower73

TI__Mastermind 46460 points

Jeff,

The problem being described does sound like a DDR3 signal integrity issue or similar marginality. I have a few questions to help diagnose the issue:

You state that there are 20 subsystems on each board where each subsystem contains a C6678 DSP and its associated memory. You then indicate that the failure occurs most often on a single subsystem but that it may occur on any of 4 subsystems.
1. Do any of the other 16 subsystems ever fail?
2. Can you change this behavior by operating the board at higher or lower temperatures?
3. Is the PCB track routing for each subsystem identical, very similar or completely unique between the C6678 DSP and the SDRAMs?
4. Are the SDRAMs on the same side of the PCB as the C6678?
5. Are all SDRAMs within a single subsystem on the same side of the PCB?
6. Have you tested multiple boards?
7. Do they all fail at the same PCB locations?
8. During execution, do all 20 subsystems run the exact same code?
9. Are they synchronized?
Have you attempted testing with a DDR3 stress test program or only application code? A DDR3 stress test program must write large blocks to the SDRAMs and then read them back to internal memory using EDMA. Writes and reads from the DSP cores will be too sparse. Ran should be able to help provide example stress test code.
1. Stress code should be run on all subsystems simultaneously.
2. EDMA block writes and reads must be in multiples of 64 bytes and be aligned with 64 byte addressing.
3. when a failure is detected, the program should be able to log the expected contents of the entire 64-byte burst written and the result when this block was read.
4. We need to define which bit(s) are susceptible to corruption.
The DDR3 initialization sequence must follow the steps defined in KeyStone I DDR3 Initialization Application Report (SPRABL2E) available at: www.ti.com/lit/pdf/sprabl2. Please provide populated PHY_CALC and REG_CALC spreadsheets for this initialization and verify that the initialization sequence in complaint with SPRABL2. (I see that the web link to the spreadsheets is not functional so they are attached to this post.)
Assuming that the problem is corrupted bits due to signal integrity marginality on the PCB, changes to the DDR3 clock rate and/or changes to the termination settings will make the problem better or worse. Once you have a known set of configuration spreadsheets, changes will be very straightforward. (This is required since many register values change when the clock rate changes.)

Tom

1134.DDR3 Initialization sprabl2.zip

0 Jeff Brower73 over 7 years ago in reply to Tom Johnson 16214

Genius 3420 points

Tom-

Thanks for your detailed reply. Below are some initial answers. Before continuing, an update -- running DDR3 at 800 MHz (0.6 rate) seems to be working, at least for a series of 24 hr tests run over the last few days. That's a small breakthrough for us, although we need to run 7 days straight to be certain. If that holds up, then the focus will shift to why we can't run at 1333 MHz.

> 1. Do any of the other 16 subsystems ever fail?

Yes but it's far less likely. After fighting with this thing for more than 2 months, we've seen other c66x subsystems fail maybe half-dozen times.

>> 2. Can you change this behavior by operating the board at higher or lower temperatures?

The customer tried to affect the problem by removing adjacent blades and inserting large fans in the chassis. This may have shown a slight reduction in error rate, but it was very hard to be sure without weeks of testing. We didn't continue on that path since any effect appeared to be marginal.

>> 3. Is the PCB track routing for each subsystem identical, very similar or completely unique between the C6678 DSP and the SDRAMs?

Yes.

>> 4. Are the SDRAMs on the same side of the PCB as the C6678?

No.

>> 5. Are all SDRAMs within a single subsystem on the same side of the PCB?

For the 32-bit version of the board, one chip is on component side and one on solder side, and for 64-bit version it's 2 and 2.

>> 6. Have you tested multiple boards?

Yes, and all have the issue. There is one much older board that does not seem to show the issue in our lab, but has not been tested in customer systems.

>> 7. Do they all fail at the same PCB locations?

Yes that tends to be the case.

>> 8. During execution, do all 20 subsystems run the exact same code?

Yes.

>> 9. Are they synchronized?

Yes and I believe this to be important. In the last few weeks we added dozens of defensive guards and checks designed to "catch errors early" before they could morph into something that mimics a software problem (e.g. HeapMem failure, malformed PA packet, opcode exception, etc). In this process we noticed that adding some simple "markers" to arrays of structs that are spaced by large powers of 2 was useful. For example we might have an array of structs:

ChanInfo_Core[n + DNUM*NCHAN].marker

where NCHAN is 1024 and the size of the struct is, say 312 bytes. When cores happened to run the same exact code at same exact time, and one core was writing zero's to its marker and another core all 1's, we could achieve a higher error rate. It was almost like a RowHammer situation, where rapidly accessing alternate rows of internal DDR3 memory can make the problem worse, although I'm sure that in our case it's just a basic DDR3 timing issue and nothing that exotic.

>> Have you attempted testing with a DDR3 stress test program or only application code?

Unfortunately the board manufacturer's stress test program is "not sufficiently stressful" to show the problem. Although now this is clear to everyone, 2 months ago the customer was not inclined to believe that was possible and we ended up in a situation where, as the software vendor, we had to rely on our application code, as that was the only thing allowed to run in order to comply with the customer's Capacity Test (CT) setup. This situation should give some idea of why we decided to integrate memory diagnostics into our run-time application, such that these diagnostics co-exist with functional application code.

>> A DDR3 stress test program must ...

Thanks for this list of stress test program requirements. We will be discussing these in detail with the board manufacturer, and encouraging them to substantially improve their stress test program.

>> Please provide populated PHY_CALC and REG_CALC spreadsheets for this initialization ...

We will make sure the board mfg does this.

>> Assuming that the problem is corrupted bits due to signal integrity marginality on the
>> PCB, changes to the DDR3 clock rate and/or changes to the termination settings will
>> make the problem better or worse.

As noted above, reducing the DDR3 clock rate may prove to be a temporary work-around. Thanks for your advice on termination settings -- after the board mfg improves their stress test program and can independently reproduce the problem, then they can vary these settings and determine if there is a more optimal configuration.

-Jeff

0 Tom Johnson 16214 over 7 years ago in reply to Jeff Brower73

TI__Mastermind 46460 points

Jeff,

Thanks for the update. 2-sided SDRAM layouts are very difficult to do. I have seen some customers have success with this when they use great care in the layouts. However, I have seen many 2-sided SDRAM layouts that had problems. Also, it is very difficult to meet the length matching requirements, trace impedance requirements, and track spacing requirements in a 2-sided SDRAM layout. PDN also becomes much more challenging in a 2-sided SDRAM layout.

Tom

0 Jeff Brower73 over 7 years ago in reply to Tom Johnson 16214

Genius 3420 points

Tom-

An update, the board mfg is working on updated track length (termination) settings.

I have a couple of brief follow-up questions:

1) Since we see errors only on the upper 2 bytes of DDR3 mem access, should only termination settings specific to that chip be modified ?

2) The Micron datasheet specs 800, 1066, and 1333 MHz clock rates. Is there any physical limit to running at half rate (667 MHz) or slower, for test purposes ?

Thanks.

-Jeff

0 Tom Johnson 16214 over 7 years ago in reply to Jeff Brower73

TI__Mastermind 46460 points

Jeff,

1. The termination controls are not adjustable to each byte lane. Settings are for the entire interface.

2. We do not recommend dropping the rate to 666MT/s. This lower rate is not well supported.

You mentioned success with running at 800MT/s. Did you recalculate all of the PHY_CALC and REG_CALC values for this lower rate or simply lower the clock rate? Changing the clock rate but not adjusting all of the PHY_CALC and REG_CALC values for this lower rate will not necessarily improve performance.

Have you tried changing the SDRAM_DRIVE setting in the REG_CALC worksheet?

The DDRSLRATE0/1 pins on the C6678 can also be adjusted to change the drive strength from the SOC. This may also make a difference to the signal integrity.

Tom

0 Jeff Brower73 over 7 years ago in reply to Tom Johnson 16214

Genius 3420 points

Tom-

Thanks for your reply.

> 1. The termination controls are not adjustable to each byte lane. Settings are for the entire interface.

Ok.

> 2. We do not recommend dropping the rate to 666MT/s. This lower rate is not well supported.

Ok.

> You mentioned success with running at 800MT/s. Did you recalculate all of the PHY_CALC and REG_CALC values
> for this lower rate or simply lower the clock rate? Changing the clock rate but not adjusting all of the PHY_CALC and
> REG_CALC values for this lower rate will not necessarily improve performance.

No we just reduced the PLL multiplier. The board mfg is about to provide an init.out file that adjusts the PHY_CALC and REG_CALC values for 800 MHz.

> Have you tried changing the SDRAM_DRIVE setting in the REG_CALC worksheet?

Not yet. We have asked the board mfg to do this.

> The DDRSLRATE0/1 pins on the C6678 can also be adjusted to change the drive strength from the SOC.
> This may also make a difference to the signal integrity.

Ok, got it. Thanks again.

-Jeff

Processors

Processors forum

c66x DDR3 mem intermittent corruption