AM5K2E04: RAM read errors

Muhand Alkhouri

Part Number: AM5K2E04

Hello,

We observe RAM read errors on some of our products. The nature of the read errors is explained in the following attachments-Files :

KS2_DDR_Debug_Spreadsheet_v1_01_errata_applied_error_occurs.xlsx (Debug results after applying KeyStoneII.BTS_errata_advisory.35 in case of error)

KS2_DDR_Debug_Spreadsheet_v1_01_errata_applied_no_error.xlsx (Debug results after applying KeyStoneII.BTS_errata_advisory.35 in case of no error)

KS2_DDR_Debug_Spreadsheet_v1_01_without_errata_error_occurs.xlsx (Debug results befor applying KeyStoneII.BTS_errata_advisory.35 in case of error)

KS2_DDR_Debug_Spreadsheet_v1_01_without_errata_no_error.xlsx (Debug results befor applying KeyStoneII.BTS_errata_advisory.35 in case of no error)

we have noticed the KeyStoneII.BTS_errata_advisory.35 is not applied in our Products, but unfortunately applying KeyStoneII.BTS_errata_advisory.35 doesn't lose our problem.

we hope that you can help us to solve this problem

best regards

Muhand AlkhouriKS2_DDR_Debug_Spreadsheet_v1_01_errata_applied_error_occurs.xlsx KS2_DDR_Debug_Spreadsheet_v1_01_errata_applied_no_error.xlsx KS2_DDR_Debug_Spreadsheet_v1_01_without_errata_error_occurs.xlsx KS2_DDR_Debug_Spreadsheet_v1_01_without_errata_no_error.xlsx

over 2 years ago

0 Shankari G over 2 years ago

TI__Mastermind 19805 points

Dear Customer,

Good day!.

Customer Says " We observe RAM read errors on some of our products. The nature of the read errors is explained in the following attachments-Files :"

Would you please, let us know, how do you test the RAM-read? Using gel file running through CCS?

If yes, would you please specify the information about the gel file used?

Regards

Shankari G

0 Muhand Alkhouri over 2 years ago in reply to Shankari G

Prodigy 10 points

Hello,

thank you for the answer. please see in attachment the information about the used gel file.keystone2-ddr3-debug-tools.zip

best regards

0 Rajarajan Uthayakumar over 2 years ago in reply to Muhand Alkhouri

TI__Expert 4314 points

Hi Muhand,

Issue is due to the gel file. Wile loading the gel file, EVM does not connect and goes to non-responsive state. I have been debugging the gel file. Will provide my suggestions ASAP.

Thanks,

Rajarajan

0 Rajarajan Uthayakumar over 2 years ago in reply to Rajarajan Uthayakumar

TI__Expert 4314 points

Hi Muhand,

I can able to run the project on my EVM using the attached gel file, PFA.

2577.evmk2e.gel

And the gel file that you have shared was not allowing the board to connect to CCS (using OnTargetConnect() function in gel file), since t is wrongly pointing C6657 (Keystone 1) core. Please refer below image.

I have alsoattached the document on "gel file Initialization", PFA.

spraa74a.pdf

Thanks,

Rajarajan

0 Gerald Ruescher over 2 years ago in reply to Rajarajan Uthayakumar

Intellectual 300 points

Hello Shankari, Rajarajan,

my name is Gerald Ruescher, I'm a colleague of Muhand, working in the software departement (Muhand is the hardware engineer in charge). Let me explain what we did:

Our product is already built and shipping since 2019. We do not use an EVM and more important, we do *not* use CCS / CodeComposerStudio for a number of reasons.

Our product uses UBoot to boot up (and later Linux as its OS).To debug our RAM problems I did these things.

1. I took the GEL file keystone2-ddr3-debug-tools\Keystone2_DDR_Debug_v1_4.gel. This is part of the TI DDR debug notes. Code in a GEL file is very similar to C-code so I took the routines in that GEL file, re-formatted them into C-Code (which is very easy) and integrated everything into our UBoot bootloader. As a result, our UBoot is now able to produce the debug infos from the GEL file

2. I took the RAM test implemented in keystone2-ddr3-debug-tools\DDR3_EDMA_TEST and also copied it into out UBoot. Now our UBoot is able to run a RAM test

With these two things done, we ran RAM tests from within UBoot on our product. For those boards where the test failed, we collected the DDR debug information, also from inside UBoot. All the information is collected in the Excel sheets provided by Muhand

If you look at the Excel files, you'll see that we already have some pretty detailed error information. We can see that we have read errors on selected data bits and we see, that in the error case exactly one of the debug-registers shows a significant different value when compared to the non error case.

Here's the summary from our report once more:

• The errors occur always in the same 8-bit block of a 64-bit word.
Simply speaking: 56 bits are stable, 8 bits are unstable
• The errors occur in the data bits 0:7 (or is it 60:63? – We’re not sure about the numbering scheme)
• We mostly see single-bit flips 1 => 0, i.e. a bit which is supposed to be 1 is read as 0
• The errors occur over the entire address range
• The errors occur both with burst reads over DMA and single reads by the CPU core
• Activating ECC hides the errors, i.e. the single bit flips are corrected by the ECC controller
• It takes a number of reboots to reproduce the error situation (see below)
• Important: After a reboot the board is either:
o Completely stable (i.e. the errors do not occur at all) or
o Unstable (i.e. the errors occur again and again)
• After running the RAM debug routine Report_Leveling_Values_DDDR3A() we see different values for DX7LCDLR2:
o When the board is stable we see:
DX7LCDLR2: 53
[7:0] (Rank 0 RL Delay): 83
[15:8] (Rank 1 RL Delay): 0
o When the board is unstable we see:
DX7LCDLR2: 1a
[7:0] (Rank 0 RL Delay): 26
[15:8] (Rank 1 RL Delay): 0 "

We were hoping to get a hint what the values in DX7LCDLR2 mean and how they could relate to the data bit errors.

Cheers,

Gerald

0 Shankari G over 2 years ago in reply to Gerald Ruescher

TI__Mastermind 19805 points

Gerald,

Thanks or your detailed and clear information.

You mentioned that these RAM read errors occurs only on some of your products.

By any chance, you swapped the "DRAM DDR3" (MT41K256M16TW-107AAT:P) part on a working product with the "DRAM DDR3" of a non-working product and tested ???

However, let me also include the hardware team from our side to have a closer look into this issue....

Thanks for your patience..

Regards

Shankari G

0 Gerald Ruescher over 2 years ago in reply to Shankari G

Intellectual 300 points

Hardware-wise, all product specimens are identical in the sense that they are built the same way. Luckily, the vast majority of specimens behave correctly. Only a few display the errors and these do not display the errors all the time.

For example, I have one board right next to me which works nicely as long as it is cool (CPU temp<20 degC). Once the board has reached operating temperature (around 40 degC) it becomes sometimes instable after a reboot. I do the following:

* Reboot board, check DX7LCDLR2.
* If DX7LCDLR2 is in "normal" range, board works as intended
* If DX7LCDLR2 is not in "normal" range, board will sooner or later display RAM errors

I'm not a DDR expert by any means but for me it looks like the DDR controller is doing some kind of link- or PHY-training on startup. It then adjusts internal parameters depending on the result of that training. Sometimes, on some boards, the training seems to provide parameters which lead to the instable DDR connection.

Important: I've implemented a workaround where I check DX7LCDLR2 on startup. If its value is outside the expected range, I just re-initialize the DDR PHY and check again. This "fixes" the problem in the sense that the board is stable afterwards but we feel that this is not a real solution. Sometimes it take 4, 5 or more attempts to get a working config.

Here's the code snippet from UBoot (modified source file board/ti/ddr3_k2e.c)

int ds_ddr3_phy_init_ok(void)
{
    u32 regDX7LCDLR2 = *(volatile u32*)(0x023293A8);

    if( (regDX7LCDLR2 & 0xFF) >= 75 )
    {
        return 1;
    }
    else
    {
        printf("WARNING:\n");
        printf("    DX7LCDLR2:     ;0x%08x \n",regDX7LCDLR2);
        printf("        [7:0] (Rank 0 RL Delay):        ;%u \n",((regDX7LCDLR2 & 0x000000FF)));
        printf("Retraining PHY\n");
        return 0;
    }
}


u32 ddr3_init(void)
{
	struct ddr3_spd_cb    spd_cb;
	u32                   retries;
	u32                   run_time;
	struct pll_init_data *pll_settings;

	if (ddr3_get_dimm_params_from_spd(&spd_cb)) {
		printf("Sorry, I don't know how to configure DDR3A.\n"
		       "Bye :(\n");
		for (;;)
			;
	}

	printf("Detected SO-DIMM [%s]\n", spd_cb.dimm_name);
	printf("DDR3 speed %d\n", spd_cb.ddrspdclock);

    if (spd_cb.ddrspdclock == 1600)
        pll_settings = &ddr3_400;
    else
        pll_settings = &ddr3_333;

    run_time = DsRtosAbstractionTime_getTicks32();
	retries = 0;
	while(1)
    {
        init_pll(pll_settings);

        /* Reset DDR3 PHY after PLL enabled */
        ddr3_reset_ddrphy();

        spd_cb.phy_cfg.zq0cr1 |= 0x10000;
        spd_cb.phy_cfg.zq1cr1 |= 0x10000;
        spd_cb.phy_cfg.zq2cr1 |= 0x10000;

        ddr3_init_ddrphy(KS2_DDR3A_DDRPHYC, &spd_cb.phy_cfg);

        if( ds_ddr3_phy_init_ok() )
            break;

        if( ++retries >= 255 )
        {
            break;
        }
    }
	run_time = DsRtosAbstractionTime_getTicks32() - run_time;

    printf("DRAM: PLL settings: multiplier=%u, divider=%u, output_divider=%u\n", pll_settings->pll_m, pll_settings->pll_d, pll_settings->pll_od);
	if( retries>0 )
    {
        printf("#################################################################\n");
        printf("DRAM: PHY init done after %u retries (%u ms)\n", retries, run_time/233333);
        printf("#################################################################\n");
    }
    ddr3_init_ddremif(KS2_DDR3A_EMIF_CTRL_BASE, &spd_cb.emif_cfg);

	printf("DRAM: %d GiB\n", spd_cb.ddr_size_gbyte);

	return (u32)spd_cb.ddr_size_gbyte;
}

0 Shankari G over 2 years ago in reply to Gerald Ruescher

TI__Mastermind 19805 points

Gerald,

Please allow me to let you know, once I hear the feedback from the hardware team.

Thanks for your patience.

Regards

Shankari

0 Shankari G over 2 years ago in reply to Shankari G

TI__Mastermind 19805 points

Gerald,

I have not heard from the internal team so far.

I will keep you posted.

Regards

Shankari G

0 Karthik Ramanan over 2 years ago in reply to Shankari G

TI__Guru** 113800 points

Unlocking the thread based on Gerald's new post - https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1067048/am5k2e04-am5k2e04, lets continue discussion here.

0 kcastille over 2 years ago in reply to Karthik Ramanan

TI__Mastermind 46522 points

Gerald and Muhand,

Can you summarize

1) what is the DX7LCDLR2 value variation in the "normal" range?

2) How many systems have you produced/shipped in production? How many systems show this problem?

Thanks,

Kyle

0 Gerald Ruescher over 2 years ago in reply to kcastille

Intellectual 300 points

Stable operation:
DX7LCDLR2: Around 83

Unstable operation:
DX7LCDLR2: Around 26

The product is relatively new and we've produced around 1.000 systems. Since the error occurs only sporadically, we have no precise statistics yet. In the last few months, I've seen about 1-2 dozen specimens which show the faulty behaviour from time to time.

0 Gerald Ruescher over 2 years ago in reply to Gerald Ruescher

Intellectual 300 points

Any news?

Frankly, I'd really like to see this issue resolved. I understand that you cannot diagnose possible hardware problems on our side but it would be at least helpful to get an idea about what the values of that debug-register mean. Having boards which show unreliable RAM behavior is a major quality risk.

0 kcastille over 2 years ago in reply to Gerald Ruescher

TI__Mastermind 46522 points

Gerald, Can you try to directly use the stable value instead of implementing the training sequence?

Thanks,

Kyle

0 Gerald Ruescher over 2 years ago in reply to kcastille

Intellectual 300 points

We already have a workaround in place: I check DX7LCDLR2 on startup. If its value is outside the expected range, I just re-train the DDR PHY and check again. This usually "fixes" the problem after 2-5 re-trainings.

Is setting DX7LCDLR2 manually really an option? And if so, what would be the preferred way:

1. Do not use training at all and instead, set ALL parameters by hand?

2. Use training but set only DX7LCDLR2 manually?

3. Our approach: Re-train until a "correct" value of DX7LCDLR2 shows up

0 kcastille over 2 years ago in reply to Gerald Ruescher

TI__Mastermind 46522 points

Gerald,

Your workaround (#3) seems reasonable.

Regards,

Kyle

Processors

Processors forum

AM5K2E04: RAM read errors