This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6678: DDR3 Read access problem

Guru 15520 points
Part Number: TMS320C6678

Hi,

My customer using the C6678 are experiencing issues with DDR3 read access during pre-shipment testing.
The one test pattern include such as "walking test", "marching test", etc..
This single test performs hundreds of millions of write/read accesses and takes about 20hours to be done.
The problem occurs mostly in "walking test", and when the problem occurs,
when "1" is written, "0" is read, and when the same address is read continuously, "1" is read correctly,
so we think it is a memory read issue.
The frequency of occurrence of the problem is once at room temperature, and more than 300 times
when the temperature is raised to about 50°C.

Two C6678(DSP#1,DSP#2) are mounted on their customer board and each DSP are connected to four 16bit DDR3(total 64bit).
The issue occured only at 1bit of byte lane #6 or #7 of DSP#1.
This board has been mass-produced for several years, and the DDR3 peripheral layout has been done according to
TI's design guide, and since this problem has not occurred so far, we believe there is no problem with the layout design.

For now, they are using partial leveling of DDR3.
But there are above issue, so the customer are thinking to try full leveling.
With that in mind, they have the following questions.

Q1.The customer want to know the details of each parameter of leveling register which are
defined in spead sheet "DDR3 PHY Calc v11" such as DATAx_WRLVL_INIT_RATIO, DATAx_GTLVL_INIT_RATIO,
RD_DQS_SLAVE_RATIO, WR_DQS_SLAVE_RATIO, WR_DATA_SLAVE_RATIO, FIFO_WE_SLAVE_RATIO.
They want to know what kind of process will be done by using these parameters.

Q2.The customer are trying to adjust the above parameter after calculated by the spread sheet.
And they want to know which parameters to adjust for a problem like this.

Q3.Is there a register that allows me to check how the initial value of each register calculated
in the spreadsheet has changed after leveling?

Q4.We would like to know about configuration flow to use full leveling.
Is configuration flow of full leveling same as partial leveling?
In full leveling, is it just that there is no setting to enter the fixed value(0x200) to DDR3_CONFIG_REG23
that was recommended in the partial leveling?

Q5.To use full leveling, I guess incremental leveling are needed after full leveling because
of errata advisory 9 workaround 3.
In Keystone I DDR3 Initialization(sprabl2e) page.16, it said as follow:
********************************************************************************
Example 25. Incremental Leveling After Full Automatic Leveling
RDWR_LVL_RMP_WIN = 0x00000502;
RDWR_LVL_RMP_CTRL = 0x80030300;
RDWR_LVL_CTRL = 0x7F090900;
********************************************************************************
Does this setting mean workaround 3 of advisory 9 ?

best regards,
g.f.

  • g.f.,

    Thanks for your detailed explanation.

    Let me look into and get back to you at the earliest.

    Regards

    Shankari G

  • g.f,

    During the "Pre-shipment testing" How many units faced this issue out of "how many units" ?.

    Yes, their understanding is right regarding the leveling adjustments.

    The following is the collection of all the DDR3 user and app guides for C6678 for their future references. Ignore, if already had all.

    ==================================================================================================

    1. Design requirements - https://www.ti.com/lit/an/sprabi1d/sprabi1d.pdf

    2.Keyston I Interface bring - up - https://www.ti.com/lit/an/spracl8/spracl8.pdf 

    3. KeyStone I DDR3 Initialization - https://www.ti.com/lit/an/sprabl2e/sprabl2e.pdf

    4. Keystone Architecture DDR3 Memory Controller -  https://www.ti.com/lit/ug/sprugv8e/sprugv8e.pdf 

    ====

    Let us see the answers to the questions one by one.

    Q1.The customer want to know the details of each parameter of leveling register which are
    defined in spead sheet "DDR3 PHY Calc v11" such as DATAx_WRLVL_INIT_RATIO, DATAx_GTLVL_INIT_RATIO,
    RD_DQS_SLAVE_RATIO, WR_DQS_SLAVE_RATIO, WR_DATA_SLAVE_RATIO, FIFO_WE_SLAVE_RATIO.
    They want to know what kind of process will be done by using these parameters.

    The details of each parameter is given in https://www.ti.com/lit/ug/sprugv8e/sprugv8e.pdf Keystone Architecture DDR3 Memory Controller.

    In DDR3 PHY Calc v11 with  

     DATAx_WRLVL_INIT_RATIO  Address: 40ch  ------- > https://www.ti.com/lit/ug/sprugv8e/sprugv8e.pdf  Page no: 53 ---  Table 4-2. DDR3 PHY Leveling Registers and then click the section 4.33

    Q2.The customer are trying to adjust the above parameter after calculated by the spread sheet.
    And they want to know which parameters to adjust for a problem like this.

    Levelling shall be tried. As it compensates the skew for both reads and writes

    Q3.Is there a register that allows me to check how the initial value of each register calculated
    in the spreadsheet has changed after leveling?

    There is a Explore register option in CCS, which shall be used to dump all the registers. 

    At a regular interval, the dumped registers shall be compared and the change in values shall be identified.

    But as such "no register" to track the change in values.  

    Q4.We would like to know about configuration flow to use full leveling.
    Is configuration flow of full leveling same as partial leveling?
    In full leveling, is it just that there is no setting to enter the fixed value(0x200) to DDR3_CONFIG_REG23
    that was recommended in the partial leveling?

    Here you go... 

    Page no : 36 in https://www.ti.com/lit/ug/sprugv8e/sprugv8e.pdf

    2.13.4 Programming Full Leveling

    ----------------------------------------------

    Leveling (both full and incremental) is executed separately for each byte lane (clock-DQS pair). Therefore,
    the leveling process converges separately for each byte lane. To ensure that leveling converges correctly,
    the DDR3 controller must be given an initial set of values to use during leveling. The leveling process uses
    these initial values to arrive at a set of converged values for each byte lane. These initial values should be
    plugged into a set of memory-mapped registers in the Boot configuration section.


    The user should note that the DATAx registers in steps 2 and 3 below map to specific byte lanes as
    follows (note the difference between C665x and other device variants). The mapping is consistent for write
    leveling and gate leveling initial values i.e. DATA n_PHY_WRLVL_INIT_RATIO and DATA
    n_PHY_GATELVL_INIT_RATIO map to the same byte lane. DATA m_PHY_WRLVL_INIT_RATIO and
    DATA m_PHY_GATELVL_INIT_RATIO map to the same byte lane.


    Table 2-16. DATAx register to byte lane mapping - Page no : 36 in https://www.ti.com/lit/ug/sprugv8e/sprugv8e.pdf


    The steps to program full leveling are as follows:
    ==================================================

    1. Unlock the Boot configuration module by writing 0x83E70B13 to the KICK0 and 0x95A4F1E0 to the
    KICK1 registers.
    2. Program the write leveling initial values into the DATA0_PHY_WRLVL_INIT_RATIO to
    DATA8_PHY_WRLVL_INIT_RATIO fields of the DDR3_CONFIG_2 to DDR3_CONFIG_10 registers
    respectively. (See the tables in sections Section 4.33 to Section 4.41).
    NOTE: The values to enter into the registers depend on the board topology and the DDR3 clock
    frequency in use. The DDR3 clock frequency (half the data rate) and trace lengths for each
    byte lane (CK-DQS pair) should be plugged in the appropriate fields in the accompanying
    PHY calculation spreadsheet which generates the values to be programmed into the boot
    config registers mentioned above.
    3. Program the gate leveling initial values into the DATA0_PHY_GATELVL_RATIO to
    DATA8_PHY_GATELVL_RATIO fields of the DDR3_CONFIG_14 to DDR3_CONFIG_22 registers
    respectively. (See the tables in sections Section 4.43 to Section 4.51.)
    NOTE: The values to enter into the registers depend on the board topology and the DDR3 clock
    frequency in use. The DDR3 clock frequency (half the data rate) and trace lengths for each
    byte lane (CK-DQS pair) should be plugged in the appropriate fields in the accompanying
    PHY calculation spreadsheet which generates the values to be programmed into the boot
    config registers mentioned above.
    4. Program CMD_PHY_DLL_LOCK_DIFF field in DDR3_CONFIG_0 register to 0xF.
    (See Section 4.31.)
    5. Enable global leveling (Set RDWRLVL_EN = 1 in RDWR_LVL_RMP_CTRL).
    6. Trigger full leveling (Set RDWRLVLFULL_START = 1 in RDWR_LVL_CTRL).
    7. Read back any of the DDR3 controller registers.
    This ensures full leveling is complete because this step is executed only after full leveling completes.

    Q5.To use full leveling, I guess incremental leveling are needed after full leveling because
    of errata advisory 9 workaround 3.
    In Keystone I DDR3 Initialization(sprabl2e) page.16, it said as follow:
    ********************************************************************************
    Example 25. Incremental Leveling After Full Automatic Leveling
    RDWR_LVL_RMP_WIN = 0x00000502;
    RDWR_LVL_RMP_CTRL = 0x80030300;
    RDWR_LVL_CTRL = 0x7F090900;
    ********************************************************************************
    Does this setting mean workaround 3 of advisory 9 ?

    The answer to this question will be confirmed by Kyle or his team. I have internally sent an email to them. They will get back to you.

    Regards

    Shankari G

  • Hi Shankari,

    Thank you for the reply.

    I understood that the parameter of spreadsheet are appropriate to the DDR3 PHY Leveling registers.
    But I still don't understand what kind of process will these parameter be used.
    From "Keystone I DDR3 Initialization" document page.2 "1.2 Automatic Leveling Initialization"
    it's written as follow:
    ******************************************************************************************************
    The DDR3 PHY Calc spreadsheet is provided to help users calculate the initial values and to translate
    them into the proper units. The inputs are the routed clock and data strobe lengths. The result values are
    the initial values in units of DLL taps, of which there are 256 per clock period. In addition, because the
    initial leveling algorithm adapts only in the positive direction, the initial values are offset 128 DLL steps in
    the negative direction.
    ******************************************************************************************************

    From above information, I thought that each byte lane of C6678 DDR3 controller have DLL to adjust the delay.
    And the values calculated by the spreadsheet will be used as initial value of the DLL.
    And after automatic leveling starts, the delay will be adjusted by DLL based on inital value.
    Is it correct?

    best regards,
    g.f.

  • g.f.,

    ******************************************************************************************************
    The DDR3 PHY Calc spreadsheet is provided to help users calculate the initial values and to translate
    them into the proper units. The inputs are the routed clock and data strobe lengths. The result values are
    the initial values in units of DLL taps, of which there are 256 per clock period. In addition, because the
    initial leveling algorithm adapts only in the positive direction, the initial values are offset 128 DLL steps in
    the negative direction.
    ******************************************************************************************************

    From above information, I thought that each byte lane of C6678 DDR3 controller have DLL to adjust the delay.
    And the values calculated by the spreadsheet will be used as initial value of the DLL.
    And after automatic leveling starts, the delay will be adjusted by DLL based on inital value.

    Here, when I open and see the spread sheet, "DDR3 PHY Calc V11.xlsx", the yellow coloured cells are the ones which are editable by the user . 

    The values of registers like "DATA8_WRLVL_INIT_RATIO " are not editable  and they are automatically calculated by the input values given by the user on the yellow coloured cells ----- following the input instructions given in sheet no : 2 on " Instructions to determine the initial values for levelling.

    1.When the full cycle ratio is  "256 ", the value gets calculated automatically by the Excel sheet as "128" in " DATA8_WRLVL_INIT_RATIO " register

    2. It is better to input the values as per the "9" instructions given in the instruction sheet of the Excel.

    From above information, I thought that each byte lane of C6678 DDR3 controller have DLL to adjust the delay.
    And the values calculated by the spreadsheet will be used as initial value of the DLL.
    And after automatic leveling starts, the delay will be adjusted by DLL based on inital value.
    Is it correct?

    Yes, the values calculated by the spread sheet will be used as the initial value of the DLL. 

    ( --> Having said that the value derived in the register, "DATA8_WRLVL_INIT_RATIO" is 128 (and not-editable directly by the user ) ) , that should be case, which seems to be matching with the explanation given in the "Keystone I DDR3 Initialization" document page.2 "1.2 Automatic Leveling Initialization".

    More-over when we click the cells, the formula used to calculate the " delay of  DQS_ECC" gets dispalyed as (D24+C24/1.1)*INCH_DEL,

    Where the D24 is the  Stripline length (inches)

    and C24 is the Microstrip length.

    Regards

    Shankari G

  • Hi Shankari,

    Thank you for the reply and sorry for the delay. I was taking winter vacation.

    I understood.

    By the way, I have not received the answer of Q5 yet.
    Are there any response from the team yet?
    >> Q5.To use full leveling, I guess incremental leveling are needed after full leveling because
    >> of errata advisory 9 workaround 3.
    >> In Keystone I DDR3 Initialization(sprabl2e) page.16, it said as follow:
    >> ********************************************************************************
    >> Example 25. Incremental Leveling After Full Automatic Leveling
    >> RDWR_LVL_RMP_WIN = 0x00000502;
    >> RDWR_LVL_RMP_CTRL = 0x80030300;
    >> RDWR_LVL_CTRL = 0x7F090900;
    >> ********************************************************************************
    >> Does this setting mean workaround 3 of advisory 9 ?

    >The answer to this question will be confirmed by Kyle or his team. I have internally sent an email to them. They will get back to you.

    And our customer have one more following question:
    Has there ever been a case in the past where only a few of the thousands of manufactured boards
    had access errors like this to DDR3 due to leveling?

    best regards,
    g.f.

  • g.f,

    For Q5, the proposed-solutions seems to be combined. I have already notified Kyle through email. Let me ping him again.

    And our customer have one more following question:
    Has there ever been a case in the past where only a few of the thousands of manufactured boards
    had access errors like this to DDR3 due to leveling?

    This question shall also be answered by Kyle or Mukul.

    Please give me a day ot two.

    Regards

    Shankari G

  • Hi Shankari,

    Thank you for the reply.
    Okay, I will wait for the answer.

    best regards,
    g.f.

  • Hi Shankari,

    I'm sorry for posting again.
    I have a question about the answer for the Q2.
    As you said that leveling shall be tried, is it mean that adjusting the parameter after
    calculated by the spread sheet are not recommended?

    best regards,
    g.f.

  • g.f.,

    Yes. You can give it a try adjusting the parameters.. and observe the output, whether it improvise or worsen in your case...

    Adjusting the parameters will help in deriving the conclusive values according to the increase or descrease of failure cases.

    Regards

    Shankari G 

  • Hi Shankari,

    Thank you for the reply.

    As I told you before, there seems to be a failure with memory reading this time,
    but in this case, which parameter should the customer specifically try to adjust?

    The failure occur at the byte lane 6 or byte lane 7.
    So, should day only adjust the gate training init ratio(DATA0_PHY_GATELVL_INIT_RATIO, DATA1_PHY_GATELVL_INIT_RATIO)
    or both write leveling init ratio(DATA0_PHY_WRLVL_INIT_RATIO, DATA1_PHY_WRLVL_INIT_RATIO and
    gate training init ratio(DATA0_PHY_GATELVL_INIT_RATIO, DATA1_PHY_GATELVL_INIT_RATIO)?
    And if there are any other parameter should be adjust, could you tell us which parameter should be adjust?

    By the way, are there any updates from the team yet?

    best regards,
    g.f.

  • g.f.,

    The failure occur at the byte lane 6 or byte lane 7.
    So, should day only adjust the gate training init ratio(DATA0_PHY_GATELVL_INIT_RATIO, DATA1_PHY_GATELVL_INIT_RATIO)
    or both write leveling init ratio(DATA0_PHY_WRLVL_INIT_RATIO, DATA1_PHY_WRLVL_INIT_RATIO and
    gate training init ratio(DATA0_PHY_GATELVL_INIT_RATIO, DATA1_PHY_GATELVL_INIT_RATIO)?
    And if there are any other parameter should be adjust, could you tell us which parameter should be adjust?

    This question will also be answered by the hardware team.

    Regards

    Shankari G

  • Hi Shankari,

    I have additional questions about Full Automatic Leveling.

    Q6.
    To use Full Automatic Leveling in C6678, there are errata(advisory 9) and
    I understood that I need to implement the workaround 3 of advisory 9 which uses incremental leveling.
    And previously, I'm asking you that the example 25 "Incremental Leveling After Full Automatic Leveling" written in
    "KeyStone I DDR3 Initialization" will be the workaround 3 or not. In this application note it said that
    "Incremental leveling of at least the read eye sample point must be executed at least 64 times after full automatic leveling
    to converge it to an initial optimum value."But I'm not quite understand that how to excute at least 64 times after full leveling.
    Is it mean that after execution of Exmaple 25, we need to implement some delays to wait for
    the incremental leveling to execute 64 times? If yes, could you provide sample code for this?

    Q7.
    I understodd that by using DDR3 PHY calculation spread sheet, DATA n_WRLVL_INIT_RATION and DATA n_GTLVL_INIT_RATIO valules
    will be provided automatically. In the spreadsheet, what kind of algorithm is used to calculate the DATA n_WRLVL_INIT_RATION and
    DATA n_GTLVL_INIT_RATIO values from the microstrip(cell C), stripline(cell D) length and the Delay(cell E)?
    There was no instruction in the spreadsheet so that could you please explain with calculation formulas or diagrams etc?

    Q8.
    And also what is the minimum value and maximum value could be set to the DATA n_WRLVL_INIT_RATION and DATA n_GTLVL_INIT_RATIO?
    I'm asking this question because our customer are trying to solve the read access failure by adjusting the value of appropriate ratio register.

    best regards,
    g.f.

  • g.f.,

    Sure, Let me know the team about these valid questions to answer in detail.

    Regards

    Shankari G

  • Hello g.f,

    You mention .... "The frequency of occurrence of the problem is once at room temperature, and more than 300 times
    when the temperature is raised to about 50°C."  ... 

    You also mention the fail is only on DSP1.

    Q: How many systems are you building/testing in this latest batch?  How many systems are passing vs failing?

    Thanks,

    Kyle

  • Hi Kyle

    Thank you for the reply and sorry for the delay.

    About 200 units have been manufactured so far, and each board has two C6678.
    This problem has occurred in 4 DSPs.

    >You also mention the fail is only on DSP1.
    Sorry, my explanation was not enough.
    If the issue occured at C6678 of DSP1 side and after that swapping the C6678 between DSP1 and DSP2,
    then the issue will occur at DSP2 side.

    The customer is waiting for an answer to questions 5-8, when can we expect an answer?

    best regards,
    g.f.

  • Hi GF,

    When you say "DSP1" and "DSP2", you are referring to 2x separate C6678 devices, correct?

    The observation that the issue moves from DSP1 to DSP2 after swapping the devices seems to point towards one of the following:

    1. An issue with the DDR memory component on the board
    2. An issue with the manufacturing of the board (poor electrical connection of a trace, etc.)
    3. An issue with the board design (layout / routing) between the DSP1 location and associated DDR memory

    Can the customer also try swapping the DDR memories between DSP1 and DSP2 to see if the issue follows the DDR memory? 

    Thanks,
    Kevin

  • Hi Kevin,

    Thank you for the reply.

    >When you say "DSP1" and "DSP2", you are referring to 2x separate C6678 devices, correct?
    Yes, you are correct.

    Actually, memory swapping has already been done,
    and it seem that the problem occurs only with a specific DSP regardless of memory swapping.
    But the DSP which the issue occur already have been analysis by the ATE test, and it passed the test.

    So, our customer want to try the full leveling and adjust the parameter of calc spreadsheet manually
    to check if the issue will be solve.
    That is why they are asking Q5-Q8.

    best regards,
    g.f.

  • Hi GF,

    If the issue occured at C6678 of DSP1 side and after that swapping the C6678 between DSP1 and DSP2,
    then the issue will occur at DSP2 side.
    it seem that the problem occurs only with a specific DSP regardless of memory swapping.

    Ok, I may have had the wrong understanding previously. So when the customer swapped DSP1 and DSP2, the issue followed DSP1 (which is now located where DSP2 was previously on the board). Is that correct?

    Regards,
    Kevin

  • Hi GF,

    For Q7, why is this information needed? If the customer believes that the spreadsheet is incorrect, can the customer not just manually adjust DATA*_WRLVL_INIT_RATIO and DATA*_GTLVL_INIT_RATIO?

    Regards,
    Kevin

  • Hi Kevin,

    Thank you for the reply.

    >Ok, I may have had the wrong understanding previously.
    >So when the customer swapped DSP1 and DSP2, the issue followed DSP1
    >(which is now located where DSP2 was previously on the board). Is that correct?

    Yes, that is correct.

    >For Q7, why is this information needed? If the customer believes that the spreadsheet is incorrect,
    >can the customer not just manually adjust DATA*_WRLVL_INIT_RATIO and DATA*_GTLVL_INIT_RATIO?

    The customer is not suspicious of the spreadsheet, but wants to try adjusting the register to see if it improves.
    But they have no idea how to adjust the values, so they think it would be helpful to see how the spreadsheet derives the register values.

    best regards,
    g.f.

  • Hi GF,

    In your original post, you mentioned: 

    when "1" is written, "0" is read, and when the same address is read continuously, "1" is read correctly,
    so we think it is a memory read issue.
    The issue occured only at 1bit of byte lane #6 or #7 of DSP#1.

    A couple of comments. 

    1) The INIT_RATIOS should be used as a starting point for the write leveling and gate training algorithms respectively. Small variations in these settings likely have no impact on the final result, and may only add or reduce the time it takes for training to complete. Larger variations could result in a cycle slip relative to the READ / WRITE commands. As the final result is what could impact setup / hold timings and cause a '1' to appear as a '0', I am not sure that adjusting the starting values will resolve the issue described.  

    2) Based on the description, the failure occurs during a READ and occurs on a single bit. Thus, write leveling parameters are likely not connected to the issue and adjusting them should be low priority. Gate training (related to *_GTLVL_INIT_RATIO) is generally used to align the write enable signal in the center of the DQS read pre-amble. Thus, generally marginalities in the gate training would impact the entire byte lane and not a single bit.

    3) Based on #1 and #2 above, I would suggest adjusting the *_RD_DQS_SLAVE_RATIO values (DDR3_CONFIG_52 through DDR3_CONFIG_59) as higher priority. Can you read out these registers and let us know what the value is?

    Additionally, I would like to know whether the customer is using cache when running the test which illustrates the issue. If the customer is continuously reading the same address location and cache is enabled, then is it possible that sub-sequent reads are not actually fetching new data from the DDR memory? 

    Thanks,
    Kevin

  • Also, it would be good to know how voltage and frequency impact the issue / failure rate.

    Can the customer try reducing DDR frequency, as well as try increasing CVDD and CVDD1?

  • Hi Kevin,

    Thank you for the comments and I'm sorry for the delay.

    I would like to explain your comments to my customer.
    But I have a following question about your comment #3.

    Isn't DDR3_CONFIG_52 through DDR3_CONFIG_60 registers for Expanded Fixed Ratio Register DSPs?
    C6678 is Combined Fixed Ratio Registers DSPs, so that I thought above registers is not for C6678. Isn't it?

    >Additionally, I would like to know whether the customer is using cache when running
    >the test which illustrates the issue. If the customer is continuously reading the same address location
    >and cache is enabled, then is it possible that sub-sequent reads are not actually fetching new data from the DDR memory?

    >Also, it would be good to know how voltage and frequency impact the issue / failure rate.
    >Can the customer try reducing DDR frequency, as well as try increasing CVDD and CVDD1?

    I will ask to my customer, so please wait for a while.

    By the way, have there been previous cases where problems like this one were caused by leveling?
    Also, could you please give us an answer to Q6?

    >Q6.
    >To use Full Automatic Leveling in C6678, there are errata(advisory 9) and
    >I understood that I need to implement the workaround 3 of advisory 9 which uses incremental leveling.
    >And previously, I'm asking you that the example 25 "Incremental Leveling After Full Automatic Leveling" written in
    >"KeyStone I DDR3 Initialization" will be the workaround 3 or not. In this application note it said that
    >"Incremental leveling of at least the read eye sample point must be executed at least 64 times after full automatic leveling
    >to converge it to an initial optimum value."But I'm not quite understand that how to excute at least 64 times after full leveling.
    >Is it mean that after execution of Exmaple 25, we need to implement some delays to wait for
    >the incremental leveling to execute 64 times? If yes, could you provide sample code for this?

    best regards,
    g.f.

  • Hi g.f.,

    Isn't DDR3_CONFIG_52 through DDR3_CONFIG_60 registers for Expanded Fixed Ratio Register DSPs?
    C6678 is Combined Fixed Ratio Registers DSPs, so that I thought above registers is not for C6678. Isn't it?

    That may be true - I'll have to check to see who could confirm. The DDR3 memory should be sending the data edge aligned with the strobe, and the controller / PHY is then responsible for centering the strobe's rising / falling edges in the middle of the data eye to latch a '0' or '1'. From my experience with TI's controllers / PHYs on other (similar) processors, the read DQS slave ratio controls this timing relationship between data and strobe. I do see from section 6.3.1.6 (Routing Rules - Data Lines) of the DDR3 Design Requirements application note that the skew between data group nets for a given byte lane is expected to be +/- 10 mils. Therefore, it is possible that the controller / PHY of C6678 applies a hard offset delay that is not configurable through a memory mapped register. Unfortunately, I do not know this detail off hand and will have to check internally. If that is the case, then it seems unlikely that there is any "training" parameter that would resolve this issue. Maybe IO settings (drive strength / termination) could be tweaked.

    However, it is still not clear whether the issue is occurring on the DDR interface (between the DDR memory and SOC) or occurring internally of the SOC.

    https://www.ti.com/lit/pdf/sprabi1 

    Regards,
    Kevin

  • Hi g.f,

    After reading the errata you pointed to (Advisory 9), maybe the only control of the data to strobe timing is in CONFIG_23, and maybe this is applied to all data lanes. Then, can you please have the customer try adjusting the lower 8 bits ([7:0]) of CONFIG_23 from the default value of 0x34 to values slightly larger and smaller to see if the failure occurs more frequency or less frequently? Note that I only assume 0x34 is the default value as that is what the errata claims. I would have to check with our design team to confirm; however, the easiest solution would be to read this directly from silicon (which the customer could do). 

    Regarding the work-around #3, yes it sounds like you would just have to implement a delay (such as a timer) before accessing DDR. My understanding is that the incremental leveling prescalar and interval fields would be used to determine the delay required. However, I think we should first understand whether adjusting the read DQS slave ratio even has any impact on the failure. If there is no impact, then I do not think it makes sense to try and apply work-around #3.

    Regards,
    Kevin

  • Hi Kevin,

    Thank you for the comments and I'm sorry for the delay again.
    I will tell my customer to try the settings as you gave me advise.

    >Unfortunately, I do not know this detail off hand and will have to check internally.
    >If that is the case, then it seems unlikely that there is any "training" parameter that would resolve this issue.
    Could you please check internally and let us know.

    And one more thing to ask:
    Are there any past cases of problems like this one caused by leveling?

    best regards,
    g.f.

  • Hi g.f.,

    Are there any past cases of problems like this one caused by leveling?
    The issue occured only at 1bit of byte lane #6 or #7 of DSP#1.
    when "1" is written, "0" is read, and when the same address is read continuously, "1" is read correctly,
    This board has been mass-produced for several years, and the DDR3 peripheral layout has been done according to
    TI's design guide, and since this problem has not occurred so far, we believe there is no problem with the layout design.
    •  As a general statement, data corruption can occur during a read or write if delays are not set appropriately.

    However per the information provided by you/customer on this thread, the error appears on a single bit and appears to be occurring while reading data. Based on this information, write delays should not be impacting the failure, and the error description does not seem consistent with that I might expect if the gate delay were not set appropriately.

    • A single bit read failure could occur if the read DQS slave delay is not optimal.

    However, using the default read DQS slave delay value (as described in Workaround #1 of Advisory #9 in SPRZ334H) is an appropriate / supported configuration. I am not aware of any information that would contradict this Workaround. This is also consistent with your statement that the board has been mass-produced for several years without issue.

    Regards,
    Kevin

  • Hi Kevin,

    Thank you for your comment.
    I have shared the details with our customer.

    Now, they are trying to increase or decrease the CONFIG_23 bit[7:0] that you commented on the other day to see
    if the failure occurs more frequency or less frequently.
    And they would like to ask if you have any insight of adjusting value("adjustment range") of the CONFIG_23 register.
    If there is none in particular, I will try about 10 points (e.g., -32,-16,-8,-4,+4,+8,+16,+32, etc.) within a range of ±0x20 and
    If they find a value that seems to improve after checking the results, they intend to explore the surrounding area in detail.

    best regards,
    g.f.

  • Hi g.f.,

    Based on experience with TI processors containing similar DDR controller / PHY, I would expect an appropriate range to be within 0x10 to 0x50. It may be possible that the customer finds a different passing range though. Please let us know whether changing this parameter has any impact. 

    In addition, there were also a few other open questions related to whether or not cache is enabled, and whether DDR frequency or if increasing CVDD and CVDD1 has any impact. Please let us know if there is any update on those items.

    Thanks,
    Kevin

  • Hi Kevin,

    Thank you for your comments.

    I will share the information as soon as we get the results from the customer.
    So, please wait for a little while.

    best regards,
    g.f.

  • Hi Kevin,

    I'm sorry for the delay.
    The customer done the test of adjusting slave ratio in the DDR3_CONFIG_23(bit[7:0),
    so I will share the results of the test and the confirmation of your previous question.

    1.Adjusting Slave Ratio
    They used partial leveling and made some simple test program(runs about 2hour).
    They tested with four boards(There are #1 to #4 boards, each with 2 C6678s).
    When setting slave ratio to "0x1C", both problematic and normal devices passed the test.
    Please take a look the attached file(C6678_DDR3_Test result and waveform.pdf) for the result.

    C6678_DDR3_Test result and waveform.pdf

    2.Is cache enabled or disabled?
    The C6678 cache remains disabled.

    3.Do frequency and voltage affect this phenomenon?
    They have changed the DDR3 clock frequency from 1333 MHz to 1066 MHz at the 3 boards.
    On two boards, the faulty DSPs did not result in an error after the frequency change, and on one board, the error continued to occur.
    Please take a look at the attached file(C6678_DDR3_Test result and waveform.pdf) for the result.
    They have not verified the voltage change due to concerns about damage to the board by changing the voltage.
    However, they said that if verification of the voltage change is mandatory, they will verify.

    The customer have a few questions as follow:

    Q1.
    The customer obtained the DQS/data waveform at the time the error occurred and confirmed
    that the waveform was normal.This leads them to believe that the problem is on the C6678 side.
    Are there any comments for this?

    The waveform acquisition method and results are as follows (Waveform data attached)

    Waveform measurement method :
    Used DSP test program that outputs "High" from GPIO of the C6678 when the memory test error occur.
    Perform a memory test, triggered by the rising edge of the GPIO, to obtain the DQS/data waveform at the time the error occurred.

    The result is as follow:
    The waveform was normal even during the error.
    It is thought that data retention failed inside the DSP.
    The error occurred in the memory test for the Walking pattern.
    *************************************************
    Address: 0x8100_E6D8
    Result value : 0xFFF7_FFFF_FFFF_FFFF
    Expected value : 0xFFFF_FFFF_FFFF_FFFF
    At this time, the burst data in Bit51 is 0,0,0,0,1,0,0,0,0,0,
    and the data is output from the memory as expected.
    *************************************************

    Q2.
    The customer have confirmed that changing the "read DQS slave ratio" from
    "0x34(default value)" to "0x1C" does not cause an error.
    However, they want explanation of how the change of "slave ratio" affects the error.
    Could you give us the explanation?

    The customer is requesting the evidence to determine that boards that are currently
    operating with the "0x34(default value)" can also operate without problems by
    changing it to "0x1C". Could you support their request?

    Q3.
    The custom board so far has been used with the default value of 0x34 set for
    DDR3_CONFIG_23 (Slave ratio) based on TI's documentation(errata sheet).
    However, since the error occurred, the customer adjusted the slave ratio of the error device from 0x34(default value) to 0x1C
    and confirmed that the error no longer occurred if slave ratio is set to 0x1C.
    It is recommended to use default values for slave ratios, but doesn't TI screen each device before shipment to make sure it works fine with the default values? In other words, aren't devices like this case that will not work unless set to a different slave ratio supposedly excluded from the screening process?

    best regards,
    g.f.

  • Hi g.f.,

    Thanks for the update and detailed data. 

    They have not verified the voltage change due to concerns about damage to the board by changing the voltage.
    However, they said that if verification of the voltage change is mandatory, they will verify.

    I do think it would be beneficial to check the impact due to voltage. Do the DSPs share the same voltage supply?

    The customer obtained the DQS/data waveform at the time the error occurred and confirmed
    that the waveform was normal.This leads them to believe that the problem is on the C6678 side.
    Are there any comments for this?

    Such a small passing window for the slave ratio (example: DSP2 of Board#3 only passes at 0x1c) is not expected. Given that it was previously stated that the failure follows the DSP, then I agree that it seems the problem is on the DSP side. However if there is a voltage drop (or dip due to supply noise) on the board, it could be that some devices are more susceptible to failure due to the voltage drop / dip. 

    Regards,
    Kevin

  • Hi Kevin,

    Thank you for the reply.

    As the customer are thinking this issue is related to DSP side
    and also you agree that it seem the problem is on the DSP side, should I contact to TI certified quality engineer?

    best regards,
    g.f.

  • Hi Kevin,

    Our customer have checked the CVDD and CVDD1 power supply of C6678.
    They shared the result of each power supply signal and I attached the file to this post,
    so please take a look.

    C6678_Result of CVDD_CVDD1_power rail.pdf

    They checked the power supplies of DSP1 and DSP2 on one custom board during normal operation and memory access.

    From the result, each power supply have no ripple, so they are thinking that
    each power supply are normal.

    best regards,
    g.f.

  • Hi Kevin,

    I'm sorry for the delay.
    Our customer done the test of increasing the voltage of CVDD/CVDD1 and
    I got the result from them. I attached the file to this forum, so please take a look.

    C6678_DDR3__increasing_CVDD_CVDD1_Test_Result.pdf

    As it turns out, increasing the voltage increased the range of Slave Ratio that passed the test.

    best regards,
    g.f.

  • Hi Kevin,

    I'm sorry for waiting you so long time.

    The customer tested increasing the voltage of CVDD and CVDD1 separately.
    The result was that when increasing 5% of CVDD voltage, it effect to the Slave ratio range.
    But when increasing the CVDD1, it didn't effect to the slave ratio range.

    The customer gave us the detail of the test, so I will attach to this forum.
    Please chech the file.

    C6678_DDR3__increasing_CVDD_CVDD1_Test_Result_20230410.pdf

    best regards,
    g.f.

  • Hi g.f.,

    Thanks for the updates! This is useful information.

    I am discussing internally, and will reply back once I have more information / suggestions.

    Regards,
    Kevin

  • Hi Kevin,

    Thank you for supporting us.

    After your feedback from local TI FAE, we shared the feedback details to the customer last week.
    The customer are feeling frustration by the fact that this issue still has not been resolved, but they are going to check the CVDD(voltage drop/noise) as you mention.
    I will share the result as fast as I get the result from them.

    best regards,
    g.f.

  • Hi Kevin,

    I got the result from the customer.
    I attached the file to this forum, so please take a look.

    C6678_Result of CVDD_power rail_20230428.pdf

    best regards,
    g.f.

  • Hi g.f., 

    We don't see any glaring issues in their cvdd measurements, but sometimes measuring supply noise can be tricky and needs very careful measurements.

    I am not aware of any similar issues seen at other customers, so the problem may be a combination of unit to unit variation in the SoC performance, and potentially customer PCB impact.

    Can they implement some sort of workaround - e.g., can they detect that the leveling result is out of the normal range and boost the voltage?

    Can you also summarize the situation at the customer?  Are they in production?  How many systems are built and passing vs failing?

    Thanks,

    Kyle

  • Hi Kyle,

    Thank you for the reply and sorry for the delay.
    I was asking the customer about the newest situation.

    >Are they in production?
    Yes, it is in production.

    >How many systems are built and passing vs failing?
    405 boards(two C6678 DSP on the board) have been built and the issue occured at 16 boards.
    Which means 16 pieces of C6678 DSPs have DDR3 memory access problems,
    including 7 genuine and 9 market products.


    >Can they implement some sort of workaround - e.g., can they detect that the leveling result is out of the normal range and boost the voltage?
    Before asking about the workaround to the customer, I have a few following questions:
    1.How to figure it out the normal range of the leveling of the each device?
    From the results of tests conducted by our customers so far, it seem the range of the leveling are
    different by each DSP, so I guess the customer will ask about the normal range.

    2.How to detect the leveling result? Could you give us some advice to detect the leveling result?

    best regards,
    g.f.

  • Hello g.f.,

    Can you share the schematic and layout so that we can review their design?  We are not seeing similar issues with other C667x customers and are concerned that their may be some issue with their design.

    Also ... re: "detect that leveling result is normal".  We don't have data on the exact ranges that should be expected.  Maybe they can experiment on a larger number of systems using the slave ratio changes that they mention above?

    Regards,
    Kyle

  • Hi Kyle, Kevin,

    As we discussed in the call with customer, they have some questions for Slave Ratio.
    Please answer below.
    Q1) What exactly Slave Ratio parameter is used in DDR access?
    Q2) The default value of Slave Ratio is 0x34. Is it expected all devices is passed with the default value?
    Q3) As you can see in below table shared before, there are "Slave Ratio Window" where memory tests were passed.
    Is this expected Slave Ratio values outside of the window are all failing?
    Or it may be possible the window exists with some gap in between?


    Thanks and regards,
    Koichiro Tashiro

  • Hi Tashiro-san,

    1. During a READ access, the DDR3 memory will send the DQ signal edge aligned to the DQS signal. It is the controller's (C6678) responsibility to delay the strobe (DQS) such that the DQ is centered with respect to the DQS rising / falling edges. The DQ needs to be centered to the strobe (DQS) because the data (DQ) is latched on the DQS rising / falling edges, and therefore the data (DQ) should not be transitioning. The slave ratio is used to determine the fraction of a clock cycle that the strobe should be delayed.

    2. According to work-around #1 of Advisory #9 in the C6678 errata (SPRZ334H), the default value (0x34) is expected to work. From my experience with similar controllers, a value of 0x40 corresponds to ~ 1/4 of a clock cycle. In theory, a value of 0x40 would be a rough approximation needed to center the DQ to the DQS assuming that the DQ and DQS signals are skew matched on the PCB within a byte lane. This is because each DQ should be no more than 1/2 a clock cycle in width, and therefore a delay of 1/4 a clock cycle would be needed to center the DQ to the DQS rising / falling edge. However, on physical systems using similar controllers, I have observed the "best" value to generally be slightly less than 0x40. Therefore, a value of 0x34 seems perfectly reasonable. 

    3. It is not expected for there to be gaps in the passing window, though there could be some marginalities at the window's min/max edges. As I stated several posts ago, I would expect an appropriate range to be within 0x10 to 0x50. It seems like this matches reasonably close to what the customer observes with the "OK" devices.

    Regards,
    Kevin 

  • Hi Kevin,

    Thank you for attending the recent meeting with our client.
    I have shared the slave ratio details you provided to me with the customer.

    And I have received the DDR3 test results(increasing +10% of CVDD) from the customer,
    which was the customer's action item, and I attached the result to this post.
    Could you please confirm the results?

    C6678_DDR3_Test result(CVDD +10%).pdf

    And also the customer have following question about the result.
    Q1.
    Why does increasing the DSP core voltage improve the test failing?
    What are the possible factors?

    best regards,
    g.f.

  • Why does increasing the DSP core voltage improve the test failing?

    Hi g.f.,

    Because the CVDD power supply is a source to the DDR controller / PHY. Please see the table note for CVDD in Table 7-1 of the datasheet which states "Includes core voltage for DDR3 module". If the voltage is too low, than the interface may not work properly.

    Regards,
    Kevin

  • Hi Kevin,

    Thank you for the reply.

    Do you mean that the customer originally supplies the power supply according to the device specifications,
    but there are variations in the required voltage from device to device, 
    so that a voltage that is fine for one device may be low for another device?

    best regards,
    g.f.

  • Hi g.f.,

    There will always be some variation device to device, but it is not expected for there to be failures if the customer supplies the power supply according to the device specification. 

    Regards,
    Kevin

  • Hi Kevin,

    Thank you for the reply and sorry for the delay.
    I understood and I will share with the customer.

    best regards,
    g.f.

  • Hi Kevin,

    In response to your response the other day, we received the following additional inquiry from a customer.
    --------------------------------------------------------------------------------------------
    Regarding CVDD, we are using the recommended value, but do you mean that it is too low?
    Also, you say that the interface does not operate properly if the value is too low,
    but we have seen that only the read of a specific memory (MEM4) is abnormal.
    If it does not work correctly, it is likely that other memory and writes will also have problems.
    I was wondering if it is possible to identify the cause of the problem.
    --------------------------------------------------------------------------------------------

    best regards,
    g.f.