This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DDR3 isolated issue with DM8148

Hi

In our custom PCBA using DM8148, we have used 4 x 256MB DDR3 chips - 2 each for DDR0 and DDR1 interface. Out of the 18 proto PCBAs built, we are facing an isolated issue with one PCBA on the DDR interface. Infact, the DDR interface on this PCBA was also working fine initially and went non-functional suddenly without any external impact as per our knowledge (no short while probing, excess voltage from power supply etc).

Below are the symptoms and some of the steps we carried out:

1. The uboot is not getting successfully loaded.

2. Using JTAG, it was identified that a write to a particular location affects some other random locations as well i.e data is written to few other locations (and a different data is written from expected value)

3. Probing on DDR CLK shows that there is 400 MHz CLK on both the interfaces - our expected clk frequency

4. Removing all 4 DDR chips and replacing with new ones did not solve the problem

Since, the other 17 boards are perfectly ok with respect to DDR interface, we are bit puzzled with what actually went wrong and the means to troubleshoot. It would be helpful if someone can suggest possible debugging steps or point to where the issue could lie.

Regards,

KS

  • KS,

    For debugging custom DM814x board DDR3, I can provide you the below resources:

    DM814x datasheet, 8.13 DDR2/DDR3 Memory Controller

    http://processors.wiki.ti.com/index.php/DM814x_Hardware_Design_Guide

    http://processors.wiki.ti.com/index.php/AM387x_/_C6A814x_Schematic_Review_Checklist

    http://processors.wiki.ti.com/index.php/TI814x-DDR3-Init-U-Boot

    http://processors.wiki.ti.com/index.php/DDR_Routing_Checklist

    http://processors.wiki.ti.com/index.php/Common_DDR_Issues

    The Mistral DM8148 EVM reference schematics

    The Mistral EVM DDR3 software tests (BB_021_DDR3_TEST.out)

    Regards,
    Pavel

  • Hi Pavel

    Thanks for those links. I have gone through most of them previously except for the one titled ' Common DDR Issues'.

    We followed the steps mentioned in the 'Common DDR Problems' link and tried to access the DDR through JTAG interface. We noticed that when we try to fill up a chunk of memory for example 2K locations with FFFFFFFF or 00000000 , we see xxFFFFFF or xx000000 respectively on certain locations of the 2K space. Here, xx is some random number and the affected locations do not have a same value for xx - first two digits.

    This was the same for both DDR0 & DDR1 interfaces.

    Here are my doubts:

    1. Does this confirm that the suspect is the Most significant byte lane of each DDR interface ie, Bits D31..D24 and respective strobes of each. Even then, am not sure why this is happening given that other boards are working fine.

    2. Does this also confirm that there is no issue with address/control lines? We see that data is written only on the targeted chunk of memory space and no data is written in other spaces although the non-targeted space have random data changing on every refreshing.

    What else can i check for possibly to narrow down things further? Any suggestions?

    Thanks

    Regards,

    KS

  • KS,

    My suggestions are to check/compare your failing custom board with your working custom boards and with the DM8148 EVM.

    Regards,
    Pavel

  • Hi Padmanabhan,


    The most common test you can perform is to check proto board to operate at lower DDR clock frequency <400MHz. If your specific proto board is working perfectly at lower DDR frequencies then you can be sure that there is no issue with the assembly or any other signal on that specific board.

    Here, to make it run at 400MHz DDR clock, read/write leveling parameters are required to be tuned further, meaning to relax some timing parameters.

    Regards,

    M

  • Hi Marut

    Thanks for your reply.

    We tried running at 100MHz still the DDR is not operating properly.

    One thing which we noticed while filling up a chunk of memory space with known data for the DDR1(EMIF1) interface, every 4th location had random wrong data and every 8th location had repetitive wrong data.

    Since the write/read operations are carried out only on the expected memory space, we concluded that the address/control lines are working normally.

    Does this point to anything else?

    Regards,

    KS Padmanabhan

  • Hi Padmanabhan,

    Here, based on your observation, I would strongly recommend to check routing of data signals (Data, Data strobe, mask) specifically for the DDR with which you are observing the problems.

    Please check following for specific DDR:

    1. Length matching

    2. Value of series termination on data signals if in case used

    3. Time levelling parameters for that ddr (iterations may be required to calculated exact parameters)

    Further, It would be better if you can share sections of schematic and layout only showing DDR section to review so that we can be sure that there is no silly mistake.

    Regards,

    M

  • Hi Padmanabhan,

    I again visited the whole thread where I found that issue is related to one board only out of 18 proto, here it must not be design issue. This must be a board specific assembly issue.

    Regards,

    M

  • Marut

    Yes. I was about to point out that :)

    Assembly issue - Well this same board was working initially and was our principle board for bringup.

    Could there be a possibility of partial failure of DM8148? i.e perhaps the DDR memory controller interface not working properly? We have already replaced all the four DDR chips but the issue persists.

    Regards,

    KS

  • Hi Padmanabhan,

    Well, I could  see a possibility of a dry solder at processor side. Here, you can observe the waveform of the data lines with which you are observing errors. If waveform is not proper then it could lead to conclude to an issue of dry solder. Capturer the waveform at receing end only. i.e measure waveform at DDR end while writing.

    In case of dry solder, you can apply some pressure on the processor by your finger(being ESD safe) during the execution of the test. If this improves the results by reduing the repeatation of the error, then you can try to heat the processor by hot air gun.

    Also measure resistance offered by the trace of the respective data signals having errors(between DDR and processor ). You may get some clue :)

    Regards,

    M

  • Hi Padmanabham,


    I am also facing exactly similar issue in my 2nd Proto batch. 16 boards are working while 4 or 5 boards are having DDR3 issue.   Could you please update us with your findings to diagnose this issue.

    Note: I have not faced such issue in any of my 1st Proto batch (10boards).


    Regards,

    Kartik

  • Hi Kartik

    We tried troubleshooting through JTAG but that didn't give us any conclusive results. We tried writing/reading patterns n could just come to a conclusion that the problem lies in the most significant byte of both the external memory interfaces.

    Since all other boards were ok, we gave up on further analysis for this PCBA. 

    Do you have any finding? Any common patterns across failed board? Are you able to get the 'CCCCCCC' message without loading uboot?

    Regards,

    Padmanabhan

  • Hi Padmanabhan,

    Yes, we are also facing similar problem with higher byte of both the DDR3 controller. 

    I am still debugging this problem.   I have never seen this problem during our alpha Proto units. 

    Please note that we are using CCYE2 (silicon 3.0 and DDR3 at 533MHz).  I tried with 400MHz but problem remains same.

    In one of the board I tried by connecting oscilloscope probe at DDR0_D30 pin near processor side after scratching its via and board gets started back again.  But similar tricks didn't work with other boards.

    Do you or somebody help us to understand this behavior?  Is this seems a PCB fabrication problem or what else could be possible?


    Thanks,

    Kartik Gandhi

  • Padmanabhan,

    The behavior you describe of the device working ok initially but later not working could be indicative of a cold solder joint (i.e. a manufacturing issue).  With a cold solder joint the device might work ok initially, but the joint can open over time leading to failures.

    Another possibility might be marginal timing for the specific data lane in question.  Have you performed software leveling?  The software leveling process tunes the sampling window for each DQS line.  It could be that your timing is marginal and so you're not seeing issues on most boards, though due to small variations across process that might actually bring out the issue on other devices (like the one in question).  If you have not yet done the software leveling that might be all you need to fix the timing issue.  Along these lines, have you tested any of the other boards at high or low temperature?  If your timings are borderline then you might see issues even on other boards at high/low temperature.

    FYI, the earlier test of operating at 100 MHz is not legal for DDR3 as there is a DLL inside DDR3 which has both max and min frequencies.  Off-hand I believe 303 MHz is the min frequency allowed though you should check your specific DDR3 data sheet to be certain.  So if for example you could slow down your DDR3 clock speed to 303 MHz then I think it would be more likely to work.  Note that you'll also need to reduce your refresh period since now the clock cycles will represent more time.

    Brad

  • Brad

    Thank you for the response.

    Regarding cold solder joint, yes this could be a process issue. We haven't checked it under X-Ray yet but then really not confident of capturing through X-Ray even in case of one. We can give it a shot though. Also, based on one of the responses above, I did try to use hot-air blower but there wasn't any change in the status.

    SW Leveling has already been carried implemented. During board bring-up( This is also the first lab prototype for the PCBA), we were able to get it to work at 400MHz even without SW leveling initially. Later on, fine tuning has been done through SW Leveling as well. On one of the other board, we have carried out characterization of the DDR interface using LeCroy scope and its software add-on for DDR3 testing. From the tests carried out, we seem to be passing most tests with sufficient timing as well as AC/DC voltage margins.

    The good PCBAs haven't be stressed yet either at high or low extreme temperatures yet. We should be doing them soon though before turning in the next version of PCBA.

    Regards,

    Padmanabhan

  • Hi,

    Even we are also facing the same issue in ONE of our 40 boards.

    Observation: When the board is powered on CCC is coming , after putting microSD card iCC stopped and it is observed that it is not booting from microSD card.

    Following debugging steps has been taken which yielded no results

    1. Our primary configuration was from NAND and then from MMC, but to isolate the issue from NAND flash, configuration has been changed for MMC.

    2. Checked power, clock and reset, everything seems to be fine

    3. Tried connecting through JTAG, was able to connect to target, but couldn't write NAND flash

    4. DM8148 is providing 480MHz clock to DDR PHYs

     5. We checked for any possible shorts on board except for BGA components and seems to be fine.

    Please let us know how to debug this issue whether the problem is with DDR chips or DM8148 IC.

    Regards

    Madhura

  • Hello Madhura,

    In our case, we found issue with assembly process issue. The baking process of DDR3 chipsets (4 per board) was not followed based on their MSL level. Assembly team had used the same packet (which was opened during alpha batch) and then baked them again only for 12hours instead of minimum 24hours recommended as per JEDEC standards.

    Kindly recheck with you assembly team/vendor to get a report for a backing procedure followed for particular batch and also X-RAY (BGA) report for processor and DDR3chipsets.

    Regards,
    Kartik Gandhi
  • Thanks kartik for the prompt reply.

    Sorry I forgot to mention that the board was working initially and it stopped suddenly .

    Regards
    madhura
  • This could be due to a cold solder joint.  A cold solder joint may not be initially obvious at time zero, but as time goes on the joint can open and cause failures. This often is the case in situations like this one where the production ATE test passes.

  • I agree with Brad on this and behavior of the board which are you have mentioned here can be due to cold solder joint. I have seen multiple board which are running fine for a week or so and then suddenly stop working

    Regards,
    Kartik Gandhi