This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6455 cache errors

We are seeing a strange problem on few of our custom C6455 based boards. This occurs on 4 boards out of about 150 boards so far.

In short there seems to be memory errors when reading and writing to the DDR2 with L1D/L2D Cache enabled. ( usually a stuck bit  ) Disabling the L2D cache ( or L1D Cache on some boards )gets rid of the memory problems. Please note this is a simple CPU driven read and write to external memory( no cache coherency  issues ).  The problem goes away after  reset is applied again without power cycle.

We have checked the power up and reset sequence, which complies with the datasheet requirement (3.3V first, then 1.2V and 1.8V together).
Delaying the reset deassertion after power is stable does not seem to make a difference. 

Any suggestion? Anyone seen this kind of  behavior before?

  • If you are lucky, the problem is the timing registers for the DDR2 EMIF or DDR2 configuration. Second chance on good luck would be simple changes to termination resistors on the board. Third chance would be that the voltages to the DDR2 devices are slightly out-of-spec and can be adjusted by simple component replacements on your board.

    My suspicion is that you either have some bad DDR2 devices or timing problems in the DDR2 board layout implementation. But if the failures are voltage- or temperature-sensitive, you have something you might be able to fix without replacing the memory devices.

    When cache is disabled, CPU reads and writes always go to the DDR one-at-a-time. When cache is enabled, reads will always be bursts of a cache line; writes will depend on how the cache line is filled, or not filled, so writes could be single or could be bursts.

    One interesting test would be to have cache enabled and run the test that finds bad bits. Then leave the cache enabled and clear all the MAR bits for the DDR space, and then repeat the test.

    I guess I need to ask, did you closely follow the DDR2 layout rules for the C6455?

  • Yes, we have followed the DDR2 layout rules.

    This problems occur only on a few boards and goes away if reset is asserted again without power cycle, which suggest that something is not reset /initialized correctly.

    I tried the test with cache enabled and DDR2 space MAR bits cleared, and  the errors didn't occur.  Doesn't clearing the MAR bit essentially disable the cache ? How different is that from explicitly disabling Cache?

     

     

  • The known variables that affect the failures are

    • specific boards
    • power cycling
    • reset
    • cache

    Using the MAR bits is an easy way to change the test without touching any of the cache or EMIF configuration registers, making it a little less intrusive on the system during the progress of a sequence of tests using different configurations. If your test is setup in a loop, with cache enabled and MARs set, you get one set of results for the first cycle of the test. Then you can either manually or through the test program clear the MARs and rerun the test, getting another set of results. If those differ, you can set the MAR bits back to 1 and run the test again to see if the first results are repeated. Be careful that the GEL file does not do a reset between runs.

    It is not clear to me how reproducible the errors are, how hard-stuck the error bits are, whether the errors change with voltage or temperature or speed, whether the errors are caused by bad writes or bad reads.

    Another thing to try would be EDMA transfers which would also do bursting operations whether cache is enabled or not. The idea is to narrow down the known variables  Using the simple MAR bits is just one way to help isolate the cause.

  • Thanks for your replies.

    We don't think this errors are related to DDR2.

    I ran a memory test on L1RAM and L2RAM   ( The test program was loaded in DDR2,  L1P was enabled, Data Caches disabled ).

    We got memory errors (stuck bit ) on memory location in L1D  and L2 RAM.

    Has anyone ever come across internal memory errors on the C64+ DSPs ?

     

  • It would be good to take DDR completely out of the picture. Your test program is probably less than 32KB, and if not, find a way to shrink it to less than 32KB. This way it will fit completely into L1P and everything can be contained within the C6455.

    Disable all cache, including L1P, and load the test program into L1P and run it from there.

    From what you have said above, you will find stuck bit failures in L1D and L2. Is this true on all 4 failing boards, that you have stuck bit failures in both L1D and L2 on all 4 boards?

    What patterns, if any, show up in the 4 sets of failure signatures?
    Do the same bits fail every time?
    Do they continue to fail until reset is applied?
    If you apply reset after power cycling, do the failures go away?
    Are the failures temperature or voltage or CPU frequency dependent?

    After you have exhausted all electrical and environmental tests, have you tried swapping the DSPs between a working board and a failing board to confirm that the failures follow the DSP and not the board, or go away when moved to a new board? I do not know if this is a practical thing for you to do or not, but some customers do this to try to prove whether the problem is with the DSP or the board/components.

  • RP said:
    We have checked the power up and reset sequence, which complies with the datasheet requirement (3.3V first, then 1.2V and 1.8V together).
    Delaying the reset deassertion after power is stable does not seem to make a difference. 

    Any chance you can provide a scope shot of your power sequence and reset timing? The fact that the problem goes away with a reset leads me to think that there is not an issue with any of the components individually, but with the device being in a funky state (likely caused by power/reset timing).

  • Randy,

    See my answers inline

    RandyP said:

    What patterns, if any, show up in the 4 sets of failure signatures?

    No pattern , the error is in L1D on 2 of the boards and L2 on the other two.

    RandyP said:

    Do the same bits fail every time?

    Yes, for each of the faulty DSP,  same bit and same location fails, But  the bit and location are different among the 4 DSPs

    RandyP said:

    Do they continue to fail until reset is applied?

    Yes. JTAG reset doesn't help, POR reset ( without a power cycle ) gets it out of this state

    RandyP said:

    If you apply reset after power cycling, do the failures go away?

    Power cycle puts it back in this error state, POR reset ( without a power cycle ) gets it out of this state

    RandyP said:

    Are the failures temperature or voltage or CPU frequency dependent?

    Don't think it's temp related, it's not freq related ( it's a 1GHz device and the problem shows up at 750 Mhz too ).

    Still investigating if it's a voltage issue.

    My question really would be , what could ever cause the DSP to be in this state ,  improper voltage ? clock ? reserved pins ? reset ?

  • Tim,

    Working on getting a scope shot

  • RP said:
    My question really would be , what could ever cause the DSP to be in this state ,  improper voltage ? clock ? reserved pins ? reset ?

    For me, the worst part about troubleshooting a problem like this is it seems to be potentially anything including these suggestions. As a rule the best I can suggest is to verify all RSVD pins and make sure all instructions in the datasheet are followed, as well as the timings for the reset. Jitter on the clock can cause funky behavior, but I would guess this would not cause bit-mask issues. Improper voltage is definitely a suspect (missing Vcc pins, slightly-out-of-spec voltage rails, etc.).

  • Tim,

    Here are the power and reset timings.

    This is the reset timing its around 700ms after the voltage is stable. ( reset is the green signal )

    The only  anomaly I can see is that the reset signal is around 280mV when the power is coming up.

     

     

    This is the power up sequence 3.3 first and 1.8 and 1.2  2.7 ms later

  • According to the data sheet:

    "After the DVDD33 supply is stable, the remaining power supplies can be powered up at the same time as CVDD as long as their supply voltage never exceeds the CVDD voltage during powerup."

    It looks like your 1.8V rail is higher than the 1.2V rail for pretty much the entire period.  That might be the issue.

    So after you clear up this issue with a reset, does the issue come back again or does that "permanently" clear it up (i.e. until a later power-up)?

  •  

    Brad Griffis said:

    So after you clear up this issue with a reset, does the issue come back again or does that "permanently" clear it up (i.e. until a later power-up)?

    It fixes the issue until a later power up

     

    I shall look into the 1.8V rail being higher than the 1.2 . Thanks for your reply.

     

  • Brad,

    We fixed the 1.8 V rail being higher than the 1.2V rail but still no joy.

  • So if you just power up your board and immediately give it a reset before running any tests/code do you avoid this condition altogether?  What does your reset timing look like?  Perhaps there's an issue there. 

    I often like to go through the data sheet searching for the word "must" because a lot of important info tends to follow in general.  Make sure you don't have any reserved pins terminated incorrectly, etc.

     

  • We have checked the reset timing and reserved pins and they all seem correct. We have had boards manufactured with this design work fine for over two years now but just encountered 4/150 boards with this problem.

    Just wanted to know if anyone had ever encountered this problem with internal memory errors.

    Thanks to everyone for their suggestions.

  • RP said:
    Just wanted to know if anyone had ever encountered this problem with internal memory errors.

    If you are asking if your description matches a known, recurring problem with the C6455, it does not.

    We do know that for your specific case, a POR clears up the problem and maybe power cycling leads to the problem recurring. There are still some questions we have asked, if you want to continue to debug this problem.

    You can search the E2E forum for other references to internal memory errors, but my search did not find anything that would directly help you any better than what is already in this thread.

    Please let us know if you want further assistance.

     

  • As Randy said, this is an issue that has not been observed before and there is nothing immediately that I can think of that points to the cause. A power up cycle followed by the POR causes the stuck bit, but a power on reset without the power cycling gets rid of the issue.

    - Can you give us a scope trace of the reset de-assertion during power up cycle? In short, we want to see the reset deassertion timing referenced against the power up sequence

    - Can you also provide a trace of the reset that gets rid of the issue?

    - It looks like a POR reset gets rid of the issue. Does warm reset also get rid of the problem?

    - Could you also provide us with the part number and lot trace code on the DSP? Lot trace code will look like #xx-#######.

    where,

    # is an alphanumeric character.

    x is a numeric character only.

     

    -Aditya

  • RP,

    Can you tell us if this issue was resolved or are you still facing it and need debug support? Do let us know.

  • Aditya,

    No, this issue was not resolved. I'm working  to get the scope trace of the reset that fixes the issue.

    Thanks

  • RP,

    We want to see the power-up reset scope trace.  That's the one that seems to have a problem.  We would like to see the reset signal relative to the power rails.

    Brad

  • Brad Griffis said:

    RP,

    We want to see the power-up reset scope trace.  That's the one that seems to have a problem.  We would like to see the reset signal relative to the power rails.

    Brad

    I think i've already posted that earlier here (http://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/112/p/39318/138276.aspx#138276)

  • What speeds are CLKIN1 and CLKIN2?  Are they stable?  After your voltages are stable you must make sure that nPOR stays asserted for at least 256 CLKIN2 cycles.

    Is there an easy way for you to lengthen the delay until reset is released? 

  • Brad Griffis said:

    What speeds are CLKIN1 and CLKIN2?  Are they stable?  After your voltages are stable you must make sure that nPOR stays asserted for at least 256 CLKIN2 cycles.

    Is there an easy way for you to lengthen the delay until reset is released? 

    CLKIN1 is 50MHz  . PLLM is (20-1) and the DSP Freq is 1000MHz

    CLKIN2 is 26.6 MHz

    As seen the in scope trace here, the reset is deasserted 700ms (which way over the required 256CLKIN2 cycles) after the voltages are stable.

  • I've not seen CLKIN1 and CLKIN2 anywhere.  Can you please check to be certain they are stable long before reset is released?  A couple screenshots would be great.

  • Brad Griffis said:

    I've not seen CLKIN1 and CLKIN2 anywhere.  Can you please check to be certain they are stable long before reset is released?  A couple screenshots would be great.

    Here are some more scope caputres:

    Voltages :  3.3V comes up first, 1.2V and 1.8V come up  2ms after 3.3V is stable.

     

     

    Reset  is de-asserted 700ms after the voltages are stable:

     

    Clocks:

    CLKIN1 is 50MHz

     

    CLKIN2 is 26.67 MHz:

     

    The Clocks start when 3.3 V  comes up:

     

    Both the clocks are stable when RST goes high ( 700ms after voltages are stable) :

     

    Let me know if you need more information.

  • RP,

    - Can you provide a trace of the PORz  AND RESETz during the power up sequence along with power rails? (Same as the 2nd trace from your previous post but with RESETz)

    - Can you also provide two sets of dumps of DDR2 registers: One after power up initialization and the second one after the reset that fixes the issue

    - Does a warm reset instead of POR fix the issue as well?

    - Following up on an earlier suggestion by Randy, is it possible for you to pull out the DSPs from the problematic board and swap them with the DSPs on the working boards? It would be good to know if the issue tails the DSP or the boards.

     

    Also, do let us know the lot trace codes on the devices.

  • Aditya said:

    - Can you provide a trace of the PORz  AND RESETz during the power up sequence along with power rails? (Same as the 2nd trace from your previous post but with RESETz)

    The RST signal is POR reset. RESET is not used and is always HIGH.

     

    Aditya said:

    - Can you also provide two sets of dumps of DDR2 registers: One after power up initialization and the second one after the reset that fixes the issue

    Do you mind sharing what info are you looking for in the DDR2 registers could be related to the internal memory errors?

    Aditya said:

    - Does a warm reset instead of POR fix the issue as well?

    Warm Reset Pin RESET is not used and is tied HIGH. A System Reset (JTAG Reset) does NOT fix the problem.

    Aditya said:

    - Following up on an earlier suggestion by Randy, is it possible for you to pull out the DSPs from the problematic board and swap them with the DSPs on the working boards? It would be good to know if the issue tails the DSP or the boards.

    No. At this time, it's not possible to swap the DSP on the boards.

    Aditya said:

    Also, do let us know the lot trace codes on the devices.

    $N21-95A8L8W

    $N21-95A8L8W

    $N21-95A8L8W

    $N21-99A0F4W

  • RP,

    I think that's very interesting that System Reset doesn't resolve the issue.  Just to clarify, in CCS you are actually going to Debug -> Advanced Resets -> System Reset, right?  (Most people just click Debug -> CPU Reset which is different.)

    Could you try asserting the (warm) /RESET to see if that will successfully clear the error?

    Brad

  • RP said:

    The RST signal is POR reset. RESET is not used and is always HIGH.

    Please try asserting RESETz low as well during power up sequence. (See Figure 7-8 Power-up timing of C6455 datasheet). RESETz release timings are given in prior tables.

     

    RP said:

    Do you mind sharing what info are you looking for in the DDR2 registers could be related to the internal memory errors?

    You are loading internal memory test into DDR2. A PORz reset resets the DDR2 and is also fixing the issue until a later power up which brings up the issue again. Ideally we would expect DDR2 controller register reset values to be the same both after power-up and the subsequent PORz.

     

    RP said:

    Warm Reset Pin RESET is not used and is tied HIGH. A System Reset (JTAG Reset) does NOT fix the problem.

    System Reset does not reset DDR2 controller registers and also does not take the problem away. Can you see if WARM reset instead of PORz fixes the issue?

     

  • RP,

     

    Any luck? Still awaiting the results on warm reset as well as the register dumps.

  • The Warm RST pin (RESET) is tied HIGH on our boards so it's not possible to experiment with asserting RESET at power up.

    (At least, not at this point since the end customer does not what us to modify the board ).

    I would like to mention that I'm not bring up this board. This design has been used in production for over 2 years without any issues.

    For the sake of disscussion,

    What if the RESET assertion-deassertion fixes the problem ? Does that tell us anyting about what caused the problem ? or

    If the DDR2 registers contents are different in the two states, would we then know what was wrong ?

  • As you have been explaining from the beginning, these 4 boards are failing after a long production run of 2 years without seeing this problem. You have some of the best at TI who have been working with you to narrow down to where the problem really exists - the four C6455s or the four boards.

    Each of the tests that they have been asking for, from power supply scope shots to memory failure patterns to reset assertions, all of these are intended to add another clue to what the problem might be. In some cases, a positive or negative result will help to validate or eliminate possible failure mechanisms.

    Each of the DSPs is tested before it leaves the factory to ensure that you will not have problems like these. The memory failures you have encountered are very easy to catch in our memory tests, so it is very hard to expect four devices to have these memory failures when they were shipped from the TI factory. To be honest, it is my belief that the DSPs were in fact working well when they were shipped from the TI factory. And from some of your testing, it seems very likely to me that the DSPs are still working well - this is based on the fact that these parts may fail immediately after powering on, but they failures go away when the POR is re-asserted without power cycling (in your second posting you said the problem "goes away if reset is asserted again without power cycle").

    You were also correct in that second posting when you said that this evidence suggests "that something is not reset / initialized correctly". Once everything is completely stabilized and you assert a second reset, the DSP works without failures. Something is incompatible in the power cycling/reset/clocking start-up but we have not yet found anything to positively say is the problem. These questions we have asked are still trying to find that cause.

    At some point you may have to decide whether to throw out these 4 boards or return the DSPs for Failure Analysis or apply double resets to each board.

    How do you want to proceed?

  • RandyP said:

    At some point you may have to decide whether to throw out these 4 boards or return the DSPs for Failure Analysis or apply double resets to each board.

    How do you want to proceed?

    We will continue debugging on our end to make sure we have not violated any of the timing, voltages or strapping options. In the meantime if anyone comes across anything similar or has any information as to what potentially could cause internal memory errors please update this thread.

    One more piece of info :
    The C6455 on the board is a PCI slave device and is configured to boot via PCI. The PCI host on the board is a DM648 device.
    I tried an experiment where, instead of booting the C6455 over PCI, we ran a test program on the DM648 and accessed the internal memory of the C6455 (over PCI) and sure enough it had errors in the same location as before. So this test basically takes the DDR out of the picture and also our init code.
    Don't know if this helps but it was a test to eliminate any bug in our chip init code.


    Aditya:
    Here are the DDR2 register dump :

    The registers dumps were the same in both the states:
    MIDR    : 0x0031030F
    DMCSTAT : 0x40000004
    SDCFG   : 0x00530822
    SDRFC   : 0x00000820
    SDTIM1  : 0x26DB5389
    SDTIM2  : 0x0096C722
    BPRIO   : 0x000000D0
    DMCCTL  : 0x00000005

    The two screen shots:
    DDR2 Register State1 (With Internal Memory Error)

     

    DDR2 Register State2 ( No Internal Memory Error)