This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6455 Opcode exception / memcpy hang in DSP/BIOS - marginal design; debug advice sought

 Hi, I'm looking for some help or suggestions in debugging what looks like a marginal design (that was done by a predecessor)

The situation is that we have two prototype boards each with two C6455 DSPs. Both DSPs on both of these boards work fine.

We then got 2 pre-production boards. Of the four DSPs in total, only one works. The others all fail in a similar way.

Failure is due to either an opcode fetch exception (in 2 of 3 chips) or memcpy getting stuck in an infinite loop (in the other DSP), in the bit of DSP/BIOS that does memory section initialisation. Code is executing from IRAM, some of the sections are in DDR2 So it seems that something that accesses memory is unhappy (yet it is totally consistent – i.e it always fails the same way)

BUT – if I single step – or, weirdly, carefully chill the chip (only) with freezer spray, then the DSP works. Always.

I tried changing the PLL (we normally run at 1.2G) but this made no difference.

I read the bit in the user manual that the memory should be initialised before DSP/BIOS – this is done in the GEL file for the moment (the bootloader does this in the full system but I can’t get far enough to program that in yet!)

I am suspicious - but not wholly convinced - it’s when the DDR2 is accessed (due I believe to the original designer wanting heaps in some sections)

As far as I can tell, DDR2 is set up OK. But I am not the original designer so have not thoroughtly checked every parameter  - but it works on the prototypes.

I am also reasonably confident that the hardware itself is well made (and probably x-rayed) so i don't think this is a solder problem or noise pickup in a track etc. The reason for this assertion is that the fault is totally consistent.

Because I have no source for the DSP/BIOS,and am relying only on interpreting what I see in the disassembly window, I am finding it extremely difficult to make any headway into where the DSP parts from normal code execution and wonder if anyone can make any suggestions on what to try next? In the case where I get an exception, the NRP is 0x6000,0000 which does not make much sense.

DSP/bios is 5_41_11_38

Thanks in advance

 

 

 

 

 

  • Gavroche,

    You have listed a number of things you question in the design. All of them should be examined in greater detail. Some of the easy questions that I can add:

    Did you change the DDR device brand or revision or part number?
    Have you checked the power supplies and signals for the DDR interface?
    Since you call the design marginal, what is marginal about it?
    Have you done side-by-side comparisons of voltage levels and signal integrity between the working board and the failing ones?

    Do both of the "opcode exception" failures occur at 0x6000,0000? Put a hardware breakpoint at 0x6000,0000 to stop the DSP before the NMI is executed, then look at the Core Registers to see where it might have come from. B3 usually have a return address for function calls and some other register might have that 0x6000,0000 instead if it is not a C function target address.

    I do not understand how memcpy could have an infinite loop. By design it counts down from some count value and only writes that many things. Maybe you can do a screen capture of the Disassembly Window and the relevant Core Registers to help us understand what is happening. If you reproduce that same functionality elsewhere in memory, would it still fail? If it is in a failure spin loop, look at the code above it to see what it might have been doing when it dropped into the spin; but that is not something I have ever noticed in our memcpy routines.

    What is different, if anything, on the C6455 device markings?

    The freeze spray test points to a timing problem. This would make me suspect the DDR interface timing, so you do need to go through every parameter and validate them against the DDR device datasheet.

    Instead of lowering the DSP clock speed, try lowering the DDR clock speed.

    Run a memory test on the board. Use memcpy to copy blocks to get the best speed out of the wider instructions instead of a simple non-optimized DSP for-loop.

    Try the freeze spray on the DDR. Does it help or make things worse, or make things different in any way?

    Regards,
    RandyP

  • Thanks. 

    I agree this looks like a DDR2 issue - running an intensive  memcpy test also induces a fail on the "good" board, and freezer changes the nature of the failure. In new boards, the DDR2 chips are Rev C, and are die-shrunk from the Rev A used on the original prototypes.

    I am now going through the timing parameters from ground up.

    But before timing I am confused as to the reason why in both my board, and the EVM on which the DDR2 is based, sets the IBANK bitfield in SDCFG to 4 banks (=2), see the GEL file at http://c6000.spectrumdigital.com/dsk6455/v2/files/DSK6455.gel on which our is based.

     DDR_SDCFG    = 0x00D38822;   // xxx22 = 4 banks, 1k page

    The devices on the EVM are MT47H64M16 which have 8 banks. We use MT47H128M16 instead (which are the same but 2Gb).  

    If i set the value to 8 banks, as I think it should be - my crash is a Fetch Packet exception probably in memcpy in  BIOS. It's hard to say because single stepping changes things, and more often than not the crash means I lose contact with the DSP.

    A value of 4 banks means I can at least fail my memory test in my own code:)

    Thanks

     

  • I think I'm going vaguely insane - I am now convinced I should have 8 banks selected  - based on SPRU970G Fig 11.

    I think I should be using the only combination in that table that gives 27 bits of (word)address when using a 10 bit page on a 32 bit wide configuration.

    i.e. 14 row bits + 3 bank bits + 10 col bits = 27 bits => 128K words => 512K bytes == 2Gbit x 2 chips.

    Although it now falls over in BIOS - I think it may be due to calling a global object's constructor located in DDR2 which I had previously overlooked (not my design so have to investigate that further)  -  do you think I should definitely be using 8 banks?

    Thanks

  • Gavroche,

    I trust your reading of the datasheets so I trust you should be using 8 banks. My recommendation is to use the values as you determine them and then test it. Check whether you get any mirrored locations, where writing a value at one address like 0x80000000 has that same value show up at 0x90000000 or 0xA0000000 or 0xC0000000, for example.

    On the line from Figure 11, I cannot comment without seeing and analyzing the Micron datasheet's Addressing table. You have looked at that and confirmed the number of column address bits, I assume. If so, then you have the right values and can move on to the timing registers.

    Regards,
    RandyP

  • Yes, i was correct. The existing setting lost half the memory, but no-one noticed as the top half isn't used. Seems it's wrong in the EVM gel file and got propagated through.

    The main cause of the problem has turned out to be that some, but not all, of the series termination resistors were incorrectly fitted by the assembly house and were 220R not 22R.  Amazed it worked as well as it did really.

     

    Thanks for your help!