This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Need to simulate Hardware Data Breakpoint (also known as Watchpoint) on C6424



Hello,

I have a product based on C6424. It seems to have a serious software issue: it consistently crashes after 5 days anda few minutes of run time (both in the lab and at the customer site). Running under CCS, we have set and hit the breakpoint at _EXC_dispatch. We have determined that an area of DDR2 memory has been corrupted. This area contains a table of pointers to other functions (e.g. a jump table). Someone has corrupted the table, causing us to branch to an invalid instruction. Now we need to know who corrupted the memory.

Hardware Data Breakpoint (also known as Watchpoint) would be perfect for this. But unfortunately C6424 is a mid-GEM device, and doesn't have watchpoints. The software technique to simulate a watchpoint is to frequently scan the memory area and halt when the memory changes. You look at which task or ISR ran most recently and hope that it was the responsible party. We have done this before to find other bugs and know how to do it. But for a problem that takes 5 days to manifest, it is very hard to get enough test runs to track down the problem using software techniques.

So we need some "out of the box" thinking about some way to hardware simulate a watchpoint on the C6424. Is there anything that can be done using JTAG? Is there any thing that can be done with page tables? (Unfortunately C6424 doesn't have memory protection unit). We really need to get creative here.

Is there a footprint-compatible part to C6424 that could be changed out on the PCB that would have full-GEM?

Regards,

Jeff Siegel

 

 

  • Jeff,

    As far as I know, we don't have a lot of options here.  I think the idea that you had about finding a similar Full-Gem device is a good start.  I'm not familiar with the footprints and peripherals available on our various families, but hopefully there is one that is similar. 

    To clarify my understanding, you are essentially saying that this happens at essentially exactly the same time?  i.e. always 5 days plus a few minutes?  never 4 days? Never 6 days?  Never 5 days and 3 hours?

    I'm just brainstorming here, so I don't know if this will work or not.  But if we put the device in real-time mode, we can read memory from the debugger without interrupting the processor.  You could write a CCS script that, maybe after 5 days, repeatedly reads a small piece of data that you know is going to get corrupted to see if it's still valid.  And if it is not, then halt.  The problem with this is that there's some delay between the time that the data gets corrupted and when the debugger actually halts.  Even if this is 100 ms, that's an eternity on a device running at 1Ghz, and many instructions will have been executed in the mean time.

    Some other thoughts, assuming that the issues is as predictable as I mentioned above.

    If they're running at 1GHz, is there enough headroom that they can slow the CPU down a little (like to 900 MHz) and see if the issue a) still occurs at the same time, b) occurs at a different time, or c) doesn't occur at all. This might give us some insight as to whether it 's based on a specific external interrupt that's occurring, or if it's just happening after we've executed a certain piece of code a certain number of times, or whether it's related to the speed of the CPU.

    I'm not sure what their application is, but is there any way to save the entire state of the DSP after 5 days?  i.e. Halt at 5 days, export all of the memory and register data to a file.  This might allow us to quickly reload the state  and reproduce the problem after a few minutes so that we can more quickly try different possibilities.  If we have to wait 5 days for each run, this could take years. 

     

    Those are just a few ideas.

    Regards,

    Dan

     

  • All the device in the family has the same mid-gem. So I dont' think you can find a simple replacement here.

    Could you please give us more information on what this chunk of data is? Our experts might be able to come up with some workarounds.

  • Dan writes:

    >To clarify my understanding, you are essentially saying that this happens at essentially exactly the same time?
    >i.e. always 5 days plus a few minutes?  never 4 days? Never 6 days?  Never 5 days and 3 hours?
    Always 5 days and a few minutes.

    >But if we put the device in real-time mode, we can read memory from the debugger without interrupting the processor.
    >You could write a CCS script that, maybe after 5 days, repeatedly reads a small piece of data that you know is going to get corrupted to see if it's still valid.  And if it is not, then halt.
    We are already doing this using software in the DSP. We monitor the memory area in the task switch hook and also in the FPGA ISR (which gets called periodically by a timer in the FPGA).

    >The problem with this is that there's some delay between the time that the data gets corrupted and when the debugger actually halts.
    Very true. I suspect the technique we are already using would be faster than doing it in CCS.

    >but is there any way to save the entire state of the DSP after 5 days?  i.e. Halt at 5 days, export all of the memory and register data to a file.
    This is a very interesting idea. I assume GEL files would be involved to dump & load the data. Unfortunately, we have little experience with GEL files/function. We'd need some help from TI to get it to work.

    Paul writes:
    >All the device in the family has the same mid-gem. So I dont' think you can find a simple replacement here.
    Unfortunate.

    >Could you please give us more information on what this chunk of data is? Our experts might be able to come up with some workarounds.
    As mentioned before, there is an interrupt handler that handles the FPGA interrupt. There are many causes of FPGA interrupt, so the ISR reads more details from the FPGA and dispatches to a sub-handler. The sub-handlers are contained in a function pointer table. Someone is scribbling on a large area of memory (64KB+) and wrecking the table, so the ISR is jumping to an illegal address. The table is located in DDR2 memory. We need to catch who is causing the memory corruption.

    Regards,
    Jeff

  • Jeff Siegel said:
    So we need some "out of the box" thinking about some way to hardware simulate a watchpoint on the C6424.

    Well, not sure if it will help, but certainly "out of the box" I had a problem like this once, it was a structure that was not initialized well generating a broken pointer and it would break my software every 40 min. What I did to figure out was keep placing the PC (program counter) in a RET instruction... I ended up going back to the function that had the bug. Not sure if it is possible depending on how broken is your program, but I though it was worth mentioning.

  • Are you able to move this "table" to internal memory or somewhere else in the DDR and see if the corruption still occur?

  • Hi,

    We were able to find our bug. By putting a breakpoint at _EXC_Dispatch, we identified the last running task, and that gave us a clue as to what the issue was. We were then able to change the testing methodology to make the unit fail within a few minutes. At that point, it was easy to do repeated test runs, setting breakpoints at various places, until we found the piece of code causing the memory corruption.

    Regards,

    Jeff Siegel