This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RAM corruption in Piccolo device



Hi,

I am using F28069 with CCS4 DSP/BIOS ver 5.41.10.36 and and code generation tools ver 5.2.10. In field testing we observed some unexpected system behavior due to changed variables in SARAM. These were global variables. These problems normally occurred after 6-8 hours of continuous operation. Cpu Temperature = 70degC. System has passed all EMI/EMC compliance testing. So there is no hardware issue. Suspecting a problem in stack its size was increased to 0x600 and relocated to L5 DPSARAM. We were unable to recreate the same problem in house despite long worst case scenario functional tests. How to create this situation and solve it to validate the solution? What are the possible causes of corrupt global variables in SARAM?

8611.F28069_example_BIOS5_flash.txt

Regards,

Naresh

  • Naresh,

    In all likelyhood you've got something amiss in your code someplace.  Perhaps a rogue pointer.  The on-chip analysis units can be useful in debugging this problem.  They allow you to setup a watchpoint.  This is similar to a breakpoint except you are watching for reads or writes to an address (as opposed to a code fetch).  For data corruption problems, you can setup a watchpoint for writes to one of the global vars that is getting corrupted (or watch an address region since a don't care mask can be applied to the address bits to ignor some of the LSBs).

    Watchpoints are normally configured using CCS in the breakpoint window.  The problem with debugging here though is that your code probably normally writes to these global vars, in which case the watchpoint will constantly be triggering and halting code execution.  In this case, you can have your code itself configure and then enable/disable the watchpoint.  There is information on doing this in appnote SPRA820 on the TI web.  Don't get fooled by the appnote title.  The appnote discusses setting up watchpoints using your own code.  The idea here would be to disable the watchpoint around normal code access to the corrupted var, and then re-enable it immediately afterwards.  That leaves the watchpoint active during the rest of your code.  This may catch the problem.  If it is an interrupt related issue though (i.e., ISR is corrupting the var), you may also need to disable interrupts around the code that normally accesses the global var.

    It will take some work to debug this problem.  I've suggested above an avenue of attack that may work.  Good luck.

    Regards,

    David

    P.S.  I'd first make sure you have the watchpoint configured properly.  Test it.  Enable the watchpoint and run over a spot in your code that you know writes to the var.  Make sure the watchpoint interrupt triggers.  Also, make sure you carefully read the bit descriptions for the analysis block registers given in SPRA820.  Some of the settings are not intuitive.  Follow what the appnote indicates carefully.

  • Naresh,

    Sadly, the most common cause of memory corruption is a software error.  Look at the location of the variables in the linker memory map.  Are the ones that were corrupted near each other?  Are they near a large array, which might have accidentally been indexed outside its bounds?  Does the system use DMA?

    Bill

  • David,

    We analyzed code for array spillover or rogue pointers, but couldn't find anything. We then overload firmware with all asynch ISRs coming in at a high frequency. So much so that menu functions and display (which have the lowest priority) were reduced to a crawl. All possible external input events were triggered simultaneously and in combination but no smoking gun. Watch points were also set as per SPRA820. Again nothing.This testing has been going on since the first post. During these tests CPU load is much more than actual field conditions.

    At site we have also replaced board with a new one. Results are the same.

    Any ideas?

    We have conducted EFT and surge tests several times on this board.

    Although tested, can conducted RF cause this situation? This is a motor drive application in a typical industrial environment.

    regards

    Naresh

  • Naresh,

    >> Although tested, can conducted RF cause this situation? This is a motor drive application in a typical industrial environment.

    I suppose anything is possible.  I myself haven't encountered that.

    About all I can suggest is you start removing portions of code functionality until the problem goes away.  Then you know the last thing you removed is probably the culprit (or at least related to the root cause).

    Good luck,

    David

  • David,

    I just wanted to inform you that problem is solved. We had some unbounded arrays in our code.

    Thank you for your support.

    regards

    Naresh