This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Switch Table Corruption

Other Parts Discussed in Thread: TMS320F28335, CONTROLSUITE

I promise, this one is weird.  Maybe some linker guru will understand what is happening.  

It started to occur all of a sudden between two code versions with very minor differences on a project that has been going well for about 6 months now.  Unfortunately, I don't have the "before" and "after" source codes although I have much older versions.  We can't rely much on these to track the fatal difference...

Device: TMS320F28335
JTAG emulator: xds100v2
CCS v5.3 (v5.4 behaves the same)
Compiler v6.1.0 (v6.1.3 behaves the same)

I have what looks like a RAM corruption issue in the switch table.
Believe it or not, adding/removing NOP instructions somewhere near the top of main() "cures" the problem.  When the problem strikes back later after writing or removing some more code elsewhere, I remove/add NOPs and "cure" the problem again. 


Now, here is the debugging sequence that reveals the problem:
  I click on "Reset CPU".
  I check content of RAM at 0xc250.  According to the map file, it should be a switch table.  It definitely looks like that.
  I put a watchpoint on a write at 0xc250 and a HW breakpoint at the top of main() routine the press "resume".
  CPU stops at the breakpoint.
  Memory Browser shows 0xC250 in red because it has been modified since previous halt.  That address is the one an only that has been modified around.    Everything else is black.

First observation: something between reset and main() destroyed the first pointer of the switch table

Second observation: the watchpoint did not work

Editing the value through Memory Browser to put back the right pointer allows the code to run properly afterward.  No crash whatsoever and nothing writes to that address until a reset occurs.

That is no stack or buffer overflow.  At least not in my own code...

So I turned toward the map file.  Below are 3 snippets.

This one crashes (0 NOP)
0000c247   _atof
0000c24b   _SetDBGIER
0000c24e   _InitEQep
0000c24f   _InitEQepGpio
0000c250   ___etext__
0000c250   _switch_runstart
0000c250   etext
0000d000   _scibDataReady

This crashes too (1 NOP)
0000c249   _atof
0000c24d   _SetDBGIER
0000c250   _InitEQep
0000c251   _InitEQepGpio
0000c252   ___etext__
0000c252   _switch_runstart
0000c252   etext

This one works (2 NOP)
0000c246   _DSP28x_usDelay
0000c24a   _atof
0000c24e   _SetDBGIER
0000c251   _InitEQep
0000c252   _InitEQepGpio
0000c253   ___etext__
0000c253   etext
0000c254   _switch_runstart
0000d000   _scibDataReady


Since map files for both crashing binaries had etext located at the same address than _switch_runstart and the working binary had different addresses for each symbol, I though I had found the problem.

To prove it, I relocated the .switch section elsewhere far away (RAML6).  For sure, there is no symbol allocated over the switch table.  But the problem is the same!!!  The
first pointer (case 0) is corrupted again!!!  That time is contains a different value so that it does not jump into the stack anymore.  It just jump somewhere else valid but
the context being totally wrong it crashes soon.

I tried adding some NOP again with the switch table located in L6 but it did not do any good.

I am getting out of breath now.  

Does anybody have any idea of what is going on here???

  • Fred,

    .switch is an initialized section.  Why are you linking it to RAM?  It needs to be linked to flash.

    - David

  • Hi David,

    .switch needs to be linked to flash?  I don't understand that requirement and it does not match with SPRU514E Table 7-1.  .switch can be located in RAM or ROM.

    I have put the switch section in RAM for faster execution.  There are wait cycles associated with flash access (especially random access).  I can't allow that.  Even all constants are moved to RAM upon startup (section codestart provided in CCS libraries).  

    Still, I will give a try at leaving the switch table in flash and see how it works.  However, that is just going around a bug and not an acceptable solution.  It does not explain the spurious write I observe.

    I should be able to put the switch section in RAM without problem.  In fact I have been able to do so for the last 6 months.  

    I can provide you my .cmd linker file or project files if you feel there might be something wrong in there.  However I would not post it on the forum so provide an e-mail address please.

    Thanks for your support.

    Frederic

  • Frederic,

    SPRU514E is the compiler user's guide, and doesn't really consider system issues with a real embedded system.  All Table 7-1 is doing is considering whether the section is modified at runtime.  If it is, it lists the section for RAM link only.  If it is not modified, it lists ROM or RAM.  The page column is also a don't care on C28x devices as they've been designed.  All memory is connected to both the program and data buses.  It basically doesn't matter what linker page a particular memory is on, nor which page any section is linked to.  Notice in Table 7-1 that it says the .text section can be linked to ROM or RAM.  Suppose you linked it to RAM.  What would be in the RAM after power up?  Nothing!  So, you'd have no code to run.

    The basic rule in an embedded system is that all initialized sections must be linked to non-volatile memory (e.g., flash), and uninitialized sections must be linked to volatile memory (RAM).  The test to determine if a section is initialized or uninitialized is to ask yourself if the contents of the section need to be preset after a power up.  If the answer is yes (e.g., code in .text section) the section is initialized.  If the answer is no (e.g., the stack in .stack section) the section is uninitialized.

    Fred P said:

    Still, I will give a try at leaving the switch table in flash and see how it works.  However, that is just going around a bug and not an acceptable solution.  It does not explain the spurious write I observe.

     
    The requirement to link the .switch section to non-volatile memory is not a bug.  It is a fundamental requirement for any embedded system.  If you need to speed up execution, you can link the .switch section to load to flash but run from RAM.  Your code will then need to copy the section to RAM at runtime.  The compiler will not generate code to do this.  See appnote SPRA958 for more information.
     
     
    Fred P said:

    I should be able to put the switch section in RAM without problem.  In fact I have been able to do so for the last 6 months. 

     
    I suspect you've been successful linking the .switch section to RAM for the past 6 months because you're still running with the emulator connected.  CCS is loading the .switch section for you when you flash the device.  The emulator is a crutch that can mask linking mistakes.  If all your doing is some investigatory work and don't plan to ever make your system run standalone without the emulator, then you can leave initialized sections linked to RAM.  But the minute you get rid of the emulator and power-cycle the device, your code will no longer work.
     
    Regards,
    David
  • David,


    Our code works in the field, in a box, bolted to an electric vehicule. No emulator whatsoever in the way. Considering it is our own design of a 7kW grid-tied inverter, I would bet it does not work by chance only.
    We (including myself) have also developed our own bootloader for flashing new code through serial port, again, without any emulator to make us believe our code works.  That is no big deal but it should convince you that we are not some wandering programmers discovering what a microcontroller is. So, I totally knew all of what you are explaining about code location vs runtime.
    It is unlikely that I would have been able to gather that much debug information if I was new to embedded development. Isn't it?

    So I guess I was not clear enough when I wrote " Even all constants are moved to RAM upon startup (section codestart provided in CCS libraries). "
    I though it was obvious that we were using TI files "DSP28xxx_CodeStartBranch.asm" and "DSP28xxx_SectionCopy_nonBIOS.asm".  My mistake.

    ALL non-volatile sections are linked to flash but ran from RAM (load=flash, run=ram in .cmd file).

    I am not telling that there is a bug in TI tools. I just ran into some weird situation and need help, not a 101 class.  If there is a line in SPRA958L that can explain what is wrong in what I am doing, I'll be glad that you point it out to me.

    I can't see why the switch table could not be moved to RAM at startup.  switch table is merely a bunch of pointers.  As long as the code reads the pointer at the right place and that the pointer actually points at the right place, it should not matter where all of this located.  Unless there is some problem in jumping from RAM to RAM?  Anyways, that's not a branch issue, it is a memory corruption issue as I explained it from the start.

    That being said, are you interested at digging into my problem?

  • Did the test with the switch table in flash (load=FLASHB, run=FLASHB).

    I get symbols _switch_loadstart and _switch_runstart located one after the other in flashB.

    _switch_runstart content is all 0xFFFF.  What's that??

    _switch_loadstart content is the nice and clean switch table without corruption.

    Guess what?  At runtime, the code uses _switch_runstart content as the switch table.

    I was expecting both run and load to be at the same address. I must miss something. I believe this kind of wrong symbol usage is something easy to fix.  What do you think?

    That test lead me to verify if the content of _switch_loadstart was not corrupted from the start when I link to flash and run from RAM (after copying everything of course...).  After verification, it is not.  It is all clean in flash. It must be loaded properly in RAM then.  Unless there is trouble reading that specific flash cell?  Extremely unlikely...  I have moved everything into FLASHC to make sure.  Behaves the same....  First entry to the switch table is corrupted in RAM but clean in flash.

    Help will be greatly appreciated.

  • I can't go any further by myself.  I need some ideas.

    As nobody is asking my project files while I offer them, I understand that nobody is working on this at all.

    We can't let the device out of here to our customer until that problem is better understood or solved.

    This is a real show stopper.

    Thank you  for you understanding

  • Hello???

    I'd be very grateful if someone could help me out of this nightmare.

    Thanks

    Frederic

  • Hi Fred,

    I am back from vacation and saw your post just now. Let's try a new start.

    In you first message you mentioned that adding NOP-instructions sometimes helps to overcome the issue.  This rings a bell. Adding a NOP extends the size of the .etext section by 1 word. Now all other following sections, which are linked after .etext to the same physical memory are also shifted by one word. Now my guess: all pointer accesses must be EVEN address aligned. This is valid for all 32-bit accesses (pointers, MOVL, DMAC etc.) of a C2000 core. So by chance you forced the .switch section to such an even address. Could this make sense?  Did you use a "align(2)" command for the .switch section? How do your linker command files look like?

    To further help you I would need some more insight in your project files. In that case please give me your email - adress so that we can discuss your project externally of E2E.

     

     

     

     

  • Hi Fred and Frank.

    It would be good, If you could post the discussion, what you had for this issue.

    Even in my project for F28335, we are experiencing the same issue. Though we couldn't debug to the level, Fred did.

    At Times, adding a piece of code, makes the unit not come up and we were in an assumption that, stack corruption could have happened.

    After, adding NOP in the code as like Fred, we were able to make our unit UP and running.

    Hence your debug analysis will help us to fix the issue permanently. 

    Your support is greatly appreciated.

    Thanks

    Mohan

  • Hi Mohan,

    Frank suggested there is a coding error in TI file "DSP28xxx_SectionCopy_nonBIOS.asm " we are using.    Here is what he wrote about it:

    Also, your file DSP28xxx_CodeStartBranch.asm is different from the ones in ControlSuite, because it calls the sectioncopy – function before it enters _c_int00. I suspect a mistake in DSP28xxx_SectionCopy_nonBIOS.asm, which could be related to your initial issue. The core copy –loops are executed by 2 instructions:

    RPT AL                                                 ; Copy Section From Load Address to

        ||       PWRITE  *XAR7, *XAR6++           ; Run Address

     In register AL is always the size of the section, for example _text_size. However, the RPT – instruction expects the number of repetitions minus 1 to be loaded in AL!!! So my interpretation is that the PWRITE –instruction is executed 1 times to often!

    This would explain the overwriting of 1 entry.


    I had no time to investigate over that since.  I have a workaround until I can spend some more time on that.

    The workaround is to relocate the .switch section elsewhere in RAM.  You can do that in the .cmd file.

    Maybe something else gets corrupted but it is not the switch table anymore and the code can run for more than a few microseconds....

    I will repost here when I have a conclusion.  Feel free to do so also.

    Good luck!

  • Thank you Fred for the inputs.

    I will sure update..if  I find any updates on this issue.

  • Hi Mohan,

    as mentioned by Fred, I found a coding error in flie DSP28xxx_SectionCopy_nonBIOS.asm, which is no longer part of the official Texas Instruments support files nor part of ControlSuite. Fred used this file and the copy-loops in this file are 1 word too long, so it could happen that the sections overlap by 1 word.

    If you do not use this file, your problem must be somewhere else.

     

  • HI 

    After certain level of debugging.

    We came to a conclusion that the issue is with the boot.asm file.

    What we observed is the relocation of file in the memory is causing the corruption and the reason for hang in our case.

    Hence what we did is we partitioned and provided a diff section of memory in the FLASH. so that the impact in other files will not move realign the boot.asm stuffs in the flash.

    This helped us in resolving the issue.

    Further we didnt dig in to the boot.asm file, as we thought that would be a huge effort for us.