This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RM57L843: How to get debug information? CCS & RM57 & LwIP

Part Number: RM57L843
Other Parts Discussed in Thread: HALCOGEN

Hi there, 

I am using CCS 6.1.1 and the HDK RM57. I am developing an application that reads sensor data over serial and sends it over TCP. For the second step I am using the LwIP library.

The application works but every time after a random period of time the application "crashes". That only happends when sending data over TCP (if I send debug data over another serial port & TCP is not sending, the program works without crashing - at least it appears to be the case, since it hasn't crashed in this settup).

I am trying to figure out why but I don't get any message from CCS after the crash and the program just "hangs" at file "HL_sys_intvecs.asm" at either one of these lines:

prefetchEntry
        b   prefetchEntry
dataEntry
        b   dataEntry

I am also trying to solve the issue with the LwIP mailing list but I need to send them some debug info which I don't know how to get from CCS. I don't know how to reproduce the error, so I can't place a brakepoint somewhere and start stepping the code.

If it is an ilegal memory access or something similar, isn't CCS suppose to be able to tell that? There is no change at all in the CCS Debug GUI, no error message from CCS or a trace. Nothing.

Maybe you have an idea as to which registers I could check? Maybe I need to modify the debugger settings?

Any help would be appreciated.

Regards,

Julio

  • The PC hanging at the prefetchEntry or the dataEntry means there was a memory violation. If at the prefetchEntry, it was caused by an instruction fetch, if at the dataEntry, it was caused by a data fetch. Check the CP15 registers for more information. They are explained in the ARM Cortex R5 TRM section 4.3.20. You can see the contents of these registers in the CCS register tab:

  • Hi Bob,

    Thanks for the help. I really appreciate it.

    After almost 2 hours waiting for the program to crash, the problem appears to be an invalid data fetch. CP15_DATA_FAULT_STATUS = 0x000008000, which means that a Write (SW = 1) caused the abort with a "background" source (status = 0x00). What does background mean in this context?

    From the CP15_DATA_FAULT_ADDRESS (0x08080000) I know that the program tried to write outside the RAM (from HL_sys_link.cmd: RAM (RW) : origin=0x08001500 length=0x0007EB00 -> 0x08001500 + 0x0007EB00 = 0x08080000).

    How do I find the statement (line and/or file) trying to write to that address?

    Regards,
    Julio
  • Hi Bob,

    I was able to include some debug code from the LwIP in a file that I think has to do with the crash. When the crash happened, it stopped at the prefetchEntry line mentioned before but the CP15_INSTRUCTION_FAULT_STATUS was 0x00000000. However, the CP15_INSTRUCTION_FAULT_ADDRESS was 0x3F606890. Does that make sense to you? Would that be a memory violation while try to read (RW = 0) an instruction from that address with a "background" status (status 0x00)?

    If this makes sense, how can I proceed with the debugging (finding that instruction)?

    Thanks in advance.

    Regards,
    Julio
  • Since your instruction fault status is 0, I don't think there is any significance to the instruction fault address. The data fault address of 0x08080000 makes sense. This is the first location beyond the top of RAM. The background source simply means that none of the other MPU regions apply to this memory location, so the background MPU setting (0), applies. Typically the background setting is no read or write.

    Here are two approaches to determining why the code tried to write past the top of RAM. The first is just a code analysis. Are there any RAM initialization or check routines that might index past the top of RAM? What is the top of RAM used for, stack or heap? (If using HALCoGen created linker files, the stack is at the lower address of RAM and grows downward. A stack overflow would create the same fault, but at address 0x07FFFFFC.) Look at the .map file generated by the linker to see what was placed at the top of the memory.

    The second approach is to look at the address in the USER LR (link register). The CPU will be in ABORT mode, so the current LR is of no use, but R14_USER contains the address of an instruction several instructions beyond the one that caused the illegal memory write. (Assuming the CPU was in USER or SYSTEM mode at the time.) Look at the disassembly and work back through the code to find a STR or STM that could cause the write. You may have to back into a subroutine. If you are lucky, you may even see the 0x08080000 address still in the index register.
  • Hi there Bob,

    Thanks again for your help.

    Approach 1

    /*----------------------------------------------------------------------------*/
    /* Memory Map                                                                 */
    
    MEMORY
    {
    /* USER CODE BEGIN (2) */
    /* USER CODE END */
        VECTORS (X)  : origin=0x00000000 length=0x00000020
        FLASH0  (RX) : origin=0x00000020 length=0x001FFFE0
        FLASH1  (RX) : origin=0x00200000 length=0x00200000
        STACKS  (RW) : origin=0x08000000 length=0x00001500
        RAM     (RW) : origin=0x08001500 length=0x0007EB00
    
    /* USER CODE BEGIN (3) */
    /* USER CODE END */
    }

    As you say (using HALCoGen), the stack is at the lower address of RAM, which then would mean the I have a heap overflow error?

    I am not using any malloc, alloc or similar function to allocate heap memory in my code. I am just using some pointers that point to arrays defined in the (global) stack. For example, as double buffers:

    typedef struct Buffer
    {
        uint8 data[SIZE];
    }
    
    static Buffer buffer_container_[2];
    static Buffer* current_buffer_;
    static Buffer* next_buffer_;
    
    void some_init_function()
    {
        current_buffer_ = &buffer_container_[0];
        next_buffer_ = &buffer_container_[1];
    }

    However, the LwIP library needs to dynamically allocate memory with its own malloc function. I would think that they take care of freeing memory correctly. But maybe there is a bug.

    You asked me to look at the .map file to see what was placed at the top of the memory. I found the file but I don't understand what is placed where. Where should I look for that?

     

    Approach 2

    The R14_USER contains the address of an instruction several instructions beyond? Wouldn't it be better if it was an instruction before the error? As I understand the disassembly does not show the instructions in the order of execution. It just shows the assembly code of each c file put all together in one file. Right? I find it hard to know what came before the instruction at R14_USER because of asynchronous calls from the tcp server and receive serial interface. Or am I not seeing it right?

    You also mentioned that I could see the faulty address (0x08080000) in the index register. Where do I find the index register? I could not find it.

    Does it tell you something the fact that it takes really long for the error to occur? I can just think about a dynamically allocated pointer that needs very few bytes and that is not freed and eventually causes the heap overflow. Could there be other reasons?

    Thanks again in advance for your help.

    Julio

  • It might be an overflow in heap size. Let me walk you through an example of a .map file and then you can check to see if you can increase your heap size and if that helps to solve the problem.

    Toward the top of the map file is a summary of how much of each memory type is used:

    MEMORY CONFIGURATION
    
             name            origin    length      used     unused   attr    fill
    ----------------------  --------  ---------  --------  --------  ----  --------
      VECTORS               00000000   00000020  00000020  00000000     X
      FLASH0                00000020   001fffe0  0005a228  001a5db8  R  X
      FLASH1                00200000   00200000  00000000  00200000  R  X
      STACKS                08000000   00001500  00000000  00001500  RW  
      RAM                   08001500   0007eb00  0001326a  0006b896  RW  
    

    In my case I have 0x6b896 bytes of RAM unused. If I search the map file for the word "heap" I find it here:

    .bss       0    08001500    000130c6     UNINITIALIZED
                      08001500    00007814     lwiplib.obj (.bss:ram_heap)
                      08008d14    00007594     lwiplib.obj (.bss:memp_memory)
                      080102a8    00003b24     HL_emac.obj (.bss:pbuf_array)
                      08013dcc    00000400     httpd.obj (.bss:httpd_req_buf)
    

    In my case, the heap is not near the end of the RAM area. However, if the heap overflowed, it would corrupt RAM used by the memory management routines. The results are unpredictable, but could cause the code to fail to deallocate heap memory eventually resulting in a memory abort. Since I have lots of RAM available, I can easily increase the heap size to see if that has an impact. LWIP creates its own heap instead of using the one in the TI library. The size is set by line 68 of the file lwipopts.h:

    #define MEM_SIZE                        (30 * 1024) /* 30K */

    Concerning the second approach, yes it would be nice if we could capture the exact address of the instruction that caused the abort, but to achieve high performance the Cortex R5 does not wait for a write to complete before it continues fetching instructions. Therefore by the time the CPU is told that the write was to an invalid location, the PC has already been incremented past that instruction. Often you can work backwards through the disassembly and discern which instruction, or at least which function caused the abort.

  • Hi,

    Thanks again. You are my only hope xD

    Yes, I forgot to tell you that, I had realized I have a lot of unused RAM (0x00058b7b) and that's why I thought "it makes sense that it needs so long for the error to occur". I have MEM_SIZE to 50k. My memory allocation view shows that 29% of RAM is being used.

    I've got new information:

    I am using the internal serial interface (the one attached to the USB Probe cable for debugging) to send debug info. I increased the amount of debug data and the error comes a lot faster (5 to 10 min). What does that mean? As a reminder: I am using sci3 to receive data from a sensor & tcp to (re)send it to a PC. And the LR_ABT register always points to a receive function from the sci3. I understand that this function does not need to be the one generating the error, but it does mean that the error comes always at the same point/time/instruction. The LR_USER also always points to the same function!

    Another thing I noticed was that the address pointed by the current and next buffers (from my previous post) start at the current position, but when the error occurs they are wrong (0x089B089B, 0xC308C308, 0xD608CC08). Those buffers are the receive buffers of the sci3 interface, the same interface pointed by LR_ABT. But, the CP15_DATA_FAULT_ADDRESS is always 0x08080000!! 

    I have no idea what can corrupt the pointers in this way.

    Best regards.

    Julio

  • This suggests that the problem is associated with the SCI3 receive buffer. If you receive data from the sensor faster than you can send it, does the receive buffer contents just continue to grow? Adding the extra diagnostic messages added to the CPU load and would make the problem occur sooner. Could the receive buffer overflow and corrupt the pointers?
  • Hi there Bob,

    I did 2 tests to rule one interface out.

    1. TCP OFF, Receiving on the SCI 3 (sensor), and Sending on the SCI 1 (debug). I increased the number of debug data and the sending frequency over SCI 1. Program did not crashed.
    2. TCP ON, Receiving on the SCI 3 (sensor), and Sending on the SCI 1 OFF. Program crashed.

    I also checked the times: I receive sensor data every 67 ms (which is how it is suppose to be) and it takes the tcp interface about 2 ms to send it. 

    Something happens, and as a consecuence the buffers of the SCI 3 get corrupted, but I think that the trigger of the problem doesn't come from the SCI 3 (Other things could also be curropted but I haven't noticed). Maybe "that something" does has to do with LwIP. Maybe I am not configuring it correctly or I have an error on my code. But I still haven't seen it.

    Could I send you a simplified version of my code?

    Best regards,

    Julio

  • Can the simplified code run without your hardware?
  • Hi Bob,

    It's hard to make a simplified version of this because it's hardware dependent. 

    I've made some more tets:

    • I left the program running with the TCP server for a whole a day with a single static buffer which was initialized once and never changed. Serial interface to the sensor was not active. Debug serial interface was active but sending data once in a while. Program did NOT crashed.
    • I left the program running with the TCP server for 4 hours with a double static buffer which was being updated every 60ms with dummy values. Serial interface to the sensor was not active. Debug serial interface was active but sending data once in a while. Program did NOT crashed. (I did this to test if my copy function somehow was fault).
    • I tried running the TCP Server and the sensor serial interface at the same time but without copying the serial buffer to the tcp buffer. The TCP server was sending, in one test, the single static buffer that is never changed, and in the other test, the double static buffer (being updated every 60ms with dummy values).  In both tests, the program crashed.

    Additionaly, I noticed that the TCP client on the PC is actually a GUI that displays my sensor data. When connected, the "image" of the sensor data "jumps" once every ~2 seconds. The data is wrong. I thought this could be a copy error from the serial buffer to the tcp buffer. But, before the board sends the data over TCP, it processes it and checks if the data is wrong or corrupted. If it is, it sends several error signals (LEDs and serial debug data). When the image in the GUI jumps, I also get the error signals from the board (which means, the data is actually wrong). When the server is not connected to the GUI, I do NOT get those error signals from the board.

    This looks as if the TCP server somehow affects the interrupt routine of the serial interface (or the same memory area?) (and somehow corrupts its buffers?). So, I saw in my port (which I got from a Texas Instrument tutorial on LwIP and the board I am using) that the functions below are called the whole time because of SYS_LIGHTWEIGHT_PROT = 1. I thought that maybe the serial interface doesn't like it when its interrupt is being enabled and disabled that fast.

    sys_prot_t
    sys_arch_protect(void)
    {
      sys_prot_t status;
      status = (IntMasterStatusGet() & 0xFF);
    
      IntMasterIRQDisable();
      return status;
    }
    void
    sys_arch_unprotect(sys_prot_t lev)
    {
      /* Only turn interrupts back on if they were originally on when the matching
         sys_arch_protect() call was made. */
      if((lev & 0x80) == 0) {
        IntMasterIRQEnable();
      }
    
    void IntMasterIRQEnable(void)
    {
        _enable_IRQ();
        return;
    }
    
    void IntMasterIRQDisable(void)
    {
        _disable_IRQ();
        return;
    }

    I then changed that define to SYS_LIGHTWEIGHT_PROT = 0. This functions were not called again, but the program still crashes.

    I don't know how to investigate this deeper. What else can I check? Maybe my port is not right? Do you (or someone) has a port to the HDK RM57 that I can compare? 

    Best regards,

    Julio