This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Exosite LWIP TCP stack causing MPU reset during EMAC0 transmit buffer contents to Host.

Guru 55913 points
Other Parts Discussed in Thread: EK-TM4C1294XL, LM3S8971

Module (exosite_hal_lwip.c) conditionally resets the TCP connection after first assering a TCP disconnect.

We later found Memory Leaks in LWIP that name additional alias in (tcp_pcb) and (exosite.c) cause MPU reset or to consume all available (alloc_pool) memory.

Occurs if additional TCP/UDP ports are added and or TCP/UDP list (tcp_pcb) periodically rebinds in a looping cycle after (tcp_close). Binds and rebinds as additional named alias (tcp_pcb *pcb) onto TCP stack prior to asserting (tcp_connect) may cause (alloc_pool) to ASSERT (mem_free: mem->used) LWIP debug ERROR.

However (mem_free: mem->used=0;) can not keep up with the speed of memory consumption and eventually an MPU reset occurs.

Other conditions: TM4C peripherals running background loops PWM0 generators 0-2, ADC0 SS0&1 ANx inputs 0-6, Timers 0A & 1A, Systick. 

Next post down has an Exosite server TCP connection trace showing the MPU reset point. 

exoHAL_ExositeEnetEvents(): << Event DHCP (Break) >>

  • There are also several other TM4C1294 peripherals running in the back ground which might lead to MPU reset during a TCP reset connection event. 
     

    Other peripherals:

    1. PWM0 gens 0-2 using two SW interrupts derived from gen 0&1.
    2. ADC0 channels 0-6 and a NVIC interrupt vector.
    3. Timers 0 & 1, Sub timers -1A & 1B.
    4. Systick interrupt driving status LEDS blink & EMAC timers.

    4-23-2015:

    Most of MPU random resets had to do with the Timer0A timing 2x the normal period of default Exosite LWIP Systick timer being two TCP port clients 80 & 23 in the same EMAC0 interrupt context.  

    Exosite IOT code did not like running at 2x higher speed that the Telnet client requires in order to produce real time data out put on GUI widgets.

    When the Telnet client is left running for many hours (tcp.c) intermittently tries to post (tcp_write) data into a listen PCB and LWIP debug sometimes throws Assert error other times resets the MPU.  

    Telnet client at 2x Systick rate of Timer0A: Two timers independently service LWIP internal timers - Timer0A: Exosite low speed client, Systick; high speed Telnet client. 

  • There seems to be issues in this part of the Exosite IOT code just after the (return true) that is resetting the MPU. By the Doxygen statements and (IntMasterDisable) a similar MPU reset condition was possibly noticed during SW development.

    Likely the Tivaware C+ embedded instructions are going wacko when ever the TCP stack is being heavily tasked. In this case UDP port 23 is also bound to TCP stack but should be in blocking mode, closed to all UDP traffic when the reset occurs. Ethernet (ringbuf.c) is in the process of sending the (tStat) to Exosite. 

  • Hello BP101,

    When you mention MPU reset, do you mean the entire device is getting reset and the application starts all over again?

    Regards
    Amit
  • Suspect the instruction pointer is restarting at (main) all 3 status LED's go dark and EMAC0 link to Cisco switch disconnects then relinks.

    Appears like the stack pointer might even be corrupting.
  • Hello BP101,

    Did you check the Reset Cause Register?

    Regards
    Amit
  • How would that be done when EMAC0 trace is a real time event during code execution of buffer transmit cycle and not using the ICDI?
  • Is it possible production silicon revision 3 would not have this issue of the EK-TM4C1294XL revision 1 silicon? This reset MPU sure seems like some kind of errata condition. Stack top is set at 4096 bytes in the application.

  • Hello BP101,

    You have mentioned MPU is reset. That from a device perspective would mean that there is a reset source causing the device to be "physically reset", e,g. watchdog, por, bor, etc.
    Now if that is not what you meant, then please do explain.

    Regards
    Amit
  • Internal software MPU reset like what the software Boot Loader asserts after we load an update binary image from the TFTP server.
  • Hello BP101

    Even that is logged

    Regards
    Amit
  • First attempts to have this 75kb program flash write in debug would not execute code during run. Ended up blowing out boot loader flash locations some how. Had to set the ICDI at 16mHz or write flash would halt less then 50% compete.

    Actually saw a few IOT trace words on the UART terminal emulator - that's encouraging. Perhaps we can get to that reset register with any luck. It has been only on rear occasions to witness ICDI run any large program at all with interrupts flying.

  • Hello BP101,

    At some point I lost track of the setup, Might you consider explaining what is happening on the setup, as it seems to be involving Flash boot loader?

    Regards
    Amit
  • We have noticed the client often hangs in a write request for no apparent reason. Later discovered heap corruption may occur from large quantity of packet drops on the PHY receiver. That behavior was witnesses while servicing LWIP timer x(10) or 16.6us, (Systick=SYSCLK/200) 1.6us 600kHz. The Heap and PBUF pool may at times run empty in seconds after an MPU reset. The Systick timer was running to slow with this divisor.

    4.26.2015: We had to separate the two client modules and speed up both clients (Systick=120mHz/300) 2.5us 400kHz - LWIP timer (12.5us). Though much more stable often one or the other client will Halt the other or cause MPU reset even after assigning each client TCP priority. Appears to be a problem Clock shifting the LWIP timers even when servicing them independently from (lwiplib.c). Seemingly impossible to have two independent TCP stacks without using a multitasking OS.

  • That's a good point Amit.

    Only on very rare occasion ever see the boot loader launch the application during ICDI debug Run. The ICDI debug simulation is set to jump to symbol (main) after program load (start address 0x0000.4000) or reset, a bit quirky with TM4C but Stellaris LM3S9871 JTAG simulations are far less troublesome.

    After the TM4C loads we have to click reset then run a few times when or if the simulator loads the AET resource (boot.asm) and jumps to symbol (main). Sometimes single stepping locks if we try to manually move to line break point location and discovered the TM4C differs from LM3s8971 in that it is not necessary to move to line prior to stepping in or over lines. Might attempt to discard the boot loader for ICDI if unable to get the APP to run yet that alters the environment that may be leading to undesired MPU resets.
  • Was able to get the application to run in ICDI but could not pause it (Unable to Halt Target) message. Reset reason register is memory mapped and had a few flagged as 1 after several perceived to be resets had occurred.

    Being memory mapped should it show the real time counts or only when ICDI is paused will it update?
  • That ICDI application run was without boot loader and perceived MPU resets still occurred.

    Some tweaking of the ICDI settings and TM4C1294XL still had to be in a halted state order to get the application to flash 100% and RUN application. The ICDI has issues Halting the DAP no matter ICDI settings (reset load/connection, etc..) all setting generate << ERROR Halting the Target >> when ever the target is executing code. Same error message trying to (Pause) ICDI target CCS5.4 with all updates.

    Error Halt target message appears to be coming from the onboard TMC123 ICDI not being able to control the TM4C1294 DAP when ever the MPU instruction pointer is moving. LM Flash Programmer is having similar issues but not nearly as much.

  • LWIP Debug anyone:

    Be sure to rename the (#ifdef DEBUG) in (lwipopts.h) to ((#ifdef DEBUGL) then add a (#define DEBUGL) somewhere above to enable LWIP debug printouts to UART.  That will avoid compiler conflicts with every added (#ifdef DEBUG) statement in other modules. Next bump up the memory space in (myapp_ccs.cmd) so LWIP debug will have space to load in flash.

    Be sure to remove the comments (//) marks in front of any LWIP debug points you want to include in the UART print outs. We are getting some interesting debug reports showing the Ethernet TCP in action where the MPU or EMAC0 presumably is SW resetting and starting the application once again.

    Next post LWP debug trace shows where resetting EMAC0 looses the connection link:

  • Hello BP101

    It will only show bit status and not the count of them. What needs to be done, is to clear them in the beginning when unexpected resets occur. E.g. If POR has already occurred for the first time, then it should halt it in a while(1) when it occurs again.
    It is crucial that we isolate the cause to a device reset or something that EMAC cannot handle the traffic (the latter less plausible)

    Regards
    Amit
  • Look as if LWIP debug is revealing problems in (malloc), possibly crashing EMAC0. Resets even after increasing (MEMP_NUM_PBUF = 32). MPU typically resets after 118 bytes are received from Exosite.

    6.17.2015: Random MPU resets resulted from default watchdog0 settings from the merged application. That application set (WatchdogMPUResetEnable)  was being interrupted from not being periodically punched yet that migrated Stellaris Ethernet EMAC client was being used in a similar way. Best to disable the watchdog MPU reset until EMAC and LWIP TCP stack are fully debugged.

    *** (memp_malloc: Out o --- f memory?)

    memp_malloc: out oWelcome to the Connected LaunchPad
     
    
    exoHAL_EnetEvents(): Flag Recieved (1) Bytes RBufWrite = 118:
    
    Current MAC: 00xxxxxxxxxx00
    Exosite Obtaining IP...
  • Hello BP101

    And the memory leak patch has been applied?

    Regards
    Amit
  • No patch yet and might just help here. Seems (memp_mallock) is definitely running out of pool memory. 

    *** Possibly because DNS starts to use UDP port 53 for name resolves versus IP 53?

    4/6/2015:

    Setting LWIP1.4.1 (lwipopts.h) (LWIP_DNS==1) the connection to Exosite now goes several cycles and synchronizes the (tSats) then runs out of memory & resets MPU. Perhaps the memory pools are not configured correctly in LWIP - settings appear out of place from the way (memp.c) seemingly handles Alloc memory pool.

  • Hello BP101,

    Yes, that is memory patch post.

    Regards
    Amit
  • Hi Amit - Must be a new bug we discovered when ever a UDP (pcb) is configured (malloc) pool runs out of memory very quickly even with the patch applied.

    We see DNS now query name resolves on UDP 53, defaults to TCP 53 without a UDP (pcb) configured?

    Acquire IP address, DHCP is using (bootp) broadcast calls on UDP local port 68. 

    Below a (pbuf) pool LWIP debug trace, the (malloc) pool can be seen to run out of memory shortly after (pbuf_free: deallocating 2000c2d4). The reported 118 bytes gains to 145 bytes successes after bumping up (MEMP_TCP_PCB / MEMP_UDP_PCB) to 512 yet still runs out of (melloc) pool in a few refresh cycles.

    exoHAL_EnetEvents(): Flag Recieved (1) Bytes RBufWrite = 145:
    
    memp_malloc: out o
  • Hello BP101

    I have assigned the thread to our debug queue, but will take time on a/c of TivaWare testing ongoing.

    Regards
    Amit
  • Hi Amit, Thanks for escalating the issue:

    Ordered a couple of new launch pads today with TM4C1294NCPTI3 MPU. Perhaps that will help if there was anything wrong in DMA transfers EMAC0. Tying to get feed back of CPU utilization IOT with PWM motor control,  may opt for running parallel processors splitting tasks.

    We could possibly pipe data variables from PWM motor launch pad over the utility USB into another IOT launch pad for the time being.

    BTW:  Upon increasing the [lwipopts.h] (PBUF_POOL_BUFSIZE) above 1024 bytes, CCS compiler (v5.4.11) errors not enough (.bss) space.  Believe Tiva patch file regulates the receive buffer space in pBuf_alloc. We have the (TCP/UDP_Num_Pcb) each 16. Also (PBUF_POOL_BUFFSIZE) set above 1024 bytes gives compile error " (TCP_WND) is smaller than  (PBUF_POOL_BUFFSIZE) " or if equal size.

  • Hi Amit,

    After apply memory leak patch to Generic IOT (code unmodified) hangs after a few hours of runtime. Saw same hang twice in spot shown.

    BTW: Message imply to (Abort) disconnect - it actually is disconnecting.

    4.23.2015: Below Exosite hangs after a rogue (http: 409 or 400) status is perceived.  Note: perceived as it my have come from and up stream source other than the Exosite server.

     

    << ExoHAL_SocketSend(): Request Bytes Sent = >> :85
    << ExoHAL_SocketSend(): Request Data Sent = >> :usrsw1=0&usrsw2=0&jtemp=45&ontime=8&ledd1=0&ledd2=0&gamestate=0x0&emailaddr=&usage=23
    exoHAL_EnetEvents(): Flag Recieved (1) Bytes RBufWrite = 188:
    
    exoHAL_EnetEvents(): Flag Recieved (1) Bytes RBufWrite = 0:
    
    ASSERT FAIL at line 625 of C:/Software/Tivaware/TivaWare_C_Series-2.1.0.12573/third_party/lwip-1.4.1/src/core/pbuf.c: p != NULL
    exoHAL_SocketClose(A): << Abort TCP Disconnect >> 
    
    exoHAL_SocketClose(R): << Finished Reset Connection >> 
    
    << requests.c; SyncWithExosite(): No data -- (Exosite_Write) Exosite Didn't Respond >>
    Initial sync failed. CIK may be invalid.
    Connecting to Exosite to provision a new CIK... 
    

     

  • //*****************************************************************************
    

     

  • Possibly (mem.c) requires attention in (Pbuf) memory recovery code changes in and around (lwipopts.h) settings. We now have a 32kb Alloc memory pool and even that eventually runs out of free buffer space.

    ASSERT FAIL at line 339 of C:/Software/Tivaware/TivaWare_C_Series-2.1.0.12573/third_party/lwip-1.4.1/src/core/mem.c: mem_free: mem->used
    

     

  • The memory leak offender were two fold:

     

  • The LWIP .bss memory pool HEAP corrupts under (pbuf) load. LWIP debug STATS with either a 32k or 64k Heap and Maximum = jumps suddenly over 4 million bytes and similar for the Heap pool size. Then we have a crash and burn MPU resets.

    Mem_Free works far better and lasts longer before MPU resets than the LWIP allocator memory free.

  • Hello BP101,

    We did not spend enough time on the issue last week, but as you have shown on multiple occasions, forum is critical for us to get the feedback on the soft aspect of the application. I will forward the thread to the team.

    Regards
    Amit
  • Hi Amit, Agree and figured some TI experts on LWIP might chime in. Some of this issue of the (Pbuf_Pool_Bufsize) growing beyond the defined HEAP boundary was partly my fault but not all. Things are now much quieter and both UDP/TCP client work concurrently after adding some constraints around the UDP client receive function.

     

  • Good place to start an investigation:

    e2e.ti.com/.../1469839
  • Hi Amit,
    Perhaps having the (HOST_TMR_INTERVAL) 200ms may be far to slow for the LWIP memory allocator or even the faster (mem_alloc) to keep the heap from going into overflow. After replanting the LWIP debug stats options in (lwipopts.h) debug started reporting memory PCB high as 65k. After setting the timer interval 5ms LWIP memory management appears to keep up with removing orphaned PCB from the Heap but may lock up and stopping the Exosite client yet the Telnet client still runs.

    Have been closely watching the debug reports but never did it ever report PCB's being that high with debug stats (#ifndef) check in (opt.h). Not sure how we ended up at 200ms timer. Seem to recall the Telnet client GUI widget scopes were moving extremely slow with SYSTICK/100 and changed to SYSTICK/200 and the widget scopes were moving way to fast.

    e2e.ti.com/.../412465

    The Telnet client is running near real time TCP updates on port 23 into the GUI widgets.