This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DP83816 Ethernet - Transmit Corrupts Memory

Other Parts Discussed in Thread: DP83816

I have been investigating a problem in our embedded system product that results in random memory locations being changed.  The problem happens after many hours or days of operation and depends on the amount of communication.  After doing a lot of things to narrow down the problem we found that it is related to the transmit DMA on the DP83816 chip.

Here are the things we did to narrow down the problem.

  1. Eliminated interrupts and polled the chip for service.
  2. Used CPU debug registers to trap on CPU writes to the altered memory location.
    We found that the CPU was not writing to the memory location (that implies DMA changed it)
  3. Verified contents of transmit and receive DMA descriptors before and after DMA transfers.
    The DMA descriptors are correct including the "link" pointers.
  4. Verify the altered memory location before and after performing transmit operation.
    We found that the memory is always changed during the transmit DMA.
  5. Block interrupts during the entire transmit DMA.
    We had the same results and memory was still altered by doing transmit DMA.
  6. Add about 100 microseconds of delay between each transmit frame.
    We found that this avoided the problem and memory was not altered.
  7. We reviewed the transmit software for the chip carefully and could find no problems.
    The software performs the same steps whether or not the 100 microsecond delay is there, yet that seems to avoid the problem.

The problem occurs differently on different boards (same design) and the altered memory address and data are not the same on each board.  For a given board, the data and altered memory locations are somewhat consistent.  Typically we see one bit set in a location that should contain zero's, ex: 0x0100 or 0x0800.  The exact data pattern seems consistent for any particular board.  None of the data values seem to match expected "cmdsts" values.  The only memory write done during Transmit DMA is the two "cmdsts" bytes in the descriptor.

If we avoid doing transmit with the DP83816 or add the 100 microsecond delay between transmit frames we do not see the problem.  There are four DP83816 chips in our system.  When the problem occurs, one is sending about 8 frames and the others are mostly receiving.  We found that the amount of receive activity has an effect on the problem.  With less receive activity we don't see the problem (or not as often).

Each transmitted frame uses two descriptors.  The first descriptor points to a buffer containing the destination and source MAC address (12 bytes).  The second descriptor contains the Ethertype code and remainder of the frame.  Frames vary from 32 bytes to 1480 bytes in length.  The software typically transmits 33 frames but that varies depending on the system configuration.  The maximum configuration doesn't use more than 64 frames.

Here is a more detailed description of how the software transmits the frames.

  1. Disable interrupts
  2. Check to make sure that there are enough unused descriptors for the packet.
    This is done by counting the number of buffers (always two) and checking the transmit write index versus read index.
    If the write index equals the read index the software checks the descriptor buffer pointer to determine if the ring is empty or full.
  3. If there are not enough available descriptors, return "ring full" status
  4. Set buffer pointer and "cmdsts" (OWN bit cleared, MORE set) for first descriptor
  5. Set buffer pointer and "cmdsts" (OWN bit set, MORE cleared) for second descriptor
  6. Set "cmdsts" (OWN bit set, MORE bit set) for first descriptor
  7. Update transmit "write" index to indicate next available descriptor
  8. *Start transmitter (see below)
  9. Enable interrupts

*The software starts the transmitter as follows.

  1. Disable interrupts
  2. PCI I/O Input from Command Register (offset 0)
  3. If "TXE" bit (0x00000001) set, go to step 12
  4. PCI I/O Input from Transmit Descriptor Pointer register (0020h)
  5. Mask unused bits of address (1 and 0)
  6. Range check address versus descriptor ring
  7. If address out of range, go to step 12
  8. Calculate ring index from descriptor address
  9. If descriptor at address has "OWN" cleared, go to step 12
  10. PCI I/O output to Transmit Descriptor Pointer register (0020h)
  11. PCI I/O Output to Command Register (offset 0) with TXE bit set (0x00000001)
  12. Enable interrupts

The above steps are also done if the "TXIDLE" bit is set in the interrupt status.  Although the software now does not use interrupts it periodically polls the chip by reading the interrupt status register.

We have had this hardware and software running in the field at dozens of customer sites and have only one customer complaining of the problem.  We have also reproduced the problem in our engineering lab.  Changes to the software configuration or the amount of communication make the problem disappear.  Because the memory locations changed are somewhat random, the effect on the software is also unpredictable.  We've used the software exception and error handler to dump memory for analysis and found that the DMA descriptors are always correct for the DP83816 chips.  So far the only apparent work around is adding around a 100 microsecond delay between sending frames.

This might not be a DP83816 problem, since a chip-set or PCI bus problem could cause memory corruption.  I am trying to verify that our software for the DP83816 chip is correct.  Some of what we are doing is to work around a problem with the transmitter going idle.  I posted a thread about that problem.  I have also reviewed the errata for the chip-set in our system (440MX) and found nothing that matches this problem.

Our software is writing the Transmit Descriptor Pointer register.  I'm wondering if that could be related to the problem. The software checks to make sure that the transmit state machine is idle before writing the register.

Here are some more details of our hardware.  We have a custom CPU board using the Intel 440MX chip-set, Celeron 650 Mhz. and 100 Mhz. front-side-bus.  There are four DP83816 Ethernet controller chips connected to the single PCI bus for the chip-set.  We have 64 MB of RAM using the built-in DRAM controller of the chip-set.  The only DMA devices are the Ethernet controllers and the only interrupt is the system timer interrupt.  We have tested the DP83816 with and without interrupts and it has no effect on the memory corruption problem.  The system has no keyboard, no video display and does not use the BIOS after the application software starts.

We use an embedded OS called  Nucleus Plus and a network stack called Nucleus Net.  The DP83816 driver is one that we wrote based on example drivers for Nucleus Net and the DP83816 datasheet (manual).  I've reviewed the Linux driver for the DP83816 and found that it differs from ours in two areas.  It starts the transmitter merely by writing the "TXE" bit in the Command Register and it performs no processing for the "TXIDLE" interrupt.  When our driver was only writing the "TXE" bit in the Command Register we saw that the transmitter would sometimes go idle before all the frames in the ring were transmitted.

If the DP83816 is incorrectly reading the contents of the "link" field of a transmit descriptor that would explain why the transmitter is going idle, and also why the DMA might write to an incorrect memory location to store the "cmdsts" bytes.  The only problem with that hypothesis is that the data values appearing in memory do not match expected "cmdsts" values for either transmit or receive.  If the receive buffer address was being read incorrectly I would expect surrounding bytes to be incorrect and not just a few bytes in the middle of correct data.

I will appreciate any advice about the correct operation of the DP83816 chip or known problems that might relate to memory being changed unexpectedly.

  • Erik,

    your current discussion with Patrick on another DP83816 issue posted on Dec-2, will it cover this issue posted on Dec-8 as well ?

    I am the moderator of this forum and need to make sure that all DP83816 issues posted by one person are covered.

    For now I'm leaving this issue open and would appreciate your earliest feedback.

    Thank you,

    and best regards,

    Thomas

     

  • Thomas, I apologize for the delay in responding. Please keep this issue "Transmit Memory Corruption" open, and you can close or combine the other issue "Transmitter Goes Idle" with this problem. I am working to get our hardware design Engineer available and will ask him about monitoring the PCI bus. We have limited test equipment. I also asked our Software Engineering Dept. Manager about getting a DP83816 Ethernet Chip register snapshot after the problem occurs.
  • We tested the system using just an unconditional write of the TXE bit to the Control Register and still had problems with memory being corrupted.  Interrupts are disabled and only the transmit processing for the DP83816 is being done during the time that memory changes.  Adding a 10 microsecond delay between inserting each buffer into the ring and writing the TXE bit makes the problem go away.  We are using the required "SFENCE" instruction after storing the buffer pointer in the descriptor and again after writing the "cmdsts".  The memory containing the descriptors is Write-Back Cacheable and the Intel Chip-Set maintains cache coherency between PCI bus DMA and the CPU using Cache Snooping.

    Apparently this problem has nothing to do with reading or writing the Transmit Descriptor Pointer register.  It seems to be a problem with the actual DMA transfers during transmission.

  • Erik,

    I still think that the best approach will be to confirm the register contents and trace the PCI bus.  I understand that effort is in progress.  In the meantime, it sounds like you have simplified the scenario and narrowed down the conditions under which the issue occurs.  I think that is good progress. 

    Under the conditions you have described, I would not expect the DP83816 to be changing memory (aside from posting cmdsts).  Confirming the registers will allow us to verify the functional mode of the device.  Tracing the bus may help us exclude the DP83816 from causing the memory corruption.  If the DP83816 is accessing memory, the bus trace may give us an indication why it is doing so.

    Patrick

  • Here are two register dumps from two failures within a few hours of each other.  The two failures occurred on two different CPU boards in the same (redundant) system.  The CPU boards communicate via Ethernet port 1.  Ethernet port 0 is used to communicate with a host (PC with HMI).  Ethernet port 1 and 2 are used to communicate with other I/O processors that read and write analog and digital I/O signals.  These CPUs have only power connections and Ethernet connections.  They do not connect to any external devices, disk drives, keyboard, display, serial ports, etc.

    We have software checking the contents of the TXDP and RXDP registers and descriptors for incorrect pointers.  So far, TXDP, RXDP and the contents of descriptors have always been correct after the failures and at other times the software has checked them.

    The software detects the failure by checking known constant locations in the system against expected values.  The exact data and address that are wrong varies from failure to failure.  On this set of CPU boards we usually see a value of 100 (hex) in a location that should contain zero.  I don't have the exact memory address or data value from the two failures below.

    The contents of PCI configuration register 0C are typically 00000800 (PCI latency timer = 8).  These failures are from a test we ran with PCI latency set too 100 (decimal) so you will see a value of 00006400 for register 0C.

     


    Error log 1 from 12-22-2011
    =========================

    P A Internal Error MEMORY CORRUPTION  106  
    P A DIAG Ethernet port 0 register dump:
    P A DIAG     PCI Config
    P A DIAG     00:0020100B 04:02900145 08:02000000 0C:00006400
    P A DIAG     10:00004001 14:C0000000 2C:00000000 30:FF000000
    P A DIAG     34:00000040 3C:340B010B 40:7E020001 44:00000000
    P A DIAG     I/O Registers
    P A DIAG     00:00000004 04:E802E004 08:00000002 0C:00000000
    P A DIAG     10:00004282 14:03005AB2 18:00000001 1C:00000000
    P A DIAG     20:03FA7C38 24:D0F00E30 30:03FA738C 34:10700002
    P A DIAG     3C:00000000 40:00000000 44:00000000 48:C8000000
    P A DIAG     4C:00008000 50:FFFF0000 54:3D3D3D3D 58:00000505
    P A DIAG     5C:00000000 60:00000000 64:00000000 68:00000000
    P A DIAG     6C:00000000 70:00000000 74:00000000 78:00000000
    P A DIAG Ethernet port 1 register dump:
    P A DIAG     PCI Config
    P A DIAG     00:0020100B 04:02900145 08:02000000 0C:00006400
    P A DIAG     10:00004101 14:C0001000 2C:00000000 30:FF000000
    P A DIAG     34:00000040 3C:340B010B 40:7E020001 44:00000000
    P A DIAG     I/O Registers
    P A DIAG     00:00000004 04:E802A004 08:00000002 0C:00000000
    P A DIAG     10:00004282 14:03005AB2 18:00000001 1C:00000000
    P A DIAG     20:03FA9CFC 24:D0F00E30 30:03FA9834 34:10700002
    P A DIAG     3C:00000000 40:00000000 44:00000000 48:C8000000
    P A DIAG     4C:00008000 50:FFFF0000 54:3D3D3D3D 58:00000505
    P A DIAG     5C:00000000 60:00000000 64:00000000 68:00000000
    P A DIAG     6C:00000000 70:00000000 74:00000000 78:00000000
    P A DIAG Ethernet port 2 register dump:
    P A DIAG     PCI Config
    P A DIAG     00:0020100B 04:02900145 08:02000000 0C:00006400
    P A DIAG     10:00004201 14:C0002000 2C:00000000 30:FF000000
    P A DIAG     34:00000040 3C:340B010B 40:7E020001 44:00000000
    P A DIAG     I/O Registers
    P A DIAG     00:00000004 04:E802A004 08:00000002 0C:00000000
    P A DIAG     10:00004282 14:03005AB2 18:00000001 1C:00000000
    P A DIAG     20:03FA8608 24:D0F00E30 30:03FA841C 34:10700002
    P A DIAG     3C:00000000 40:00000000 44:00000000 48:C8000000
    P A DIAG     4C:00008000 50:FFFF0000 54:3D3D3D3D 58:00000505
    P A DIAG     5C:00000000 60:00000000 64:00000000 68:00000000
    P A DIAG     6C:00000000 70:00000000 74:00000000 78:00000000
    P A DIAG Ethernet port 3 register dump:
    P A DIAG     PCI Config
    P A DIAG     00:0020100B 04:02900145 08:02000000 0C:00006400
    P A DIAG     10:00004301 14:C0003000 2C:00000000 30:FF000000
    P A DIAG     34:00000040 3C:340B010B 40:7E020001 44:00000000
    P A DIAG     I/O Registers
    P A DIAG     00:00000004 04:E802A004 08:00000002 0C:00000000
    P A DIAG     10:00004282 14:03005AB2 18:00000001 1C:00000000
    P A DIAG     20:03FA9314 24:D0F00E30 30:03FA8AC8 34:10700002
    P A DIAG     3C:00000000 40:00000000 44:00000000 48:C8000000
    P A DIAG     4C:00008000 50:FFFF0000 54:3D3D3D3D 58:00000505
    P A DIAG     5C:00000000 60:00000000 64:00000000 68:00000000
    P A DIAG     6C:00000000 70:00000000 74:00000000 78:00000000

    * NOTE:
    I/O register 10 (hex) is not the actual contents of the Interrupt Status Register.
    It is a summary of all un-masked interrupts that occurred for the chip since boot.

     


    Error log 2 from 12-22-2011
    ===========================

    Message
    S B Internal Error MEMORY CORRUPTION  104  
    S B DIAG Ethernet port 0 register dump:
    S B DIAG     PCI Config
    S B DIAG     00:0020100B 04:02900145 08:02000000 0C:00006400
    S B DIAG     10:00004001 14:C0000000 2C:00000000 30:FF000000
    S B DIAG     34:00000040 3C:340B010B 40:7E020001 44:00000000
    S B DIAG     I/O Registers
    S B DIAG     00:00000004 04:E802E004 08:00000002 0C:00000000
    S B DIAG     10:00004282 14:03005AB2 18:00000001 1C:00000000
    S B DIAG     20:03FA7A40 24:D0F00E30 30:03FA73EC 34:10700002
    S B DIAG     3C:00000000 40:00000000 44:00000000 48:C8000000
    S B DIAG     4C:00008000 50:FFFF0000 54:3D3D3D3D 58:00000505
    S B DIAG     5C:00000000 60:00000000 64:00000000 68:00000000
    S B DIAG     6C:00000000 70:00000000 74:00000000 78:00000000
    S B DIAG Ethernet port 1 register dump:
    S B DIAG     PCI Config
    S B DIAG     00:0020100B 04:02900145 08:02000000 0C:00006400
    S B DIAG     10:00004101 14:C0001000 2C:00000000 30:FF000000
    S B DIAG     34:00000040 3C:340B010B 40:7E020001 44:00000000
    S B DIAG     I/O Registers
    S B DIAG     00:00000004 04:E802A004 08:00000002 0C:00000000
    S B DIAG     10:00004282 14:03005AB2 18:00000001 1C:00000000
    S B DIAG     20:03FA9E70 24:D0F00E30 30:03FA9B70 34:10700002
    S B DIAG     3C:00000000 40:00000000 44:00000000 48:C8000000
    S B DIAG     4C:00008000 50:FFFF0000 54:3D3D3D3D 58:00000505
    S B DIAG     5C:00000000 60:00000000 64:00000000 68:00000000
    S B DIAG     6C:00000000 70:00000000 74:00000000 78:00000000
    S B DIAG Ethernet port 2 register dump:
    S B DIAG     PCI Config
    S B DIAG     00:0020100B 04:02900145 08:02000000 0C:00006400
    S B DIAG     10:00004201 14:C0002000 2C:00000000 30:FF000000
    S B DIAG     34:00000040 3C:340B010B 40:7E020001 44:00000000
    S B DIAG     I/O Registers
    S B DIAG     00:00000004 04:E802A004 08:00000002 0C:00000000
    S B DIAG     10:00004282 14:03005AB2 18:00000001 1C:00000000
    S B DIAG     20:03FA8800 24:D0F00E30 30:03FA844C 34:10700002
    S B DIAG     3C:00000000 40:00000000 44:00000000 48:C8000000
    S B DIAG     4C:00008000 50:FFFF0000 54:3D3D3D3D 58:00000505
    S B DIAG     5C:00000000 60:00000000 64:00000000 68:00000000
    S B DIAG     6C:00000000 70:00000000 74:00000000 78:00000000
    S B DIAG Ethernet port 3 register dump:
    S B DIAG     PCI Config
    S B DIAG     00:0020100B 04:02900145 08:02000000 0C:00006400
    S B DIAG     10:00004301 14:C0003000 2C:00000000 30:FF000000
    S B DIAG     34:00000040 3C:340B010B 40:7E020001 44:00000000
    S B DIAG     I/O Registers
    S B DIAG     00:00000004 04:E802A004 08:00000002 0C:00000000
    S B DIAG     10:00004282 14:03005AB2 18:00000001 1C:00000000
    S B DIAG     20:03FA9344 24:D0F00E30 30:03FA8D08 34:10700002
    S B DIAG     3C:00000000 40:00000000 44:00000000 48:C8000000
    S B DIAG     4C:00008000 50:FFFF0000 54:3D3D3D3D 58:00000505
    S B DIAG     5C:00000000 60:00000000 64:00000000 68:00000000
    S B DIAG     6C:00000000 70:00000000 74:00000000 78:00000000

    * NOTE:
    I/O register 10 (hex) is not the actual contents of the Interrupt Status Register.
    It is a summary of all un-masked interrupts that occurred for the chip since boot.

  • I will review the register contents and see what I can learn.

    You have given me a good idea of how the DP83816 devices are connected on the Ethernet side.  Could you help me understand the devices that are sharing the PCI bus, in addition to the DP83816 devices?

  • The four DP83816 chips are the only PCI bus devices external to the chip-set. The chip-set has one PCI Bus addressed as 0.  The chip-set Host-PCI bridge is device 0 and acts as a bus master for I/O accesses to the Ethernet chip registers.  The chip-set also has device 7 with some built-in peripherals that we use as slave I/O devices.  We are not using any of the chip-set DMA controllers or IDE DMA capabilities.  Our IDE disk is a disk-on-chip (flash) and we use the Primary IDE controller in program-I/O (non-DMA) mode.  The four Ethernet chips are the only DMA devices in the system being used, and they are the only PCI bus masters other than the CPU Host-PCI bridge.  The chip-set is an Intel 440MX that is very similar to the Intel 440BX.

    You probably don't need this level of detail, but here are the exact device numbers and other information for the chips.

        {
            0x08,                    // PCI device number (PCI Slot #)
            PCI_PIRQA,                // PCI Interrupt Request A-D
            0x4000,                    // I/O Registers Address
            11,                        // Interrupt request Line (0-15)
        },
        {
            0x09,                    // PCI device number (PCI Slot #)
            PCI_PIRQB,                // PCI Interrupt Request A-D
            0x4100,                    // I/O Registers Address
            11,                        // Interrupt request Line (0-15)
        },
        {
            0x0A,                    // PCI device number (PCI Slot #)
            PCI_PIRQC,                // PCI Interrupt Request A-D
            0x4200,                    // I/O Registers Address
            11,                        // Interrupt request Line (0-15)
        },
        {
            0x0B,                    // PCI device number (PCI Slot #)
            PCI_PIRQD,                // PCI Interrupt Request A-D
            0x4300,                    // I/O Registers Address
            11,                        // Interrupt Request Line (0-15)
        }

    The chip-set's Edge / Level Control Register has IRQ 11 for the Ethernet chips programmed to make them level triggered.  Each chip is connected to a separate PCI bus interrupt line but they are all routed to IRQ 11 in the Interrupt Routing Register.

    The chip-set supports four separate pairs of PCI bus master request / grant signals and we have each Ethernet chip connected directly to a separate pair of signals.  Arbitration for mastership is set to round-robin.

    If you need any other hardware details it will probably be better for me to have the hardware design engineer answer them.  I primarily deal with software.

  • Patrick,

    What did you conclude from your review of the registers dumped from the Ethernet chips?  We have not found a work around that completely eliminates the problem.

  • I do not see any issues or anything concerning in the register settings.  Since you have seen memory corruption, I scrutinized the buffer management configuration.  The register settings for fill and drain thresholds, DMA, etc. all look correct. 

    I noticed one difference between the port1 registers and the registers for the other ports.  The Auto-Negotiation Select bits in the Configuration and Media Status Register, CFG[15:13], are set differently.  Depending on your desired configuration, either setting will work OK.  I was just curious if the difference is intentional.  Is the difference introduced via an explicit register write or via the EEPROM configuration? 

    At this point, I believe the best approach will be to connect a PCI bus analyzer and attempt to capture the problem condition.  I believe that activity is being tracked on the parallel E2E thread "DP83816 Ethernet - Transmitter Goes Idle". 

  • Patrick O'Farrell said:

    I do not see any issues or anything concerning in the register settings.  Since you have seen memory corruption, I scrutinized the buffer management configuration.  The register settings for fill and drain thresholds, DMA, etc. all look correct. 

    I noticed one difference between the port1 registers and the registers for the other ports.  The Auto-Negotiation Select bits in the Configuration and Media Status Register, CFG[15:13], are set differently.  Depending on your desired configuration, either setting will work OK.  I was just curious if the difference is intentional.  Is the difference introduced via an explicit register write or via the EEPROM configuration? 

    At this point, I believe the best approach will be to connect a PCI bus analyzer and attempt to capture the problem condition.  I believe that activity is being tracked on the parallel E2E thread "DP83816 Ethernet - Transmitter Goes Idle". 

    The difference in the settings for the first port are intentional.  The first port connects to a host computer and allows 10 megabit operation.  The other ports are required to operate at 100 megabits.  Our system actually does not have an EEPROM connected to each Ethernet chip.  Instead the MAC address and other information are programmed by software.  The software obtains configuration information from the flash memory containing the CPU BIOS and application software.

    So far we have not been able to observe anything on the PCI bus that looks incorrect.  Our ability to monitor signals is quite limited due to the layout and the use of BGA (ball-grid-array) components.  We are also limited to a standard logic analyzer that is not tailored to PCI Bus connection or monitoring.  Is there anything specific that you can suggest for us to look at?

    I have been trying to keep the focus on this thread, as the transmitter issue is less serious and in fact may just be a symptom of the wrong memory address being accessed.  If you wish, you may combine the two threads or close the idle transmitter issue.