I have been investigating a problem in our embedded system product that results in random memory locations being changed. The problem happens after many hours or days of operation and depends on the amount of communication. After doing a lot of things to narrow down the problem we found that it is related to the transmit DMA on the DP83816 chip.
Here are the things we did to narrow down the problem.
- Eliminated interrupts and polled the chip for service.
- Used CPU debug registers to trap on CPU writes to the altered memory location.
We found that the CPU was not writing to the memory location (that implies DMA changed it) - Verified contents of transmit and receive DMA descriptors before and after DMA transfers.
The DMA descriptors are correct including the "link" pointers. - Verify the altered memory location before and after performing transmit operation.
We found that the memory is always changed during the transmit DMA. - Block interrupts during the entire transmit DMA.
We had the same results and memory was still altered by doing transmit DMA. - Add about 100 microseconds of delay between each transmit frame.
We found that this avoided the problem and memory was not altered. - We reviewed the transmit software for the chip carefully and could find no problems.
The software performs the same steps whether or not the 100 microsecond delay is there, yet that seems to avoid the problem.
The problem occurs differently on different boards (same design) and the altered memory address and data are not the same on each board. For a given board, the data and altered memory locations are somewhat consistent. Typically we see one bit set in a location that should contain zero's, ex: 0x0100 or 0x0800. The exact data pattern seems consistent for any particular board. None of the data values seem to match expected "cmdsts" values. The only memory write done during Transmit DMA is the two "cmdsts" bytes in the descriptor.
If we avoid doing transmit with the DP83816 or add the 100 microsecond delay between transmit frames we do not see the problem. There are four DP83816 chips in our system. When the problem occurs, one is sending about 8 frames and the others are mostly receiving. We found that the amount of receive activity has an effect on the problem. With less receive activity we don't see the problem (or not as often).
Each transmitted frame uses two descriptors. The first descriptor points to a buffer containing the destination and source MAC address (12 bytes). The second descriptor contains the Ethertype code and remainder of the frame. Frames vary from 32 bytes to 1480 bytes in length. The software typically transmits 33 frames but that varies depending on the system configuration. The maximum configuration doesn't use more than 64 frames.
Here is a more detailed description of how the software transmits the frames.
- Disable interrupts
- Check to make sure that there are enough unused descriptors for the packet.
This is done by counting the number of buffers (always two) and checking the transmit write index versus read index.
If the write index equals the read index the software checks the descriptor buffer pointer to determine if the ring is empty or full. - If there are not enough available descriptors, return "ring full" status
- Set buffer pointer and "cmdsts" (OWN bit cleared, MORE set) for first descriptor
- Set buffer pointer and "cmdsts" (OWN bit set, MORE cleared) for second descriptor
- Set "cmdsts" (OWN bit set, MORE bit set) for first descriptor
- Update transmit "write" index to indicate next available descriptor
- *Start transmitter (see below)
- Enable interrupts
*The software starts the transmitter as follows.
- Disable interrupts
- PCI I/O Input from Command Register (offset 0)
- If "TXE" bit (0x00000001) set, go to step 12
- PCI I/O Input from Transmit Descriptor Pointer register (0020h)
- Mask unused bits of address (1 and 0)
- Range check address versus descriptor ring
- If address out of range, go to step 12
- Calculate ring index from descriptor address
- If descriptor at address has "OWN" cleared, go to step 12
- PCI I/O output to Transmit Descriptor Pointer register (0020h)
- PCI I/O Output to Command Register (offset 0) with TXE bit set (0x00000001)
- Enable interrupts
The above steps are also done if the "TXIDLE" bit is set in the interrupt status. Although the software now does not use interrupts it periodically polls the chip by reading the interrupt status register.
We have had this hardware and software running in the field at dozens of customer sites and have only one customer complaining of the problem. We have also reproduced the problem in our engineering lab. Changes to the software configuration or the amount of communication make the problem disappear. Because the memory locations changed are somewhat random, the effect on the software is also unpredictable. We've used the software exception and error handler to dump memory for analysis and found that the DMA descriptors are always correct for the DP83816 chips. So far the only apparent work around is adding around a 100 microsecond delay between sending frames.
This might not be a DP83816 problem, since a chip-set or PCI bus problem could cause memory corruption. I am trying to verify that our software for the DP83816 chip is correct. Some of what we are doing is to work around a problem with the transmitter going idle. I posted a thread about that problem. I have also reviewed the errata for the chip-set in our system (440MX) and found nothing that matches this problem.
Our software is writing the Transmit Descriptor Pointer register. I'm wondering if that could be related to the problem. The software checks to make sure that the transmit state machine is idle before writing the register.
Here are some more details of our hardware. We have a custom CPU board using the Intel 440MX chip-set, Celeron 650 Mhz. and 100 Mhz. front-side-bus. There are four DP83816 Ethernet controller chips connected to the single PCI bus for the chip-set. We have 64 MB of RAM using the built-in DRAM controller of the chip-set. The only DMA devices are the Ethernet controllers and the only interrupt is the system timer interrupt. We have tested the DP83816 with and without interrupts and it has no effect on the memory corruption problem. The system has no keyboard, no video display and does not use the BIOS after the application software starts.
We use an embedded OS called Nucleus Plus and a network stack called Nucleus Net. The DP83816 driver is one that we wrote based on example drivers for Nucleus Net and the DP83816 datasheet (manual). I've reviewed the Linux driver for the DP83816 and found that it differs from ours in two areas. It starts the transmitter merely by writing the "TXE" bit in the Command Register and it performs no processing for the "TXIDLE" interrupt. When our driver was only writing the "TXE" bit in the Command Register we saw that the transmitter would sometimes go idle before all the frames in the ring were transmitted.
If the DP83816 is incorrectly reading the contents of the "link" field of a transmit descriptor that would explain why the transmitter is going idle, and also why the DMA might write to an incorrect memory location to store the "cmdsts" bytes. The only problem with that hypothesis is that the data values appearing in memory do not match expected "cmdsts" values for either transmit or receive. If the receive buffer address was being read incorrectly I would expect surrounding bytes to be incorrect and not just a few bytes in the middle of correct data.
I will appreciate any advice about the correct operation of the DP83816 chip or known problems that might relate to memory being changed unexpectedly.