This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

ethernet transmission stall on DM8148 EVM

Hardware:

DM8148 EVM @ 600 MHz, DDR2

Started with SDK 5.x linux source code. I have been working on a QNX BSP. The QNX BSP I started with did not have an ethernet driver. So I have looked at the linux cpsw ethernet driver and am trying to port it to QNX.

The driver works in the sense that pings can be done and I can use the QNX IDE to remotely download and debug a program.

Sometimes, the transmission side of the driver hangs. When it hangs, there is a value in TX_HDP[0] but the buffer it is pointing to is not transmitted. Subsequent packets are just added to the buffer chain. Packets are still being received.

The previous packet is completely finished before the stall packet is written to TX_HDP[0]. This is not an EOQ handshaking problem. I can see that the previous packet has been recovered.

There are different scenarios when this happens. One is very reliable.  The reliable method is to start up QNET. The first packet sent is not an IP packet. It is for a different ethernet protocol. The packet contents are:

0x0000 FF FF FF FF FF FF 00 50-C2 49 DE 09 82 04 00 00 ÿÿÿÿÿÿ.PÂIÞ.‚...

0x0010 2A 0B 03 02 00 00 00 00-00 00 00 00 00 00 00 00 *...............

0x0020 00 00 00 00 00 00 00 00-00 00 00 00 70 00 00 00 ............p...

0x0030 00 80 00 00 01 00 01 00-70 00 00 00 04 00 00 00 .€......p.......

0x0040 08 00 00 00 0C 00 00 00-0A 00 00 00 16 00 00 00 ................

0x0050 10 00 00 00 26 00 00 00-08 00 00 00 2E 00 00 00 ....&...........

0x0060 0A 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................

0x0070 45 41 34 39 64 65 39 00-6E 65 74 2E 69 6E 74 72 EA49de9.net.intr

0x0080 61 00 10 01 00 50 C2 49-DE 09 00 00 00 00 00 00 a....PÂIÞ.......

0x0090 00 00 45 41 34 39 64 65-39 00 6E 65 74 2E 69 6E ..EA49de9.net.in

0x00A0 74 72 61 00 tra.

Sending out this packet also causes other QNX systems to send back packets for the same ethernet protocol (0x8204)

After this packet is sent, the transmitter no longer makes progress.

I have tried looking at the emac registers before and after the stall and except for TX_HDP[0] staying stuck on the next packet, I don't see a difference. 

I was going to try to send the above packet when linux is running on the EVM board but I am not sure how to use PF_PACKET or SOCK_RAW to get access and put out a packet with the exact contents above. The socket call fails when I try.

I don't completely understand how some parts of the device work (address lookup entries, switch part and using ports 0, 1, 2...) but it works for a while so it seems to be mostly right.

When it is working, I have used ping -f with different packet sizes and it runs as long as I let it run. It does seem to have trouble running as a ttcp receiver but it does not appear to be the same stall problem. 

Suggestions would be appreciated. 

  • Under what circumstances can the TX_HDP[0] register be non-zero but the packet is not being transmitted? Is there flow control? Can an ALE entry matter? What can be looked at to determine what is causing the transmission to not occur? The values of the cpdma are:

    4a100100: 00180108 00000001 00000000 00000000   00180108 00000001 00000000 00000000
    4a100120: 00000000 80000000 00000000 00000000   00000000 00000000 00000000 00000000
    4a100140: 00000000 00000000 00000000 00000000   00000000 00000000 00000000 00000000
    4a100160: 00000000 00000000 00000000 00000000   00000000 00000000 00000000 00000000
    4a100180: 00000000 00000000 00000000 00000000   00000000 00000001 00000000 00000000
    4a1001a0: 00000000 00000000 00000001 00000001   00000000 00000000 00000000 00000000
    4a1001c0: 00000000 00000000 00000000 00000000   00000000 00000000 00000000 00000000
    4a1001e0: 00000000 00000000 00000000 00000000   00000000 00000000 00000000 00000000
    4a100200: 4a103050 00000000 00000000 00000000   00000000 00000000 00000000 00000000
    4a100220: 4a102670 00000000 00000000 00000000   00000000 00000000 00000000 00000000
    4a100240: 4a103040 00000000 00000000 00000000   00000000 00000000 00000000 00000000
    4a100260: 4a102660 00000000 00000000 00000000   00000000 00000000 00000000 00000000

    TX_CONTROL is 1, RX_CONTROL is 1, TX_HDP[0] is 4a103050 but it never progresses. Before the special packet is written, TX_HDP[0] is written to, the packet is transmitted and the register clears, ready for the next packet.

  • John

    can you provide more details on the nature of the packet - classify them as unicast vs multicast/broadcast packets. Does the issue occur only with one of the packet types.

    If you have been referring to the Linux driver, please lookup portions of the init code which adds ucast and multicast entries into the ALE.

     

    Also, if you are using only one port on the board, ensure that you configure the only the relevant ports alone (port zero is CPU port and port one/two are the downstream ports)

     

    Regards

    Sriram

  • The whole packet was included in the first post. It is a broadcast packet with ethernet protocol 0x8204. It is not an IP packet.

     

    I have other situations that produce a similar stall:

     

    1.Sending the broadcast packet initially described.

    2. After the link comes up under QNX, if pings are started right away, they seem to work. If the link comes up and just sits for  a while with no transmissions and then ping starts, then the transmission seems stalled.

    3. The stall has occurred other times but the scenarios have been difficult to characterize and/or reproduce.Trying to do a ttcp benchmark test doesn't usually work and sometimes it looks like a stall and sometimes not. 

     

    ALE initialization includes the following:

    1. enable the ale

    2. clear the ale table

    3. make it run in vlan unaware mode

    4. make the host port be ALE_PORT_STATE_FORWARD

    5. Add a unicast address for the mac address for the host port, with ALE_SECURE set. ( I tried it without ALE_SECURE set and it did not matter)

    6. Add a multicast address for ff:ff:ff:ff:ff:ff for the host port

    7. For the slave ports: put slave port into ALE_PORT_STATE_FORWARD. I think this is done for both ports 1 and 2.

    8. Add multicast address for ff:ff:ff:ff:ff:ff:ff for each slave port (1 and 2). I think this changes the ALE entry so that ports 0, 1, and 2 are set for this address.

    I am only using 1 physical RJ45 port so I infer from what you are saying that I should just use port 1 and not do any initialization for port 2? This is the TI EVM board for the 8148. I thought the uboot and linux drivers did the initialization for all the ports but maybe I missed how they did not do port 2.

    I have been trying to see if I can make this happen under linux by opening a raw, packet-based socket and sending the offending packet but so far, the socket() calls always return with errno=97 so that is not making a lot of progress.

    Thanks.

     

    John

     

     

  • I have tried a couple of changes:

     

    1. Changed it so only port 1 is initialized. Nothing is done with port 2.

    2. Changed the qnx interrupt service routine to just change RX_EN instead of the hardware mask that I was doing. At first, I had trouble with the non-hardware mask routine but I know a bit more now and could work around the previous problem.

    3. Turned off the debug messages.

     

    Now, the special packet from before no longer seems to cause a problem. QNET on QNX seems to be working. I have been running for several minutes without seeing the stall.

    I am not sure which of the above was most important. I am happy to see it working better though.

    I will do more tests and let you know how it goes.

     

    Thanks.

     

  • Hi

    Good to hear that it works for you now. I believe the fix is mostly due to configuring only one port now - please acknowledge once you have verified the same

    Regards

    Sriram

     

    If this post answers your query, consider clicking the Verify Answer button

  • I changed the number of slaves back to 2 and the driver quit working. Changing back to 1 and it starts to work again. Seems pretty conclusive that this is the culprit.

     

    Thanks for your help.