This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Our system is designed to use ethernet boot with the Keystone2, but we have not been experiencing reliable booting. Only about 80% of the time are we able to successfully download u-boot. We have our system set up such that when u-boot fails to load after a period of time, we toggle RESET on the Keystone2 to re-try the boot process. The boot statistics we have recorded are:
loads uboot over ethernet, with no retries – 53/67 – 79%
loads uboot over ethernet, on 1st 1 retry (u-boot doesn’t load on 1st attempt) – 11/67 – 16%
loads uboot over ethernet, one 2nd retry (u-boot doesn’t load on 1st and 2nd attempt) – 2/67 – 3%
fails to load uboot over ethernet, 3rd retry, (u-boot doesn’t load on 1st, 2nd, 3rd attempt) – 1/67 – 1.5%
It is plausible to imagine that given a larger set of data, we would continue to see about an 80% chance for a successful boot per re-try.
Our ethernet interface passes through an FPGA before going to a switch, where we monitor traffic after converting the physical media to GMII. The concerning behavior we have seen is that whenever the boot fails, we see data errors flagged on the GMII interface in the BOOTP request packet. We also see the Keystone2 re-send the BOOTP request after a designated sequence of timeouts. Every time the boot process fails, regardless of power cycles or resets or timeouts, the BOOTP request packets sent by the Keystone2 all have data errors in the same exact byte positions, with the same bad data values. Although these errors are flagged by the FGPA transceivers as either disparity or “not in table”, the precise and consistent nature of the failure suggests that the Keystone2 is actually sending bad packets rather than there being a signal integrity issue.
Hi Yordan,
I am using mcsdk version 3.00.03.15
Does the mcsdk version matter here? Isn't the RBL the only software running until uboot is retrieved?
Thanks
-Guven
Zooming in on first error in the ethernet header, where source MAC addr is bad (should be a0:f6:fd:a5:e5:10). We always see consecutive 0xA5’s when there is an error here. See figure 2.
Zooming in on second error in the payload where source MAC addr is also bad. Sometimes this second instance of the MAC addr is correct, even when the first instance is bad. See figure 3.
Zooming in on the third error near the end of the payload. This bad piece of data is always 0xC4 and is flagged as not in table. See figure 4.
Hi Rahul, I have attached a wireshark log of a good ethernet boot. We have 2 servers that interact with the board here. I am running wireshark from the server that serves the tftp file. The KSII gets the DHCP response from another server, which gives it an IP address, and also uses the next_server argument to tell the KSII where to get the uboot binary from, which is the server where wireshark is run from. Our KSII has a MAC address of a0:f6:fd:a5:e5:10
For the other questions, yes we are at room temperature, and the temperature sensors on the IO card with the KSII are all normal temperatures, the highest read is 35.25 degrees celsius.
In the case that we are trying to debug, the packets never make it out onto the network, due to this issue where it looks like the packets are not formed properly. So, when this happens, we see nothing in wireshark, as the packets never make it out onto our network.
Remove the .dat from the attachments filename, the forum would not let me upload with normal wireshark filename extension.
In these diagrams the GMII signals (gmii_rxd_i, gmii_rx_er_i, gmii_rx_dv_i) are delayed by a few clock cycles with respect to the transceiver status debug signals (gt_rxdisperr_dbg, gt_rxnotintable_dbg, rxchariscomma, rxcharisk). This is because we are tapping them off in chipscope at different levels of the logic. However, gmii_rx_er_i, gmii_rxd_i, and gmii_rx_dv_i are being displayed in synchronized time.
In the 2nd and 3rd diagrams, we are claiming that the second occurrence of 0xA5 is incorrect because the sequence of data in that part of the packet is clearly supposed to be the Keystone2’s MAC address. The gmii_rx_er_i signal is asserted because there is a 8b/10b disparity error, which can be seen flagged a few clock cycles earlier. That doesn’t mean that the data associated with it (0x10) is necessarily bad though. It only means that the transceiver was expecting a positive or negative disparity encoding and received the opposite. The previous byte could still be at fault. 0xA5 has neutral parity and the same symbol for both its positive and negative encoding, which means it cannot cause the disparity error. Here is what might be happening:
8B 10B- 10B+ Net parity
-------------------------------------------------
0xA5 1010011010 same as neg neutral
0x10 0110110100 1001001011 neutral
0xE5 1010011110 1010010001 +/- 2
Before the two 0xA5s, we received 0xFD are at a +1 disparity count (same idea applies if starting at -1). The expected sequence of a packet with the correct MAC address would be:
0xA5 -> 0xE5 -> 0x10
0 -2 +0 (symbol’s disparity)
+1 -1 -1 (running disparity count)
If we receive a rogue 0xA5 in the middle but the 0x10 is still encoded with its positive symbol as if a 0xE5 was sent, then the transceiver could be expecting the negative symbol for 0x10 even though it has neutral parity:
0xA5 -> 0xA5 -> 0x10
0 0 +0
+1 +1 +1
The 4th diagram is unrelated to the previous 2. It is a repeatable error that I have no theories on.
Does the K2H have a feature for sending out PRBS data on the SERDES Tx line such that we could capture an eye diagram?
Yes, clock is extracted from the data.
We do not think these symptoms are indicative of a signal integrity issue though. They are repeatable, and the same errors occur each time. We also test the same HSS lane in question at over 16 gbps, albeit in a different mezzanine card configuration.
We had an internal discussion based on the data that you have provided on the E2E and looked into your wireshark log messages. However, these have not given us enough information to root cause the issue.
THe wireshark log that you provided is only for when the device boots up correctly. DO the packets that get detected as bad by the FPGA get filtered out. We would like to look at a Wireshark log when the boot fails but it won`t help if the packets are captured at the switch and th FPGA filters out the bad bootp packets. We have compiled a set of questions that Randy will forward to you.
In the meantime, we want to try and see if we can replicate, your setup with our evaluation platform. For this we would need your BOOTMODE settings, PG version so that we can run the same tests. Our EVM is hooked up so that the PHY is the master and K2 is the slave in autonegotiation mode so we may have some restrictions on what we can try but getting your bootmode settings will definitely help.
IF you have a JTAG connection to the KS2, you can run the GEL that I have attached here and provide us a log.
This reads the JTAGID, BOOTMODE pins and the MACID programmed in your parts. If you don`t have a JTAG connection, you may be able to read these register addresses from uboot or Linux by reading from the physical addresses provided in the GEL file.
Regards,
Rahul
Hi Rahul,
I ran the GEL file, I also added a couple of register reads too: MM_REVID, PSC_VCNTLID (define was there, but no read function.) Here is the output:
C66xx_0: GEL Output: *******************************************************************************************************
C66xx_0: GEL Output: ********************************** KEPLER EFUSE STATUS and SNAPSHOT ******************************
C66xx_0: GEL Output: *******************************************************************************************************
C66xx_0: GEL Output: DEVSTAT ---> 0x02000001
C66xx_0: GEL Output: PSC_VCNTLID ---> 0x002F0000
C66xx_0: GEL Output: MM_REVID ---> 0x00090003
C66xx_0: GEL Output: BOOTCFG_JTAGID ---> 0x2B98102F
C66xx_0: GEL Output: ********************************** Kepler MACID Register (MACID) ************************************
C66xx_0: GEL Output: MACID[31:0] ---> 0xFDA5E510
C66xx_0: GEL Output: MACID[32:47] ---> 0xA0F6
C66xx_0: GEL Output: BCAST[16](Broadcast Reception) ---> Broadcast
C66xx_0: GEL Output: BCAST[17](MAC Flow Control) ---> Off
C66xx_0: GEL Output: CHECKSUM[24:31] ---> 0x6F
This session is from jtag, and I booted the board into SLEEP mode. Whenever I have to use JTAG debug, that is the mode I boot into, is that ok?
I uploaded a wireshark of a good boot earlier in the thread, if that is not sufficient let me know, I will try again.
When we boot into ethernet mode, this is what our BOOTMODE settings are :
devstat: 0x6CEB bootmode: 0x3675 (DEVSTAT [16:1] bits map to BOOTMODE[15:0])
Thanks for providing the GEL dump, looks like you are using PG 2.0 silicon and your boot configuration is
Can you confirm our clock understanding of your setup and also the link connection.
Regards,
Rahul
Hi Rahul,
Our FPGA is using the same reference clock source for gigabit ethernet that the Keystone2 is using. We weren't able to get auto-negotiation working between the FPGA and Keystone2, which is why we are using the forced max speed setting.
The FPGA does not tamper with the CRC/FCS sent by the Keystone2. However, the FPGA is decoding the data for monitoring, then re-encoding it. Since the FPGA believes it is receiving packets with encoded data not in the 8b/10b table, the bad bytes are really just arbitrary values that still get re-encoded and passed to the switch. There is guaranteed to be FCS errors on those packets when they arrive at the switch. I have implemented a CRC to see this within the FPGA, and on good packets I do get the magic 0xC704DD7B value.
-Jonny
Another test we just tried was to look at ethernet statistics once the Keystone2 successfully boots into Linux. We used scp to transfer some large files over the network, and never saw any errors or dropped packets being reported. This gives further credence to the possibility that the transceiver settings on the Keystone2 are a little off when running from the boot ROM.
Rahul, isn't that the default boot when dipswitch is set to 0101? I thought that it wasn't close enough to our configuration.
I have a question about the keystone2_boot_examples, not sure if this is the right place to ask though. Maybe we can get it to the right people if I write it here:
In the keystone2_boot_examples, the k2e examples multi stage ethernet boot makes use of a source file named ethWard.c, but the k2h does not, is there a reason for this? I can get the examples to work on the EVMK2H for some of the examples, single and multi stage UART for example. I have not had any success with the k2h multi stage ethernet boot though. I am not able to get the eval board to download the uartImage1.bin I generated via xmodem. The other UART examples work ok though, using dipswitch setting 0100. Our design does download the uartImage1.bin file, but nothing happens after that. I feel like there is something missing with this example, or I don't quite understand it, also this ethWard.c that is in the k2e, but not in the k2h is confusing me.
Should I start a separate thread for this?
Thanks
-Guven
Jonny,
I am referring back to this picture:
This diagram is not consistent with the Ethernet specification. That may be contributing to the data corruption as the ports may not be interpreting the data the way you expect them to. As has been said previously, the port in the FPGA adjacent to the DSP is a PHY port. However, as drawn above, this PHY block has GMII connected internally and the media side facing the DSP. (PCS and PMA are layers in a PHY between a media independent interface port and a media port.) Even though you refer to this as SGMII, it might not be. It appears to me to be a 1000BASE-X media port, not a SGMII media independent interface port. There are many similarities but they are not identical. It is important to recognize that from an electrical point of view the SGMII interface is very similar to the 1000BASE-X interface. Both use 8B/10B encoding, a serial interface and an embedded clock. However, these two interfaces are located in distinctly different places within the Ethernet stack and can not be mixed. Systems can operate with SGMII connected to a media port but they are not guaranteed to operate as they are not consistent with the Ethernet standard. Rather than a 'PHY block' in the FPGA, you need a 'MAC block' that supports SGMII. Then you configure one as an SGMII master and one as an SGMII slave to achieve a compliant interface. I would assume that the connection to the switch could be like this too although perhaps the switch ports can be true 1000BASE-X media ports.
Tom
Hi Tom,
I am no Ethernet expert so I am open to having a discussion about this, but I don't believe we are doing anything that is non-compliant with Ethernet spec.
Allow me to clarify the diagram I posted. Everything in the diagram exists entirely on a single board, so technically everything in it is media independent. The "1/2.5G Ethernet PCS/PMA Core" may be confusing you because of its name. It is a verified piece of IP from Xilinx that can be configured to support either Cisco's SGMII standard or 1000BASE-X, since the two are so similar at the physical level. Note again that a PCS and PMA are not media dependent. The SGMII and 1000BASE-X labels in the diagram are only describing the media independent aspects of those interfaces. We are not playing around in different layers of the Ethernet stack. Only within the Ethernet switch (Marvell 88E6185) do we move to the next layer of a MAC, and only on a different switch port do we have a media dependent port (1000BASE-KX copper backplane I believe).
Do you think we could have error-less secure copies of 22G files once we boot into Linux if our connection was not Ethernet compliant? Does the fact that we only see issues when executing code from the Keystone2's boot ROM not concern you?
-Jonny
Jonny,
We have numerous other customers with boards in production that use the RBL Ethernet boot over SGMII. Because of this we are exploring your unique implementation to determine how it could result in corrupted packets during RBL boot. You indicate that once booted, Ethernet transfers are robust. Therefore, we need to explore what is different in your system during the boot processes that fail.
Tom
Jonny,
No, I was considering that there was something different between the trials that booted and then transferred data versus those that did not boot. Perhaps something adaptive that converged to a different operating point. One possibility is SERDES equalization that is adaptive. There could be other adaptive system components.
Tom