This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

66AK2H14: Bad bootp packets on ethernet boot mode

Part Number: 66AK2H14

Our system is designed to use ethernet boot with the Keystone2, but we have not been experiencing reliable booting. Only about 80% of the time are we able to successfully download u-boot. We have our system set up such that when u-boot fails to load after a period of time, we toggle RESET on the Keystone2 to re-try the boot process. The boot statistics we have recorded are:

loads uboot over ethernet, with no retries – 53/67 – 79%
loads uboot over ethernet, on 1st 1 retry (u-boot doesn’t load on 1st attempt) – 11/67 – 16%
loads uboot over ethernet, one 2nd retry (u-boot doesn’t load on 1st and 2nd attempt) – 2/67 – 3%
fails to load uboot over ethernet, 3rd retry, (u-boot doesn’t load on 1st, 2nd, 3rd attempt) – 1/67 – 1.5%

It is plausible to imagine that given a larger set of data, we would continue to see about an 80% chance for a successful boot per re-try.

Our ethernet interface passes through an FPGA before going to a switch, where we monitor traffic after converting the physical media to GMII. The concerning behavior we have seen is that whenever the boot fails, we see data errors flagged on the GMII interface in the BOOTP request packet. We also see the Keystone2 re-send the BOOTP request after a designated sequence of timeouts. Every time the boot process fails, regardless of power cycles or resets or timeouts, the BOOTP request packets sent by the Keystone2 all have data errors in the same exact byte positions, with the same bad data values. Although these errors are flagged by the FGPA transceivers as either disparity or “not in table”, the precise and consistent nature of the failure suggests that the Keystone2 is actually sending bad packets rather than there being a signal integrity issue.

  • Hi,

    Which SDK are you using?

    Best Regards,
    Yordan
  • Hi Yordan,

    I am using mcsdk version 3.00.03.15

    Does the mcsdk version matter here? Isn't the RBL the only software running until uboot is retrieved?

    Thanks

    -Guven

  • Guven,

    Do you have monitor tools on the SGMII interface directly from the 66AK2H14?

    Are the boot statistics consistent for a single given board? Are some boards more reliable or less reliable than others?

    The Ethernet boot process is used reliably in many production systems.

    Regards,
    RandyP
  • Randy,

    The SGMII interface connects to an FPGA, which is where we monitor the packets. That is where we are able to observe those 'bad' packets.

    The boot statistics seem pretty consistent across the boards I have been using, they all show about 80% success.

    What areas do think we should look into. We have confirmed CVDD looks ok, and we seem to be following the boot waveform in section 11.2 of the user guide, SPRS866F.

    We are open to looking into areas where we might be missing something. It just seems odd to us that it works most of the time.

    Thanks

    -Guven
  • View of the entire BOOTP req packet.  There are 4 cycles where rx_er is asserted, but the last one is simply calling for carrier extension (occurs after rx_dv goes to 0). See figure 1.

    Zooming in on first error in the ethernet header, where source MAC addr is bad (should be a0:f6:fd:a5:e5:10).  We always see consecutive 0xA5’s when there is an error here. See figure 2.

    Zooming in on second error in the payload where source MAC addr is also bad.  Sometimes this second instance of the MAC addr is correct, even when the first instance is bad. See figure 3.

    Zooming in on the third error near the end of the payload.  This bad piece of data is always 0xC4 and is flagged as not in table.  See figure 4.

     

     

  • This is a block diagram representing our configuration:

  • thanks for providing details using a block diagram and packet analyzer but I’m not sure I understand the nature of the error. On the KSII side if the Ethernet CRC passes then (most likely) the data in the packet was the data that was meant to be sent. There is no reason why the ROM would send a packet with invalid data. Are you performing all the experiments at room temperature ?

    Can you provide a Wireshark log of the bootp packets received by the FPGA which it reports to be bad. Also, a similar Wireshark log when the boot passes successfully will be useful for us and this will help us look at the packets sent over SGMII TX side.

    Regard,
    Rahul
  • Rahul,

    The FPGA does not implement a MAC or check the CRC of the incoming traffic. It simply passes the traffic through from one port to another. Guven should be able provide you with wireshark logs of the BOOTP packets once they make their way out of the FPGA, through an ethernet switch, and onto our network. I do believe that there will be a frame check sequence error on these bad BOOTP requests. What we have provided you is a much lower level view of the same thing that wireshark would see.

    While there is certainly no reason for the KSII to send packets with invalid data, that does not preclude the possibility of it happening. What explanations do you have for why we repeatedly see the same incorrect MAC addresses? We are performing these experiments at room temperature.

    -Jonny
  • Hi Rahul, I have attached a wireshark log of a good ethernet boot.  We have 2 servers that interact with the board here.  I am running wireshark from the server that serves the tftp file.  The KSII gets the DHCP response from another server, which gives it an IP address, and also uses the next_server argument to tell the KSII where to get the uboot binary from, which is the server where wireshark is run from.  Our KSII has a MAC address of a0:f6:fd:a5:e5:10

    For the other questions, yes we are at room temperature, and the temperature sensors on the IO card with the KSII are all normal temperatures, the highest read is 35.25 degrees celsius.

    In the case that we are trying to debug, the packets never make it out onto the network, due to this issue where it looks like the packets are not formed properly.  So, when this happens, we see nothing in wireshark, as the packets never make it out onto our network.

    Remove the .dat from the attachments filename, the forum would not let me upload with normal wireshark filename extension.

     good_uboot_eth.pcapng.dat

  •  

    In these diagrams the GMII signals (gmii_rxd_i, gmii_rx_er_i, gmii_rx_dv_i) are delayed by a few clock cycles with respect to the transceiver status debug signals (gt_rxdisperr_dbg, gt_rxnotintable_dbg, rxchariscomma, rxcharisk).  This is because we are tapping them off in chipscope at different levels of the logic.  However, gmii_rx_er_i, gmii_rxd_i, and gmii_rx_dv_i are being displayed in synchronized time.

     

    In the 2nd and 3rd diagrams, we are claiming that the second occurrence of 0xA5 is incorrect because the sequence of data in that part of the packet is clearly supposed to be the Keystone2’s MAC address.  The gmii_rx_er_i signal is asserted because there is a 8b/10b disparity error, which can be seen flagged a few clock cycles earlier.  That doesn’t mean that the data associated with it (0x10) is necessarily bad though.  It only means that the transceiver was expecting a positive or negative disparity encoding and received the opposite.  The previous byte could still be at fault.  0xA5 has neutral parity and the same symbol for both its positive and negative encoding, which means it cannot cause the disparity error. Here is what might be happening:

     

    8B    10B-        10B+              Net parity

    -------------------------------------------------

    0xA5  1010011010  same as neg       neutral

    0x10  0110110100  1001001011        neutral

    0xE5  1010011110  1010010001        +/- 2

     

    Before the two 0xA5s, we received 0xFD are at a +1 disparity count (same idea applies if starting at -1).  The expected sequence of a packet with the correct MAC address would be:

     

    0xA5 -> 0xE5 -> 0x10

    0      -2      +0      (symbol’s disparity)

    +1      -1      -1      (running disparity count)

     

    If we receive a rogue 0xA5 in the middle but the 0x10 is still encoded with its positive symbol as if a 0xE5 was sent, then the transceiver could be expecting the negative symbol for 0x10 even though it has neutral parity:

     

    0xA5 -> 0xA5 -> 0x10

    0       0      +0

    +1      +1      +1

     

    The 4th diagram is unrelated to the previous 2.  It is a repeatable error that I have no theories on.

     

    Does the K2H have a feature for sending out PRBS data on the SERDES Tx line such that we could capture an eye diagram?

     

    Yes, clock is extracted from the data.

     

    We do not think these symptoms are indicative of a signal integrity issue though.  They are repeatable, and the same errors occur each time.  We also test the same HSS lane in question at over 16 gbps, albeit in a different mezzanine card configuration.

     

  • In the 4th data capture screenshot posted above, the sequence of data in the payload is ascii code for the following string: "TCI c66x Bootp Boot= a0f6fda5e510".

    The 0xC4 byte should be 0x6F (second 'o' in "Boot"). I have no theories on this failure yet.
  • We had an internal discussion based on the data that you have provided on the E2E and looked into your wireshark log messages.  However, these have not given us enough information to root cause the issue.

    THe wireshark log that you provided is only for when the device boots up correctly. DO the packets that get detected as bad by the FPGA get filtered out. We would like to look at a Wireshark log when the boot fails but it won`t help if the packets are captured at the switch and th FPGA filters out the bad bootp packets. We have compiled a set of questions that Randy will forward to you.

    In the meantime, we want to try and see if we can replicate, your setup with our evaluation platform. For this we would need your BOOTMODE settings, PG version so that we can run the same tests. Our EVM is hooked up so that the PHY is the master and K2 is the slave in autonegotiation mode  so we may have some restrictions on what we can try but getting your bootmode settings will definitely help.

    IF you have a JTAG connection to the KS2, you can run the GEL that I have attached here and provide us a log.

    K2H_register_dump.gel

    This reads the JTAGID, BOOTMODE pins and the MACID programmed in your parts. If you don`t have a JTAG connection, you may be able to read these register addresses from uboot or Linux by reading from the physical addresses provided in the GEL file.

    Regards,

    Rahul

  • Hi Rahul,

    I ran the GEL file, I also added a couple of register reads too: MM_REVID, PSC_VCNTLID (define was there, but no read function.)  Here is the output:

    C66xx_0: GEL Output: *******************************************************************************************************
    C66xx_0: GEL Output: ********************************** KEPLER EFUSE STATUS and SNAPSHOT ******************************
    C66xx_0: GEL Output: *******************************************************************************************************

    C66xx_0: GEL Output: DEVSTAT ---> 0x02000001
    C66xx_0: GEL Output: PSC_VCNTLID ---> 0x002F0000
    C66xx_0: GEL Output: MM_REVID ---> 0x00090003
    C66xx_0: GEL Output: BOOTCFG_JTAGID ---> 0x2B98102F

    C66xx_0: GEL Output: ********************************** Kepler MACID Register (MACID) ************************************
    C66xx_0: GEL Output: MACID[31:0] ---> 0xFDA5E510
    C66xx_0: GEL Output: MACID[32:47] ---> 0xA0F6
    C66xx_0: GEL Output: BCAST[16](Broadcast Reception) ---> Broadcast
    C66xx_0: GEL Output: BCAST[17](MAC Flow Control) ---> Off
    C66xx_0: GEL Output: CHECKSUM[24:31] ---> 0x6F

    This session is from jtag, and I booted the board into SLEEP mode.  Whenever I have to use JTAG debug, that is the mode I boot into, is that ok?  

    I uploaded a wireshark of a good boot earlier in the thread, if that is not sufficient let me know, I will try again.

    When we boot into ethernet mode, this is what our BOOTMODE settings are : 

    devstat: 0x6CEB bootmode: 0x3675 (DEVSTAT [16:1] bits map to BOOTMODE[15:0])

  • Thanks for providing the GEL dump, looks like you are using PG 2.0 silicon and your boot configuration is

    •  ARM Little Endian Ethernet boot mode.
    •  SYSPLL uses input clock of 122.88Mhz and ARM PLL  uses input clock of 312.5 Mhz.
    •  External connection is forced to maximum speed.
    •  Reference clock of 156.25 Mhz
    •  PA CLK is at same reference as core reference

    Can you confirm our clock understanding of your setup and also the link connection. 

    Regards,

    Rahul

  • Hi Rahul,

    Yes, that is also our understanding of the bootmode bits.

    -Guven
  • Rahul,

    I ran the GEL script on another board we have, and all of the values were the same except for this one:

    C66xx_0: GEL Output: PSC_VCNTLID ---> 0x002E0000
  • Guven,

    The intent of the question regarding (bootmode-clocksetting) was to confirm that your system is using the clock setup that matches your boot switch settings. Can you confirm? We have a Ethernet PHY on our evaluation platform so the SOC acts as the slave and uses the link parameters determines through auto negotiation. I am checking internally if we can test this setup with a forced link and I will get back to you.

    The other question that we had was what happens to the packets which the FPGA determines are bad packets do they get filtered out? Have you run any CRC check experiments to see how the FPGA handles the CRC ?

    Regards,
    Rahul
  • Rahul,

    Yes that is what we are using for clocks, so it does match our setup.

    I believe the bad packets never make it passed the switch, so we do not see them on our network or in wireshark.

    I will let Jonny answer the CRC question, I am not sure if he was looking into this,

    Guven
  • Hi Rahul,

    Our FPGA is using the same reference clock source for gigabit ethernet that the Keystone2 is using.  We weren't able to get auto-negotiation working between the FPGA and Keystone2, which is why we are using the forced max speed setting.

    The FPGA does not tamper with the CRC/FCS sent by the Keystone2.  However, the FPGA is decoding the data for monitoring, then re-encoding it.  Since the FPGA believes it is receiving packets with encoded data not in the 8b/10b table, the bad bytes are really just arbitrary values that still get re-encoded and passed to the switch.  There is guaranteed to be FCS errors on those packets when they arrive at the switch.  I have implemented a CRC to see this within the FPGA, and on good packets I do get the magic 0xC704DD7B value.

    -Jonny

  • Another test we just tried was to look at ethernet statistics once the Keystone2 successfully boots into Linux.  We used scp to transfer some large files over the network, and never saw any errors or dropped packets being reported.  This gives further credence to the possibility that the transceiver settings on the Keystone2 are a little off when running from the boot ROM.

  • Jonathan, Guven,

    We performed Ethernet boot reset tests to see if we observe any kind of boot issues with this boot mode. The setup Ethernet boot and host wireshark and TFTP settings are as described here:
    processors.wiki.ti.com/.../KeystoneII_Boot_Examples

    The EVM boot configuration setting (for DEVSTAT =0x115EEB) was used:
    • ARM Little Endian Ethernet boot mode.
    • SYSPLL and ARM PLL uses input clock of 122.88Mhz.
    • External connection is slave with auto negotiation.
    • Reference clock of 125 Mhz
    • PA clocked at the same reference as the SerDes reference.

    We reset the SOC around 30+ times using the BMC console command “reboot por” and confirmed that we observe BOOTP on every reset.

    Here are some comments that we got from our hardware team :
    SGMII is a MAC-PHY interface which includes a master-slave protocol for activating the link. Configuring our MAC for ‘forced’ mode means that it is no longer compliant with SGMII. What mode is the FPGA port? Is it a MAC port? Does it support SGMII? ‘Forced’ ports will only function if they are configured identically.

    Regards,
    Rahul
  • Rahul, isn't that the default boot when dipswitch is set to 0101?  I thought that it wasn't close enough to our configuration.

    I have a question about the keystone2_boot_examples, not sure if this is the right place to ask though.  Maybe we can get it to the right people if I write it here:

    In the keystone2_boot_examples, the k2e examples multi stage ethernet boot makes use of a source file named ethWard.c, but the k2h does not, is there a reason for this?  I can get the examples to work on the EVMK2H for some of the examples, single and multi stage UART for example.  I have not had any success with the k2h multi stage ethernet boot though.  I am not able to get the eval board to download the uartImage1.bin I generated via xmodem.  The other UART examples work ok though, using dipswitch setting 0100. Our design does download the uartImage1.bin file, but nothing happens after that.  I feel like there is something missing with this example, or I don't quite understand it, also this ethWard.c that is in the k2e, but not in the k2h is confusing me.

    Should I start a separate thread for this?

    Thanks

    -Guven

  • Rahul,

    Our FPGA port is a PHY port, not a MAC port. It is able to handle SGMII or 1000BASE-X, both with auto-negotiation optionally turned on or off. If we set the Keystone2's external connection to slave with auto-negotiation (devstat[15:0] = 0x5CEB) and turn on auto-negotiation on our side, we still see the same errors described earlier in this thread.

    We are aware that you can run on an evaluation board with default settings and never see our issue. What else can you guys do to help us debug this? Have we not convinced you that the link is configured properly given that we have no ethernet issues at any time except when executing from boot ROM? If not , what other experiments can we run for you?

    Guven is attempting to do a dual stage boot (1st stage UART, 2nd stage ethernet) to see if we can trick the SGMII transceiver into using better settings. Even if this works though, it would not be a viable long term solution.

    -Jonny
  • Guven,

    K2E device has BootROM errata issue that prevents the device from directly booting from Ethernet. You can refer to the details in Advisory 25 provided in the errata here:
    www.ti.com/.../sprz417b.pdf

    Due this errata, we are not recommending direct Ethernet boot on K2E and K2L devices. This issue doesn`t impact K2H silicon hence the EthWard code that is work around for this errata issue is not required on K2H devices.

    Regards,
    Rahul
  • Rahul,

    Advisory 25 sounds awfully close to the problems we are seeing, although we believe the problem is on the transmit rather than receive.

    Is it possible that issue was mis-diagnosed and also present on K2H devices?

    -Jonny
  • Advisory 25 also claims that the issue was caused by uninitialized values in the boot ROM code, which has nothing to do with the silicon.
  • Advisory 25 in K2E Errata only applies to K2E devices and not to K2H devices. We treat Boot ROM errata issues as silicon issues even though it could be argued that ROM is software. This is reported as an errata against the bootROM version but was introduced due to a different PA sub-system being used inside the NETCP IP on K2E devices. There was an uninitialized variable in PA subsystem on K2E. This variable did not exist in the PA subsystem on K2H devices so we didn`t observe this on K2H and K2K devices.

    Regards,
    Rahul
  • Jonny,

    I am referring back to this picture:

    This diagram is not consistent with the Ethernet specification.  That may be contributing to the data corruption as the ports may not be interpreting the data the way you expect them to.  As has been said previously, the port in the FPGA adjacent to the DSP is a PHY port.  However, as drawn above, this PHY block has GMII connected internally and the media side facing the DSP.  (PCS and PMA are layers in a PHY between a media independent interface port and a media port.) Even though you refer to this as SGMII, it might not be.  It appears to me to be a 1000BASE-X media port, not a SGMII media independent interface port.  There are many similarities but they are not identical.  It is important to recognize that from an electrical point of view the SGMII interface is very similar to the 1000BASE-X interface. Both use 8B/10B encoding, a serial interface and an embedded clock. However, these two interfaces are located in distinctly different places within the Ethernet stack and can not be mixed. Systems can operate with SGMII connected to a media port but they are not guaranteed to operate as they are not consistent with the Ethernet standard.  Rather than a 'PHY block' in the FPGA, you need a 'MAC block' that supports SGMII.  Then you configure one as an SGMII master and one as an SGMII slave to achieve a compliant interface.  I would assume that the connection to the switch could be like this too although perhaps the switch ports can be true 1000BASE-X media ports.

    Tom

  • Hi Tom,

    I am no Ethernet expert so I am open to having a discussion about this, but I don't believe we are doing anything that is non-compliant with Ethernet spec.

    Allow me to clarify the diagram I posted.  Everything in the diagram exists entirely on a single board, so technically everything in it is media independent.  The "1/2.5G Ethernet PCS/PMA Core" may be confusing you because of its name.  It is a verified piece of IP from Xilinx that can be configured to support either Cisco's SGMII standard or 1000BASE-X, since the two are so similar at the physical level.  Note again that a PCS and PMA are not media dependent.  The SGMII and 1000BASE-X labels in the diagram are only describing the media independent aspects of those interfaces.  We are not playing around in different layers of the Ethernet stack.  Only within the Ethernet switch (Marvell 88E6185) do we move to the next layer of a MAC, and only on a different switch port do we have a media dependent port (1000BASE-KX copper backplane I believe).

    Do you think we could have error-less secure copies of 22G files once we boot into Linux if our connection was not Ethernet compliant?  Does the fact that we only see issues when executing code from the Keystone2's boot ROM not concern you?

    -Jonny

  • Jonny,

    We have numerous other customers with boards in production that use the RBL Ethernet boot over SGMII.  Because of this we are exploring your unique implementation to determine how it could result in corrupted packets during RBL boot.  You indicate that once booted, Ethernet transfers are robust.  Therefore, we need to explore what is different in your system during the boot processes that fail.

    Tom

  • Hey Tom,

    Thanks for the quick reply.

    Are you inquiring if there is something different in our system during boot vs post-boot? I don't believe there are any differences...

    -Jonny
  • Jonny,

    No, I was considering that there was something different between the trials that booted and then transferred data versus those that did not boot.  Perhaps something adaptive that converged to a different operating point.  One possibility is SERDES equalization that is adaptive.  There could be other adaptive system components.

    Tom

  • Hey Tom,

    Since you had me thinking about RX equalization, I decided to mess around with the settings on our FPGA. Xilinx SERDES have two equalization modes, both of which are adaptive. We had been using the more robust mode (dynamic feedback equalization, aka DFE), but I switched to the more basic mode (linear equalizer) and our boot issues seem to have stopped occurring (tried 30 attempts).

    The only reason I can think of for why the DFE mode was problematic is related to this statement in Xilinx transceiver documentation:

    "The DFE allows better compensation of transmission channel losses by providing a closer adjustment of filter parameters than when using a linear equalizer. However, a DFE cannot remove the pre-cursor of a transmitted bit; it only compensates for the post cursors."

    Regardless, after we run some more tests tomorrow we may be able to consider this issue resolved.

    -Jonny