This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

66AK2H06: U-boot Ethernet hang on MDIO control register read (New)

Part Number: 66AK2H06
Other Parts Discussed in Thread: 66AK2E05, 66AK2H14

I posted a question about this several months ago and closed it because we found what was causing the issue. But we are experiencing the same symptom with boards where that root cause has been resolved. 

My first question is where the best place to post a hardware-related issue is. 

Some of our Keystone 66ak2h06 based boards are hanging when trying to start the network in u-boot. These boards start out working but, later, some fail. A significant percentage of the boards fail in exactly the way described and we are trying to determine a root cause so that we can address it.

The hang is very specific and repeatable. It happens on the first read access to the MDIO controller register space (e.g., 0x02090300). By "hang," I mean that it never returns from the read. This happens in keystone2_eth_mdio_enable. I have added some debug code to show the problem:

static void keystone2_eth_mdio_enable(void)
{
	u_int32_t	clkdiv;

	clkdiv = (EMAC_MDIO_BUS_FREQ / EMAC_MDIO_CLOCK_FREQ) - 1;

	writel((clkdiv & 0xffff) |
	       MDIO_CONTROL_ENABLE |
	       MDIO_CONTROL_FAULT |
	       MDIO_CONTROL_FAULT_ENABLE,
	       &adap_mdio->CONTROL);

    //Debug code added
    printf("gets here\n");
    (void) readl(&adap_mdio->CONTROL) & MDIO_CONTROL_IDLE);
    printf("doesn't get here\n");
    
	while (readl(&adap_mdio->CONTROL) & MDIO_CONTROL_IDLE)
		;
}

Everything else seems to run okay. That is, if I disable the network in u-boot, it is able to start and access non-network devices (NAND, SPI, ...).

I have checked all power supplies and the SYSCLK, PASSCLK, SGMIICLK, and ARMCLK, and they all look good. Note, however, that I cannot probe on the pins of the Keystone, so I probe as close as possible.

I checked the power and clock domain registers (in PSC) for the PA, SGMII, and SA and they all matched the values for a working board.

I do not believe this is a software problem. I also don't think that the Keystone, per se, is the root cause. But something is happening to these boards that is affecting the Keystone in a relatively specific way and I'd like to get closer to what it is.

So, I believe the issue is hardware -- perhaps we are killing Keystones -- and I would like to get some suggestions as to how to get closer to the root cause.

Thanks,

Lance

  • The SW and HW teams are notified. They will post their feedback directly here.

    BR
    Tsvetolin Shulev
  • Lance,

    You indicate that this problem was previously resolved.  What changed to make it come back?  Did you release new code?  Did you build a new batch of prototypes?

    It is known that the MDC/MDIO communications interface is susceptible to corruption due to glitches on the MDC line when it is fanned out.  This is a known issue on the 66AK2H14 EVM when the expansion boards are attached.  This was resolved on the 66AK2E05 EVM by adding proper clock buffering to the MDC line.  Could this cause the Linux read to hang?

    Tom

  • Tom,

    Thanks for the quick response. More precisely, the symptom returned, but the root cause appears to be different. Before, the problem turned out to be a cracked resistor (on several boards) in the 0.85V regulator. When we replaced the resistors, all boards worked, so we believe that was the root cause.

    On the present issue, it is the same rev of the boards, u-boot has not changed, and the resistor is intact. We are getting 0.845V on VDDALV. But they are exhibiting the exact symptom as those with the cracked resistor. That is, they hang as soon as the network code reads from a MDIO controller register.

    Note that this is not failing during an actual MDIO transaction with the PHY -- it never gets to that point. It fails just reading MDIO control register (at 0x02090304), although it does not fail during a write (see code above). It hangs very early in u-boot (it never gets to linux or even a u-boot prompt).

    As a test, I removed the resistor from a working board saw that the failure was identical to the present board failure -- hangs a the exact same instruction.

    There are a couple of differences in the present symptom from the cracked resistor symptom. Previously, boards would sometimes get further, but when they failed it was always this same way. Also, those boards would often work, for a short time, at elevated temperature (but not cold). Our theory is that the cracked resistors, due to thermal-mechanical forces, sometimes worked well enough. None of the boards failed after the resistor was replaced. With the present issue, the boards never get past that hanging instruction regardless of temperature.

    Thanks,
    Lance
  • Lance,

    As soon as you enable the MDIO peripheral, it begins polling PHY addresses to detect link status on any connected PHYs.  This is independent of explicit accesses to PHY registers.  The link status of any detected (or falsely detected) PHY is then reported in the MDIO registers.  MDC clock glitches are known to cause this polling to report false information.

    I still did not get an understanding of what changed.  You had the issue with VDDALV due to cracked resistors.  Once you fixed those resistors, was everything working reliably?  Assuming yes, what coincided with the re-appearance of the failure?

    Tom

  • Tom,

    Sorry, I wasn't very clear -- it's a long story.

    This symptom actually appeared before, during and after the cracked resistors issue. We had been putting the boards in a "bone pile" until we had a chance to debug them. Then we got a batch boards in which a large percentage (25%+) exhibited this symptom. We finally tracked it down to a resistor that was cracked on most of these failing boards. We replaced that resistor on all boards (failing or not) from that batch. That resolved the problem. That is, they worked reliably. Note that all of these boards failed before we got very far along in our production bring up.

    The boards we are looking at now come from the bone pile comprising boards from builds before and after the cracked resistor batch and also include boards from that batch. Of course, we checked the resistor (and, more to the point, the 0.85V supply) on these boards and it's fine. I'll have to do research, but I think one or two may have been ones that originally had a cracked resistor.

    A key difference is that most of these boards start out working but eventually fail. They are run for several days including several hours in a temperature chamber. The failure typically happens after a process such as conformal coating or ESS testing. A few have failed in the field in customers' hands.

    The failures present in a few different ways, with this network failure being among the most common. Another common mechanism is a similar hang reading DDR3 (not executing from it, just reading it). I am focusing on the network issue because it is a common failure mode and because it appears relatively "contained."

    Regarding the MDIO polling. The only MDIO register that has been accessed at the point of failure is the control register, to which we just written (0x02090304 <= 0x400c0188). Is this sufficient to initiate the automatic polling? Keystone hangs as soon as we read that control register back, which normally the next instruction (where it starts to check for idle). But I believe it would hang reading from that register even the MDIO controller had not been enabled (something I can check if it helps).

    Also, note that the symptom is not getting bad or garbled status, it is that processor appears to stop executing instructions. Is this the MDC related issue?

    Thanks,
    Lance
  • Lance,

    What percentage of production boards fail this way?

    It would be interesting to see whether a read to this register causes a stall even before the MDIO operation is enabled.

    Tom

  • Tom,

    I may have accidentally marked this as resolved (I got a reply to that effect). If so, it was not intentional.

    I don't have all the quantitative data, but, as of last month, we had built about 50 board and about 1/2 of those failed in various ways (including never coming up to start). Of these, maybe five have failed with a hang on network initialization in u-boot. These have not all been verified to fail exactly the way described, but at least three have. (I only recently developed the test that identifies the failing instruction.)

    Regarding whether the register read causes a stall before MDIO is enabled, I am about to check that now.

    By the way, I will not be back in the office until next Monday.

    Thanks,
    Lance
  • Tom,

    I found that the hang occurs even if I read the register before I enable MDIO operation. (I am assuming that the write to the MDIO control register is what enables it.)

    Note that, in earlier testing, I found that I could read the MDIO register space without hanging up to a certain point in the initialization. I would only get zeros (for any of the registers), but the processor continued to run. Unfortunately, I didn't keep good notes an exactly when in the sequence a read would hang, but it's not too difficult to reproduce this. I'm guessing it was sometime after the PSC enable of the networking power and clock domains.

    Lance
  • Tom,

    Reading 0x02090300 returns 0 until after the call to psc_enable_module(KS2_LPSC_CPGMAC) (in keystone2_emac_initialize) after which the read hangs the processor. That enables the SGMII module.

    Does TI offer a service for us to send the Keystone so can they tell us whether and what failed?

    Thanks,
    Lance
  • Lance,

    TI will take returned units for failure analysis.  However, you need to establish through debugging that the units are truly failed.  At this time you still appear to be debugging a low production board design that has many failures for various reasons.  Due to the low yield, I would suspect design or fabrication issues before i would assume that the parts are bad.

    Tom

  • Tom,

    I completely agree that there is likely something in the design, manufacturing, or handling of the board. I do not suspect that the parts are arriving defective from TI. We are hoping that failure analysis will help us to determine whether we are killing them and, if so, how we might be doing that.

    We have had limited success removing/replacing Keystone on our boards, so such tests are somewhat inconclusive.

    Thanks,
    Lance
  • Lance,

    Rather than modifying the uboot application to perform these tests, you can also allow the uboot application to complete without MDIO support and then use scripts or command-line to access the mdio register using u-boot md or mw commands.

    Tom

  • Tom,

    That's a good idea and I had been doing that, but was having some trouble reproducing the issue that way. That is, I was neither getting the network to hang nor initialize. I had captured the I/O accesses and then scripted them with u-boot pokes. Apparently, I wasn't capturing all of the register accesses, so, to be sure, I just started reloading u-boot. It turns out that the whole change/build/load cycle is only a couple of minutes using UART boot.

    Thanks,

    Lance

  • Tom,

    I was able to write a u-boot script that evoked the same behavior as when u-boot ran the network init code. Basically, I traced/captured the init I/O then changed it to u-boot sequences with a SED equivalent.

    I ran the script on a failing board and on a good board to ensure the script was good. Everything is identical (except the MAC address read) right up to the point where the bad boards hangs.

    I noticed that the good board consumes a little over 1W more with the network enabled than it does before I run the network enable script. On the bad board, there is no difference in power consumption.

    I checked the location in the script where the power steps up (on the good board). That happens immediately after the PA module is enabled (psc_enable_module(KS2_LPSC_PA)). The enable also occurs on the bad board (long before it hangs), but there is no power change.

    Does that help idenitfy what might be happening in the Keystone or suggest areas for further investigation?

    Thanks,
    Lance
  • Lance,

    I am not sure.  Perhaps you are missing a clock input.  That might cause the behavior that you are seeing.  Have you x-ray'ed the part to look for opens or shorts in the BGA assembly process?

    Tom

  • 6366.xtcievmk2x.gelHi,

    psc_enable_module(KS2_LPSC_PA)=====> Do you see the PA power domain change after running this script or stuck there for the bad board? I think you can look at the u-boot code how this function is implemented? Also, you can refer to the equivalent GEL file (attached) for enabling the PA power domain: Set_PSC_State(PD2, LPSC_PA, PSC_ENABLE);

    Regards, Eric

  • Eric,

    First, note that I measured the supply again and I was wrong about the bad board not changing the power it used. It did by about the same amount as the good board. I also over-estimated the change. It was closer to 300mW more on both boards after enabling the PA.

    Below is the captured sequence of reads/writes while enabling the module.

    keystone2_emac_initialize:1043
    RD(4) 02350a1c=00000100:psc_enable_module:194
    RD(4) 0235061c=000211ff:psc_set_state:129
    RD(4) 02350128=00000000:psc_wait:78
    RD(4) 02350308=00000000:psc_set_state:146
    WR(4) 02350308=00000001:psc_set_state:148
    RD(4) 02350a1c=00000100:psc_set_state:153
    WR(4) 02350a1c=00000103:psc_set_state:156
    RD(4) 02350120=00000000:psc_set_state:159
    WR(4) 02350120=00000004:psc_set_state:161
    RD(4) 02350308=00000001:psc_set_state:163
    RD(4) 02350208=00000301:psc_set_state:164
    RD(4) 02350a1c=00000103:psc_set_state:165
    RD(4) 0235081c=00001f03:psc_set_state:166
    RD(4) 02350128=00000000:psc_wait:78
    RD(4) 02350308=00000001:psc_set_state:168
    RD(4) 02350208=00000301:psc_set_state:169
    RD(4) 02350a1c=00000103:psc_set_state:170
    RD(4) 0235081c=00001f03:psc_set_state:171
    RD(4) 02350128=00000000:psc_wait:78
    keystone2_emac_initialize:1048
    

    I believe this means that it did change properly (psc_wait saw the transition pending bit cleared). This trace is from the bad board, but it is identical to the good board. They don't differ until much later.

    These are production boards, which do not have an emulator connector on them, so running the GEL file will be difficult at best. When we had similar hang problems in DDR3 on our prototype boards, we found that hang also hung the emulator.

    Thanks,

    Lance

  • Hi,

    This means the issue is not PA related, both good and bad boards can turn on the PA and added 300 mw power consumption. As you can use command to turn it on, there is no need to run the GEL for the same purpose.

    Regards, Eric
  • Eric,

    Thank you for input and suggestion. As I said earlier, the most demonstrable symptom is that the processor hangs when any read is done to the MDIO space -- including the status and ID registers. Coincidentally, the hang occurs only after the PA module is enabled (as in the trace). Before that, reads from the MDIO space return 0 which, while not correct, does not hang the processor.

    Thanks,

    Lance

  • Lance,

    Is there anything else pending for this thread?

    Tom

  • Tom,
    Sorry, I was out of the office for a couple of weeks. Unfortunately, I am no closer to figuring out how determine the exact cause of the failure, or even whether the Keystone itself has failed. But I can't think of any additional questions to ask or experiments to run. Remember that it is not the boards themselves I am trying to save. Rather I am trying to determine what is causing them to fail so we can take corrective action.

    I appreciate your time and input, but I think, at this point, it is probably best to close the issue.

    Thanks,
    Lance