66AK2H06: U-boot Ethernet hang on MDIO control register read (New)

Lance Jump

Part Number: 66AK2H06
Other Parts Discussed in Thread: 66AK2E05, 66AK2H14

I posted a question about this several months ago and closed it because we found what was causing the issue. But we are experiencing the same symptom with boards where that root cause has been resolved.

My first question is where the best place to post a hardware-related issue is.

Some of our Keystone 66ak2h06 based boards are hanging when trying to start the network in u-boot. These boards start out working but, later, some fail. A significant percentage of the boards fail in exactly the way described and we are trying to determine a root cause so that we can address it.

The hang is very specific and repeatable. It happens on the first read access to the MDIO controller register space (e.g., 0x02090300). By "hang," I mean that it never returns from the read. This happens in keystone2_eth_mdio_enable. I have added some debug code to show the problem:

static void keystone2_eth_mdio_enable(void)
{
	u_int32_t	clkdiv;

	clkdiv = (EMAC_MDIO_BUS_FREQ / EMAC_MDIO_CLOCK_FREQ) - 1;

	writel((clkdiv & 0xffff) |
	       MDIO_CONTROL_ENABLE |
	       MDIO_CONTROL_FAULT |
	       MDIO_CONTROL_FAULT_ENABLE,
	       &adap_mdio->CONTROL);

    //Debug code added
    printf("gets here\n");
    (void) readl(&adap_mdio->CONTROL) & MDIO_CONTROL_IDLE);
    printf("doesn't get here\n");
    
	while (readl(&adap_mdio->CONTROL) & MDIO_CONTROL_IDLE)
		;
}

Everything else seems to run okay. That is, if I disable the network in u-boot, it is able to start and access non-network devices (NAND, SPI, ...).

I have checked all power supplies and the SYSCLK, PASSCLK, SGMIICLK, and ARMCLK, and they all look good. Note, however, that I cannot probe on the pins of the Keystone, so I probe as close as possible.

I checked the power and clock domain registers (in PSC) for the PA, SGMII, and SA and they all matched the values for a working board.

I do not believe this is a software problem. I also don't think that the Keystone, per se, is the root cause. But something is happening to these boards that is affecting the Keystone in a relatively specific way and I'd like to get closer to what it is.

So, I believe the issue is hardware -- perhaps we are killing Keystones -- and I would like to get some suggestions as to how to get closer to the root cause.

Thanks,

Lance

over 5 years ago

0 Cvetolin Shulev-XID over 5 years ago

TI__Guru 65405 points

The SW and HW teams are notified. They will post their feedback directly here.

BR
Tsvetolin Shulev

0 Tom Johnson 16214 over 5 years ago in reply to Cvetolin Shulev-XID

TI__Mastermind 46460 points

Lance,

You indicate that this problem was previously resolved. What changed to make it come back? Did you release new code? Did you build a new batch of prototypes?

It is known that the MDC/MDIO communications interface is susceptible to corruption due to glitches on the MDC line when it is fanned out. This is a known issue on the 66AK2H14 EVM when the expansion boards are attached. This was resolved on the 66AK2E05 EVM by adding proper clock buffering to the MDC line. Could this cause the Linux read to hang?

Tom

0 Lance Jump over 5 years ago in reply to Tom Johnson 16214

Expert 1280 points

Tom,

Thanks for the quick response. More precisely, the symptom returned, but the root cause appears to be different. Before, the problem turned out to be a cracked resistor (on several boards) in the 0.85V regulator. When we replaced the resistors, all boards worked, so we believe that was the root cause.

On the present issue, it is the same rev of the boards, u-boot has not changed, and the resistor is intact. We are getting 0.845V on VDDALV. But they are exhibiting the exact symptom as those with the cracked resistor. That is, they hang as soon as the network code reads from a MDIO controller register.

Note that this is not failing during an actual MDIO transaction with the PHY -- it never gets to that point. It fails just reading MDIO control register (at 0x02090304), although it does not fail during a write (see code above). It hangs very early in u-boot (it never gets to linux or even a u-boot prompt).

As a test, I removed the resistor from a working board saw that the failure was identical to the present board failure -- hangs a the exact same instruction.

There are a couple of differences in the present symptom from the cracked resistor symptom. Previously, boards would sometimes get further, but when they failed it was always this same way. Also, those boards would often work, for a short time, at elevated temperature (but not cold). Our theory is that the cracked resistors, due to thermal-mechanical forces, sometimes worked well enough. None of the boards failed after the resistor was replaced. With the present issue, the boards never get past that hanging instruction regardless of temperature.

Thanks,
Lance

0 Tom Johnson 16214 over 5 years ago in reply to Lance Jump

TI__Mastermind 46460 points

Lance,

As soon as you enable the MDIO peripheral, it begins polling PHY addresses to detect link status on any connected PHYs. This is independent of explicit accesses to PHY registers. The link status of any detected (or falsely detected) PHY is then reported in the MDIO registers. MDC clock glitches are known to cause this polling to report false information.

I still did not get an understanding of what changed. You had the issue with VDDALV due to cracked resistors. Once you fixed those resistors, was everything working reliably? Assuming yes, what coincided with the re-appearance of the failure?

Tom

0 Lance Jump over 5 years ago in reply to Tom Johnson 16214

Expert 1280 points

Tom,

Sorry, I wasn't very clear -- it's a long story.

This symptom actually appeared before, during and after the cracked resistors issue. We had been putting the boards in a "bone pile" until we had a chance to debug them. Then we got a batch boards in which a large percentage (25%+) exhibited this symptom. We finally tracked it down to a resistor that was cracked on most of these failing boards. We replaced that resistor on all boards (failing or not) from that batch. That resolved the problem. That is, they worked reliably. Note that all of these boards failed before we got very far along in our production bring up.

The boards we are looking at now come from the bone pile comprising boards from builds before and after the cracked resistor batch and also include boards from that batch. Of course, we checked the resistor (and, more to the point, the 0.85V supply) on these boards and it's fine. I'll have to do research, but I think one or two may have been ones that originally had a cracked resistor.

A key difference is that most of these boards start out working but eventually fail. They are run for several days including several hours in a temperature chamber. The failure typically happens after a process such as conformal coating or ESS testing. A few have failed in the field in customers' hands.

The failures present in a few different ways, with this network failure being among the most common. Another common mechanism is a similar hang reading DDR3 (not executing from it, just reading it). I am focusing on the network issue because it is a common failure mode and because it appears relatively "contained."

Regarding the MDIO polling. The only MDIO register that has been accessed at the point of failure is the control register, to which we just written (0x02090304 <= 0x400c0188). Is this sufficient to initiate the automatic polling? Keystone hangs as soon as we read that control register back, which normally the next instruction (where it starts to check for idle). But I believe it would hang reading from that register even the MDIO controller had not been enabled (something I can check if it helps).

Also, note that the symptom is not getting bad or garbled status, it is that processor appears to stop executing instructions. Is this the MDC related issue?

Thanks,
Lance

0 Tom Johnson 16214 over 5 years ago in reply to Lance Jump

TI__Mastermind 46460 points

Lance,

What percentage of production boards fail this way?

It would be interesting to see whether a read to this register causes a stall even before the MDIO operation is enabled.

Tom

0 Lance Jump over 5 years ago in reply to Tom Johnson 16214

Expert 1280 points

Tom,

I may have accidentally marked this as resolved (I got a reply to that effect). If so, it was not intentional.

I don't have all the quantitative data, but, as of last month, we had built about 50 board and about 1/2 of those failed in various ways (including never coming up to start). Of these, maybe five have failed with a hang on network initialization in u-boot. These have not all been verified to fail exactly the way described, but at least three have. (I only recently developed the test that identifies the failing instruction.)

Regarding whether the register read causes a stall before MDIO is enabled, I am about to check that now.

By the way, I will not be back in the office until next Monday.

Thanks,
Lance

0 Lance Jump over 5 years ago in reply to Lance Jump

Expert 1280 points

Tom,

I found that the hang occurs even if I read the register before I enable MDIO operation. (I am assuming that the write to the MDIO control register is what enables it.)

Note that, in earlier testing, I found that I could read the MDIO register space without hanging up to a certain point in the initialization. I would only get zeros (for any of the registers), but the processor continued to run. Unfortunately, I didn't keep good notes an exactly when in the sequence a read would hang, but it's not too difficult to reproduce this. I'm guessing it was sometime after the PSC enable of the networking power and clock domains.

Lance

0 Lance Jump over 5 years ago in reply to Lance Jump

Expert 1280 points

Tom,

Reading 0x02090300 returns 0 until after the call to psc_enable_module(KS2_LPSC_CPGMAC) (in keystone2_emac_initialize) after which the read hangs the processor. That enables the SGMII module.

Does TI offer a service for us to send the Keystone so can they tell us whether and what failed?

Thanks,
Lance

0 Tom Johnson 16214 over 5 years ago in reply to Lance Jump

TI__Mastermind 46460 points

Lance,

TI will take returned units for failure analysis. However, you need to establish through debugging that the units are truly failed. At this time you still appear to be debugging a low production board design that has many failures for various reasons. Due to the low yield, I would suspect design or fabrication issues before i would assume that the parts are bad.

Tom

0 Lance Jump over 5 years ago in reply to Tom Johnson 16214

Expert 1280 points

Tom,

I completely agree that there is likely something in the design, manufacturing, or handling of the board. I do not suspect that the parts are arriving defective from TI. We are hoping that failure analysis will help us to determine whether we are killing them and, if so, how we might be doing that.

We have had limited success removing/replacing Keystone on our boards, so such tests are somewhat inconclusive.

Thanks,
Lance

0 Tom Johnson 16214 over 5 years ago in reply to Lance Jump

TI__Mastermind 46460 points

Lance,

Rather than modifying the uboot application to perform these tests, you can also allow the uboot application to complete without MDIO support and then use scripts or command-line to access the mdio register using u-boot md or mw commands.

Tom

0 Lance Jump over 5 years ago in reply to Tom Johnson 16214

Expert 1280 points

Tom,

That's a good idea and I had been doing that, but was having some trouble reproducing the issue that way. That is, I was neither getting the network to hang nor initialize. I had captured the I/O accesses and then scripted them with u-boot pokes. Apparently, I wasn't capturing all of the register accesses, so, to be sure, I just started reloading u-boot. It turns out that the whole change/build/load cycle is only a couple of minutes using UART boot.

Thanks,

Lance

0 Lance Jump over 5 years ago in reply to Lance Jump

Expert 1280 points

Tom,

I was able to write a u-boot script that evoked the same behavior as when u-boot ran the network init code. Basically, I traced/captured the init I/O then changed it to u-boot sequences with a SED equivalent.

I ran the script on a failing board and on a good board to ensure the script was good. Everything is identical (except the MAC address read) right up to the point where the bad boards hangs.

I noticed that the good board consumes a little over 1W more with the network enabled than it does before I run the network enable script. On the bad board, there is no difference in power consumption.

I checked the location in the script where the power steps up (on the good board). That happens immediately after the PA module is enabled (psc_enable_module(KS2_LPSC_PA)). The enable also occurs on the bad board (long before it hangs), but there is no power change.

Does that help idenitfy what might be happening in the Keystone or suggest areas for further investigation?

Thanks,
Lance

0 Tom Johnson 16214 over 5 years ago in reply to Lance Jump

TI__Mastermind 46460 points

Lance,

I am not sure. Perhaps you are missing a clock input. That might cause the behavior that you are seeing. Have you x-ray'ed the part to look for opens or shorts in the BGA assembly process?

Tom

0 lding over 5 years ago in reply to Tom Johnson 16214

TI__Guru* 95265 points

6366.xtcievmk2x.gelHi,

psc_enable_module(KS2_LPSC_PA)=====> Do you see the PA power domain change after running this script or stuck there for the bad board? I think you can look at the u-boot code how this function is implemented? Also, you can refer to the equivalent GEL file (attached) for enabling the PA power domain: Set_PSC_State(PD2, LPSC_PA, PSC_ENABLE);

Regards, Eric

0 Lance Jump over 5 years ago in reply to lding

Expert 1280 points

Eric,

First, note that I measured the supply again and I was wrong about the bad board not changing the power it used. It did by about the same amount as the good board. I also over-estimated the change. It was closer to 300mW more on both boards after enabling the PA.

Below is the captured sequence of reads/writes while enabling the module.

keystone2_emac_initialize:1043
RD(4) 02350a1c=00000100:psc_enable_module:194
RD(4) 0235061c=000211ff:psc_set_state:129
RD(4) 02350128=00000000:psc_wait:78
RD(4) 02350308=00000000:psc_set_state:146
WR(4) 02350308=00000001:psc_set_state:148
RD(4) 02350a1c=00000100:psc_set_state:153
WR(4) 02350a1c=00000103:psc_set_state:156
RD(4) 02350120=00000000:psc_set_state:159
WR(4) 02350120=00000004:psc_set_state:161
RD(4) 02350308=00000001:psc_set_state:163
RD(4) 02350208=00000301:psc_set_state:164
RD(4) 02350a1c=00000103:psc_set_state:165
RD(4) 0235081c=00001f03:psc_set_state:166
RD(4) 02350128=00000000:psc_wait:78
RD(4) 02350308=00000001:psc_set_state:168
RD(4) 02350208=00000301:psc_set_state:169
RD(4) 02350a1c=00000103:psc_set_state:170
RD(4) 0235081c=00001f03:psc_set_state:171
RD(4) 02350128=00000000:psc_wait:78
keystone2_emac_initialize:1048

I believe this means that it did change properly (psc_wait saw the transition pending bit cleared). This trace is from the bad board, but it is identical to the good board. They don't differ until much later.

These are production boards, which do not have an emulator connector on them, so running the GEL file will be difficult at best. When we had similar hang problems in DDR3 on our prototype boards, we found that hang also hung the emulator.

Thanks,

Lance

0 lding over 5 years ago in reply to Lance Jump

TI__Guru* 95265 points

Hi,

This means the issue is not PA related, both good and bad boards can turn on the PA and added 300 mw power consumption. As you can use command to turn it on, there is no need to run the GEL for the same purpose.

Regards, Eric

0 Lance Jump over 5 years ago in reply to lding

Expert 1280 points

Eric,

Thank you for input and suggestion. As I said earlier, the most demonstrable symptom is that the processor hangs when any read is done to the MDIO space -- including the status and ID registers. Coincidentally, the hang occurs only after the PA module is enabled (as in the trace). Before that, reads from the MDIO space return 0 which, while not correct, does not hang the processor.

Thanks,

Lance

0 Tom Johnson 16214 over 5 years ago in reply to Lance Jump

TI__Mastermind 46460 points

Lance,

Is there anything else pending for this thread?

Tom

0 Lance Jump over 5 years ago in reply to Tom Johnson 16214

Expert 1280 points

Tom,
Sorry, I was out of the office for a couple of weeks. Unfortunately, I am no closer to figuring out how determine the exact cause of the failure, or even whether the Keystone itself has failed. But I can't think of any additional questions to ask or experiments to run. Remember that it is not the boards themselves I am trying to save. Rather I am trying to determine what is causing them to fail so we can take corrective action.

I appreciate your time and input, but I think, at this point, it is probably best to close the issue.

Thanks,
Lance

Processors

Processors forum

66AK2H06: U-boot Ethernet hang on MDIO control register read (New)