66AK2H06: U-boot Ethernet hang on MDIO control register read.

Lance Jump

Part Number: 66AK2H06

We have a custom Keystone 2 (66ak2ho6) board design for which a large percentage of boards (3 out of 12) hang in exactly the same way. The board is a very minor re-spin to a design that has never seen this problem for 25+ boards.

The hang is when u-boot tries to access the network (e.g., DHCP). Using the emulator, we have isolated the specific instruction that causes the hang. It is a read of the MDIO control register:

ctl = readl(&adap_mdio->CONTROL);

in drivers/net/keystone_net.c:keystone2_eth_mdio_enable

When stepping that instruction (the assembler instruction that actually performs the read), the emulator reports that it cannot halt the CPU because the pipeline is stalled. A system reset is required to regain control.

I also have a debug print before and after the offending instruction, and, when I run it directly (without the emulator) I see the pre-instruction message and not the post-message (when the board hangs).

I have also used the emulator to read that register (rather than stepping the instruction) and got read error from the emulator. I read other registers in the same area (Ethernet switch subsystem) and they cause the same problem.

Note that the board does not hang every time it reads that register, but every time the board does hang it is when reading that register. Also, I put two reads of that register (with intervening messages) and, while it sometime works successfully, it has never failed on the second read after succeeding on the first. Note also that the routine first writes to that register before reading it, and the write never hangs.

Does any of this sound familiar?

A specific question I have is: can a problem in the network coprocessor sybsystem could cause the ARM core to hang this way when reading such a register?

Thanks,
Lance

over 8 years ago

0 Yordan Kovachev over 8 years ago

TI__Guru**** 161600 points

Hi Lance,

The board is a very minor re-spin to a design that has never seen this problem for 25+ boards.

Is it possible to explain how is this board different from the other design (that has never seen this problem)?
Which Linux SDK are you using?

Best Regards,
Yordan

0 Lance Jump over 8 years ago in reply to Yordan Kovachev

Expert 1280 points

Yordan,

What is so puzzling about this is that almost none of the respin is related to the K2 -- it is mostly to other board functions and to incorporate stitch wire changes into the PCB. Among the things that might affect K2 are power supply changes (due to EOL parts), additional power/ground planes (for noise in sensitive analog circuits) and top-level clock generation (although the clock drivers to the K2 remain unchanged).

We are using MCSDK 3.01.01.04 and the respin required no changes here.

Since the original post, we have also tried slowing down the system clock (1200MHz to 800MHz), the PA clock (983MHz to 800MHz) and the DDR clock (800MHz to 500MHz) and saw no change.

I realize that it is hard to diagnose such problems with so little information. I am trying to understand what sorts of things could cause the specific symptom we are seeing (hanging on MDIO controller register access).

Is this something that can be caused by a problem in the network co-processor?

Is there a clock, other than the system peripheral clock, that could cause this?

Is there a specific power supply rail that would be indicated by this symptom?

Is there any way the external PHY could be at fault (the MDIO controller is the one inside the K2)?

Thanks,

Lance

0 Lance Jump over 8 years ago in reply to Yordan Kovachev

Expert 1280 points

I just wanted to close the loop on this and report what we found -- maybe it will help someone else diagnosing issues in the future.

The problem turned out to be a manufacturing error that caused the 0.85V supply to be at 0.6V on the failing boards. We are working with our contract manufacturer to determine why so many boards had exactly the same error.

Of interest is that we do not use USB, which, it seems, is the primary use of this supply on the chip. We do, however, use SGMII, which implies using a SERDES. Does the 0.85V supply operate the SGMII SERDES?

Also, all boards worked at elevated temperature (+70C ambient), but hung when running at room temperature or cold.

Thanks to those who considered the issue.

Lance

Processors

Processors forum

66AK2H06: U-boot Ethernet hang on MDIO control register read.