This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TPS659037: Palmas register read failures

Part Number: TPS659037
Other Parts Discussed in Thread: AM5728

Intro:

We have a board based around Phy-Tec's AM5728-based SOM which contains the recommended TPS659037 PMIC. We are using Phy-Tec's kernel (4.19.79) and u-boot (2019.01) which are from the TI-SDK.

What Happened:

I was debugging a unrelated driver a few days ago, which caused a kernel panic. After this happened, Linux would get to the point boot in which the palmas-pmic driver is probed and then the board would reset. I originally thought this was some sort of bug in the driver, so I added some printks to track down the problem. The driver does read-modify-write operations on the CTRL/status registers of each regulator on the PMIC, but the read operation was returning all zeros. When the driver wrote the modified value back, the regulator shut down (as expected since the mode field was set to "off").

I then thought that this was some sort of SW regression, so I went back to a commit that I knew worked and built completely from scratch, but the same thing happened. I then flashed a build that had been built before the issue started and it still happened.

So I decided to test out the issue in the u-boot console via the i2c command. Specifically, I issued "i2c md 0x58 0x20 1", which reads the CTRL register of SMPS12. I'd expect this to return 0x33, since this regulator should power up to mode 0b11 and report its status as 0b11 in bits 5:4. The first read returned 0. The next read returned the expected value, so I decided to see if it was some sort random intermittent failure. I noticed a pattern. After the first read of 0, I get 23 reads of 0x33, then this sequence: 0x02, 0x00, 0x01, 0x01, 0x03, 0x00, 0x01, 0x01, 0x00, then another 23 reads of 0x33, and so on. Reading other CTRL registers does not reset the sequence. Those reads return the next value in the sequence. The voltage select registers seem affected as well, and follow the same pattern (with exactly the same bad values).

Other notes:

This I2C bus is shared with an RTC @ 0x68 and an EEPROM @ 0x50. I disconnected the battery backup to the RTC but observed no change in symptoms. I don't think it's a bug in the other devices, though, since other registers on the PMIC, like the VENDOR_ID register, read back perfectly every time.

I found a board that hadn't been damaged yet and did the same probing in u-boot that I did with the damaged ones. The reads from the PMIC functioned normally.

Questions:

What are the possible causes? It happened with no relevant SW changes, but I suspect something like the OTP programming registers got written during the initial crash (but that's pure speculation).

Is this sort of thing recoverable or does the PMIC need replaced? It seems to function somewhat normally aside from the faulty reads. For now, I'm going to patch the Linux driver to always assume the reads from the CTRL register include the correct mode bits (which is a safe assumption seeing as though if it were not the case, there would not be power to the processor).

  • I attempted the proposed workaround and when the board boots, there's a lot of physical memory corruption after regulator init. As in, the DDR power supply is likely getting set to an incorrect value.

  • Hello,

    Thank you for the update.  I have forwarded to the SME.

    Regards,
    Chris

  • To add extra context, though I'm unable to measure the power supplies directly right now since they're on a SOM, I was able to get more evidence to this suspicion: I ran u-boot's mtest (memory test) command and it ran fine for at least 10 minutes. Once Linux boots (and incorrectly configures the PMIC due faulty IO), the system rapidly destabilizes. Typically, the ultimate cause of the crash is the kernel failing to be service page faults since the page tables end up corrupted.

  • Hello,

    I'm looking at the issue at will take me sometime to digest what is going on. I will back with you on the expected date to resolving this issue.

    Regards

  • Hi Zane,

    We can help with the PMIC hardware - if it's ends up being a driver issue we will need to tie this back to the AM335x part number you are using for their support.

    To confirm the background, this issue is just occurring on one board, correct? Out of how many and when / how did it originally fail? We see a lot of unique issues during development due to irregular board handling (whether it be powering up in a weird way, ESD, accidentally shorting pins together while probing, etc) so if this is a one-off event during prototyping then you'd need to reproduce it to delve much deeper.

    From our standpoint, would you be able to provide the oscilloscope waveforms of the incorrect I2C reads? From there we can work to see if there is damage related to the I2C signal itself or whether the PMIC or processor is interpreting the commands incorrectly.

    Other options would include doing an ABA swap to see if the issue follows the board or the PMIC but that would require re-balling the unit which isn't always straightforward.

  • I managed to break two SOMs.  To clarify, as mentioned in the original post, this is a production AM5728 SOM from PHY-TEC (PHY-TEC model PCM-057). I didn't remove the unit from our carrier board until it was already malfunctioning. The carrier board + SOM combo has been somewhat stable (not production quality, but this was the first board failure) for quite a while.

    Since it's a SOM (very compact layout) and there aren't any test points on the I2C bus it's on, getting the waveforms will be tricky. I have investigated how to do this, but I'm not sure anyone in our office can handle the soldering involved. I do not suspect actual corruption on the wire, though. The failure is too perfect and other registers on the PMIC aren't similarly effected. I can read the version registers correctly every time but the CTRL/VOLTAGE/FORCE registers follow a very specific pattern. I also noticed that when I modified the Linux driver to allow the board to boot (since it was shutting off the power during initialization), writing may be similarly effected since the system is very unstable due to memory corruption yet I can run the U-boot mtest command (memory test) for over 10 minutes (possibly longer) so the board and thus the regulators seem fine before the first PMIC IO.

    I'm not sure we're capable of swapping the PMICs in house. If so, I'd have to mail it to another office and have them do it. Also, the PMIC is in a very awkward spot on the board.

    Software-wise, in my naive search for an answer to our issues with overheating/a CPU lockup issue, I attempted to upgrade to Linux 5.4. I now know why TI doesn't support that on AM5728 since it crashes in very bizarre ways (though judging by the mailing list, there's a decent amount of development still going on for this platform, so I expected...not this). The board that was used for this is a write-off as far I am concerned. I'm very surprised that a kernel panic managed to damage the PMIC in this way, but for now I'll count that as a "learning experience". What was really surprising is when the second board did the same thing. That board, as far as I know (and I'm the only one who touched it) should never saw that firmware. It's only ever ran TI's 4.19.79 kernel + PHY-TEC patches + our patches (which was previously stable). Though unlikely, I suspect the build may have been contaminated as I switched back to 4.19, but I'd really like to know more about what's going on before I proceed.

    It's also possible that the lockup bug I've been hunting could be related somehow. I did run an extended test on the second board to see if I could still reproduce the issue. The board sat locked up for a long time (since I left it running overnight). This could have been the cause as well, though the failure mode is still surprising.

    Another interesting note: The SW_REVISION register reads 0x8A. I couldn't find a reference to this version in any of the user guides.

  • The first capture is of a bad read (read register 0x20 from I2C address 0x58, get 0x01 as response.

    The second is a good read (same register, but this time it returned 0x33 as expected).

  • Is there any mechanism to verify the OTP on the PMIC via I2C?

    Could violating the power on/power off sequencing requirements for the PMIC cause this sort of failure?

  • Hi Zane,

    Thank you for posting the scope shots.

    To me, it looks like the first two bytes are good, but then there is a STOP condition at an odd point in what appears to be the 3rd byte:

    Is there a chance your I2C driver is corrupted? My expectation is that there should be a repeat start like in this example:

  • >Is there a chance your I2C driver is corrupted? My expectation is that there should be a repeat start like in this example:

    This was done in U-boot (specifically 2019.01 + TI patches + PHY-TEC patches). The board config does not set CONFIG_I2C_REPEATED_START, so it uses stops and starts between bytes. Here's the comment directly above the implementation of i2c_read for OMAP2+:

    /*
     * i2c_read: Function now uses a single I2C read transaction with bulk transfer
     *           of the requested number of bytes (note that the 'i2c md' command
     *           limits this to 16 bytes anyway). If CONFIG_I2C_REPEATED_START is
     *           defined in the board config header, this transaction shall be with
     *           Repeated Start (Sr) between the address and data phases; otherwise
     *           Stop-Start (P-S) shall be used (some I2C chips do require a P-S).
     *           The address (reg offset) may be 0, 1 or 2 bytes long.
     *           Function now reads correctly from chips that return more than one
     *           byte of data per addressed register (like TI temperature sensors),
     *           or that do not need a register address at all (such as some clock
     *           distributors).
     */

    I've noticed this morning that if I read 16 bytes at a time starting from register 0x20, the values alternate between the good ones and incorrect ones:

    => i2c md 0x58 0x20 10
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    ...?3..[3..C3..=
    => i2c md 0x58 0x20 10
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    ...?3..[3..C3..=
    => i2c md 0x58 0x20 10
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    ...?3..[3..C3..=
    => i2c md 0x58 0x20 10
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    ...?3..[3..C3..=
    => i2c md 0x58 0x20 10
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    ...?3..[3..C3..=
    => i2c md 0x58 0x20 10
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    ...?3..[3..C3..=
    => i2c md 0x58 0x20 10
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    ...?3..[3..C3..=
    => i2c md 0x58 0x20 10
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    ...?3..[3..C3..=
    => i2c md 0x58 0x20 10
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    ...?3..[3..C3..=

    This is repeatable ad nauseum. If I start in the right "phase" by priming it with a 16-byte read and getting the wrong values back, I can then read 32 bytes from 0x20 correctly every time:

    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    ...?3..[3..C3..=
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    => i2c md 0x58 0x20 20
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01    3..?3..[3..C!...
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........

    Note that there were 3 devices total on this I2C bus: the PMIC at 0x58, 0x59, 0x5A, and 0x5B, an RTC at 0x68, and an EEPROM at 0x50. I removed the RTC yesterday and saw no change in behavior. The EEPROM is a M24C32. Issuing reads to it seems have an effect on the PMIC reads. I need to look at it more.

  • Correction: both sets of values are wrong, but different registers are garbled each time.

    Example:

    In this case, registers for SMPS12 are correct.

    => i2c md 0x58 0x20 10

    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01

    But the registers for SMPS6 are not.

    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 21 00 01 01

    On the next read of 16 bytes, the opposite is true, with SMPS12 being corrupted and SMPS6 being correct:

    => i2c md 0x58 0x20 10
    0020: 01 01 00 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d

  • I moved one of the damaged SOMs onto the PCM-948 carrier board we have: the issue persists. I then booted the board off of the SDK SD card: the issue persists.

  • Reading continuously starting from the last BACKUP register always works:

    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..
    => i2c md 0x58 0x1f 10
    001f: 00 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7    .3..?3..[3..C3..

    As does just reading from 0x00 onwards:

    => i2c md 0x58 0x10 40
    0010: 00 80 00 00 00 27 00 00 00 00 00 00 00 00 00 00    .....'..........
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    3..?3..[3..C3..=
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    0040: 00 00 00 00 00 7f 7f ff 00 00 ff fe 07 00 00 00    ................
    => i2c md 0x58 0x10 40
    0010: 00 80 00 00 00 27 00 00 00 00 00 00 00 00 00 00    .....'..........
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    3..?3..[3..C3..=
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    0040: 00 00 00 00 00 7f 7f ff 00 00 ff fe 07 00 00 00    ................
    => i2c md 0x58 0x10 40
    0010: 00 80 00 00 00 27 00 00 00 00 00 00 00 00 00 00    .....'..........
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    3..?3..[3..C3..=
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    0040: 00 00 00 00 00 7f 7f ff 00 00 ff fe 07 00 00 00    ................
    => i2c md 0x58 0x10 40
    0010: 00 80 00 00 00 27 00 00 00 00 00 00 00 00 00 00    .....'..........
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    3..?3..[3..C3..=
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    0040: 00 00 00 00 00 7f 7f ff 00 00 ff fe 07 00 00 00    ................
    => i2c md 0x58 0x10 40
    0010: 00 80 00 00 00 27 00 00 00 00 00 00 00 00 00 00    .....'..........
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    3..?3..[3..C3..=
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    0040: 00 00 00 00 00 7f 7f ff 00 00 ff fe 07 00 00 00    ................
    => i2c md 0x58 0x10 40
    0010: 00 80 00 00 00 27 00 00 00 00 00 00 00 00 00 00    .....'..........
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    3..?3..[3..C3..=
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    0040: 00 00 00 00 00 7f 7f ff 00 00 ff fe 07 00 00 00    ................
    => i2c md 0x58 0x10 40
    0010: 00 80 00 00 00 27 00 00 00 00 00 00 00 00 00 00    .....'..........
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    3..?3..[3..C3..=
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    0040: 00 00 00 00 00 7f 7f ff 00 00 ff fe 07 00 00 00    ................
    => i2c md 0x58 0x10 40
    0010: 00 80 00 00 00 27 00 00 00 00 00 00 00 00 00 00    .....'..........
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    3..?3..[3..C3..=
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    0040: 00 00 00 00 00 7f 7f ff 00 00 ff fe 07 00 00 00    ................
    => i2c md 0x58 0x10 40
    0010: 00 80 00 00 00 27 00 00 00 00 00 00 00 00 00 00    .....'..........
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    3..?3..[3..C3..=
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    0040: 00 00 00 00 00 7f 7f ff 00 00 ff fe 07 00 00 00    ................
    => i2c md 0x58 0x10 40
    0010: 00 80 00 00 00 27 00 00 00 00 00 00 00 00 00 00    .....'..........
    0020: 33 03 c7 3f 33 00 00 5b 33 03 be 43 33 03 c7 3d    3..?3..[3..C3..=
    0030: 00 00 00 00 33 03 ae ae 00 00 00 00 d8 00 00 08    ....3...........
    0040: 00 00 00 00 00 7f 7f ff 00 00 ff fe 07 00 00 00    ................
    (...repeats dozens of more times...)

     

    New theory: setting the address pointer in the PMIC occasionally glitches (in a regular, periodic manner) when the chosen register is a SMPS (possibly LDO) register.

     

    That doesn't get me any closer to answering the question of what happened in the first place or how to fix it, though.

  • Correction: the last example started at 0x10, not 0x00 as stated. Actually starting from 0x00 yields similar results.

  • Hi Zane,

    Are you able to use an external I2C controller to try to read out from the PMIC? I am still concerned with the I2C waveforms. I don't have a board with me to double check but a stop mid-command doesn't align with anything I've seen. A STOP between reads makes sense (instead of utilizing continuous read for example). 

  • The stop is between the address write and the read. I would have thought that to be normal. Nothing has been changed with the u-boot i2c driver vs the TI version that this u-boot is derived from. Additionally, I took a damaged SOM and placed it on the vendor's carrier board and booted it with the vendor's SDK image and it still had the same behavior.

    I also tried enabling CONFIG_I2C_REPEATED_START in U-boot with no noticeable change on the waveforms w.r.t the intermediate STOP. Also note that no board actually uses that option (and only omap24xx_i2c implements it), so the driver support might be broken.

    I don't have access to a proper external I2C controller right now, but I might be able to concoct something with a VGA cable. They have 3.3V I2C for EDID. It's either that or the Raspberry Pi that's hiding in my desk. I guess I'll try that tomorrow. I seriously doubt that will make a difference unless the SoC's I2C peripheral is damaged.

  • Hi Zane,

    One other thing you might try in the interim is whether single I2C read works: https://stackoverflow.com/questions/24966553/reading-multiple-bytes-using-i2c-in-u-boot

  • That was the very first thing I tried since it uses the same path through the U-boot I2C driver as the palmas driver does.

    It doesn't work. That's how I discovered the sequence in the original post.

    What does work: sequential reads starting from an address that isn't a regulator register. For example, reading starting from 0x1f works every time.

  • Hi Zane,

    Thanks, that is interesting. Are you utilizing I2C2 lines? I don't really think that should impact anything. 

    And the output voltages aren't toggling or anything like that? Just want to make sure it's just the read out that is having an issue, not the actual content of the register changing.

    Also is LDOVRTC_OUT stable?

  • >Thanks, that is interesting. Are you utilizing I2C2 lines? I don't really think that should impact anything.

    We are using I2C2 for a touchscreen, though that hasn't been an issue before. The PMIC is on I2C1.

    >And the output voltages aren't toggling or anything like that? Just want to make sure it's just the read out that is having an issue, not the actual content of the register changing.

    I haven't checked, but the CTRL register for SMPS12, which feeds power to the processor, can read back as values (such as 0x00), which indicate that it's off. If it were off, I wouldn't be able to execute that read. Similar things happen for other registers.

    >Also is LDOVRTC_OUT stable?

    I'll check tomorrow.

    Some updates from today:

    I killed another SOM. That's 4 now.

    I discovered that the accessory board I use as an adapter to connect a USB-serial cable to our main board has left-over resistors on it that cause the microcontroller that controls power for the board to start the power up sequence and then abort 9ms later. From the PMIC's perspective, 3.3V ramps up, 5V ramps up, a "button press" begins to occur, then the power dies. I also am suspect of the quality of the power supervisor implementation on the SOM, since it unable to monitor the actual upstream power source.

    This does not explain why this issue only started happening recently, though.

    Another tidbit: different PMICs exhibit different patterns when they fail, not just the pattern posted upthread but there is always a pattern.

    I did not get around to hacking together an external I2C master today.

    Here's some scope captures I did with the bad adapter board attached. (It's easier to just take a picture with my phone than to use the scope's USB, sorry.)

    Yellow is 3.3V to the SOM, blue is 5V.

    Yellow is 3.3V to SOM, blue is ONOFF.

    Obviously way out of spec, but I think it's been like this for a while, so I'm not confident in this being the cause of the issue at hand. If try anything tomorrow, it'll be with a fixed adapter board and revised microcontroller code.

  • Hi Zane,

    Would you be able to provide the schematic? I am mostly interested in what is powered from 5V and what is powered from 3.3V. In particular, no net should have power before the net that includes VCC1 for example. If power is applied elsewhere prior to VCC1, then the internal leakage paths might create damage.

  • We'd probably be able to provide our carrier board schematic with an NDA. You'd have to get in touch over email (first initial last name at telinst.com) for that. The SOM belongs to PHY-TEC, so that would have to be set up with them.

    We tested another SOM today with the fixed serial adapter board (so no power on spikes). It ran fine when we used older builds of firmware (as in, fairly recent commits/changes but built before this issue came up). We then switched to an older version, but built recently. That broke the PMIC, which makes sense, since my first PMIC probably saw dozens to hundreds of cycles worth of the power spike issue without dying.

    What I noticed is that, since we're using Buildroot, we don't have our own mirror of Linux, and I pinned the Linux version to a branch instead of a similarly-named tag, my original build tree (prior to the Linux 5.4 experiment) contained Phy-Tec's kernel as of Aug 14, 2020 while later build trees, such as mine after the 5.4 experiment and my co-worker's (which is where the firmware used today was built) have the kernel as of Oct 19.

    Between those points, there's a commit that re-maps the RTCs to fixed aliases (RTC0 is the discrete RTC at I2C1 0x68, RTC1 is the OMAP RTC, RTC2 is the PMIC RTC). Previously, RTC2 had been the discrete RTC. I wrote an init script that, during shutdown, saved the current time to RTC2. This means that these new builds of firmware now write to the PMIC RTC whereas the previous builds write to the discrete RTC.

    This change happened exactly when the I first observed failures. Additionally, to reiterate, this board ran a previously built firmware image just fine, but exposure to the newly-built old image (best way to describe it) resulted in a failure after the first power off and power on sequence (first boot is always fine).

    I believe it's guaranteed that no net will be powered before the one that includes VCC1 except for the following:

    1. The power button line is driven by a microcontroller that might not be handling the line as open drain. I will correct this when I get the chance. It's been like this for a while, though.

    2. VBUS_DET can have +5V from an attached USB host even when the power is off.

    Other notes: This is silicon revision 1.3. There is no pull down on LDO_VRTC. I'll try to get access to LDO_VRTC and measure it during shutdown. Our falling slew rate for VCC doesn't seem slow enough to generate a POR reliably, though the oscilloscope I have access to doesn't seem capable of measuring it. This problem has existed the entire development time on this platform, though (half a year).

  • Hi Zane,

    Unfortunately I won't be much help with the software side. Are you able to re-install the original software and get the board working again? Trying to see if the software is causing a physical damage or the change just needs to be reverted. 

  • We already tried installing the original software (as in, a known-good build). That worked. When we tried rebuilding a known working version (which was older than the previous image since we trying to be as safe as possible), that didn't work. The thought is that when we rebuilt, we picked up a change on Linux that we weren't expecting and that's the difference between the builds.

    Basically, all working builds have a kernel version that has the RTCs in the order OMAP RTC, Palmas RTC, external RTC. The only difference the broken builds have is that their order is external RTC, OMAP RTC, Palmas RTC. We have a script that runs at shutdown that saves the current time to RTC2, which is now the Palmas RTC.

    I'm thinking that writing to the Palmas RTC combined with the missing workaround for the Palmas POR erratum on silicon revision 1.3, possibly combined falling slew rate/timing of the board power supplies is fatal. An additional (circumstantial) piece of evidence is that the falling boards have always survived power on and OPP throttling, which is where most of the U-boot/Linux writes the regulator part of the PMIC happen. Powering down is when the failures occur. This is also when the RTC save script runs.

    I'll do some testing today.

  • This is a capture of suddenly removing power from the board (I can't do a safe shutdown on this board since it's broken).

    Blue is 3.3V input to PMIC. Yellow is LDOVRTC.

    Not sure what the ripples are in the middle. Probably the SMPS that supplies the PMIC trying to function despite power having failed since there's no supervisor on the carrier board.

    LDOVRTC settles at 500mV since this board is connected to a serial cable which is back-powering the processor (we are missing buffers). 3.3V in settles at 1.00V.

  • We're fairly confident that this has been resolved now. It did turn out that writing to the palmas RTC was fatal for the PMIC. The solution was to disable it in software since it shouldn't have been used anyway.

    One extra detail: the system, as expected, booted with the date set to the Unix epoch by default. Nothing updated this before power down, so Jan 1, 1970 was written to the RTC (or at least to the driver). I suspect that whatever value the RTC driver wrote ended up being the ultimate cause. That seems like a pretty big hardware/driver flaw to allow writing the default datetime to destroy chips.