This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/AM5728: AM5728: PCIe communication and boot sequence

Part Number: AM5728

Tool/software: Linux

Hello,

We designed a carrier board for Phytec SOM.

Our board has two different PCIex1 connectors where we connect two PCIe-SATA cards (to give access to 4 SSD drives per card)

Both cards links succesfully at PCIe GEN1 and data can be read/written from/to our SSD drives.

However we have an issue with pcie-dra7xx driver (autor Kishon Vijay Abraham, from TI) under 4.14.67 Linux kernel.

When using this function without any modification:

static void dra7xx_pcie_shutdown(struct platform_device *pdev)
{
    struct device *dev = &pdev->dev;
    struct dra7xx_pcie *dra7xx = dev_get_drvdata(dev);
    int ret;

    dra7xx_pcie_stop_link(dra7xx->pci);

    ret = pm_runtime_put_sync(dev);
    if (ret < 0)
        dev_dbg(dev, "pm_runtime_put_sync failed\n");


    pm_runtime_disable(dev);


    dra7xx_pcie_disable_phy(dra7xx);
}

sometimes the system hangs and when restarting the system we get a lot of ata/scsi errors and one or both cards are not able to link or they do but disks are not recogniced.

However if we just add a little delay as follows in the mentioned function:

static void dra7xx_pcie_shutdown(struct platform_device *pdev)
{
    struct device *dev = &pdev->dev;
    struct dra7xx_pcie *dra7xx = dev_get_drvdata(dev);
    int ret;

    dra7xx_pcie_stop_link(dra7xx->pci);

    ret = pm_runtime_put_sync(dev);
    if (ret < 0)
        dev_dbg(dev, "pm_runtime_put_sync failed\n");

    msleep(500);    // delay

    pm_runtime_disable(dev);

    msleep(500);    // delay

    dra7xx_pcie_disable_phy(dra7xx);
}

we do not longer get any problems. Does anyone know the relation between this delay and identification/communication problems over the PCIe link?

How can the previous close affect the next boot? After a reboot (where global power has been restarted) everything should start from scrach. However at some point (PCIe cards?, PMIC?...) a previous state seems to be stored.

Has the PMIC any influence in the identification and communication over PCIe? The PMIC we are using is the one embeded in Phytec SOM so that we haven't made any changes.

Hope you can help me. Thank you in advance.

Regards,




  • Hello Jorge,

    We are sorry about the inconvenience but our expert on the topic is not available today and we should be able to give you feedback by early next week.

    Regards,
    Krunal
  • Hello Krunal

    Any feedback from your expert?

    Thanks in advance!
  • Hi, Jorge,

    You said without modification, sometimes the system hangs. Does it hang during the shutdown? Could you provide the ata/scsi error logs when system restarting?

    Thanks!

    Rex
  • Jorge,

    When you restart the system, do you reboot from kernel prompt or power cycle the system? Does the issue happen when power cycle the SOM?

    Rex
  • Dear Rex,

    Yes, it hangs during the shutdown.

    With the modification:

    111.383729] sd 4:0:0:0: [sdd] Synchronizing SCSI cache
    [  111.389687] sd 4:0:0:0: [sdd] Stopping disk
    [  111.397421] sd 3:0:0:0: [sdc] Synchronizing SCSI cache
    [  111.404092] sd 3:0:0:0: [sdc] Stopping disk
    [  111.412654] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
    [  111.419397] sd 1:0:0:0: [sdb] Stopping disk
    [  111.428405] sd 0:0:0:0: [sda] Synchronizing SCSI cache
    [  111.435216] sd 0:0:0:0: [sda] Stopping disk
    [  113.604451] reboot: Power down

    Without the modification:

    111.383729] sd 4:0:0:0: [sdd] Synchronizing SCSI cache
    [  111.389687] sd 4:0:0:0: [sdd] Stopping disk
    [  111.397421] sd 3:0:0:0: [sdc] Synchronizing SCSI cache
    [  111.404092] sd 3:0:0:0: [sdc] Stopping disk
    [  111.412654] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
    [  111.419397] sd 1:0:0:0: [sdb] Stopping disk
    [  111.428405] sd 0:0:0:0: [sda] Synchronizing SCSI cache
    [  111.435216] sd 0:0:0:0: [sda] Stopping disk  "<-- HANGS"

    This happens with TI SDK 5.01 and 5.00 but it does not happen with 4.03

    Then, after restarting (power cycle) some disks attached at PCIe interface are not recognized. Some errors we got are:

    [ 5.881123] ata2: SATA link down (SStatus 1 SControl 300)
    [ 6.203431] ata2: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
    [ 6.210890] ata2: irq_stat 0x80000040, connection status changed
    [ 6.216927] ata2: SError: { CommWake DevExch }
    [ 6.221435] ata2: hard resetting link
    [ 8.470266] ata2: SATA link down (SStatus 1 SControl 300)
    [ 8.475805] ata2: EH complete
    [ 8.702997] ata2: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
    [ 8.710471] ata2: irq_stat 0x80000040, connection status changed
    [ 8.716505] ata2: SError: { CommWake DevExch }
    [ 8.721010] ata2: limiting SATA link speed to 1.5 Gbps
    [ 8.726196] ata2: hard resetting link
    [ 9.360250] ata6: link is slow to respond, please be patient (ready=0)
    [ 10.950268] ata2: SATA link down (SStatus 1 SControl 310)
    [ 10.955781] ata2: EH complete
    [ 11.202575] ata2: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
    [ 11.210032] ata2: irq_stat 0x80000040, connection status changed
    [ 11.216082] ata2: SError: { CommWake DevExch }
    [ 11.220623] ata2: limiting SATA link speed to 1.5 Gbps
    [ 11.225815] ata2: hard resetting link
    [ 13.460265] ata2: SATA link down (SStatus 1 SControl 310)
    [ 13.465718] ata2: EH complete
    [ 13.630303] ata1: softreset failed (1st FIS failed)
    [ 13.690758] ata4: softreset failed (1st FIS failed)
    [ 23.630735] ata1: softreset failed (1st FIS failed)
    [ 23.691169] ata4: softreset failed (1st FIS failed)
    [ 58.630299] ata1: softreset failed (1st FIS failed)
    [ 58.635207] ata1: limiting SATA link speed to 3.0 Gbps
    [ 58.690576] ata4: softreset failed (1st FIS failed)
    [ 58.695481] ata4: limiting SATA link speed to 3.0 Gbps
    [ 63.640577] ata1: softreset failed (1st FIS failed)
    [ 63.645590] ata1: reset failed, giving up
    [ 63.700850] ata4: softreset failed (1st FIS failed)
    [ 63.705756] ata4: reset failed, giving up

    Any hint about the reason causing this behaviour?

    Thanks in advance!

  • Hi, Jorge,

    Are the errors for disk not recognized shown with the successful reboot using the delay in the shutdown?
    Does any of the errors happen using 4.3 release?
    I had a discussion internally. Our thought was the power not being stable.
    But I need more info from you to negate that conclusion.

    Rex
  • Jorge,

    I just checked with the system test again. We don't see different behaviors on SDK 4.3 (kernel 4.9) or SDK 5.0 (kernel 4.14) releases. We saw issues on PCIe ATA on other non-AM5728 platforms, but replacing the power supply seems to solve the issue. Did you said you use internal power for ATA? Does it have external power connector? Could you try to see if it resolve the disk not being recognized issue when restart.

    Rex
  • Hi, Jorge,

    We would have expected kernel to shutdown child devices (scsi disk) before parent (PCIe in this case). Either something is wrong in that order, or the order is fine, but there is some asynchronous operation going on while PCIe shuts down.

    We would like to first verify that PCIe shutdown is not erroneously being called before scsi subsystem shuts down all disks.

    To check this, can you insert "initcall_debug" (without quotes) into bootargs. This will give additional logs showing when each shutdown handler is called, and send me the logs (without delay in the code).

    Also, what is the exact command being used to shutdown the board? Does it happen if you run 'sync' before shutdown? I would also expect doing an unmount might mask the issue too. These will be good to try to help characterize the issue.

    Thanks!

    Rex
  • Dear Rex,

    I'll come back as soon as I have answers.

    I really appreciate your support.
  • Hello,

    Our SW developer has donde some test. Here are the results:

    Are the errors for disk not recognized shown with the successful reboot using the delay in the shutdown?

    ---> With the delay in driver pcie-dra7xx function dra7xx_pcie_shutdown disks are correctly recognized in the subsequent boot


    Does any of the errors happen using 4.3 release ?

    --> No.


    what is the exact command being used to shutdown the board? Does it happen if you run 'sync' before shutdown?

    --> The problem is the same with any shutdown command; reboot, shutdown, halt, poweroff.  Whether the discs have been used or not does not influence. Things like unmount or sync are irrelevant.


    Reports with initcall_debug=1

    --> Atached 4 files. The problem is easily recognizable
     
      A1_OK_shutdown_with_pcie_delays_initcall_debug..txt
      A2_OK_boot_with_previous_pcie_shutdelay_initcall_debug.txt
     
      B1_NOK_shutdown_without_pcie_delays_initcall_debug..txt   -> System Hangs. Hardware reset.
        Then
      B2_NOK_boot_without_previous_pcie_shutdelays_initcall_debug.txt
    2781.Logs.zip

    Comment:

    We have experienced a lot of problems with AM5728  PCIe for a long time. We had our doubts about PCIe connector, power supply, PMIC or cpu clock frequency regulation. But they were always erratic problems and hard to reproduce objectively. With the SDK version 4.03 (kernel 4.9.69) and various precautions, we managed to stabilize the system almost completely.

    But this problem in the SDK version 5.01 (kernel 4.14.67) is the first one that we can reproduce in a controlled way and we suspect that it is related to the previous problems.

    Thank you in advance for your support,

    Regards

  • Hi, Jorge,

    Could you do one more test? Could you see if the problem still exist with only one of the controllers is enabled (all other conditions remaining the same). Thanks!

    Rex
  • Hi, Jorge,

    Have you had the chance to try it with only one controller?

    Rex
  • Dear Rex,

    I will post the results as soon as I have the chance to do the test.

    Thank you for your help,

    Jorge

  • Dear Rex,

    Our SW developer has done the tests with only one of the controllers enabled. The result is the same.

    Attached are LOGs

    C1_NOK_boot_without_previous_pcie_shutdelays_initcall_debug.txt

        33.571917] watchdog: watchdog0: watchdog did not stop!
    [   33.582034] systemd-shutdow: 29 output lines suppressed due to ratelimiting
    [   33.761204] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
    [   33.778792] systemd-journald[229]: Received SIGTERM from PID 1 (systemd-shutdow).
    [   33.833655] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
    [   33.848465] systemd-shutdown[1]: Hardware watchdog 'OMAP Watchdog', version 0
    [   33.856916] systemd-shutdown[1]: Unmounting file systems.
    [   33.862928] systemd-shutdown[1]: Remounting '/' read-only with options 'data=ordered'.
    [   33.925200] EXT4-fs (mmcblk0p2): re-mounted. Opts: data=ordered
    [   33.931205] systemd-shutdown[1]: All filesystems unmounted.
    [   33.936889] systemd-shutdown[1]: Deactivating swaps.
    [   33.942134] systemd-shutdown[1]: All swaps deactivated.
    [   33.947617] systemd-shutdown[1]: Detaching loop devices.
    [   34.996639] pru-rproc 4b2b8000.pru: shutdown
    [   35.000931] pru-rproc 4b2b4000.pru: shutdown
    [   35.005230] pruss-intc 4b2a0000.intc: shutdown
    [   35.009697] pru-rproc 4b238000.pru: shutdown
    [   35.013992] pru-rproc 4b234000.pru: shutdown
    [   35.018278] pruss-intc 4b220000.intc: shutdown
    [   35.022772] pruss 4b280000.pruss: shutdown
    [   35.026884] pruss 4b200000.pruss: shutdown
    [   35.031023] xhci-hcd xhci-hcd.1.auto: shutdown
    [   35.035561] dwc3 488d0000.usb: shutdown
    [   35.039481] sd 4:0:0:0: shutdown
    [   35.042737] sd 3:0:0:0: shutdown
    [   35.045976] sd 3:0:0:0: [sdd] Synchronizing SCSI cache
    [   35.053706] sd 2:0:0:0: shutdown
    [   35.056947] sd 2:0:0:0: [sdc] Synchronizing SCSI cache
    [   35.064630] sd 1:0:0:0: shutdown
    [   35.067870] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
    [   35.075711] sd 0:0:0:0: shutdown
    [   35.078951] sd 0:0:0:0: [sda] Synchronizing SCSI cache
    [   35.102672] xhci-hcd xhci-hcd.0.auto: shutdown
    [   35.107260] mmcblk mmc1:0001: shutdown
    [   35.145605] dwc3 48890000.usb: shutdown
    [   35.149475] mmcblk mmc0:59b4: shutdown
    [   35.154341] cpufreq-dt cpufreq-dt: shutdown
    [   35.158550] tlv320aic3x-codec 3-0018: shutdown
    [   35.163038] i2c i2c-3: shutdown
    [   35.166193] i2c 0-0068: shutdown
    [   35.169434] at24 0-0050: shutdown
    [   35.172825] palmas-usb 48070000.i2c:tps659038@58:tps659038_usb: shutdown
    [   35.179583] palmas-rtc 48070000.i2c:tps659038@58:tps659038_rtc: shutdown
    [   35.186435] palmas-pmic 48070000.i2c:tps659038@58:tps659038_pmic: shutdown
    [   35.193401] dummy 0-005a: shutdown
    [   35.196842] dummy 0-0059: shutdown
    [   35.200301] palmas 0-0058: shutdown
    [   35.203846] i2c i2c-0: shutdown
    [   35.207053] snd-soc-dummy snd-soc-dummy: shutdown
    [   35.211812] ti-cpufreq ti-cpufreq: shutdown
    [   35.216094] cpsw-phy-sel 4a002554.cpsw-phy-sel: shutdown
    [   35.221566] davinci_mdio 48485000.mdio: shutdown
    [   35.226328] serial8250 serial8250: shutdown
    [   35.231038] aer 0000:00:00.0:pcie002: shutdown
    [   35.235515] pcie_pme 0000:00:00.0:pcie001: shutdown
    [   35.240413] ahci 0000:01:00.0: shutdown
    [   35.244275] pcieport 0000:00:00.0: shutdown
    [   35.248478] dra7-pcie 51000000.pcie: shutdown
    

  • Hi, Jorge,

    I tried on TI EVM with one PCIe port (which we only have 1 port any way) and 1 SATA drive. It works fine. With 2 SATA drive, both links comes up without problem, only when I do "ls /dev/sd*", it prints messages shown below, but was able to come up again. I don't see any hung issue or SATA drives not coming up. I also tried reboot without any issues either.

    root@am57xx-evm:~# ls /dev/sd*
    /dev/sda /dev/sda1 /dev/sda2 /dev/sda5 /dev/sdb /dev/sdb1 /dev/sdb2
    root@am57xx-evm:~# [ 48.530827] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x400000 action 0x6 frozen
    [ 48.538961] ata2: SError: { Handshk }
    [ 48.544797] ata2.00: failed command: READ FPDMA QUEUED
    [ 48.550062] ata2.00: cmd 60/01:00:6b:6a:00/00:00:00:00:00/40 tag 0 ncq dma 51
    [ 49.333903] ata2.00: status: { DRDY }
    [ 49.337610] ata2.00: failed command: READ FPDMA QUEUED
    [ 49.348809] ata2.00: cmd 60/01:f0:6a:6a:00/00:00:00:00:00/40 tag 30 ncq dma 512 in
    [ 49.348809] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
    [ 49.377970] ata2.00: status: { DRDY }
    [ 49.385417] ata2: hard resetting link
    [ 49.890787] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 330)
    [ 49.898286] ata2.00: configured for UDMA/100
    [ 49.904606] ata2.00: device reported invalid CHS sector 0
    [ 49.910034] ata2.00: device reported invalid CHS sector 0
    [ 49.917334] ata2.00: device reported invalid CHS sector 0
    [ 50.420543] print_req_error: I/O error, dev sdb, sector 27251
    [ 50.427905] sd 1:0:0:0: [sdb] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
    [ 50.437216] sd 1:0:0:0: [sdb] tag#9 Sense Key : 0x5 [current]
    [ 50.444128] sd 1:0:0:0: [sdb] tag#9 ASC=0x21 ASCQ=0x4
    [ 50.449290] sd 1:0:0:0: [sdb] tag#9 CDB: opcode=0x28 28 00 00 00 6a 74 00 00 01 00
    [ 50.459008] print_req_error: I/O error, dev sdb, sector 27252
    [ 50.466059] ata2: EH complete
    root@am57xx-evm:~#
    root@am57xx-evm:~# ls /dev/sd*
    /dev/sda /dev/sda1 /dev/sda2 /dev/sda5 /dev/sdb /dev/sdb1 /dev/sdb2
    root@am57xx-evm:~#
    root@am57xx-evm:~#

    root@am57xx-evm:~# hdparm -N /dev/sda

    /dev/sda:
    max sectors = 488397168/488397168, HPA is disabled
    root@am57xx-evm:~# hdparm -N /dev/sdb

    /dev/sdb:
    max sectors = 1953525168/1953525168, HPA is disabled

    I am using Marvell SATA controller

    root@am57xx-evm:~# lspci -nn
    00:00.0 PCI bridge [0604]: Texas Instruments Multicore DSP+ARM KeyStone II SOC [104c:8888] (rev 01)
    01:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9120 SATA 6Gb/s Controller [1b4b:9120] (rev 12)
    01:00.1 IDE interface [0101]: Marvell Technology Group Ltd. 88SE912x IDE Controller [1b4b:91a4] (rev 12)

    and SATA drives: WD10EZEX(1TB) and WD2500AAKS(250GB).

    I suspect the messages printed is caused by bad signal of ATA which cause the ATA re-negotiation. I googled on the subject as a reference: unix.stackexchange.com/.../what-causes-the-ata-exceptions-in-my-syslog-and-how-to-solve-them

    I am not sure if trying different SATA drive would improve in some degree, and this may not be the PCIe issue.

    Rex