This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

PCIESS hangs?

Other Parts Discussed in Thread: TLC59108, AM3874

I am using a DM8148 evm and have PCIESS configured as RC. In the ti814x_pcie_setup routine, as soon as the code enables the link training LTSSM, all subsequent reads from the PCIESS application registers cause the kernel to panic. The error message is: 

"Unhandled fault: external abort on non-linefetch (0x1008) at  e8820004" 

This is triggered when the driver next tries to read the CMD_STATUS register.. which is in disable_bars() -> set_dbi_mode().  What happened in PCIESS to cause inability to read ioremapped memory??? 

Thank you in advance!

Dennis

  • Dennis,

    Are you working with EZSDK 5.05.02.00 / PSP 04.04.00.01?

    Dennis McLeod said:
    I am using a DM8148 evm and have PCIESS configured as RC

    Are you align with the below wiki page:

    http://processors.wiki.ti.com/index.php/TI81XX_PSP_PCI_Express_Root_Complex_Driver_User_Guide

    Dennis McLeod said:
    What happened in PCIESS to cause inability to read ioremapped memory??? 

    Will the below patch fix this?

    http://arago-project.org/git/projects/?p=linux-omap3.git;a=commit;h=6369801405c5b10cf2d0837ad89b4a826e11615d

    BR
    Pavel

  • Hi Pavel, 

    Thank you for your reply. I did follow the RC driver guide quite closely. I just applied the patch you referred to and there is no change in behavior. 

    I can provide any register dumps you think might help solve the problem 

    Another interesting piece of information is that we have a PLX 8605 downstream. If we hold the PLX in reset, the kernel boots fine. Once the kernel is booted (and with PLX still held in reset), we can do "devmem 0x51000004 32 0xa07"  to initiate link training and there is no problem. If we then take the PLX out of reset and then initiate link training, the problem reappears.  Any accesses to the area at 0x5100xxxx  cause the "external abort" error to reappear.   

    I would understand if the link training failed. But for it to suddenly have problems accessing memory, this is quite confusing. 

  • Dennis,

    Can you also apply the below patches, does it make any difference?

    http://arago-project.org/git/projects/?p=linux-omap3.git;a=commit;h=7367129164936713aaa7e832fd4c22e3bc1c3a2a

    http://arago-project.org/git/projects/?p=linux-omap3.git;a=commit;h=937f9325a14db8a584af933f9b9b8c51fa34573c

    http://arago-project.org/git/projects/?p=linux-omap3.git;a=commit;h=d5352bda02e9af813632e0afa5bd25dcc997b086

    http://arago-project.org/git/projects/?p=linux-omap3.git;a=commit;h=3e1bd8effac5332322e1dbe98e2c7535f20c0416

    Regards,
    Pavel

  • The SERDES_STATUS register in the PCIESS application registers block is undocumented. Can anyone tell us what the bits represent? 

    I will apply the patches you listed and report back soon. 

  • No luck after applying the patches. The same behavior continues. 

    Just so it is clear, the issue happens instantly after the call to 

    writel(LTSSM_EN_VAL | readl(reg_virt + CMD_STATUS),
    reg_virt + CMD_STATUS);

    in ti81xx_pcie_setup().   Any call to readl or writel to the memory pointed to by "reg_virt" will cause the error. 

    If the writel above is allowed to execute during bootup, it results in a kernel halt due to the "Unhandled fault: external abort on non-linefetch (0x1008) at  e882000"  and a reboot is required.  

    I built a kernel with that line commented out, and it boots fine.  Once booted, I can initiate link training with 

    devmem2 0x51000004 32 0x0a07

    once I do this, any calls to devmem2 0x5100xxxx  will cause the error to be printed to the screen, (Unhandled fault: external abort on non-linefetch) but it does not result in a kernel halt. 

    What's also very interesting, is that if we use devmem2 to initiate link training and then wait a VERY long time (many many minutes later), we can again use devmem2 to query the PCIESS without getting the error. 

    Please, any input is greatly appreciated  :) 

    Thank you again

  • Dennis,

    This is what we have in pcie-ti81xx.c

    static int ti81xx_pcie_setup(int nr, struct pci_sys_data *sys)
    {

    ......

    /* 16KB region is sufficiant for reg(4KB) + configs(8KB) + IO(4KB) */
        reg_virt = (u32)ioremap_nocache(reg_phys, SZ_16K);

        if (!reg_virt) {
            pr_err(DRIVER_NAME ": PCIESS register memory remap failed\n");
            goto err_ioremap;
        }

        pr_info(DRIVER_NAME ": Register base mapped @0x%08x\n", (int)reg_virt);

    ......

    __raw_writel(DIR_SPD | __raw_readl(
                    reg_virt + SPACE0_LOCAL_CFG_OFFSET + PL_GEN2),
                reg_virt + SPACE0_LOCAL_CFG_OFFSET + PL_GEN2);

    ......

    if (device_id)
            __raw_writew(device_id, reg_virt + SPACE0_LOCAL_CFG_OFFSET +
                    PCI_DEVICE_ID);

    ......

    __raw_writel(LTSSM_EN_VAL | __raw_readl(reg_virt + CMD_STATUS),
                 reg_virt + CMD_STATUS);

         /* 100ms */
         msleep(100);

    __raw_writew(PCI_CLASS_BRIDGE_PCI,
                reg_virt + SPACE0_LOCAL_CFG_OFFSET + PCI_CLASS_DEVICE);

    .....

    }

    These are the messages I have on a successful boot:

    ti81xx_pcie: Register base mapped @0xd7020000

    This means that the PCIe register base address (physical address) 0x51000000 is mapped to the virtual address 0xD7020000. Do you have something similar?

    Also, do you mean that you have several successful __raw_real() and __raw_writel() before the crash?

    Dennis McLeod said:

    Just so it is clear, the issue happens instantly after the call to 

    writel(LTSSM_EN_VAL | readl(reg_virt + CMD_STATUS),
    reg_virt + CMD_STATUS);


    Is you boot up flow crash at the below line?

    __raw_writew(PCI_CLASS_BRIDGE_PCI,
                reg_virt + SPACE0_LOCAL_CFG_OFFSET + PCI_CLASS_DEVICE);

    BR
    Pavel

     

  • This means that the PCIe register base address (physical address) 0x51000000 is mapped to the virtual address 0xD7020000. Do you have something similar?

     

    Yes, that line is printed for us also. However, the virtual address is different. It says "ti81xx_pcie: Register base mapped @0xe8820000". (Probably different because we're on a DM8148 and you're using a DM8168?)

     

    Is you boot up flow crash at the below line?

    __raw_writew(PCI_CLASS_BRIDGE_PCI,
                reg_virt + SPACE0_LOCAL_CFG_OFFSET + PCI_CLASS_DEVICE);

     

    I am sorry, I was incorrect when I said that "any reads/writes to reg_virt window" cause the error. During boot up, the crash happens when application registers are accessed.  So in the following code block, the crash happens inside the disable_bars() function. The first thing disable_bars() does is call set_dbi_mode(). The set_dbi_mode() function contains a readl(reg_virt + CMD_STATUS) that blows everything up. 

     

    writel(LTSSM_EN_VAL | readl(reg_virt + CMD_STATUS),
    reg_virt + CMD_STATUS);

    /* 100ms */
    msleep(100);

    /*
    * Identify ourselves as 'Bridge' for enumeration purpose. This also
    * avoids "Invalid class 0000 for header type 01" warnings from "lspci".
    *
    * If at all we want to restore the default class-subclass values, the
    * best place would be after returning from pci_common_init ().
    */
    writew(PCI_CLASS_BRIDGE_PCI,
    reg_virt + SPACE0_LOCAL_CFG_OFFSET + PCI_CLASS_DEVICE);

    /*
    * Prevent the enumeration code from assigning resources to our BARs. We
    * will set up them after the scan is complete.
    */
    disable_bars();       //  << ----- #### crash happens in here ####

     

    Best Regards, 

    Dennis

  • Dennis,

    Dennis McLeod said:
    Yes, that line is printed for us also. However, the virtual address is different. It says "ti81xx_pcie: Register base mapped @0xe8820000". (Probably different because we're on a DM8148 and you're using a DM8168?)

    No, I am also using DM8148 EVM, but with nothing attached to the PCIe. If you remove the PCIe device attached to the DM8148 EVM, with you have the same virtual address (0xE8820000)?

    Can you read the CMD_STATUS register right after you set the virtual address:

    /* 16KB region is sufficiant for reg(4KB) + configs(8KB) + IO(4KB) */
        reg_virt = (u32)ioremap_nocache(reg_phys, SZ_16K);
      

    if (!reg_virt) {
            pr_err(DRIVER_NAME ": PCIESS register memory remap failed\n");
            goto err_ioremap;
        }

        pr_info(DRIVER_NAME ": Register base mapped @0x%08x\n", (int)reg_virt);

        pcie_ck = clk_get(NULL, "pcie_ck");
        if (IS_ERR(pcie_ck)) {
            pr_err(DRIVER_NAME ": Failed to get PCIESS clock\n");
            goto err_clkget;
        }

        if (clk_enable(pcie_ck))
            goto err_clken;

    __raw_readl(reg_virt + CMD_STATUS);  ---->   Are you able to read the CMD_STATUS register? Or the flow crash as before?

    BR
    Pavel

  • I placed several reads between the ioremap and the "initiate link training", as you requested. Like this: 

    if (clk_enable(pcie_ck))
    goto err_clken;

    printk("pcie: %s %d reading CMD_STATUS\n", __FILE__, __LINE__);
    val32 = readl(reg_virt + CMD_STATUS);
    printk("pcie: %s %d read CMD_STATUS, value = %x\n", __FILE__, __LINE__, val32);

    /*
    * TI81xx devices do not support h/w autonomous link up-training to GEN2
    * form GEN1 in either EP/RC modes. The software needs to initiate speed
    * change.
    */
    writel(DIR_SPD | readl(
    reg_virt + SPACE0_LOCAL_CFG_OFFSET + PL_GEN2),
    reg_virt + SPACE0_LOCAL_CFG_OFFSET + PL_GEN2);

    printk("pcie: %s %d reading CMD_STATUS\n", __FILE__, __LINE__);
    val32 = readl(reg_virt + CMD_STATUS);
    printk("pcie: %s %d read CMD_STATUS, value = %x\n", __FILE__, __LINE__, val32);

    .....

    printk("pcie: %s %d reading CMD_STATUS\n", __FILE__, __LINE__);
    val32 = readl(reg_virt + CMD_STATUS);
    printk("pcie: %s %d read CMD_STATUS, value = %x\n", __FILE__, __LINE__, val32);
    /*
    * Override the default device ID if required - TI81XX devices generally
    * come up with ID 0x8888.
    */
    if (device_id)
    writew(device_id, reg_virt + SPACE0_LOCAL_CFG_OFFSET +
    PCI_DEVICE_ID);

    printk("pcie: %s %d reading CMD_STATUS\n", __FILE__, __LINE__);
    val32 = readl(reg_virt + CMD_STATUS);
    printk("pcie: %s %d read CMD_STATUS, value = %x\n", __FILE__, __LINE__, val32);
    /*
    * Initiate Link Training. We will delay for L0 as specified by
    * standard, but will still proceed and return success irrespective of
    * L0 status as this will be handled by explicit L0 state checks during
    * enumeration.
    */
    writel(LTSSM_EN_VAL | readl(reg_virt + CMD_STATUS),
    reg_virt + CMD_STATUS);

    /* 100ms */
    msleep(100);

    printk("pcie: %s %d reading CMD_STATUS\n", __FILE__, __LINE__);
    val32 = readl(reg_virt + CMD_STATUS);  // <--- #### causes crash ####
    printk("pcie: %s %d read CMD_STATUS, value = %x\n", __FILE__, __LINE__, val32);

    ----------------------

    Here is the output: 

    ti81xx_pcie: Invoking PCI BIOS...
    ti81xx_pcie: Setting up Host Controller...
    ti81xx_pcie: Register base mapped @0xe8820000
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 655 reading CMD_STATUS
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 657 read CMD_STATUS, value = a00
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 668 reading CMD_STATUS
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 670 read CMD_STATUS, value = a00
    ti81xx_pcie: forcing link width - x1
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 703 reading CMD_STATUS
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 705 read CMD_STATUS, value = a00
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 714 reading CMD_STATUS
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 716 read CMD_STATUS, value = a00
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 729 reading CMD_STATUS
    Unhandled fault: external abort on non-linefetch (0x1008) at 0xe8820004
    Internal error: : 1008 [#1]
    last sysfs file:
    Modules linked in:
    CPU: 0 Not tainted (2.6.37+ #2)
    PC is at ti81xx_pcie_setup+0x388/0x6c0
    LR is at release_console_sem+0x198/0x1ac
    pc : [<c005ca3c>] lr : [<c006cf04>] psr: 60000013
    sp : e783be40 ip : e783bd78 fp : e783be6c
    r10: 00000000 r9 : 00000000 r8 : e786ac9c
    r7 : e786ac40 r6 : c04c4610 r5 : e786ac80 r4 : c04c4610
    r3 : e8820000 r2 : c049fe00 r1 : 000015ca r0 : 00000042
    Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel

  • Dennis,

    Dennis McLeod said:

    /*
    * Initiate Link Training. We will delay for L0 as specified by
    * standard, but will still proceed and return success irrespective of
    * L0 status as this will be handled by explicit L0 state checks during
    * enumeration.
    */
    writel(LTSSM_EN_VAL | readl(reg_virt + CMD_STATUS),
    reg_virt + CMD_STATUS);

    /* 100ms */
    msleep(100);

    printk("pcie: %s %d reading CMD_STATUS\n", __FILE__, __LINE__);
    val32 = readl(reg_virt + CMD_STATUS);  // <--- #### causes crash ####

    May be this 100ms delay is not enough for the link training successful completion. Can you check the value of DEBUG0[4:0] LTSSM_STATE right after the msleep(100) function? Does it show 0x11?

    BR
    Pavel

  • I think we are getting closer to a solution. I noticed that on the EVM, the u-boot "bootargs" contained parameters for mem=size@offs entries. I have never needed to do that for our board. Our board has 1GB of ddr, in contrast to the EVM's 2GB ddr. So just for the sake of trying, I added this to u-boot's bootargs: 

    mem=512M@0x80000000  mem=512M@0xa0000000

    It booted!  It detected the PLX bridge chip and assigned resources for all 4 ports.

    But, it only boot once!  I can not get it to boot again. Even with those mem=   entries, it is back to behaving as it did before.  So I think I have some memory allocation problem.  The memory related bootargs parameters are mem=512M@0x80000000  mem=512M@0xa0000000 and vmalloc=256M

    I did change the ti81xx_pcie_resources entry in devices.c for "pcie-inbound0" to the following: 

    {
    /* Inbound memory window - DJM: EVM has 2GB ddr, we only have 1GB */
    .name = "pcie-inbound0",
    .start = PLAT_PHYS_OFFSET,
    .end = PLAT_PHYS_OFFSET + SZ_1G - 1,
    .flags = IORESOURCE_MEM,
    },

    Is there anyplace else I need to modify to reflect the smaller memory size? 

    Maybe the 4 different ports of the PLX switch are getting enumerated and take up too much memory? And that is what is causing the crash?  I tried slightly smaller and slightly larger sizes for vmalloc, but it didn't make a difference. 

    BR, 

    Dennis

  • Dennis,

    Dennis McLeod said:
    Our board has 1GB of ddr, in contrast to the EVM's 2GB ddr.

    The DM8148 EVM has 1GB DDR3 memory.

    Dennis McLeod said:
    So I think I have some memory allocation problem.

    Dennis McLeod said:
    Maybe the 4 different ports of the PLX switch are getting enumerated and take up too much memory? And that is what is causing the crash?

    Yes, I also suspect that the root cause for the crash is the PLX switch and the memory it requires.

    BR
    Pavel

  • A possible solution can be to reduce some resource from the EZSDK 1GB memory map and provide this memory to the linux kernel (mem argument):

    http://processors.wiki.ti.com/index.php/EZSDK_Memory_Map

    Regards,
    Pavel

  • Hi Pavel, 

    I was just reviewing that page also, the EZSDK_Memory_Map.  It appears my mem= arguments were incorrectly set. I was passing in the entire ddr range. 

    So as instructed on that wiki page, i set the bootargs as follows: 

    mem=364M@0x80000000 mem=320M@0x9fc00000  vmalloc=500M 

    The board booted again, without problem. 

    Here is part of the boot log: 

    ti81xx_pcie: forcing link width - x1
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 709 reading CMD_STATUS
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 711 read CMD_STATUS, value = a00
    pcie: cmregs PCIE_CFG = 2
    pcie: cmregs PCIE_PLLCFG0 = 70007017
    pcie: cmregs PCIE_PLLCFG1 = 640010
    pcie: cmregs PCIE_PLLCFG2 = 0
    pcie: cmregs PCIE_PLLCFG3 = 4008e0
    pcie: cmregs PCIE_PLLCFG4 = 609c
    pcie: cmregs PCIE_PLLSTATUS = 88cd
    pcie: cmregs PCIE_RXSTATUS = 0
    pcie: cmregs PCIE_TXSTATUS = 0
    pcie: cmregs SERDES_RFCK_CTL = 2
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 744 reading CMD_STATUS
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 746 read CMD_STATUS, value = a00
    pcie: cmregs PCIE_CFG = 2
    pcie: cmregs PCIE_PLLCFG0 = 70007017
    pcie: cmregs PCIE_PLLCFG1 = 640010
    pcie: cmregs PCIE_PLLCFG2 = 0
    pcie: cmregs PCIE_PLLCFG3 = 4008e0
    pcie: cmregs PCIE_PLLCFG4 = 609c
    pcie: cmregs PCIE_PLLSTATUS = 88cd
    pcie: cmregs PCIE_RXSTATUS = 0
    pcie: cmregs PCIE_TXSTATUS = 0
    pcie: cmregs SERDES_RFCK_CTL = 2
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 784 reading CMD_STATUS
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 786 read CMD_STATUS, value = a01
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 797 reading CMD_STATUS
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 799 read CMD_STATUS, value = a01
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 806 reading CMD_STATUS
    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 808 read CMD_STATUS, value = a01
    ti81xx_pcie: Starting PCI scan...
    PCI: bus0: Fast back to back transfers disabled
    pci 0000:01:00.0: unsupported PM cap regs version (7)
    PCI: bus1: Fast back to back transfers enabled
    PCI: bus2: Fast back to back transfers enabled
    ti81xx_pcie: PCI scan done.
    pci 0000:00:00.0: BAR 8: assigned [mem 0x20000000-0x200fffff]
    pci 0000:01:00.0: BAR 0: assigned [mem 0x20000000-0x20003fff]
    pci 0000:01:00.0: BAR 0: error updating (0x20000000 != 0xffffffff)
    pci 0000:01:00.0: BAR 0: set to [mem 0x20000000-0x20003fff] (PCI address [0x20000000-0x20003fff])
    pci 0000:01:00.0: PCI bridge to [bus 02-02]
    pci 0000:01:00.0: bridge window [io disabled]
    pci 0000:01:00.0: bridge window [mem disabled]
    pci 0000:01:00.0: bridge window [mem pref disabled]
    pci 0000:00:00.0: PCI bridge to [bus 01-02]
    pci 0000:00:00.0: bridge window [io disabled]
    pci 0000:00:00.0: bridge window [mem 0x20000000-0x200fffff]
    pci 0000:00:00.0: bridge window [mem pref disabled]
    pci 0000:00:00.0: Refused to change power state, currently in D3
    bio: create slab <bio-0> at 0

    ...

    But you're not going to believe this... 

    It wouldn't boot a second time.  When I cycled power, the pcie driver got to the same line and then crashed the same way. 

    pcie: arch/arm/mach-omap2/pcie-ti81xx.c 784 reading CMD_STATUS
    Unhandled fault: external abort on non-linefetch (0x1008) at 0xd7020004

    Why would it boot once, but never again? 

  • Dennis McLeod said:
    It wouldn't boot a second time.  When I cycled power, the pcie driver got to the same line and then crashed the same way. 

    When boot for the second time, have you verified your boot arguments are the same as the fisrt/successful time boot? Is it possible that the boot args to be wrong in the second time boot?

    Regards

    Pavel

  • I did verify bootargs, yes. 

    I find it so strange that altering the memory arguments can cause it to boot successfully, but only once. 

    I just got it to boot again, with the following bootargs: 

    [root@BWS5F:~]# lspci
    00:00.0 Class 0604: 104c:b801
    01:00.0 Class 0604: 10b5:8605
    02:01.0 Class 0604: 10b5:8605
    02:02.0 Class 0604: 10b5:8605
    02:03.0 Class 0604: 10b5:8605
    [root@BWS5F:~]# cat /proc/cmdline
    console=ttyO0,115200n8 earlyprintk mem=364M@0x80000000 mem=93M@0x98000000 mem=320M@0x9fc00000 root=/dev/mtdblock3 rootfstype=jffs2 noinitrd ip=off vmalloc=500M

    but as soon as I reboot or power cycle, it crashes with the unhandled fault. This is so confusing.. 

    BR, 

    Dennis

  • note: we are not using hdmi/video/dsp at all. 

  • Update: 

    I saw in ti81xx_pcie_setup there was a late call to hook_fault_code to register ti81xx_pcie_fault(). 

    This happens at the end of ti81xx_pcie_setup() .  I moved the registration of the fault handler to BEFORE the function tries to initiate link training (and where it was causing the crash). 

    printk("pcie: %s %d reading CMD_STATUS\n", __FILE__, __LINE__);
    val32 = readl(reg_virt + CMD_STATUS);
    printk("pcie: %s %d read CMD_STATUS, value = %x\n", __FILE__, __LINE__, val32);

    printk("pcie: calling hook_fault_code just before LTSSM_EN\n");
    hook_fault_code(8, ti81xx_pcie_fault, SIGBUS, 0,
    "Precise External Abort on non-linefetch");

    printk("pcie: %s %d reading CMD_STATUS\n", __FILE__, __LINE__);
    val32 = readl(reg_virt + CMD_STATUS);
    printk("pcie: %s %d read CMD_STATUS, value = %x\n", __FILE__, __LINE__, val32);

    /*
    * Initiate Link Training. We will delay for L0 as specified by
    * standard, but will still proceed and return success irrespective of
    * L0 status as this will be handled by explicit L0 state checks during
    * enumeration.
    */
    writel(LTSSM_EN_VAL | readl(reg_virt + CMD_STATUS),
    reg_virt + CMD_STATUS);

    /* 100ms */
    msleep(200);


    printk("pcie: %s %d reading CMD_STATUS\n", __FILE__, __LINE__);
    val32 = readl(reg_virt + CMD_STATUS);
    printk("pcie: %s %d read CMD_STATUS, value = %x\n", __FILE__, __LINE__, val32);

    ....

    /*
    * PCIe access errors that result into OCP errors are caught by ARM as
    * "External aborts" (Precise).
    */
    // printk("pcie: hooking fault at normal location\n");
    // hook_fault_code(8, ti81xx_pcie_fault, SIGBUS, 0,
    // "Precise External Abort on non-linefetch");

     

    The ti81xx_pcie_fault function gets called many many times after the LTSSM_EN bit is set.  At least now it doesn't crash, but it obviously also doesn't enumerate any PCI devices. 

    The fact that this fault code is hooked in the driver already, that must be a sign that the driver author(s) must have also experienced this problem?  Was there ever any answer as to why the faults are happening? 

     

     

     

  • Dennis McLeod said:
    The SERDES_STATUS register in the PCIESS application registers block is undocumented. Can anyone tell us what the bits represent? 

    The 32-bits of this registers are mapped to the 16-bit STS_TX and STS_RX buses of the PCIe PHY. These registers/buses are not for customer use, but are occasionally useful for debugging with the factory. The description of the bits should be "Reserved" in TRM:

    SERDES_STATUS[31:16] STS_TX
    SERDES_STATUS[15:0] STS_RX

    Regards

  • Dennis McLeod said:
    note: we are not using hdmi/video/dsp at all.

    You can try to change the EZSDK map thus providing more memory to the linux kernel. See the below E2E threads for more info:

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/304274.aspx

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/294217.aspx

    Regards,
    Pavel

  • Some notes:

    Can you dump the DEBUG0 and DEBUG1 registers? addr: 0x51001728 and 0x5100172c. These registers show the status of the link during the training.

    How are you connecting the EP. Have you checked the cable modification/clocking scheme wiki page http://processors.wiki.ti.com/index.php/DM816x_C6A816x_AM389x_PCIe_Clocking_Schemes these are applicable for DM814X too?

    There is read access error so there is abort - this cannot be avoided.

    The application must know the region it tries to access is valid - else be prepared for abort.

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/240799.aspx

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/154734.aspx

    http://e2e.ti.com/support/embedded/linux/f/354/t/141733.aspx

     http://e2e.ti.com/support/embedded/linux/f/354/t/162993.aspx

    BR
    Pavel

  • Pavel Botev said:

    Some notes:

    Can you dump the DEBUG0 and DEBUG1 registers? addr: 0x51001728 and 0x5100172c. These registers show the status of the link during the training.

    How are you connecting the EP. Have you checked the cable modification/clocking scheme wiki page http://processors.wiki.ti.com/index.php/DM816x_C6A816x_AM389x_PCIe_Clocking_Schemes these are applicable for DM814X too?

    There is read access error so there is abort - this cannot be avoided.

    The application must know the region it tries to access is valid - else be prepared for abort.

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/240799.aspx

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/154734.aspx

    http://e2e.ti.com/support/embedded/linux/f/354/t/141733.aspx

     http://e2e.ti.com/support/embedded/linux/f/354/t/162993.aspx

    BR
    Pavel

    Hi Pavel, 
    I am reviewing everything you posted. Thank you. 
    Some questions about modifying the memory map: 
    1) according to the thread here http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/304274.aspx and here http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/294217.aspx ,  there was instruction on which elements from the memory map can be modified to 0MB size. My system has 1GB ddr, so the default map http://processors.wiki.ti.com/index.php/EZSDK_Memory_Map#Memory_Map_in_the_current_EZSDK_.28version_5.02_onwards.29 should be what my system uses currently.  Our product does not use DSP or any video at all.  Will it be ok if I make the following changes? 
    - CMEM  = 0MB
    - DSP_ALG_HEAP = 0MB
    - IPC_SR_HOST_DSP = 0MB
    - DSP_DATA = 0MB
    - IPC_SR_MC_HDVICP2_HDVPSS = 0MB
    - MC_HDVPSS_INT_HEAP_CACHED = 0MB
    - MC_HDVICP2_INT_HEAP_CACHED = 0MB
    and 
    - IPC_SR_FRAME_BUFFERS  = 0MB
    2) Once I do that, bootargs would be mem=477M  mem=508M@0x9fc00000  ?
    3) the "firmware loader" and mm_host_util,  where do they burn the bin file? Is there embedded flash in the AM387x ? Is it a "one time burn" or does it need to happen every boot? 
    4) Some items in the memory map (mentioned above) are not present in the sources board-support/media-controller-utils_2_05_00_17/src.  For example, you mentioned changing IPC_SR_HOST_DSP and DSP_DATA but they do not exist in the sources. How would i claim those regions? 
    Thanks Pavel, for all your help!
  • Dennis,

    Dennis McLeod said:
    Our product does not use DSP or any video at all.

    Does this mean you do not need Cortex-M3 ARM, HDVICP2 and HDVPSS cores? If yes, the easiest way is just to remove the hdvicp2 and hdvpss firmware auto loading.

    targetfs/etc/init.d/load-hd-firmware.sh

    case "$1" in
        start)
      #      echo "Loading HDVICP2 Firmware"
      #      prcm_config_app s
      #      modprobe syslink
      #      until [[ -e /dev/syslinkipc_ProcMgr && -e /dev/syslinkipc_ClientNotifyMgr ]]
      #      do                                                
      #          sleep 0.5
      #      done
      #      firmware_loader $HDVICP2_ID /usr/share/ti/ti-media-controller-utils/dm814x_hdvicp.xem3 start
      #      echo "Loading HDVPSS Firmware"
      #      firmware_loader $HDVPSS_ID /usr/share/ti/ti-media-controller-utils/dm814x_hdvpss.xem3 start
      #      modprobe vpss sbufaddr=0xBFB00000 mode=hdmi:1080p-60 i2c_mode=1
      #      modprobe ti81xxfb vram=0:24M,1:16M,2:6M
      #      configure_lcd
      #      modprobe ti81xxhdmi
      #      modprobe tlc59108
          ;;

    Thus you will remove the firmware load, and you can pass more memory to the linux kernel.

    Dennis McLeod said:
    Will it be ok if I make the following changes? 
    - CMEM  = 0MB
    - DSP_ALG_HEAP = 0MB
    - IPC_SR_HOST_DSP = 0MB
    - DSP_DATA = 0MB
    - IPC_SR_MC_HDVICP2_HDVPSS = 0MB
    - MC_HDVPSS_INT_HEAP_CACHED = 0MB
    - MC_HDVICP2_INT_HEAP_CACHED = 0MB
    and 
    - IPC_SR_FRAME_BUFFERS  = 0MB

    I think this is OK.

    Dennis McLeod said:
    2) Once I do that, bootargs would be mem=477M  mem=508M@0x9fc00000  ?

     Correct.

    Dennis McLeod said:
    3) the "firmware loader" and mm_host_util,  where do they burn the bin file? Is there embedded flash in the AM387x ? Is it a "one time burn" or does it need to happen every boot? 

    You should generate new bin file:

    ~/ti-ezsdk_dm814x-evm_5_05_02_00$make media-controller-utils

    Then you should install the new bin file:

    ~/ti-ezsdk_dm814x-evm_5_05_02_00$make media-controller-utils_install

    The console messages should point you where exactly the new bin file is installed.

    Regards

  • For IPC_SR_HOST_DSP, it is used only when remote codec engine is used:

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/p/160814/589538.aspx#589538

    Regards,
    Pavel

  • For DSP_DATA (0x99500000), it is used only when OMX and/or RPE is used:

    ti-ezsdk_dm814x-evm_5_05_02_00/component-sources/omx_05_02_00_48/src/ti/omx/build/MemSegmentDefinition.xs

    memory[2] = ["DSP",
      {
              name: "DSP",
              base: 0x99500000,
              len:  0x00C00000,    //if you are planning to use OMX without DSP, you can change this to 0x0
              space: "code/data"
      }];

    ti-ezsdk_dm814x-evm_5_05_02_00/component-sources/rpe_1_00_01_13/examples/dm81xx/dspsubsys.xs

    memory[2] = ["DSP",
      {
              name: "DSP",
              base: 0x99500000,
              len:  0x00C00000,
              space: "code/data"
      }];

    Best regards,
    Pavel

  • Pavel Botev said:
    Can you dump the DEBUG0 and DEBUG1 registers? addr: 0x51001728 and 0x5100172c. These registers show the status of the link during the training.

    I can't, actually. After setting the LTSSM_EN bit, readl() will cause the abort. 

    I am going to pursue the memory map changes, but I am pretty sure it isn't the cause of the problem. The driver calls pci_common_init in the probe() function, then it soon enters the ti81xx_pcie_setup() function to set up the RC. The enumeration of downstream devices doesn't happen until ti81xx_pcie_scan, much later. The problem happens in ti81xx_pcie_setup, so I think that means it is not likely a problem with the downstream device or it's BARs/resource allocation since we're not even getting past setting up the RC. Would you agree? 

  • I have more info that might help debug the problem.  In the ti81xx_pcie_setup function, just before setting the LTSSM_EN bit  I set the smart idle and smart standby to "off", I also still have the hook_fault_code moved up above there too, so it doesn't abort and crash the kernel.

    /*
    * PCIe access errors that result into OCP errors are caught by ARM as
    * "External aborts" (Precise).
    */
    printk("pcie: early fault hook register\n");
    hook_fault_code(8, ti81xx_pcie_fault, SIGBUS, 0,
    "Precise External Abort on non-linefetch");

    printk("pcie: removing smart idle caps\n");

    val32 = __raw_readl(reg_virt + CMD_STATUS);
    val32 = (val32 & ~(0x0f << 8)) | (0x5 << 8); // set bits 11:8 to 0101 (no idle, no standby)
    __raw_writel(val32, reg_virt + CMD_STATUS);

    /*
    * Initiate Link Training. We will delay for L0 as specified by
    * standard, but will still proceed and return success irrespective of
    * L0 status as this will be handled by explicit L0 state checks during
    * enumeration.
    */
    __raw_writel(LTSSM_EN_VAL | __raw_readl(reg_virt + CMD_STATUS),
    reg_virt + CMD_STATUS);

    The kernel always boots now, and maybe 50% of the time it finds the downstream PLX switch. Nothing changes between reboots, but yet it is only half the time that it "works".  Here's some info from the boot log from one of the "successful" boot attempts: 

    ti81xx_pcie: Register base mapped @0xd7020000
    pcie: early fault hook register
    pcie: removing smart idle caps
    ti81xx_pcie: Starting PCI scan...
    PM: Adding info for No Bus:pci0000:00
    PM: Adding info for No Bus:0000:00
    pci 0000:00:00.0: [104c:b801] type 1 class 0x000604
    PCI: bus0: Fast back to back transfers disabled
    pci 0000:01:00.0: [10b5:8605] type 1 class 0x000604
    pcie: fault hook, addr = d702203c
    PCI: bus1: Fast back to back transfers enabled
    PCI: bus2: Fast back to back transfers enabled
    PM: Adding info for pci:0000:00:00.0
    PM: Adding info for pci:0000:01:00.0
    PM: Adding info for No Bus:0000:02
    PM: Adding info for No Bus:0000:01
    pci 0000:00:00.0: BAR 8: assigned [mem 0x20000000-0x201fffff]
    pci 0000:00:00.0: BAR 9: assigned [mem 0x20200000-0x203fffff pref]
    pci 0000:00:00.0: BAR 7: can't assign io (size 0x1000)
    pci 0000:01:00.0: BAR 8: assigned [mem 0x20000000-0x201fffff]
    pci 0000:01:00.0: BAR 9: assigned [mem 0x20200000-0x203fffff pref]
    pci 0000:01:00.0: BAR 7: can't assign io (size 0x1000)
    pci 0000:01:00.0: PCI bridge to [bus 02-02]
    pci 0000:01:00.0: bridge window [io disabled]
    pci 0000:01:00.0: bridge window [mem 0x20000000-0x201fffff]
    pci 0000:01:00.0: bridge window [mem 0x20200000-0x203fffff pref]
    pci 0000:00:00.0: PCI bridge to [bus 01-02]
    pci 0000:00:00.0: bridge window [io disabled]
    pci 0000:00:00.0: bridge window [mem 0x20000000-0x201fffff]
    pci 0000:00:00.0: bridge window [mem 0x20200000-0x203fffff pref]
    pci 0000:00:00.0: Refused to change power state, currently in D3
    pci_bus 0000:00: resource 0 [mem 0x20000000-0x2fffffff]
    pci_bus 0000:00: resource 1 [io 0x40000000-0x402fffff]
    pci_bus 0000:01: resource 1 [mem 0x20000000-0x201fffff]
    pci_bus 0000:01: resource 2 [mem 0x20200000-0x203fffff pref]
    pci_bus 0000:02: resource 1 [mem 0x20000000-0x201fffff]
    pci_bus 0000:02: resource 2 [mem 0x20200000-0x203fffff pref]
    pci 0000:00:00.0: BAR 7: can't assign io (size 0x1000)
    pci 0000:01:00.0: BAR 7: can't assign io (size 0x1000)
    PM: Adding info for No Bus:default

    lspci does show both devices. But oddly enough, even though this looks relatively successful, I still am not able to do "devmem2 0x51000004" without it printing the "

    Unhandled fault: Precise External Abort on non-linefetch (0x1018) at 0x40077004
    Bus error" message. 

    Any suggestions on what to look at next? Or what could be going wrong? 

    BR, 

    Dennis

  • After adjusting the pre-emphasis settings on both ends of the link (between am3874 and PLX 8605), we have been able to achieve link-up most of the time. It is still not reliable, but it's better than it ever has been. 

    I still have that fault-hook function registered early, that keeps the kernel from crashing. 

    I also created a script to run when the pciess locks up, this resets it (and thereafter restores access to the 0x5100xxxx window): 

    [root@am3874:~]# cat /bin/resetpci
    #!/bin/sh

    devmem 0x48180b10 32 0x09c
    sleep 1

    devmem 0x48180578 32 0
    devmem 0x48180510 32 0
    sleep 1

    devmem 0x48180510 32 0x02
    devmem 0x48180578 32 0x02
    sleep 1

    devmem 0x48180b10 32 0x001c

    echo Done

    I am curious about  memory allotted during pcie enumeration though. The am3874 has an embedded pci-pci bridge, that has a BAR8 as : 

    cat /sys/bus/pci/devices/0000\:00\:00.0/resource
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000020000000 0x00000000200fffff 0x0000000000000200
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000

    According to the L3 memory map, PCIE has a window 256MB in size.. from 0x20000000 to 0x2FFFFFFF. Why is the RC only using 0x20000000 to 0x200FFFFF ?  16MB

    Although I don't have any endpoints, pci enumeration sees the 4 downstream ports of the PLX switch: 

    cat /sys/bus/pci/devices/0000\:01\:00.0/resource
    0x0000000020000000 0x0000000020003FFF 0x0000000000040200
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000
    0x0000000000000000 0x0000000000000000 0x0000000000000000

    So from that initial 16MB, the first enumerated device gets a 16K chunk. Seems strange the whole available memory window isn't used for the pool?