This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

pci_disable_msi Oops Bug

5773.pcie_test.c

I am trying to write a kernel module that will handle MSI interrupts for a PCIe device. I have written a simple skeleton outline for my driver currently and whenever I try to call 'pci_disable_msi(dev)' I get an unable to handle kernel NULL pointer dereference error. I am following along exactly as described from the /Documentation/PCI/MSI-HOWTO.txt and it seems to me that I should not be getting this error. Is this a bug or is my setup incorrect? I attatched my source code.

I am using Keystone II processor running 3.10.10 SMP Kernel

  • Thank you very much for keeping the stack trace a secret.

    (You should not use IRQF_SHARED for an MSI, and you forgot pci_disable_device(), but those are harmless bugs.)

  • Hi Bill,

    You shall follow Clemens suggestions posted above.

    Could you please share the complete log of error message? 

    Thanks.

  • I have tried what Clemens describes above but no change. Attatched is the log.

    7762.log.txt

  • (I told you these two bugs are harmless.)

    Your code calling pci_disabl_msi() looks OK.

    The crash is in arch_teardown_msi_irq().
    This looks like a bug in the MSI support code of your architecture.
    What specific kernel are you using? Where is its source?

  • Linux arago 3.10.10 #5 SMP Wed Jun 11 15:12:42 EDT 2014 armv7l GNU/Linux. Source is from:

     git://git.ti.com/keystone-linux/linux.git

    compiled with

    make keystone2_defconfig

  • There appears to be a bug in arch_teardown_msi_irq() in drivers/pci/host/pcie-keystone.c: It never checks if irq_data might be NULL.

    Try changing it like this:

        struct irq_data *irq_data;
        struct keystone_pcie_info *info;
       
        irq_data = irq_get_chip_data(irq);
        if (!irq_data)

            return;
        info = irq_data_get_irq_chip_data(irq_data);
        ...

  • Great idea, but unfortunately this does not solve the issue. Here is what I changed to:

    void arch_teardown_msi_irq(unsigned int irq)
    {
     printk(KERN_NOTICE "In arch_teardown...\n");
     struct irq_data *irq_data;
     struct keystone_pcie_info *info;

     printk(KERN_NOTICE "irq = %d\n",irq);
     if(!irq)
     {
      return;
     }

     irq_data = irq_get_chip_data(irq);
     printk(KERN_NOTICE "1\n");
     if(!irq_data)   // Trying to fix bug
      return;
     printk(KERN_NOTICE "2\n");
     info = irq_data_get_irq_chip_data(irq_data);
     printk(KERN_NOTICE "3\n");
     if (info) {
      printk(KERN_NOTICE "4\n");
      int pos = irq - irq_linear_revmap(info->msi_irqd, 0);
      printk(KERN_NOTICE "5\n");
      irq_set_chip_and_handler(irq, NULL, NULL);
      printk(KERN_NOTICE "6\n");
      irq_set_chip_data(irq, NULL);
      printk(KERN_NOTICE "7\n");
      clear_bit(pos, msi_irq_bits);
      printk(KERN_NOTICE "8\n");
      return;
     }
     pr_err(DRIVER_NAME
      ": arch_teardown_msi_irq, can't find driver data\n");
    }

     

    Judging by the printks it is failing at this line:

      int pos = irq - irq_linear_revmap(info->msi_irqd, 0);

    Attatched is the log

    0207.log.txt

  • irq_linear_revmap() is an inline function:

    static inline unsigned int irq_linear_revmap(struct irq_domain *domain,
                                                 irq_hw_number_t hwirq)
    {
            return hwirq < domain->revmap_size ? domain->linear_revmap[hwirq] : 0;
    }

    Apparently, domain (info->msi_irqd) is NULL, but how is that possible?

  • I have no clue, this is deep into the kernel for my experience level. Anything else I can give you that might help?

  • Hi Bill,

    We are working with factory team to support you. Thank you for your patience.

  • Has there been any resolution to this? My kernel module blows up the kernel whenever it is removed requiring a restart everytime due to this issue.

  • Bill,

    Can you explain who is PCIE RC, is that a TI K2H device running Linux kernel? And who is the PCIE EP, where did you get the Linux driver code for this device? Is there any readme for that device for integration?

    Regards, Eric

  • As mentioned above, I am running RC on the TI k2h. It is running 3.8.4 kernel (it was running 3.10 before). The EP is a xilinx FPGA which is running a xilinx core. I am able to enable the MSI according to the /Documentation/PCI/MSI-HOWTO.txt. Everything works fine when configuring the FPGA except for when I remove my module and try to  teardwon the irq. My code is above along with a very detailed description and debug trace down to just about the line of code in the kernel that causes the crash.

  • Eric,

    Rasersakan messaged me and said that you would be working with me to try to solve this issue. Has there been any resolution? This is causing serious delays in our development.

  • Bill,

    Sorry for the delay. I am working with our developers for this. The pci_disable_msi() finally calls into the arch_teardown_msi_irq() in our pci-keystone.c, and looks like info->msi_irqd is a NULL.

    AHCI SATA (drivers/ata/ahci.c) is known to use our Keystone MSI interrupts and works. Inside the driver code, it also have pci_disable_msi(), we are tracking if this is called.

    Regards, Eric

  • Bill,

    In the drivers/pci/host/pcie-keystone.c file where you added debug code in the past, can you help by collecting more debug info by printing some variables or pointers, we maninly wanted to compare how some structures became NULL from MSI setup to teardown.

    In arch_setup_msi_irq() function where MSI is setup:

    irq_set_chip_data(irq, info);
    irq_set_chip_and_handler(irq, &keystone_msi_chip, handle_level_irq);

    Please provide:
    irq,
    info->num_msi_host_irqs
    info->num_msi_irqs,
    info->msi_irqd

    Then in the arch_teardown_msi_irq()
    {
    struct irq_data *irq_data = irq_get_chip_data(irq);
    struct keystone_pcie_info *info = irq_data_get_irq_chip_data(irq_data);
    if (info) {
    int pos = irq - irq_linear_revmap(info->msi_irqd, 0);

    Please provide:
    irq_data->chip
    irq_data->domain
    irq_data->handler_data
    irq_data->chip_data
    info
    info->msi_irqd

    Thanks, Eric

  • Hi Eric, Here is the code that I added:

    void arch_teardown_msi_irq(unsigned int irq)
    {
        struct irq_data *irq_data = irq_get_chip_data(irq);
        struct keystone_pcie_info *info = irq_data_get_irq_chip_data(irq_data);
    
        if (info) {
    
            printk("chip = 0x%0X\n",irq_data->chip);	
            printk("domain = 0x%0X\n",irq_data->domain);
            printk("handler_data = 0x%0X\n",irq_data->handler_data);
            printk("chip_data = 0x%0X\n",irq_data->chip_data);
    
            int pos = irq - irq_linear_revmap(info->msi_irqd, 0);
    
    
    .
    .
    .
    int arch_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *desc)
    {

    ......
                irq_set_chip_data(irq, info);
                irq_set_chip_and_handler(irq, &keystone_msi_chip,
                             handle_level_irq);
    
                printk("irq = 0x%0X\n",irq);
                printk("num_msi_host_irqs = 0x%0X\n",info->num_msi_host_irqs);
                printk("num_msi_irqs = 0x%0X\n",info->num_msi_irqs);
                printk("msi_irqd = 0x%0X\n",info->msi_irqd);



    The output is:
    [   30.316960] irq = 0x240
    [ 30.316961] num_msi_host_irqs = 0x8
    [   30.316963] num_msi_irqs = 0x20
    [   30.316965] msi_irqd = 0xEE28A000
    On unload:
    [   34.809903] chip = 0x21807FFF
    [   34.812859] domain = 0x0
    [   34.815391] handler_data = 0xC1376B50
    [   34.819042] chip_data = 0x200
  • Bill,

    Thanks, what is the info->msi_irqd during disabling the MSI, is this a NULL pointer or not?

    Regards, Eric

  • With this being the prints I place in the teardown:
    printk
    ("chip = 0x%0X\n",irq_data->chip); printk("domain = 0x%0X\n",irq_data->domain); printk("handler_data = 0x%0X\n",irq_data->handler_data); printk("chip_data = 0x%0X\n",irq_data->chip_data); printk("info = 0x%0X\n",info); printk("info->msi_irqd = 0x%0X\n", info->msi_irqd);

    I get this out:
    [   85.670042] chip = 0x21807FFF

    [   85.672997] domain = 0x0

    [   85.675530] handler_data = 0xC1376B50

    [   85.679180] chip_data = 0x200

    [   85.679182]info = 0x200

    [   85.679189] Unable to handle kernel NULL pointer dereference at virtual address 000002fc
  • Bill,

    The log shows info is invalid during teardown(): info is retrieved from irq_chip_data. In the setup() call, info is set to irq chip data in function irq_set_chip_data().

    So not sure when it is retrieved it is showing invalid value. irq number is same 0x240 in setup() and teardown(). So how info gets corrupted. This needs to be traced by printing irqdata at different places. Just after irq_set_chip_data(irq, info), in arch_setup_msi_irq(), make a call to 

       irq_data = irq_get_chip_data(irq);

       info = irq_data_get_irq_chip_data(irq_data);

    and print the info ptr. Did irq_set_chip_data work properly here?

    Regards, Eric

  • See if the attached code change fix your issue.0066.0001-pcie-keystone-fix-msi-irq-teardown-issue.patch

  • Any news?

    Regards, Eric

  • Hi Eric,

    Sorry for the delay. I was on vacation last week. I just applied the patch and everything is working fine now. Thank you all for you help, I hope this can help others who might be having the same issue. It does suprise me however that no one else has run into this problem yet. Anyways, I appreciate it.