This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5716: [NIC issue] The NIC's driver occur error if we down/up the network card for 2 days.

Part Number: AM5716

Hi experts

Recently, we encountered that the NIC issue of the AM5716. If we continue to down/up the network card for one day, the kernel will report the following error(ps. The network card is transmitting network data during down/up):

[75972.247088] Unable to handle kernel paging request at virtual address f18fe000
[75972.254357] pgd = 0176da8e
[75972.257086] [f18fe000] *pgd=80000080007003, *pmd=ae855003, *pte=00000000
[75972.263836] Internal error: Oops: 207 [#1] PREEMPT SMP ARM
[75972.269344] Modules linked in:
[75972.272414] CPU: 0 PID: 5152 Comm: irq/227-eth2 Tainted: G W 4.19.94-rt39SunGrow-ga242ccf3f1 #1
[75972.282456] Hardware name: Generic DRA72X (Flattened Device Tree)
[75972.288580] PC is at memcpy+0xe8/0x330
[75972.292348] LR is at emac_rx_packet+0x1bc/0x840
[75972.296896] pc : [<c0e41de8>] lr : [<c0987588>] psr: 600f0013
[75972.303187] sp : eda2fde4 ip : 00000002 fp : eda2fe6c
[75972.308431] r10: f18fe000 r9 : 00000040 r8 : eda6a000
[75972.313677] r7 : eda74440 r6 : edaa5a40 r5 : edb75a42 r4 : 00000040
[75972.320230] r3 : 00000040 r2 : 0000003c r1 : f18fe000 r0 : edb75a42
[75972.326785] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[75972.333949] Control: 30c5387d Table: ad076dc0 DAC: fffffffd
[75972.339718] Process irq/227-eth2 (pid: 5152, stack limit = 0x306a9037)
[75972.346271] Stack: (0xeda2fde4 to 0xeda30000)
[75972.350647] fde0: edb75a42 00000040 c0987588 c06c0368 c06c0238 eda2fe1c c18524a8
[75972.358859] fe00: edb75a40 c18dc38c 00000000 00000040 c18dc38c c18064c8 fffffcba 00000000
[75972.367072] fe20: f1900000 00000000 eda74940 00000000 ef6561c0 ed1a1e40 ee53de40 e5a23cf6
[75972.375284] fe40: eda2fe5c f18b8000 00000008 eda74440 00000000 eda6a000 eda2febc f18b9ca0
[75972.383496] fe60: eda2ff04 eda2fe78 c0987dc8 c09873dc 00000000 00000002 00000040 00000000
[75972.391708] fe80: c1806400 c0e5f124 c18dc38c ee53de40 c18064c8 c10b4c68 c18dc374 00000000
[75972.399920] fea0: c10b4c68 00000004 c18dc38c 000005f2 eda74940 f18b8000 0e380000 00000000
[75972.408132] fec0: 00000002 00000040 00000000 c1806400 c0e5f124 e5a23cf6 00000004 ed071d00
[75972.416344] fee0: eda4e640 ffffe000 00000000 c0293998 ed071d00 00000001 eda2ff24 eda2ff08
[75972.424555] ff00: c02939bc c0987c18 ed071d24 eda4e640 ffffe000 00000000 eda2ff74 eda2ff28
[75972.432767] ff20: c0293cf8 c02939a4 c12ab968 c18064c8 c18763c0 eda2ff40 c0e5f0c8 00000000
[75972.440979] ff40: c0293adc e5a23cf6 c0252a78 ed071700 ed071c40 00000000 eda2e000 ed071d00
[75972.449190] ff60: c0293bb8 ed093b7c eda2ffac eda2ff78 c0252f84 c0293bc4 ed07171c ed07171c
[75972.457401] ff80: 00000000 ed071c40 c0252e24 00000000 00000000 00000000 00000000 00000000
[75972.465613] ffa0: 00000000 eda2ffb0 c02010e0 c0252e30 00000000 00000000 00000000 00000000
[75972.473824] ffc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[75972.482036] ffe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[75972.490246] Backtrace:
[75972.492706] [<c09873d0>] (emac_rx_packet) from [<c0987dc8>] (emac_rx_thread+0x1bc/0x2cc)
[75972.500833] r10:f18b9ca0 r9:eda2febc r8:eda6a000 r7:00000000 r6:eda74440 r5:00000008
[75972.508694] r4:f18b8000
[75972.511242] [<c0987c0c>] (emac_rx_thread) from [<c02939bc>] (irq_thread_fn+0x24/0x80)
[75972.519107] r10:00000001 r9:ed071d00 r8:c0293998 r7:00000000 r6:ffffe000 r5:eda4e640
[75972.526968] r4:ed071d00
[75972.529514] [<c0293998>] (irq_thread_fn) from [<c0293cf8>] (irq_thread+0x140/0x210)
[75972.537203] r7:00000000 r6:ffffe000 r5:eda4e640 r4:ed071d24
[75972.542890] [<c0293bb8>] (irq_thread) from [<c0252f84>] (kthread+0x160/0x168)
[75972.550055] r10:ed093b7c r9:c0293bb8 r8:ed071d00 r7:eda2e000 r6:00000000 r5:ed071c40
[75972.557916] r4:ed071700
[75972.560462] [<c0252e24>] (kthread) from [<c02010e0>] (ret_from_fork+0x14/0x34)
[75972.567713] Exception stack(0xeda2ffb0 to 0xeda2fff8)
[75972.572784] ffa0: 00000000 00000000 00000000 00000000
[75972.580996] ffc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[75972.589207] ffe0: 00000000 00000000 00000000 00000000 00000013 00000000
[75972.595849] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0252e24
[75972.603710] r4:ed071c40
[75972.606253] Code: e8bd8011 e26cc004 e35c0002 c4d13001 (a4d14001)
[75972.612394] ---[ end trace 3c621fe2e35a4e98 ]---

The source code for the up and down network cards is as follows:

We are using Linux, and the kernel version is 4.19.

I'd appreciate if you can give some comments.

  • Hi,

    How are you sending ethernet data. Which application are you using for that?

    Regards,
    Tanmay

  • Dear Tanmay,

    I didn't run any applications, just connected the network port to the switch, which may have had low-frequency network communication with AM5716. On the contrary, after unplugging the network cable, there will be no errors when I keep up/down the network port

  • Hi,

    ps. The network card is transmitting network data during down/up

    So, in your statement above, its just the background data and not any data you are specifically sending?

    Also, in your application, you are scheduling interface down every 100ms and then once every 500ms, you are bringing the interface up. When you have the cable connected, I am guessing the interface won't be able to come up in 100ms when the next link down command is scheduled. So, the links would be always down when this script is executed.

    Is this analysis correct?

    Can you please help me the understand the need for such a test.

    Meanwhile I am trying this out and i will let you if I am able to reproduce this on my end.

    Regards,
    Tanmay

  • Dear Tanmay,

    Yes, its just the background data and not any data you are specifically.

    After completing the experiment at hand, we can conduct another experiment to expand the 100ms like 1s.

  • Hi,

    Have you also tried this experiment by just bringing the interface down when it is up and bringing the interface up only when it is down?

    Regards,
    Tanmay

  • Dear Tanmay,

    Yes, we did. We run the following script and it got bad result like before.

  • Dear Tanmay,

    I would like to share with you the results of my analysis over the past few days, I have two questions that I would like to ask you:

    1) The kernel error mentioned in the main text of my case is due to an exception that occurred during a network storm, and this error code is the driver of the pruss network card. After both the cpsw network card and the pruss network card suffer from a network storm for a period of time, the pruss network card will report this kernel error. In addition, when the network storm stops, the cpsw network card can still work normally, but the pruss network card cannot work properly. I want to know why? is it because the cpsw network card cannot withstand network storms?

    2) When encountering a network storm, if i execute ifconfig eth1 down, it will block in emac_pru_stop()->disable_irq(emac->rx_irq) function. If I replaced the disable_irq(emac->rx_irq) function with disable_irq_nosync(emac->rx_irq), then it will no longer block, and the network card can be down. So I would like to ask if I can replaced the disable_irq(emac->rx_irq) function with disable_irq_nosync(emac->rx_irq)?

  • Hi,

    The ports which you are toggling (eth1 and eth2), are they cpsw ports or pru ports. I was assuming them to be cpsw ports. But from the source code you are updating, they seem to be pru ports. Can you confirm.

    Can you please also explain to exactly what are you trying to test with these experiments here? I might be able to make some easier to replicate setup which would help me with debug. Here is what I understood so far:

    • After toggling the interfaces for a long time, you are observing a kernel panic.
    • There is some data being transmitted in the background
      • I am not sure if this data is being actually sent or not.
      • As you are toggling the link state pretty fast (~100ms), are you sure the link is coming up and you are able to send/receive data on the link?

    I didn't understand, why are you doing this experiment and also what is the network storm which you are talking about.

    Regards,
    Tanmay

  • Dear Tanmay,

    Because we haven't communicated for many days, there are some issues that I need to raise with you again. There are also some things that you don't need to pay attention to, such as 100ms.Let me tell you again about the new problems we have encountered. We only need to discuss the following two issues:

    1)When the pruss network card and the cpsw network card receive a large number of network packets in an instant. For example, within 10 minutes, both types of network ports receive 1 million packets. When the network packets stop filling these two types of network ports, the pruss network card cannot be used normally, while the cpsw network card can be used normally. So I want to know if the stability of the pruss network card is not as good as the cpsw network card?

    2)If I replace the disable_irq(emac->rx_irq) function with disable_irq_nosync(emac->rx_irq) in pruss driver code, Will this change affect the performance of the pruss network card?

  • Hello Tanmay

    Are there any new updates

    My guess is, will this high traffic cause too much load on the CPU, causing a kernel panic? However, it is not clear that the network interface using CPSW will not have this problem, is it the operation mechanism of the PRU network interface? Or the driver issue?

     

    Since the customer is currently using SDK6.03, are there any changes we have made to the PRU driver in subsequent SDK releases that address this issue?

     

    Looking forward to your reply.

  • Hi Tanmay

    Any update? This issue is very urgent, please give some comments as soon as possible. Thanks