This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: System kswapd0 exception rollout error

Part Number: TDA4VM

Hi,In the product, we found that the system reported kswapd0 error, the log is shown below.

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080
[ 1322.953192] Mem abort info:
[ 1322.956039]   ESR = 0x96000006
[ 1322.959119]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 1322.964438]   SET = 0, FnV = 0
[ 1322.967531]   EA = 0, S1PTW = 0
[ 1322.970688] Data abort info:
[ 1322.973575]   ISV = 0, ISS = 0x00000006
[ 1322.977400]   CM = 0, WnR = 0
[ 1322.980386] user pgtable: 4k pages, 48-bit VAs, pgdp=00000008af7a6000
[ 1322.986842] [0000000000000080] pgd=00000008f7c71003, p4d=00000008f7c71003, pud=00000008af0ab003, pmd=0000000000000000
[ 1322.997483] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[ 1323.003045] Modules linked in: xag_product_info xag_open_expwr xag_status_leds rd_sync_gpio xag_ctrl_4g_sim_mode net_4g_gpio_control xag_rtk_rst_ctrl wifi_module_ctrl set_main_doamin_macs system_flag pvrsrvkm(O) vxd_dec vxe_enc videobuf2_dma_contig videobuf2_dma_sg videobuf2_memops v4l2_mem2mem videobuf2_v4l2 videobuf2_common sch_fq_codel cryptodev(O)
[ 1323.034138] CPU: 1 PID: 100 Comm: kswapd0 Tainted: G           O      5.10.162-generic #1
[ 1323.042299] Hardware name: Texas Instruments K3 J721E SoC (DT)
[ 1323.048123] pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--)
[ 1323.054136] pc : percpu_counter_add_batch+0x2c/0x120
[ 1323.059101] lr : clear_page_dirty_for_io+0x1b8/0x248
[ 1323.064052] sp : ffff8000118d3880
[ 1323.067359] x29: ffff8000118d3880 x28: fffffe00209a5788 
[ 1323.072664] x27: dead000000000100 x26: fffffe00209a5780 
[ 1323.077965] x25: ffff8000118d3a38 x24: 0000000000000000 
[ 1323.083264] x23: 0000000000000000 x22: 0000000000000000 
[ 1323.088563] x21: ffff00082ec14a60 x20: 0000000000000001 
[ 1323.093862] x19: 0000000000000060 x18: ffff00082f5dcc00 
[ 1323.099162] x17: 0000000000000000 x16: 0000000000000000 
[ 1323.104465] x15: 00008cb0584067c2 x14: 0000000000000004 
[ 1323.109765] x13: 0000000000000004 x12: 0000000000000000 
[ 1323.115064] x11: 0000000000000002 x10: ffff00082f316800 
[ 1323.120363] x9 : 000000000000000f x8 : 00000000ffffffff 
[ 1323.125662] x7 : 0000000000000020 x6 : 0000000000000000 
[ 1323.130961] x5 : ffff80086e9fa000 x4 : ffff000827ce3b00 
[ 1323.136261] x3 : 0000000000000001 x2 : 0000000000000010 
[ 1323.141558] x1 : ffffffffffffffff x0 : 0000000000000060 
[ 1323.143565] stress-ng-vm invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=1000
[ 1323.146857] Call trace:
[ 1323.146871]  percpu_counter_add_batch+0x2c/0x120
[ 1323.146877]  clear_page_dirty_for_io+0x1b8/0x248
[ 1323.146879]  pageout+0x80/0x360
[ 1323.146881]  shrink_page_list+0x800/0xb50
[ 1323.146884]  shrink_inactive_list+0x1c4/0x3a8
[ 1323.146886]  shrink_lruvec+0x308/0x390
[ 1323.146888]  shrink_node+0x3f4/0x6b8
[ 1323.146890]  balance_pgdat+0x258/0x4a8
[ 1323.146892]  kswapd+0x1c0/0x3a0
[ 1323.146899]  kthread+0x140/0x160
[ 1323.146903]  ret_from_fork+0x10/0x34
[ 1323.146911] Code: b9401083 11000463 b9001083 d538d085 (f9401003) 
[ 1323.146916] ---[ end trace cc243eb0715b0a4e ]---
[ 1323.146993] note: kswapd0[100] exited with preempt_count 1

After the problem, the system's iowait is abnormal, in fact, the whole system is running abnormally, just iowait is more obvious.

In the system memory and disk pressure is relatively easy to appear, please help to see how to solve the problem.

The direction of suspicion is that when memory and disk pressure is high, the system's memory reclamation mechanism has an abnormality, causing kswapd0 to exit.

  • Hi Sheng,

    It seems to come up by 1300 seconds which is roughly around 21 minutes. Can add the below details:

    • The use case that's running.
    • Is this a custom board or TI EVM?
    • Is this reproducible consistently?
    • What's the DDR size in case it's your custom board?
    • Any more details on what triggers this?

    Best Regards,

    Keerthy

  • For is 1300 seconds later, there is no error message in front of this one, it's just the normal kernel booting finished into the file system message.


    The board is self-developed, the ddr is 4GB capacity, the frequency is 3733, and the sdk is based on the 8.6 version.


    The problem, after our testing, tends to recur when the system memory load is relatively high, such as 95% or more, and when disk IO is high.


    Our current solution is to turn off swap in the kernel, as well as adjusting the size of Hugepagesize to 2MB, it is much more stable, at least we ran business and fio disk stress test for more than 24 hours did not appear.

  • Hi Sheng,

    The problem, after our testing, tends to recur when the system memory load is relatively high, such as 95% or more, and when disk IO is high.


    Our current solution is to turn off swap in the kernel, as well as adjusting the size of Hugepagesize to 2MB, it is much more stable, at least we ran business and fio disk stress test for more than 24 hours did not appear

    How can this be reproduced on the EVM? What is the use case we run? Also SDK version? We do not see this issue on the EVM.

    - Keerthy

  • Hi Keerthy,

    Inherently more difficult to reproduce, this needs more testing and validation.

    The evm development boards we started out with didn't do a ton of testing either and it wasn't clear that it would come up.

    One scenario is that we have tiovx enabled, is it possible to test this on your end, our commands are using stress-ng and fio.

  • Hi Sheng,

    One scenario is that we have tiovx enabled, is it possible to test this on your end, our commands are using stress-ng and fio

    Yes. If you can give out the steps it will be possible to try on EVM.

    - Keerthy

  • Hi Keerthy,

    We tested and finally realized that it happens when we start nginx and not when we shut it down, what could be the problem.


    Setting up an evm environment is difficult, not in a position to do so at the moment.

  • Hi Sheng,

    From the traces I observe oom_killer invoked. This is basically out of memory . I believe without running nginx this does not happen. Please check nginx to see if for some reason memory is either leaked or some big allocation is needed?

    Sorry I don't know much about nginx and hence cannot comment further.

    Best Regards,

    Keerthy 

  • Hi Keerthy,

    I think I have found the problem on my side, the module cryptodev-linux, its a call to the hardware crypto module, I don't really know how the kernel works internally, its affecting the kswapd0 thread, causing it to crash and exit, and thus the system exception.

    After removing the cryptodev-linux driver the problem no longer occurs, you can test it on your side to see what the exact cause is.

  • Hi,

    After removing the cryptodev-linux driver the problem no longer occurs, you can test it on your side to see what the exact cause is

    By default the  module cryptodev-linux is needed for crypto operations. 

    http://cryptodev-linux.org/

    The hardware crypto accelerators are put to use. Does nginx need the Crypto operations? If that is not needed then please go ahead and remove the module.
    Thanks for digging in and sharing the information.

    - Keerthy

  • then please go ahead and r

    Thanks for this time of communication, after more than 24 hours of stress testing, the system no longer appears kswapd thread exit exception, I intend to remove the module first, prioritize the solution to the problem of system exceptions, cryptodev-linux module why it will have an impact and then step by step to troubleshoot.