This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/AM3352: AES driver random crashes

Part Number: AM3352

Tool/software: Linux

I have an AM3352 processor running a linux system with a linux kernel from the processor-sdk-linux-rt-04.02.00 branch (kernel v 4.9.59).

I use encrypted file systems and various user space processes using the cryptodev module (rev 1.9). This means that the aes driver in the kernel is having a busy time. This has worked fine in the 4.4.12 kernel (processor-sdk-linux-rt-03.00.00), but when I upgraded to the 4.9.59 kernel I get random crashes in omap-aes.c. It crashes in different places. Here is one typical crash that I get. It looks to me like there is some kind of memory corruption.

[ 3048.612234] omap_aes_prepare_req: Failed to copy SGs for unaligned cases
[ 3048.620382] Unable to handle kernel NULL pointer dereference at virtual address 00000004
[ 3048.628928] pgd = cdac8000
[ 3048.631766] [00000004] *pgd=00000000
[ 3048.635531] Internal error: Oops: 5 [#1] ARM
[ 3048.639999] Modules linked in: cryptodev(O)
[ 3048.644391] CPU: 0 PID: 1154 Comm: in:imklog Tainted: G           O    4.9.59-rt23+ #75
[ 3048.652768] Hardware name: Generic AM33XX (Flattened Device Tree)
[ 3048.659144] task: cc1d8140 task.stack: cc58c000
[ 3048.663894] PC is at scatterwalk_copychunks+0x110/0x190
[ 3048.669362] LR is at scatterwalk_copychunks+0x108/0x190
[ 3048.674831] pc : [<c03876b4>]    lr : [<c03876ac>]    psr: 200f0113
[ 3048.674831] sp : cc58de70  ip : cc58de70  fp : cc58de9c
[ 3048.686849] r10: c3af3644  r9 : 00008000  r8 : 00000001
[ 3048.692316] r7 : 00000001  r6 : c7df0030  r5 : cc58dea8  r4 : 00000674
[ 3048.699147] r3 : 00000000  r2 : 00000644  r1 : 00000674  r0 : 00000000
[ 3048.705981] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[ 3048.713451] Control: 10c5387d  Table: 8dac8019  DAC: 00000051
[ 3048.719463] Process in:imklog (pid: 1154, stack limit = 0xcc58c210)
[ 3048.726022] Stack: (0xcc58de70 to 0xcc58e000)
[ 3048.730584] de60:                                     00000001 cca3a010 c0118470 c7df0000
[ 3048.739150] de80: c3540000 00000100 c0d3fd84 40000006 cc58decc cc58dea0 c05cf8fc c03875b0
[ 3048.747715] dea0: 00000000 cca3a010 00000000 00000674 cca3a030 00000000 00000000 c08cf294
[ 3048.756279] dec0: cc58deec cc58ded0 c0134a28 c05cf6e0 00000000 00000006 cc58c000 c0d3fd9c
[ 3048.764843] dee0: cc58defc cc58def0 c0134abc c0134a04 cc58df5c cc58df00 c0134c2c c0134a98
[ 3048.773408] df00: c0160668 c0162910 c01e9e50 00400140 c0d13440 000431b2 00000009 c0803860
[ 3048.781973] df20: ffffe000 c0d3fd40 c0d0fcf0 c0d3fd84 cd815300 c0d1d358 00000014 00000000
[ 3048.790538] df40: 00000000 cd804000 00000000 004e004c cc58df6c cc58df60 c01352f8 c0134b4c
[ 3048.799103] df60: cc58df94 cc58df70 c015fe68 c01351d0 c0d6bc68 200f0030 ffffffff 10c5387d
[ 3048.807667] df80: 10c53c7d 00000000 cc58dfac cc58df98 c0101420 c015fe14 004b426a 200f0030
[ 3048.816232] dfa0: 00000000 cc58dfb0 c07025d4 c01013f0 00517a78 00503970 00000000 00000001
[ 3048.824796] dfc0: 00517a78 00000001 00503970 005081d0 00505cc0 00000000 004e004c b6877d30
[ 3048.833361] dfe0: 00503a00 b6877428 004b56d9 004b426a 200f0030 ffffffff 00000000 00000000
[ 3048.841917] Backtrace:
[ 3048.844488] [<c03875a4>] (scatterwalk_copychunks) from [<c05cf8fc>] (omap_aes_done_task+0x228/0x284)
[ 3048.854053]  r10:40000006 r9:c0d3fd84 r8:00000100 r7:c3540000 r6:c7df0000 r5:c0118470
[ 3048.862249]  r4:cca3a010 r3:00000001
[ 3048.866007] [<c05cf6d4>] (omap_aes_done_task) from [<c0134a28>] (__tasklet_action.constprop.6+0x30/0x94)
[ 3048.875933]  r7:c08cf294 r6:00000000 r5:00000000 r4:cca3a030
[ 3048.881861] [<c01349f8>] (__tasklet_action.constprop.6) from [<c0134abc>] (tasklet_action+0x30/0x34)
[ 3048.891423]  r7:c0d3fd9c r6:cc58c000 r5:00000006 r4:00000000
[ 3048.897351] [<c0134a8c>] (tasklet_action) from [<c0134c2c>] (__do_softirq+0xec/0x258)
[ 3048.905553] [<c0134b40>] (__do_softirq) from [<c01352f8>] (irq_exit+0x134/0x138)
[ 3048.913297]  r10:004e004c r9:00000000 r8:cd804000 r7:00000000 r6:00000000 r5:00000014
[ 3048.921491]  r4:c0d1d358
[ 3048.924152] [<c01351c4>] (irq_exit) from [<c015fe68>] (__handle_domain_irq+0x60/0xb0)
[ 3048.932353] [<c015fe08>] (__handle_domain_irq) from [<c0101420>] (omap_intc_handle_irq+0x3c/0x98)
[ 3048.941644]  r9:00000000 r8:10c53c7d r7:10c5387d r6:ffffffff r5:200f0030 r4:c0d6bc68
[ 3048.949764] [<c01013e4>] (omap_intc_handle_irq) from [<c07025d4>] (__irq_usr+0x54/0x80)
[ 3048.958141] Exception stack(0xcc58dfb0 to 0xcc58dff8)
[ 3048.963429] dfa0:                                     00517a78 00503970 00000000 00000001
[ 3048.971994] dfc0: 00517a78 00000001 00503970 005081d0 00505cc0 00000000 004e004c b6877d30
[ 3048.980557] dfe0: 00503a00 b6877428 004b56d9 004b426a 200f0030 ffffffff
[ 3048.987480]  r5:200f0030 r4:004b426a
[ 3048.991225] Code: e1a00003 eb012369 e5850000 e1a03000 (e5901004)
[ 3048.997694] ---[ end trace 85338e5eb66d6e98 ]---
[ 3049.002529] Kernel panic - not syncing: Fatal exception in interrupt
[ 3049.009181] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

----------------------

What could this be due to?

  • Magnus,

    Do you see the same failure with non-RT kernel in the Processor SDK?
  • Yes same thing. I also tried without pre-emption  (No Forced Preemption). But I still get the crash.

  • Magnus,

    Since it worked in v4.4 but is broken in v4.9, it might not be difficult to find the kernel patch which caused the regression.
    Can you please provide more details about your use case to help me replicate the failure? How do you use encrypted filesystem? Do you have to do any specific task to trigger the crash?
  • Magnus,

    I haven’t heard back from you, I’m assuming you were able to resolve your issue. If not, just post a reply below (or create a new thread if the thread has locked due to time-out). thanks.
  • Hello again,

    I haven't been able to create a generic method to recreate the error. But I have compiled the kernel with KDB so I can get the log out. It looks like we're out of 64kB buffers, but the memory seems to be very fragmented. Maybe you have a suggestion on how to resolve this. See log below:

    <4>[  319.763538] enter, keylen: 16
    <4>[  319.763602] nbytes: 48, enc: 1, cbc: 1
    <4>[  319.768733] enter, keylen: 16
    <4>[  319.768855] nbytes: 32816, enc: 0, cbc: 1
    <4>[  319.768942] 53500000.aes-en: page allocation failure: order:4, mode:0x2080020(GFP_ATOMIC)
    <4>[  319.768977] CPU: 0 PID: 821 Comm: 53500000.aes-en Tainted: G           O    4.9.59-rt23+ #84
    <4>[  319.768981] Hardware name: Generic AM33XX (Flattened Device Tree)
    <4>[  319.768985] Backtrace:
    <4>[  319.769020] [<c00197ec>] (dump_backtrace) from [<c0019aa8>] (show_stack+0x18/0x1c)
    <4>[  319.769030]  r7:c0a92954 r6:c0aa35c0 r5:00000030 r4:00000000
    <4>[  319.769046] [<c0019a90>] (show_stack) from [<c02d420c>] (dump_stack+0x24/0x28)
    <4>[  319.769064] [<c02d41e8>] (dump_stack) from [<c00c1cb4>] (warn_alloc+0x100/0x114)
    <4>[  319.769073] [<c00c1bb8>] (warn_alloc) from [<c00c2038>] (__alloc_pages_nodemask+0x2cc/0xb28)
    <4>[  319.769079]  r3:00000000 r2:00000004 r1:c06f129c
    <4>[  319.769083]  r4:02080020
    <4>[  319.769093] [<c00c1d6c>] (__alloc_pages_nodemask) from [<c00c28ac>] (__get_free_pages+0x18/0x34)
    <4>[  319.769102]  r10:00000000 r9:cdc30000 r8:00000004 r7:00008030 r6:cc61da00 r5:cdddd280
    <4>[  319.769106]  r4:cca42610
    <4>[  319.769123] [<c00c2894>] (__get_free_pages) from [<c04e432c>] (omap_aes_prepare_req+0xc0/0x270)
    <4>[  319.769143] [<c04e426c>] (omap_aes_prepare_req) from [<c0298b90>] (crypto_pump_work+0x1d0/0x298)
    <4>[  319.769152]  r10:00000000 r9:00000000 r8:cc91db7c r7:cc91db50 r6:a0030013 r5:cdddd280
    <4>[  319.769156]  r4:cc91dbd4
    <4>[  319.769179] [<c02989c0>] (crypto_pump_work) from [<c0053ec0>] (kthread_worker_fn+0xf4/0x198)
    <4>[  319.769188]  r9:00000000 r8:ffffe000 r7:00000001 r6:c0ae5978 r5:cca16000 r4:cc91dbb4
    <4>[  319.769200] [<c0053dcc>] (kthread_worker_fn) from [<c0053d80>] (kthread+0xf8/0x110)
    <4>[  319.769210]  r9:00000000 r8:c0053dcc r7:cc91dbb4 r6:cca16000 r5:cca45c80 r4:00000000
    <4>[  319.769232] [<c0053c88>] (kthread) from [<c0015720>] (ret_from_fork+0x14/0x34)
    <4>[  319.769240]  r8:00000000 r7:00000000 r6:00000000 r5:c0053c88 r4:cca45c80
    <4>[  319.769243] Mem-Info:
    <4>[  319.769265] active_anon:18689 inactive_anon:1451 isolated_anon:0
    <4>[  319.769265]  active_file:7223 inactive_file:7187 isolated_file:0
    <4>[  319.769308] lowmem_reserve[]: 0 0 0
    <4>[  319.769317] Normal: 104*4kB (UMEHC) 156*8kB (UMEH) 296*16kB (UMEH) 1128*32kB (UMEH) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 2*8192kB (C) = 58880kB
    <4>[  319.769358] 15965 total pagecache pages
    <4>[  319.769369] 0 pages in swap cache
    <4>[  319.769374] Swap cache stats: add 0, delete 0, find 0/0
    <4>[  319.769376] Free swap  = 0kB
    <4>[  319.769378] Total swap = 0kB
    <4>[  319.769382] 65536 pages RAM
    <4>[  319.769384] 0 pages HighMem/MovableOnly
    <4>[  319.769387] 2758 pages reserved
    <4>[  319.769389] 6144 pages cma reserved
    <3>[  319.769398]    omap_aes_copy_sgs: Couldn't allocated pages for unaligned cases.
    <3>[  319.777172] omap_aes_prepare_req: Failed to copy SGs for unaligned cases
    <4>[  319.784208] total: 32816
    <4>[  319.785243] enter done_task

  • Magnus,

    I think if you could explain how to use encrypted filesystem, I could try to reproduce it on my setup. I haven't used encrypted filesystem before.

    The second log you provided seems telling out of memory. How much DDR does your board have?

    Since your system works in v4.4 but failed in v4.9, looking at the crypto driver changes in kernel source code from git://gitorious.ti.com/ti-linux-kernel/ti-linux-kernel.git:

    $ git log --oneline ti2016.01..ti2017.05 drivers/crypto/omap-aes*

    02a424f69f73 Revert "crypto: omap-aes: prevent probe time race condition with multiple cores"
    3f3c3de21b5c crypto: omap-aes: Verify page zone of scatterlists before starting DMA
    4f15d74a4338 crypto: omap-aes: Use bool instead of int for omap_aes_check_aligned()
    18b7177f5e6e crypto: omap-aes: add spinlock init for GCM mode lock
    49f6bf75eced crypto: omap-aes: prevent probe time race condition with multiple cores
    ff9ca8c009a0 crypto: omap-aes: make queue length configurable
    fb73ee38836c crypto: omap-aes: make fallback size configurable
    ca1e39992036 crypto: omap-aes: Add support for GCM mode
    ecbcbd3f4306 crypto: omap-aes: export some AES driver functionality locally
    460e7be69189 crypto: omap-aes: fix context handling for multiple cores
    d7b43357ab94 crypto: omap-aes: move definitions over to a separate header file
    99f2c84ba255 crypto: omap-aes: drop unused flags attribute from omap_aes_ctx
    c98ef8dbca18 crypto: omap-aes - fix error return code in omap_aes_probe()
    0d0cda93de95 crypto: omap-aes - fix crypto engine initialization order
    9fcb191a84eb crypto: omap-aes - Add fallback support
    164f3ef301bf crypto: omap-aes - Add support for multiple cores
    f303b455bb95 crypto: omap-aes - use runtime_pm autosuspend for clock handling
    4cba7cf025f3 crypto: engine - permit to enqueue ashash_request
    2589ad84047f crypto: engine - move crypto engine to its own header
    7c001a8650e6 crypto: omap - Stop using crypto scatterwalk_bytes_sglen
    b973eaab68db crypto: omap - do not call dmaengine_terminate_all
    3741bbb207f7 crypto: omap-aes - Fix registration of algorithms
    da8b29a6c189 crypto: omap-aes - Use dma_request_chan() for requesting DMA channel
    0529900a01cb crypto: omap-aes - Support crypto engine framework
    16f080aaadcb crypto: omap - Removed unused variable "err"

    The two highlighted above are more interesting to the issue you have. Is it possible for you to manually revert the two changes one at a time to see if they are related?

  • Bin,

    To use encrypted files systems you should use the tool cryptsetup:

    # Create a key
    head -c 32 /dev/urandom  > /tmp/key
    
    # Map a partition to the decrypted device /dev/mapper/encpart
    cryptsetup --cipher=aes-cbc-essiv:sha256 --key-file=/tmp/key open --type=plain /dev/<partition> encpart
    
    # Create file system
    mkfs.ext4 /dev/mapper/encpart
    
    # Mount the decrypted partition
    mkdir /tmp/encpart
    mount /dev/mapper/encpart /tmp/encpart
    

    I have 256MB RAM which should be more than enough. I think that the main problem is that the memory gets very fragmented.

    I tried to revert the commits you suggested. I managed to revert "Stop using crypto scatterwalk_bytes_sglen", but failed with the other one. However this didn't solve the problem. There has been a major refactoring of the crypto drivers between the 4.4 and 4.9 release, which makes it hard to revert individual commits. Linus Torvalds made a large commit on Oct 10th 2016. I tried to go back to the version before this commit, but then md5 calculations seems to be incorrect. Actually I'm not sure how to proceed now.

  • Magnus,

    I followed your instruction and set the encrypted mount of /dev/ram0 which is configured to be 64MB, and I also set Uboot bootargs with 'mem=272M' to limited the memory for kernel, but I'm unable to see any kernel crash with constantly reading/writing data from/to /tmp/encpart/. How soon do you think I expect to see the failure with reading/writing data to the encrypted mount?

    Do you see the issue if not load omap-aes-driver to not use the hw crypto?
    Can you try to keep the old Uboot image but only change the kernel from v4.4 to v4.9 to see if the issue still happens?
  • Bin,

    I use both encrypted file systems and cryptodev from a user space application.

    Anyway, your suggestion to disable the hw crypto lead me to the commit causing the trouble. When I turned off "Support for OMAP MD5/SHA1/SHA2 hw accelerator" I got no crashes. So the root cause of the crash actually came from omap-sham.c. The reason why omap-aes.c crashed was that it got out of memory because the omap-sham had made the memory fragmented.

    The commit in omap-sham.c that caused the problem was:
    182e283f3e550d70ba5b7e6be66ea74726646166

    "crypto: omap-sham - shrink the internal buffer size

    The current internal buffer size is way too large for crypto core, so
    shrink it to be smaller. This makes the buffer to fit into the space
    reserved for the export/import buffers also."
  • I think I jumped to conclusions here. I now saw in the log that omap-sham initialization failed when I reverted the commit I mentioned.

    But omap-sham still seems to be the root cause. If omap-sham is disabled or fails to initialize I don't get any crashes. But I can't say exactly what is causing the crash when omap-sham is used.
  • Magnus,

    Can you please try the mainline v4.14 kernel to see if the issue still exist?
    Can you please get memory usage statistics with omap-sham to see if there is mem leak?
    How soon after kernel booted to see the failure?

  • Bin,

    I tried mainline 4.14 rc8 and I still get the crash. The crash happens when my user space application does a lot of md5 calculations through cryptodev.
    I have already spent too much time on this and I don't really have the time to start to debug the memory allocations in omap-sham. For now I will stick to the 4.4 kernel.
  • Magnus,

    Good to know you can still use v4.4 kernel to avoid the issue. I will keep focusing on md5/sham to see if I can reproduce the issue. I will update you if I make any progress.
  • Magnus,

    There are two patches just posted in the kernel community, which might be related to the issue you reported.

    Do you mind to give them a test to see if they fix the issue? Might be a good idea to test them on v4.14-rc8 which you tested recently, just in case the two patches don't apply to v4.9 cleanly.

    patchwork.kernel.org/.../
    patchwork.kernel.org/.../
  • Bin,

    Thanks for the suggestion of patches. I'm away for about a week but I will try these as soon as I get back.
  • Bin,

    The patches did not resolve my problem. But I have finally been able to make a small test program to trigger the error. It seems like it is the md5 calculation of a certain size that is problematic.

    The test program looks like this:

    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <sys/ioctl.h>
    #include <crypto/cryptodev.h>
    
    #define BUF_LEN 32784
    
    char buf[BUF_LEN];
    char digest[16];
    
    int main(void)
    {
        int count = 0;
    
        while (1)
        {
            int cfd = 0;
            struct session_op sess = {0};
            struct crypt_op cryp = {0};
    
            cfd = open("/dev/crypto", O_RDWR, 0);
            if (cfd < 0) {
                perror("open(/dev/crypto)");
                exit(1);
            }
    
            sess.mac = CRYPTO_MD5;
            if (ioctl(cfd, CIOCGSESSION, &sess)) {
                perror("ioctl(CIOCGSESSION)");
                exit(1);
            }
    
            cryp.ses = sess.ses;
            cryp.flags = COP_FLAG_UPDATE;
            cryp.len = BUF_LEN;
            cryp.src = buf;
            cryp.mac = digest;
            if (ioctl(cfd, CIOCCRYPT, &cryp)) {
                perror("ioctl(CIOCCRYPT)");
                exit(1);
            }
    
            cryp.flags = COP_FLAG_FINAL;
            cryp.len = 0;
            cryp.src = NULL;
            if (ioctl(cfd, CIOCCRYPT, &cryp)) {
                perror("ioctl(CIOCCRYPT)");
                exit(1);
            }
    
            if (ioctl(cfd, CIOCFSESSION, &sess.ses)) {
                perror("ioctl(CIOCFSESSION)");
                exit(1);
            }
            close(cfd);
    
            printf("%d\n", count++);
        }
    }
    

    After a number of runs the kernel will run out of memory when I run on kernel version 4.14.

    Before I run the test program /proc/pagetypeinfo shows this:

    Page block order: 11
    Pages per block:  2048
    
    Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10     11 
    Node    0, zone   Normal, type    Unmovable      1      1      1      0      1      1      1      1      2      0      1      0 
    Node    0, zone   Normal, type      Movable      1      1      1      2      0      1      0      0      0      1      1      4 
    Node    0, zone   Normal, type  Reclaimable      0      0      0      1      0      1      1      1      1      1      0      0 
    Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0      0 
    Node    0, zone   Normal, type          CMA      1      0      1      1      1      1      0      1      1      1      1      2 
    Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0      0 
    
    Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate 
    Node 0, zone   Normal            4           24            1            0            3            0 
    

    After I run the test program it looks like this:

    Page block order: 11
    Pages per block:  2048
    
    Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10     11
    Node    0, zone   Normal, type    Unmovable    100     45     14    781      1      0      0      0      0      0      0      0
    Node    0, zone   Normal, type      Movable      6      0      0      0      0      0      0      0      0      0      0      0
    Node    0, zone   Normal, type  Reclaimable      3      0      0      0      0      0      0      0      0      0      0      0
    Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0      0
    Node    0, zone   Normal, type          CMA      5      2      1      1      1      1      1      0      1      1      1      2
    Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0      0
    
    Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
    Node 0, zone   Normal           10           19            0            0            3            0 

    It is clear that the memory has been very fragmented after the test. If I run this on the 4.4 kernel I don't get this fragmentation.

  • Magnus,

    Thanks for the update.
    I will look at your test program and let you know when I found something in the kernel.
  • Magnus,

    Thanks for your test program, it helps debugging the memleak.

    Can you please try the kernel patch below?

    diff --git a/drivers/crypto/omap-sham.c b/drivers/crypto/omap-sham.c
    index e731749d23bb..4c9e594f0df1 100644
    --- a/drivers/crypto/omap-sham.c
    +++ b/drivers/crypto/omap-sham.c
    @@ -1089,7 +1089,7 @@ static void omap_sham_finish_req(struct ahash_request *req, int err)
     
            if (test_bit(FLAGS_SGS_COPIED, &dd->flags))
                    free_pages((unsigned long)sg_virt(ctx->sg),
    -                          get_order(ctx->sg->length));
    +                          get_order(ctx->sg->length + ctx->bufcnt));
     
            if (test_bit(FLAGS_SGS_ALLOCED, &dd->flags))
                    kfree(ctx->sg);
    
  • Bin,

    Thanks a lot! Now it works. But it looks like this was a quite serious bug in the sham driver.
  • Magnus,

    Glad it works now!
    Agreed, it is a silly bug.
    I have sent the patch to the kernel community mailing list and the driver owner has approved it. Hopefully this patch will get into the mainline kernel soon and backported to all the affected stable kernel versions.

    Thank you again for reporting the issue and helping to find the root cause. It is much appreciated.