Kmalloc problem

_Ralph_

I'm trying to allocated a 10MB contiguous region of memory using kmalloc. I need this because it would be easier to have this available than to target my peripheral to DMA into different areas of my memory (i.e. "scatter gather"), plus if I hand off the buffer that the ARM receives from the peripheral to the DSP, then the DSP definitely requires contiguous memory. Whenever I try to allocate more than a very small amount it fails, so I tried to increase CONSISTENT_DMA_SIZE to 14MB in arch/arm/include/asm/memory.h but this doesn't work either.

So in either case when I try to allocate 10MB of memory with kmalloc here is the resulting output:

------------[ cut here ]------------
WARNING: at mm/page_alloc.c:1990 __alloc_pages_nodemask+0x148/0x514()
Modules linked in: pciemodule(+)
[<c0042ff4>] (unwind_backtrace+0x0/0xec) from [<c0062174>] (warn_slowpath_common+0x4c/0x64)
[<c0062174>] (warn_slowpath_common+0x4c/0x64) from [<c00621a4>] (warn_slowpath_null+0x18/0x1c)
[<c00621a4>] (warn_slowpath_null+0x18/0x1c) from [<c0097b48>] (__alloc_pages_nodemask+0x148/0x514)
[<c0097b48>] (__alloc_pages_nodemask+0x148/0x514) from [<c0097f24>] (__get_free_pages+0x10/0x24)
[<c0097f24>] (__get_free_pages+0x10/0x24) from [<bf001590>] (pci_probe+0x294/0x320 [sharkpcie])
[<bf001590>] (pci_probe+0x294/0x320 [sharkpcie]) from [<c0190954>] (local_pci_probe+0x50/0xac)
[<c0190954>] (local_pci_probe+0x50/0xac) from [<c0190ce8>] (pci_device_probe+0x58/0x8c)
[<c0190ce8>] (pci_device_probe+0x58/0x8c) from [<c01c53b4>] (driver_probe_device+0xc8/0x188)
[<c01c53b4>] (driver_probe_device+0xc8/0x188) from [<c01c54d4>] (__driver_attach+0x60/0x84)
[<c01c54d4>] (__driver_attach+0x60/0x84) from [<c01c4c0c>] (bus_for_each_dev+0x44/0x74)
[<c01c4c0c>] (bus_for_each_dev+0x44/0x74) from [<c01c4564>] (bus_add_driver+0xa8/0x224)
[<c01c4564>] (bus_add_driver+0xa8/0x224) from [<c01c57c4>] (driver_register+0xa8/0x134)
[<c01c57c4>] (driver_register+0xa8/0x134) from [<c0190f18>] (__pci_register_driver+0x38/0xac)
[<c0190f18>] (__pci_register_driver+0x38/0xac) from [<bf0012c0>] (drv_init+0x14/0x50 [sharkpcie])
[<bf0012c0>] (drv_init+0x14/0x50 [sharkpcie]) from [<c00393f8>] (do_one_initcall+0xc8/0x19c)
[<c00393f8>] (do_one_initcall+0xc8/0x19c) from [<c008a6b4>] (sys_init_module+0x90/0x1ac)
[<c008a6b4>] (sys_init_module+0x90/0x1ac) from [<c003e180>] (ret_fast_syscall+0x0/0x30)
---[ end trace 29c1800fa352226a ]---

Any ideas how to fix this? I already have "vmalloc=500M" in my kernel parameters which I don't expect is helping things.

Thanks,
Ralph

over 13 years ago

0 _Ralph_ over 13 years ago

Guru 10685 points

Sorted after a massive amount of googling:

http://stackoverflow.com/questions/5940101/allocating-more-than-4-mb-of-pinned-contiguous-memory-in-the-linux-kernel

0 Chris Ring over 13 years ago in reply to _Ralph_

TI__Genius 17205 points

FWIW, we typically recommend CMEM (from the Linux Utils product in your EZSDK/DVSDK) for this:

http://processors.wiki.ti.com/index.php/CMEM_Overview

Chris

0 _Ralph_ over 13 years ago in reply to Chris Ring

Guru 10685 points

Thanks for the tip. Having quickly refreshed my memory on this I can see that the main drawback is that you have to manually specify the start and end addresses of the physical block. This makes it unsuitable for dynamically allocating buffers as there is no guarantee that those buffers will still be free after booting. Ideally the CMEM module would use the kernel page allocator to work out which contiguous address range to allocate.... but then it would fail because the kernel would refuse to allocate more than 2^(MAX_ORDER-1) pages at a time. So, overall it seems I may as well just alter MAX_ORDER and stick with the normal kernel way of doing things. Please correct me if I'm wrong!

As a side note, the entire reason that we are having to use one large physical buffer is because the IOMMU on the DSP is broken (and I IIRC believe Codec Engine uses physical addresses anyway by its fundamental design?). Our other bespoke peripheral does not have an IOMMU either but does have scatter-gather DMA which means it doesn't care if our ARM-allocated buffer is not contiguous.

Ralph

0 Chris Ring over 13 years ago in reply to _Ralph_

TI__Genius 17205 points

_Ralph_ said:
Having quickly refreshed my memory on this I can see that the main drawback is that you have to manually specify the start and end addresses of the physical block. This makes it unsuitable for dynamically allocating buffers as there is no guarantee that those buffers will still be free after booting. Ideally the CMEM module would use the kernel page allocator to work out which contiguous address range to allocate.... but then it would fail because the kernel would refuse to allocate more than 2^(MAX_ORDER-1) pages at a time.

Using CMEM, you create a carveout (typically using MEM= on the Linux cmd line) so Linux doesn't really know about the memory. Assuming adequate memory was carved out, "dynamic" allocation will always succeed since we're not using Linux's kernel allocator at all (CMEM's allocator takes care of it).

Without better kernel support, this is the simple-to-understand approach we developed several years ago, and it remains today. Good news is that the kernel community sees this as an issue and is looking at adding a "CMA" (Contiguous Memory Allocator) feature into the kernel in the [hopefully near] future. We're monitoring this, and prototyping putting CMEM user mode APIs on top of CMA. This would give you the benefits of CMEM's API set (alloc/free, cache, virt2phys, etc) with the additional benefits of CMA (when the memory isn't allocated, the kernel can use it for other things). CMA is still being discussed and reviewed, but it looks like it'll be added to the mainline kernel this year.

_Ralph_ said:
As a side note, the entire reason that we are having to use one large physical buffer is because the IOMMU on the DSP is broken (and I IIRC believe Codec Engine uses physical addresses anyway by its fundamental design?).

I can neither confirm nor deny that with certainty, but I think you're right about the DSP's MMU being broken. :/ Strictly speaking, CE is responsible for translating addresses from user-side virt to "whatever the remote proc can see". Today that's physical addresses (and worse, physically contiguous addresses) b/c the slave has no MMU. But assuming the slave has an MMU, CE could add a feature to scatter/gather memory into the slave-side's MMU (so it looks virtually contiguous to the slave), and pass the slave that [now virtual, and virtually contiguous] address for the slave to work on. Don't have that support today, though.

Chris

Processors

Processors forum

Kmalloc problem