faster memory allocation/freeing

RCReddy

Genius 3575 points

Other Parts Discussed in Thread: SYSBIOS

Hi All,

As part of IPC UG, i came across this statement.

""The HeapBufMP manager is intended as a very fast memory manager which can only allocate blocks of a single size"".

From my understanding of Ti ipc examples, let me put in points.

1. If a sharedRegion is created, by default a HeapMemMP instance is associated with it. ["It creates a HeapMemMP instance which spans the full size of the region. The other processors open the same HeapMemMP instance."]

2. For using across Multiple cores [i.e. doing a malloc across multiple cores], a single instance is enoff right. In other words, i will use the heap handle of HeapMemMp [i.e gotten through sharedregion_getheap()...so this will give a heap handle and it can be used by all cores].

Ptr Osal_cppiMalloc (UInt32 num_bytes)
{
Error_Block errorBlock;

/* Allocate memory. */
Error_init(&errorBlock);
cppiMallocCounter++;

/* Allocate a buffer from the default HeapMemMp */
return Memory_alloc (SharedRegion_getHeap(0), num_bytes, 0, &errorBlock);
}

please note i am using the ALREADY CREATED HeapMemMp instance created by SharedRegion.

3. Similar to step 2, i want to create a HeapBufMp instance out of a shared Region. I saw an example on that in consumer_srio.c

if (selfId == 0)
{
/* Create the heap that will be used to allocate messages. */
HeapBufMP_Params_init(&heapBufParams);
heapBufParams.regionId = 0; //sharedRegionId
heapBufParams.name = HEAP_NAME;
heapBufParams.numBlocks = NUM_HOST_DESC;
heapBufParams.blockSize = SRIO_MTU_SIZE;
heapHandle = HeapBufMP_create(&heapBufParams);
if (heapHandle == NULL)
{
System_abort("HeapBufMP_create failed\n" );
}
}
else
{
/* Open the heap created by the other processor. Loop until opened. */
do
{
status = HeapBufMP_open(HEAP_NAME, &heapHandle);
} while (status < 0);
}

[A]. Is the above method/way correct way to create HeapBufMP handle. Can i dictate sharedRegion to create default HeapBufMP instance instead of HeapMemMp instance ??

[B]. Do i need to call HEapBufMP_Open function mandatorily. looks like each core is having its own handle of the HeapBufMp is this correct. why can't there be global handle i.e. [heapHandle] ?

[c] Point me to an example, where there is no open [HEapBufMP_Open] called in each core, only one global handle created by one core and used by all cores [since handle is nothing but a structure, why should there be multiple handles in each core, there can be one global handle].

Thanks

RC Reddy

over 12 years ago

0 judahvang over 12 years ago

TI__Mastermind 32475 points

RC Reddy,

A. Yes that's the correct way of creating HeapBufMP. Yes, you can specify which SharedRegion to create the HeapBufMP. The region should have a heap in it.

B. You need to call HeapBufMP_open() on the cores not doing the create. Each core has a local object which points to the HeapBufMP in shared memory. Having a global handle has its own challenges especially when you want to delete the object. We choose to implement it this way.

C. There is no example of this because its not supported.

Judah

0 RCReddy over 12 years ago in reply to judahvang

Genius 3575 points

Hi Judah,

Today i profiled the HeapBufMP_alloc

1. I created a sharedRegion

2. release build mode.

3. entire project under -o3 option.

4. platform 6670.

5. SharedRegion owner is core0, holds 32 number of blocks [each block is 128 bytes] in DDR3. cache alignment option given in SharedRegion in .cfg

6. created a HeapBufMP Handle using almost a similar code [pasted below]

if (selfId == 0)
{
/* Create the heap that will be used to allocate messages. */
HeapBufMP_Params_init(&heapBufParams);
heapBufParams.regionId = 0;
heapBufParams.name = HEAP_NAME;
heapBufParams.numBlocks = 20;
heapBufParams.blockSize = 128;
heapHandle = HeapBufMP_create(&heapBufParams);
if (heapHandle == NULL)
{
System_abort("HeapBufMP_create failed\n" );
}
}
else
{
/* Open the heap created by the other processor. Loop until opened. */
do
{
status = HeapBufMP_open(HEAP_NAME, &heapHandle);
} while (status < 0);
}

7. Profiled the HeapBufMP_alloc function using TSCL call in each core differently.

(u32 *)HeapBufMP_alloc(heapHandle, 128, 128)

8. Please note i am allocating a shared region of 128*32 and creating a heapBufMP in it for around 20*128

i.e.

heapBufParams.numBlocks = 20;
heapBufParams.blockSize = 128;

9. Here are the results

core0 = 3322 cycles

core1 = 3347 cycles

core2 = 3574 cycles

core3 = 3336 cycles

9. Frankly, speaking this is No Very Fast memory allocation taking around 3+ Microseconds for a 128 bytes allocation.

10. Let me know if you see any issue in the simulation/test from above.

11. Also let me know is there any faster memory allocation mechanism from Ti [care should be taken in terms of multiple cores accessing them simulataneously].

12.Also kindly provide me the results you [i meant Ti has] have for various allocation mechanism.

13.Also i had a brief look at the code, can you please explain the need for 2 gates enter and exit.

=================================

ti_sdo_ipc_heaps_HeapBufMP_alloc

{

/* Enter the gate */
key = GateMP_enter((GateMP_Handle)obj->gate);

/* Get the first block */
block = ListMP_getHead((ListMP_Handle)obj->freeList);

}

/*
* ======== ListMP_getHead ========
*/
Ptr ListMP_getHead(ListMP_Handle handle)
{
ti_sdo_ipc_ListMP_Object *obj = (ti_sdo_ipc_ListMP_Object *)handle;
ListMP_Elem *elem;
ListMP_Elem *localHeadNext;
ListMP_Elem *localNext;
Bool localNextIsCached;
UInt key;

/* prevent another thread or processor from modifying the ListMP */
key = GateMP_enter((GateMP_Handle)obj->gate);

}

Please let me know any faster mechanism allocation of memory and free which takes in range of hundreds of nanoseconds.

Thanks

RC Reddy

0 judahvang over 12 years ago in reply to RCReddy

TI__Mastermind 32475 points

RC,

I don't see anything wrong with your simulation. I will say that I tried a quick experiment myself and came up with a slighly better number but this was just between 2 cores. So Just between Core 0 and Core 1, I measured a MessageQ_alloc of 128 bytes on Core 0 to be 2336 cycles. I added a couple of lines into the .cfg file to get this:

    var BIOS        = xdc.useModule('ti.sysbios.BIOS');
    BIOS.libType = BIOS.LibType_Custom;
    BIOS.logsEnabled = false;
    BIOS.assertsEnabled = false;
    BIOS.clockEnabled = false;

I don't know of any faster memory allocation mechanism from TI for a multicore aware heap. I'm sure that a good portion of that time is the cache coherency calls to make sure each CPU's view is coherent.

As far as the 2 GateMP enters...First, you should only be getting a function call hit here as the second GateMP enter simply increments a count and returns. The reason why its entered twice is simply an implmentation detail. ListMP can be used independent of HeapMemMP so obviously it must have a GateMP enter when manipulating something in shared memory. In GateMP, we conditionally perform some additional operations and these need to be protected within a GateMP. Bottom line is that entering a GateMP twice should not have a significant timing impact.

Judah

0 RCReddy over 12 years ago in reply to judahvang

Genius 3575 points

Hi Judah,

Thanks for your reply. So even in your case its around 2+ micro seconds. In any case this will be a huge number. Can you please find from Ti experts [internal] about any possible tricks/tips in getting faster memory allocation/free mechanism [any other mechanism which can alloc/free in hundreds of cycles].

Thanks

RC Reddy

0 John Dowdal over 12 years ago in reply to RCReddy

TI__Intellectual 2180 points

RCReddy: do you really require allocating/freeing *shared* memory in less than 1000 cycles? Can you use non-shared memory for you application? Shared memory regardless of implementation, even if you linked the buffers to hw descriptors (qmss), would worst case around 1000 cycles. (Assume all cores do a request at the exact same time. Somebody has to be arbitrated last. Assuming 8 cores, this is about 500-1000 cycles).

One way to avoid this bottleneck is to pull off several buffers from the shared heap "at startup" and put them into a locally managed list. The local list will always have constant alloc time, because it doesn't bottleneck at a central resource.

0 RCReddy over 12 years ago in reply to John Dowdal

Genius 3575 points

Hi John,

Thanks for reply.

1. do you really require allocating/freeing *shared* memory in less than 1000 cycles?

yes

2. Can you use non-shared memory for you application?

No, my code/application requires to use shared memory alloc/free in real time.

3. Shared memory regardless of implementation, even if you linked the buffers to hw descriptors (qmss), would worst case around 1000 cycles. (Assume all cores do a request at the exact same time

can you please detail more on this.

Thanks

RC Reddy

0 RCReddy over 12 years ago in reply to RCReddy

Genius 3575 points

Hi John,

I am waiting for reply on this.

Thanks

RC Reddy

0 John Dowdal over 12 years ago in reply to RCReddy

TI__Intellectual 2180 points

Lets assume that one pop (alloc) costs 60 cycles. If you issue one alloc, you spend a delay of 60 cycles.

If you issue 2 alloc, one is returned in 60 cycles, the other in 120 (even if from different cores).

If you issue 8 at same time, first is returned in 60 cycles, last in 480 cycles even if they originated on differnet cores.

You generally don't know which core will take 60 and which core will take 480.

0 RCReddy over 12 years ago in reply to John Dowdal

Genius 3575 points

Hi John,

Can you please detail what method you are suggesting and how the malloc/free works with qmss.

Thanks

RC Reddy

0 John Dowdal over 12 years ago in reply to RCReddy

TI__Intellectual 2180 points

You can use QMSS as a memory allocator. A "pop" is a malloc, and a "push" is a free. Since this is pure software use case, you don't have to restrict yourself to monolithic or host mode descriptors. The descriptors themselves can be the memory that is allocated/freed. You could also use standard monolithic descriptors (where the data area is the portion that is used by the application, or host mode descriptors (where the linked buffer is the memory used by the allocation). The choice on how to do this is mostly based on the number of regions available, and the supported descriptor sizes. You can make a multi-sized heap by putting different sized allocation units on different queues. However, each queue would have fixed size allocations (whether in custom sw descriptors, mono, or host descriptors).

0 RCReddy over 12 years ago in reply to John Dowdal

Genius 3575 points

Hi John,

Thanks for your reply. what is the Maximum size of "custom sw descriptors" i can allocate. In other words, i just tried an experiment of using 8k [=8192 bytes] size descriptors and it worked, is there any restriction you see [w.r.t to size of "custom sw descriptors"]. Any other things that need to be taken care, please detail ?.

Thanks

R.Ravi Chandra

0 John Dowdal over 12 years ago in reply to RCReddy

TI__Intellectual 2180 points

There are 13 bits reserved for DESC_SIZE. Thus, the maximum size is 128K. 13 bits is 0x1FFF; 131072 == (0x1FFF + 1) * 16. Remember size also has to be multiple of 16.

Processors

Processors forum

faster memory allocation/freeing