TI-RTOS-MCU: How to write code to reduce thread's stack size

Luca Di Crescenzo

Prodigy 200 points

Part Number: TI-RTOS-MCU
Other Parts Discussed in Thread: CC1310, CCSTUDIO, SYSBIOS

Hi all,

I'm working with CC1310 but the question is more related to TI-RTOS in general.

I'm trying to figure out how to write my code so that I can reduce thread's stack usage.

Since I heavily use structs and struct pointers I would like to understand their impact on code.

Assume I have a struct like this one:

typedef struct MyStruct_S
{

char _arr[128];

char _v;

}MyStruct_T;

char getV(PeerDescriptor_T *peerPtr)

{

char myV = peerPtr->_v;

return myV;

}

I made some experiments with my code by looking at ROV. And this I what I get

experiment 1)

void myThreadFunc(void *arg)

{

...

//stack peak is 60

MyStruct_T *myPtr = (MyStruct_T*) malloc(sizeof(MyStruct_T));

//now peak is 268

char v = getV(myPtr);

//now peak is still 268

...

}

experiment 2) Struct allocated outiside the thread and accessed inside the thread

static MyStruct_T myStruct;

...

void myThreadFunc(void *arg)

{

//stack peak is 60

char v = getV(& myStruct);

//stack peak is 112

...

}

experiment 3) same as 2) but with a modified struct field order.

typedef struct MyStruct_S
{

char _v;

char _arr[128];

}MyStruct_T;

void myThreadFunc(void *arg)

{

//stack peak is 60

char v = getV(& myStruct);

//stack peak is 60 again

...

}

I really cannot understand the results. So I have these questions:

By what amount calling malloc() inside a task increases stack usage? I mean, does it take sizeof(MyStruct_T) bytes in heap and sizeof(MyStruct_T) bytes in stack?
When a struct field is accessed by struct pointer, does the whole struct is loaded into task stack? How struct field order changes stack usage?

Thank you very much

over 5 years ago

0 jack rao over 5 years ago

Prodigy 100 points

hi i am also programmer

i am asking you

have you check your code in any simulator ?

or any interpenetrater

thanks

0 Luca Di Crescenzo over 5 years ago in reply to jack rao

Prodigy 200 points

Hi,

I use the runtime ROV feature of CCStudio to retrieve the numbers above ("stack peak is...").

So I run the code above directly on target.

Best Regards

0 Brian Willoughby over 5 years ago in reply to Luca Di Crescenzo

Genius 4630 points

The question seems to be related primarily to C Language "automatic variables" versus dynamic, global or static.

When variables are declared local to a function, they are considered automatic variables, and they consume space on the stack. You can avoid this by moving the variables to global space. If you add the `static` keyword, the variables will remain private in scope to the function. If you run low on global SRAM space, and if you are very careful, you can move the variables to be truly global in scope, so long as one function does not overwrite the variables of another function.

Embedded firmware design often has to deal with careful variable declarations to manage the limited amount of memory available. I'm not using the CC1310, but the chip I have is limited to 256 KB of SRAM. My stack is only 1 KB (but I'm not using TI-RTOS for my projects). You're still limited by total SRAM, but moving variables off the stack should help a lot, provided that you design for the concurrency of multiple threads.

0 jack rao over 5 years ago in reply to Luca Di Crescenzo

Prodigy 100 points

ok i am installing the ccStudio

Code Composer Studio (CCS) Integrated Development Environment (IDE)

0 Luca Di Crescenzo over 5 years ago in reply to Brian Willoughby

Prodigy 200 points

Hi Brian,

thanks for your reply. I already have a background on variable storage schemes. My question is more specific to the TI-RTOS and TI compiler world.

It's quite obvious that an automatic variable declared into a task entry point goes into task's stack. In experiment 2) myStruct is allocated inside the main stack (not any task-specific stack).

Less obvious are the two questions I posed, that I repeat here:

Does a call to malloc in a task entry point puts the memory into task's stack or elsewhere in heap? The second should be correct, but I cannot correlate this with the fact that, after stepping next to the malloc statement in the above experiments, task's stack increases my more than a pointer.
Does accessing a struct field through a pointer in a task entry point pulls all the struct in task's stack or it pulls only the field dereferenced? Does the behavior changes with a different organization of field order?

Luca

0 Luca Di Crescenzo over 5 years ago

Prodigy 200 points

Part Number: CC1310

Hi all,

I have two questions:

Does a call to malloc in a task entry point puts the memory into task's stack or elsewhere in heap? The second should be correct, but I cannot correlate this with the fact that, after stepping next to a malloc statement, I see task's stack increasing my more than a pointer (by using ROV).
Does accessing a struct field through a pointer in a task entry point pulls all the struct in task's stack or it pulls only the field dereferenced? Does the behavior changes with a different organization of field order?

Info:

Platform: CC1310
TI compiler v18.12.1.LTS
Optimization: off
Debugging mode: fully symbolic

Reference code: I annotate the stack peak from ROV as comments in the code below.

experiment 1)

void myThreadFunc(void *arg)

{

...

//stack peak is 60

MyStruct_T *myPtr = (MyStruct_T*) malloc(sizeof(MyStruct_T));

//now peak is 268

char v = getV(myPtr);

//now peak is still 268

...

}

experiment 2) Struct allocated outiside the thread and accessed inside the thread

static MyStruct_T myStruct;

...

void myThreadFunc(void *arg)

{

//stack peak is 60

char v = getV(& myStruct);

//stack peak is 112

...

}

experiment 3) same as 2) but with a modified struct field order.

typedef struct MyStruct_S
{

char _v;

char _arr[128];

}MyStruct_T;

void myThreadFunc(void *arg)

{

//stack peak is 60

char v = getV(& myStruct);

//stack peak is 60 again

...

}

Thanks

0 Clément over 5 years ago in reply to Luca Di Crescenzo

TI__Guru** 101490 points

Hi Luca,

A good way to deepen your understanding of the stack consumption is to open the .map file generated at the compilation.

An other important element to take into consideration is the role of the code optimizer. Depending on his configuration, this bad boy can do a lot of more or less weird things. Same remark for the "Debug" configuration: to have an easy to debug code, this configuration can also do weird things.

Luca Di Crescenzo said:

Does a call to malloc in a task entry point puts the memory into task's stack or elsewhere in heap? The second should be correct, but I cannot correlate this with the fact that, after stepping next to the malloc statement in the above experiments, task's stack increases my more than a pointer.

Does accessing a struct field through a pointer in a task entry point pulls all the struct in task's stack or it pulls only the field dereferenced? Does the behavior changes with a different organization of field order?

1- The malloc is supposed to allocate memory on the heap

2- This depends on the optimization level. The order of the field might play a role in the size occupied by the structure (the 16bit and the 32bit data have to be aligned, so it is better to declare first the 8bit data, then the 16bit data and to finish the 32bit data).

I hope this will help,

Regards,

0 Luca Di Crescenzo over 5 years ago in reply to Clément

Prodigy 200 points

Hi Clément,

thanks for your reply. I verified what you wrote and double checked my tests. I can confirm there are no problems on malloc and struct pointers.

My high stack usage does not depend on the points above. I'm trying to understand more deeply what's going on, and I discovered something that looks weird to me. The following code is located within a task entry point (running state). I annotated ROV stack peaks before and after calling the routines.

Semaphore_Handle semHandle;
Error_Block eb;
Semaphore_Params semParams;
Semaphore_Params_init(&semParams); //stack peak: before: 164 - after: 272 - total stack usage >= 108
semParams.mode = ti_sysbios_knl_Semaphore_Mode_BINARY;

semHandle = Semaphore_create(0, & semParams, & eb); //stack peak: before: 164 (since Semaphore_Params_init returned) - after: 508 - total >= 344 bytes!!

From .map I see:

callee addr tramp addr call addr call info

ti_sysbios_knl_Semaphore_create $Tramp$TT$L$PI$$ti_sysbios_knl_Semaphore_create

1001b5a9 00007f08 00003914 helib.lib : hub_os.obj (.text)

The same large stack usage happens for mailboxes:

mailBoxHandle = Mailbox_create(params._itemSize, params._itemNum, &mailBoxParams, &eb);

//stack peak: before: 240 - after: 492 : stack usage >= 252

Just not to mention this one:

NVS_init(); //stack usage >= 340

I'm in the very unlucky (or stupid, depending on the point of view) situation where I need to allocate a mailbox at the very end of my worst case nested function calls. This means that if I need X bytes for stack on my code, I have to add extra 340 bytes to avoid stack overflow. Lesson learned: better to perform kernel-related allocations at the very beginning of the entry point.

Nvertheless, I really cannot understand why the kernel requires so many bytes just to create a semaphore . Such overhead is totally unwarrented from my point of view. Maybe I have some sub-optimal configuration of the kernel. How can I check that?

Thanks

Luca

0 Brian Willoughby over 5 years ago in reply to Clément

Genius 4630 points

Thanks for that note, Clément. I always thought it was best to first declare the 32-bit data, then the 16-bit data, and finally the 8-bit data (especially strings and other arrays). I think that perhaps the key is that all 8-bit data should be adjacent for best savings, and all 16-bit data should be adjacent. The order may not matter so much as avoiding a mix.

0 Luca Di Crescenzo over 5 years ago in reply to Brian Willoughby

Prodigy 200 points

Hi Brian,

I think you are right. Reordering the structure members by decreasing alignment is the best way to reduce RAM footprint. You may be aware of this:

http://www.catb.org/esr/structure-packing/

Some exceptions apply to the general rule: "Curiously, strictly ordering your structure fields by increasing size also works to mimimize padding. You can minimize padding with any order in which (a) all fields of any one size are in a continuous span (completely eliminating padding between them), and (b) the gaps between those spans are such that the sizes on either side have as few doubling steps of difference from each other as possible. Usually this means no padding at all on one side."

Anyway guys, these padding problems are just a part of the story. And a very small part. I don't think I will ever need 1000 struct instances on an deeply embedded uC project, so padding does not actually matter in this context.

A big chapter of the story is how much stack take library functions external to user code. As I wrote in the previous message, you may expect an increase of 344 bytes just because you are dynamically instantiating a mutex:

semHandle = Semaphore_create(0, & semParams, & eb); //stack peak: before: 164 (since Semaphore_Params_init returned) - after: 508 - total >= 344 bytes!!

I'll write a new post about TI-RTOS stack requirements.

Thanks to all

0 Brian Willoughby over 5 years ago in reply to Luca Di Crescenzo

Genius 4630 points

I suppose the issue is that your stack allocation, per thread, has to be large enough to handle the deepest usage at any instant, regardless of the fact that the stack returns to its previous usage after each function call. If you have a lot of threads, but little SRAM, then you could easily run out of resources before you even start.

Perhaps a solution here is to create a single thread for initialization of all other threads, and give this initialization thread a large stack allocation. The initialization thread would call Semaphore_create() multiple times, then collect and organize all of the handles. Perhaps you can even terminate that thread when it has completed, and regain the stack space. Your normal, run-time threads might not need as large of a stack if they can rely on all semaphore handles being created before the threads begin execution. This would be the equivalent of "static" data allocation, even though it's actually using function calls to dynamically allocate semaphores. You'd obviously need some way to know which semaphores belong to which threads.

0 Luca Di Crescenzo over 5 years ago in reply to Brian Willoughby

Prodigy 200 points

Hi Brian,

I got your suggestion. Thanks for chatting here, find it stimulating. I have some comments about your proposal.

I typically perform factory in the main() function to reuse system stack for interrupts. But in the example above, I had to allocate a mailbox not at the startup but only when the system fetches some data from the outside. This Is something that a factory task could address well.

Your solution needs some additional messaging mechanism to allow workers to ask the factory task to build some specific kernel objects. The request has also to be acknoledged and the worker has to block till completion. This makes code more complicated, but let's look if It is worth.

Consider that a thread has some overhead by itself (200 bytes of stack for a task calling just sleep()), the object allocation takes 344 bytes. So overall such factory would take 544 bytes on heap space.

And It will free the worker stack by 344 bytes. So overall your solution works great if the number of worker tasks requesting dynamic kernel allocations at the end of nested function calls is larger than one. Guess what? One is my case! Anyway thanks for the inspiration. The scheme looks promising for more complex designs.

Luca

Arm-based microcontrollers

Arm-based microcontrollers forum

TI-RTOS-MCU: How to write code to reduce thread's stack size