This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C++ standard libraries <vector> and memory allocation

Other Parts Discussed in Thread: SYSBIOS

Hello,

I just wanted to share a tidbit that took me sometime to figure out and to which I did not find any suggestions for on the forum.  My task was to "make" a C++ application run on on the DSPs.  In theory, it is possible, surely, but as an embedded engineer - C++ is not my strong suit or favorite choice.  My first challenge was the standard library, <vector>.  This library was using the C++ new() function.  This new() function was defaulting to memory allocation from L2 cache.  Maybe this is okay for some, but these vectors get quite large and very quickly don't fit in L2.  I looked for a few days on how to solve this.  I would like to save others some time.

You have to use a custom allocator.  I found one here that started me on the right foot:

http://www.josuttis.com/libbook/memory/myalloc1.cpp.html

Here I replaced the new() and delete() calls with Memory_alloc and Memory_free DSP calls and I told it what heap to use.  I had created the heap statically in my .cfg file.

Then you add the custom allocator to your vector variable declaration.

Problem #2 was multicore.  I now need to change this custom allocator to accept a variable in the constructor which will tell it which heap to use so that I don't have to have an allocator for each core.  It seems the behavoir of the vector class is such that it uses the copy constructors as defined in the above example.  So I created a member variable, mHeap, and populate it - first through a constructor with a specific allocator instance and then it must be populated again in the copy constructor.

Now my variable declaration looks like this:

const myLib::myAlloc<int> allocSrc1(myHeap);

vector<ObservationDataStruct, myLib::myAlloc<int> > myVect(allocSrc1);

The first line creates an allocator instance that uses a specific heap.  The second line creates a vector using the "custom allocator template" and then gets passed an instance of this allocator.  When the vector is created, it will call the copy constructor in the example file.  In this copy constructor, you will need to add a line to copy the heap variable from the original allocator instance.    

//Create an allocator instance and give it the heap you want to use.

myAlloc( const ti_sysbios_heaps_HeapMem_Handle heap) throw()

 { mHeap = heap;}

 //container classes will call this copy constructor first.

 //give it the proper heap number.

 template <class U>

myAlloc(const myAlloc<U>& ma) throw()

{ mHeap = ma.mHeap;}

 //Container classes will call this copy constructor second.

 //again, give it the proper heap number

 //no one knows why it does so many copies...

 myAlloc(const myAlloc& ma) throw()

 { mHeap = ma.mHeap;}

Well, so anyhow, hopefully this will help someone else.  Works great for me and allows me to control where these vectors are getting created.  I think I will have to use it for maps too.

Brandy

  • Brandy,

    Thanks for the input.  I'm moving this thread to the TI C/C++ Compiler Forum as it will probably do more good there.

    Best Regards,

    Chad

  • I'm pretty sure that new() [and malloc()] allocate from the RTS-defined heap, which is not necessarily in L2 RAM (it depends on your linker command file).  If you put the heap in a better place, e.g. external SDRAM, you won't need to mess around with the way that constructors work, which is in effect to use new().

  • I thought so too and I will double check again, but when I create a vector and I wanted it to be very large 2048*48 bytes, I can only create 300 items before I get a memory error.   Then I see that the vectors were created in L2SRAM.  I can also see that my very large heap that I designated for the stack/task is not full, in fact it is hardly used.  When I change the allocator for the vector, it works quite fine. 

     

    I am happy to have you look at my config file to see if I missed something.  It is attached.  it turns out this is huge under taking and gets much messier because the code I am trying to port has a "vector of structs that contain vectors" and if I can get those internal vectors to use a custom allocator then they will again use all my L2SRAM...  its quite a predicament.  On the other hand, I could attempt to re-write the algorithm without the stl libraries, but I think that will be much worse. 

    thanks for any advice you might  have!

    Brandy

     

    Here is my config file:

    heapMemParams.size = 3000*1024;//Very large for recursion

    heapMemParams.sectionName = ".myHeapSection";

    Program.global.myHeap0 = HeapMem.create(heapMemParams);

    Program.global.myHeap1 = HeapMem.create(heapMemParams);

    Program.global.myHeap2 = HeapMem.create(heapMemParams);

    Program.global.myHeap3 = HeapMem.create(heapMemParams);

    Program.global.myHeap4 = HeapMem.create(heapMemParams);

    Program.global.myHeap5 = HeapMem.create(heapMemParams);

    Program.global.myHeap6 = HeapMem.create(heapMemParams);

    Program.global.myHeap7 = HeapMem.create(heapMemParams);

     

    heapMemParams.size = 0x320000; //Size of 1280*1280

    heapMemParams.sectionName = ".myScratchSection";

    Program.global.myScratchCore0 = HeapMem.create(heapMemParams);

    Program.global.myScratchCore1 = HeapMem.create(heapMemParams);

    Program.global.myScratchCore2 = HeapMem.create(heapMemParams);

    Program.global.myScratchCore3 = HeapMem.create(heapMemParams);

    Program.global.myScratchCore4 = HeapMem.create(heapMemParams);

    Program.global.myScratchCore5 = HeapMem.create(heapMemParams);

    Program.global.myScratchCore6 = HeapMem.create(heapMemParams);

    Program.global.myScratchCore7 = HeapMem.create(heapMemParams);

     

    heapMemParams.size = 2048*4*4+1024; //numBuffers*MaxBlobSets*4Bytes*4variables + for misc

    heapMemParams.sectionName = ".msmcsramHeap";

    Program.global.msmcsramHeap1 = HeapMem.create(heapMemParams);

    Program.global.msmcsramHeap2 = HeapMem.create(heapMemParams);

    Program.global.msmcsramHeap3 = HeapMem.create(heapMemParams);

    Program.global.msmcsramHeap4 = HeapMem.create(heapMemParams);

    Program.global.msmcsramHeap5 = HeapMem.create(heapMemParams);

    Program.global.msmcsramHeap6 = HeapMem.create(heapMemParams);

    Program.global.msmcsramHeap7 = HeapMem.create(heapMemParams);

     

    heapMemParams.size = 0x300000;// 4*1024;//0x100;

    heapMemParams.sectionName = ".myHeapSection";

    Program.global.myNetworkHeap0 = HeapMem.create(heapMemParams);

     

     

    Program.sectMap["systemHeap"]                         = "L2SRAM";

    Program.sectMap[".cio"]                       = "L2SRAM";

    Program.sectMap[".far"]                                       = "L2SRAM";

    Program.sectMap[".rodata"]                                                 = "L2SRAM";

    Program.sectMap[".neardata"]                             = "L2SRAM";

    Program.sectMap[".bss"]                                                       = "L2SRAM";

    Program.sectMap[".stack"]                                    = "L2SRAM";

    Program.sectMap[".nimu_eth_ll2"]      = "L2SRAM";

    Program.sectMap[".myLocalMemory"] = "L2SRAM";

    Program.sectMap[".cinit"]                                     = "L2SRAM";  //might work in DDR3

    Program.sectMap[".fardata"]                                = "L2SRAM";

    //Program.sectMap[".vecs"]                                  = "L2SRAM";

    Program.sectMap[".csl_vect"]                               = "L2SRAM";

    Program.sectMap[".msmcsramHeap"]                 = "MSMCSRAM";

    Program.sectMap[".srioSharedMem"] = "MSMCSRAM";

    Program.sectMap[".imgHeaders"]                        = "MSMCSRAM_IMG_HDR";

    Program.sectMap[".cppi"]                                     = "DDR3";

    Program.sectMap[".qmss"]                                   = "DDR3";

    Program.sectMap[".const"]                                   = "DDR3";

    Program.sectMap[".text"]                                      = "DDR3";

    Program.sectMap[".switch"]                                 = "DDR3";

    Program.sectMap["platform_lib"]         = "DDR3"; //confirm this is only code

    Program.sectMap[".myHeapSection"] = "DDR3";

    Program.sectMap[".far:taskStackSection"]           = "DDR3";

    Program.sectMap[".myScratchSection"]                               = "DDR3_SCRATCH";

    Program.sectMap[".resmgr_memregion"]                           = {loadSegment: "L2SRAM", loadAlign:128};        /* QMSS descriptors region   */

    Program.sectMap[".resmgr_handles"]                                  = {loadSegment: "L2SRAM", loadAlign:16};          /* CPPI/QMSS/PA Handles                                    */

    Program.sectMap[".resmgr_pa"]                                                          = {loadSegment: "L2SRAM", loadAlign:8};                            /* PA Memory                                                                        */

    Program.sectMap[".far:NDK_OBJMEM"]                                              = {loadSegment: "MSMCSRAM_NDK", loadAlign: 8};

    Program.sectMap[".far:NDK_PACKETMEM"]                        = {loadSegment: "MSMCSRAM_NDK", loadAlign: 128};

     

    Program.sectMap[".windowData"] = new Program.SectionSpec();

    Program.sectMap[".windowData"] = "DDR3_IMAGERY";

     

    Here is some code samples for usage:

        Task_Params_init(&taskParams);

        taskParams.stackSize = 3000*1024;

        taskParams.stackHeap = HeapMem_Handle_to_xdc_runtime_IHeap(myHeap0); //DDR3

        taskParams.arg0 = CORE0_MAIN_TASK;

        taskParams.instance->name= "Core0Task";

        Task_create((ti_sysbios_knl_Task_FuncPtr)core0Task, &taskParams, NULL);

     

    static void core0Task(UArg arg0, UArg arg1)

    {

        __SmartAlloc<ObservationDataStruct> smartAllocCore0(myScratchCore0);

        vector<ObservationDataStruct, __SmartAlloc<ObservationDataStruct> > myVector(smartAllocCore0);

    ….

    }

  • Also, here is a screen shot.  In this program, I have already changed all the vectors to use my allocators - so you can see their memory is in DDR3.  However, I haven't got to the <map> variables yet.  They are still allocating from the default L2SRAM even though my code is running from a CoreTask0 (as shown above.)

    Thanks again for any thoughts!

  • Hi Brandy,

    Very good job you've done. Am I asking too much if I ask you for a cpp example code whre you used those allocators?

    I'm not so familiar with some of C++ features.

    Thanks in advance.

  • Since I haven't been using DSP/BIOS (aka SYS/BIOS), I am unfamiliar with what options it provides for specifying allocation of linker sections.

    When using a plain old linker.cmd file, you can just include in the SECTIONS specifications the entry

      .sysmem > BIGRAM /* malloc/new heap */

    where BIGRAM is suitably defined in the MEMORY specifications, as well as specifying (near the beginning of linker.cmd)

    --heap_size=0x100000 /* or whatever*/

    There is presumably some way to do the same thing when using */BIOS.

  • Hi Johannes,

    Please start a private conversation with me and we can discuss.

     

    Hi Douglas,

    You know, I think you are right but if I move the "systemHeap" section to DDR3, it present many other problems.  I should maybe think about it again and determine what is best.

     

    Brandy

  • Hello Douglas,

    I thought about what you said a bit more and I would like to explain what other problems it will cause, at least in the multi-core case.

    If I move that to DDR3, each core will use the same DDR3 address when it uses sysmem for anything.  Therefore, the cores will potentially corrupt each other.  The problem I see is that I am not entirely sure what of the TI libraries use sysmem, and so I wonder if there might be an adverse effect that I would miss.  If I leave sysmem in L2, each core has its own L2 and I don't have this problem.  Of course, I do have the other stl library problem ...

    But anyhow, I knew I had thought about this already, this time I will jot it down here so myself and others can reference the thought.  Also, here is a post about the multicore memory map from earlier:

    http://e2e.ti.com/support/embedded/bios/f/355/p/180742/654302.aspx#654302

    Thanks,
    brandy

  • Generally in a multi-threaded environment, malloc() etc. is designed to use a single heap in a "thread-safe" manner, but possibly the one shipped with CCS isn't (I haven't examined it).  Another approach would be to link the modules for each core with a separate linker command file that puts system into a different location depending on the core.  Note that the C language was never intended for multi-threading, so there are a *lot* of details you have to worry about to construct a reliable multi-threaded app with C (or with any other PL not designed for that environment).

  • Hello, just an update.

    For the smartAllocator, I was originally using memory_alloc and memory_free.  While this worked most of the time, there was still some errors, especially when I enabled cache.  Therefore I determined that it was not multicore safe.  To fix this i had to go to a TI library HeapMemMP_alloc and HeapMemMp_free().

    I was hesistant to add any more TI libraries then I needed so I had a discussion with TI about the minimum code needed.  You can follow that conversation here:

    http://e2e.ti.com/support/embedded/bios/f/355/t/210699.aspx?pi239031349=2

    thanks,

    Brandy

  • Brandy,

    Thanks for the update!  I am in the same position now as you were many months ago.  (I have C++ code, full of std vectors, that I'm trying to use in a multicore sys/bios environment).

    One thing I noticed today, is that the following code takes ~2.5milliSeconds to compute:

    UInt32 t1 = Timestamp_get32();
    
    {  //inside a SYS/BIOS thread...
     
    //create a vector of vectors, resize, & initialize values
       vector< vector<float> > abc;
       abc.resize( 32, vector<float>( 50, 0.987 ) );
    }
    
    UInt32 t2 = Timestamp_get32();
    std::cout << "Time: " << (t2-t1) << std::endl;

    I read your post about having something like a struct of vectors full of vectors.  My question for you: were your algorithms utilizing any special features of the C++ Vector like I am (above)?  i.e. 'dynamic' features, like resize(), push_back(), etc.?  These seem to take a while (>1mSec per resize seems long to me), maybe for your application it's not an issue?

    I think that I can have enough memory in my DSP device to just assume max sizes of my large vectors and code them as C/C++ arrays.  However, I do use C++ "maps" ... so I believe I still need to follow your advice in this thread regarding TI's HeapMemMP_alloc().

    Thank you for your thoughts and advice!

  • Hi Chris,

    In the end, I had to remove all the C++ vectors and maps to get the speed I needed.  The dynamic memory allocation just takes too much time.  Everytime the vector grew or deleted it did an entire memory copy of a large data set in my case, and I had to use external memory because of the size too!  So long story short, even though I switch the memory allocation functions to DSP functions, it still wasn't fast enough and I had to rewrite everything to vectors, arrays, circular lists etc.  Like you I had enough memory to do this.

    The funny thing is, I was pretty much in denial about this being the solution because I was so unfamiliar with the algorithm, I didn't want to change the memory structures.  But by messing with the memory allocation so much, I learned the algorithm and then could quickly change the structures.

    If you are porting the algorithm for the sole purpose of speed, I recommend skipping the C++ structures immediately and save yourself the month or so of time!  Dynamic memory allocation will never be fast enough. 

    My algorithm had to run at 2hz and with C++ I was getting maybe 1 to 1.5sec per loop on average.  Some loops were as high as 4 seconds!  When I removed all the C++ it dropped to 200ms and was actually consistent!

    Good luck!

  • You can mitigate the effects of frequent reallocation due to growth by using vector<T>::reserve(); this causes the vector to reserve enough memory right up front to hold that many entries. However, if the sheer size of the data forces you to external memory, there's not much that can be done about that.

  • Brandy, Archaeologist:

    This information is extremely helpful, thank you so much for taking the time to post!

    I just had a discussion with my algorithm developer and we are definitely in favour of modifying the code to remove Vectors.  Luckily it should be possible without too much headache, and we can still meet our desired performance results by using large arrays.

    I appreciate your help.

  • Your welcome!  Good luck!