This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Method for assigning some arrays to local memory?



Hello,

I am running a large program on a C6748 Experimenter kit, meaning most of it is running in mDDR.

I am trying to optimize a certain section that uses a lot of static float arrays, some are large. While it is not possible to assign a whole section to local memory due to size, is there a "pragma" or other type of instruction that can be placed in the code to assign certain arrays to local memory to improve performance?

Thanks in advance.

Dan.

 

  • Dan,

    In the C Compiler User's Guide, you will want to look at #pragma DATA_SECTION. This is not for temporarily locating data in a specific place, but for permanently placing this memory element in a specific section.

    Then you will want to look at the Assembly Language Tools Reference Guide to see how to create a linker command file that places a certain section into the defined memory component that you want to use. This is assuming that you will create a new memory section.

    You may also want to make some tradeoffs between L1D cache size and L1D SRAM size, or between L2 cache size and L2 SRAM size.

    Regards,
    RandyP

     

    If this answers your question, please click the  Verify Answer  button below. If not, please reply back with more information.

  • Hi Randy,

    Thanks for your help. I made those assignments for static arrays and pointers formerly in mDDR to L2 memory that is cached by L1Data, but so far it does make much difference in performance. I am playing around with L1D cache/SRAM settings to see if I can get a boost. I'm guessing since the program portion is still in mDDR it may not make much difference to assign just the data-portion to faster memories?

    I will try the CODE_SECTION pragma next for select functions.

    Thanks again.

     

    Dan.

  • Dan,

    Which versions are you using: CCS, BIOS, Code Generation?

    Since all of this is intended to get better performance, have you enabled caching?

    Have you set the MAR bits, MAR192-MAR223?

    Randy

  • Hi Randy,

    CCS: 3.3.83.19

    BIOS: 5.33.05

    Caching enabled.

    MAR1920-MAR223 enabled

    So is MAR17 and MAR128 (when those memory sections are used).

    Optimizations used:

    -O3

    -ms0

    -mt

    -pm

    -mf5

     

    Dan.

  • From what you have said, you have done everything right. Since moving from mDDR to L2 did not improve performance, then the memory location may not be the problem. The mDDR location would have been cached in L1D and L2, so perhaps data accesses are not the problem.

    The profiler in CCS is supposed to be helpful for locating problem areas. I have not used it much, but I know the tools are very powerful. And I think CCSv3 has some good tutorials. The tools are even better with the simulator, at least for catching cache miss rates and so on.

    You can also embedded STS objects or your own benchmarking arrays, and use the CLK_gethtime() call to measure the time between different points in your code.

    To make sure you have all the settings enabled as you have listed above, I recommend disabling one and re-running to see the performance effect. Then restore it and disable the next, and so on.

    If you have questions about using the Profiler or STS objects, I recommend searching the DSP/BIOS documents that you have (or getting updates from ti.com), go through the relevant CCS Tutorials, search the TI Wiki Pages and search the E2E BIOS forum. If it is not clear or useful yet, post your questions on the BIOS forum under Embedded Software.

  • Hi Randy,

    I am following up here to keep the flow consistent, but if you think this should be a seperate post let me know.

    I used CLK_gethtime() and related tools to benchmark the function in question. Most all of it runs very well which in this case means in about 5msec, but this one (relatively small) for-loop costs about 110msecs so I am focusing on this part since the function is called a few hundred times in a processing cycle.

    I have created a representative code for loop (i.e. actual names are different , but code is the same). Based on a few experiments (e.g. replacing some of the array calls with constants etc.), I believe I may be running into a dependancy problems that are preventing the pipeline from doing its job.

    In the representation below arrays with name-extensions like "_M1" or "_M2" are in fact pointers of the base name array with the index shifted so for example:

    A base array might be KR while KR_M1 is a pointer to the KR array, but with index  &KR -1(X sizeof(float)) (yes the zero-case is handled so things don't get nasty). An index of "nM1" represents index m minus 1. I have done some re-arranging of the indexing (not included here) to no avail.

    The loop is iterated ~25 times (i.e. M = 25) for each pass (and there are many passes). All data arrays and pointers involved are stored in L1D SRAM. There are some dependancies in the loop, but I expect the amount of access from the same memory space may be part of the problem?

    FYI: There are functions defined in the h-file which I believe are already optimized (e.g.Cmr) as the are called many time thoughout the rest of the program with little delay. I also tried optimizing these specific functions using intrinsics etc., with no improvement.

    Represetative code:

    **************************

    //(In h-file)

    #define Cmr(ar,ai,br,bi) ((ar*br)-(ai*bi))

    #define Cmi(ar,ai,br,bi) ((ar*bi)+(ai*br))

    #define CSqMod(ar,ai) ((ar*ar)+(ai*ai))                   

    //In while loop

    for(m = 1;m <= M;m++)

                                  {

                                                 mM1 = m-1;

     

                                                 FR[m]                   = FR[mM1]-

                                                                (Cmr(KR_M1[m],-KI_M1[m],BR_M1[mM1],BI_M1[mM1])/RBR_M2[mM1]);

                                                 FI[m]                    = FI[mM1]-

                                                                (Cmi(KR_M1[m],-KI_M1[m],BR_M1[mM1],BI_M1[mM1])/RBR_M2[mM1]);

     

                                                 BR[m]                   = BR_M1[mM1]-

                                                                               (Cmr(KR_M1[m],KI_M1[m],FR[mM1],FI[mM1])/RFR_M1[mM1]);

                                                 BI[m]                    = BI_M1[mM1]-

                                                                               (Cmi(KR_M1[m],KI_M1[m],FR[mM1],FI[mM1])/RFR_M1[mM1]);

     

                                                 KR[m]                   = (FF*KR_M1[m])+(AR_M1[mM1]*

                                                                               Cmr(FR[mM1],-FI[mM1],BR_M1[mM1],BI_M1[mM1]));

                                                 KI[m]                    = (FF*KI_M1[m])+(AR_M1[mM1]*

                                                                               Cmi(FR[mM1],-FI[mM1],BR_M1[mM1],BI_M1[mM1]));

     

                                                 if(m < M)

                                                 {

                                                                RFR[m]  = RFR[mM1]-

                                                                               (CSqMod(KR[m],KI[m])/RBR_M1[mM1]);

     

                                                                RBR[m] = RBR_M1[mM1]-

                                                                               (CSqMod(KR[m],KI[m])/RFR[mM1]);

                                                 }

     

                                                 RTR[m] = RTR[mM1]+

                                                                               (Cmr(KxR[m],-KxI_M1[m],BR[mM1],BI[mM1])/RBR_M1[mM1]);

                                                 RTI[m]   = RTI[mM1]+

                                                                               (Cmi(KxR[m],-KxI_M1[m],BR[mM1],BI[mM1])/RBR_M1[mM1]);

     

                                                 AR[m]    = AR[mM1]-

                                                                (AR[mM1]*AR[mM1]*

                                                                CSqMod(BR[mM1],BI[mM1])/RBR[mM1]);

                                  }

    **********************************************

    Any thoughts you might have to improve the performance would be much appreciated.

    Thanks in advance.

     

    Dan.