This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM648 optimize to improve code performace

Recently,I transported my algorithm noto DM648,algorithm was so slow that i must optimize to improve performance

And,I have foud some solutions in E2E community ; Most of them,said like this;

1. CPU must visit externel memory (DDR) and DDR is very slow;  its the bottleneck!

2. using on chip memory such as  L1D cache/ram   L1P  cache/ram  L2  cache/ram   could  improve  code performace;

3. someone said that  you could placing  your code (wasting most time) in l1d or l1p ;

According the third point, I placing  Funtion1(wasting most time in my project) into L1PSRAM (In DM648, L1P can be configured as 0kb 16kb 32kb,

in my project I configured L1P  16kb cache  and  16kb sram);but it didn't work and improve any performance,Placing Funtion1 into DDR or L1PSRAM

have no difference.It was also very slow. my DDR  were all cacheenable and that was not enough to improve performace,I wanted to use memory on chip (l1d l1p l2);

 

Who can tell me what  the problem was in my case? why placing DDR or SRAM on chip has no difference? how to use on chip memory to improve code performance?

thx! 

  • Hongke Zhang,

    When you say "wasting most time", do you mean that you spent many hours getting this to work or that you use a lot of DSP performance time to put the program into L1PSRAM. I recommend setting both L1P and L1D to 32KB cache and let the DSP manage the use of the L1 resources. You may want to use a mix of cache and SRAM in L2, but if your entire program will not fit into L2SRAM, it will be easier for you to use maximum L2 cache and put everything in external memory.

    Are you using the Debug Build configuration or the Release Build configuration? This is the first place to find performance optimization by using Debug to get the program functionally correct and then switch to Release to get the optimization for performance.

    Concentrate on using cache and using the compiler's optimization to get better performance.

    When you look at memory in the CCSv5 memory browser, it will show you if an external memory location is stored in cache or not. This is marked by the color shading in the browser window. If everything has a white background then it is not stored in cache and you may have some cache configurations incorrect.

    Regards,
    RandyP

  • RandyP:

    1.

         when I said wasting most time I meant that the code was time-consuming and less performance.so I must to optimize to improve performance

    2.

        Of course, I used Release -O3 in my project ,but that was still time-consuming,I have to try other method such as EDMA and on chip memory .

    3.

         I believed using cache and compiler's optimization maybe useful,But I thought using EDMA and on chip memory was more efficient than that.

    Before, On chip DM6437, L1D CACHE set as 32KB,L1DSRAM set as 48KB; I put my code in L1DSRAM to processing and move the output from

    L1DSRAM to DDR,I got better performance(before function waste 32ms after optimization function waste 4ms).If just using using cache

    and compiler's optimization you cannot get so better performance.


    4.

             Comparing DM648 and DM6437 with datasheet, On DM6437 you can configure 32KB L1D CACHE and 48KB L1DSRAM,But On DM648 you cannot do that,

    if you configure 32KB L1D CACHE, the size of L1D SRAM reduced to 0 KB,because total size (L1D cache and sram) was 32KB.

    My question was:

    Now On DM648 performance didn't make difference when I configure L1D cache as 16KB or 32KB ,So I want to use the left 16KB SRAM to process

    image data and move result to DDR (externel MEMORY) using EDMA.But using the same method on DM6437 and DM648 I cannot get the same performance,

    the performance was still poor. I didn't know why? I just know its helpful to improve code performance using on chip sram,But I don't know

    how to use and I can get better performance?

    Regards!

  • Hongke Zhang,

    The best answer to the simple question "how ... can I get better performance?" is to use the cache and compiler to their full advantage. You seem to be asking for more advanced methods, but you have also used those methods, too, so I am confused. You may need to look at the resources I will list at the end of this post.

    For different algorithms, cache and internal SRAM will have different effects. Simply switching to SRAM instead of cache may not be a good solution.

    hongke zhang said:
    On chip DM6437, L1D CACHE set as 32KB,L1DSRAM set as 48KB; I put my code in L1DSRAM to processing and move the output from

    L1DSRAM to DDR

    If you were able to put your code in L1DSRAM in the DM6437, then you must know how to use the linker command file or CCS tools to place data sections into L1DSRAM. The DM6437 is one of only a few DSPs that have the extra L1DSRAM space. This is an expensive resource to include on the chip so it is not included in other DSPs, in general.

    hongke zhang said:
    On DM648 performance didn't make difference when I configure L1D cache as 16KB or 32KB

    This is an unusual result. Please tell me how the performance changes when you configure L1D cache as 8KB, 4KB, 2KB, and 0KB.

    What is the nature of your algorithm in terms of how data memory is accessed? Is it accessed sequentially, in blocks, "randomly", by large span? How much data is used?

    How did you fit it into the DM6437 L1DSRAM?

    How is L2 configured for cache vs. SRAM?

    hongke zhang said:
    But using the same method on DM6437 and DM648 I cannot get the same performance,

    the performance was still poor.

    a. Why did you switch to the DM6437 if you got good enough performance?

    b. Is your problem that the DM648 does not have as much L1DSRAM as the DM6437 or how to use what it does have?

    c. What does the memory browser show for the location of data in cache when you inspect data that is being processed? This is to determine if you are effectively using cache or not.

    Additional resources:

    EDMA coding: Please search the forum for "C6455_Edma.zip" (no quotes). I do not know if the same CSL is available for the DM648, but if so you can use this code as an example for using the EDMA to move data. Otherwise, look for information on the EDMA3 LLD in the TI Wiki Pages and at TI.com.

    Optimization training: In the TI Wiki Pages you can search for "C6000 Optimization" (no quotes) for advice on optimization methods. Also add Training or Workshop to that search and you can download our workshop material.

    EDMA training: In the Training section of TI.com, there is a training video set for the C6474. It has the same DSP core and EDMA3 module as the DM648, so it could be helpful for you to review all of the modules. But in particular, the EDMA3/QDMA/IDMA Module may help you understand some of the features and options available within the EDMA3 module. The Cache module may help you with use of the cache. You can find the complete video set here.

    Design support: From the Home Page at TI.com, click the Support & Community tab then click the TI Design Network. You can enter information on the type of design you are doing and get a list of Design Support providers who may be able to help you with your application. This would be helpful if you decide to do a complete analysis of your project to determine the best possible optimization, which is a highly technical process. These C64x+ DSPs and their architecture are designed to allow you to do most work without having to do that, but if you have more difficult requirements then their help may be vital for you and your success.

    Regards,
    RandyP

  • RandyP:

    I found that  examples in  TI 's technical documents about optimization were very simple .But in my  algorithms, codes were complex. 

    for example:

    //////////////////////////////////////////////////////////

    for(i=0;i<num1;i++)

    {

              for(j=0;j<num2;j++)

    {

             for(k=0;k<num3;k++)

             {

                     if(a)  continue;

                     if(b)

                      {

                         //// do something 

                      } else

                       {

                            //// do something 

                       }

                        for(p=0;p<num4;p++)

                        {

                                if(d)  break;

                        }

             }

    }

    }

    ///////////////////////////////////////////////////////////

    As to TI's optimization tips, if statement  ,  continue and break statement  will disturb software pipeline. but how to modify? I was confused that there were nested loops (three layers)?

    I thought that was hard to unloop for three-layer  loop!