Compiler/TMS320C6678: Optimized compilation and performance with/without RTSC/OMP

Idris Kempf

Intellectual 685 points

Part Number: TMS320C6678

Tool/software: TI C/C++ Compiler

Hi there,

I'm having some performance issues with function I'm running without RTSC (fast) and with RTSC (slow).

In the first example, I link with the following C6678.cmd:

MEMORY
{
    SHRAM:           o = 0x0C000000 l = 0x00400000   /* 4MB Multicore shared Memmory */
  
    CORE0_L2_SRAM:   o = 0x10800000 l = 0x00080000   /* 512kB CORE0 L2/SRAM */
    CORE0_L1P_SRAM:  o = 0x10E00000 l = 0x00008000   /* 32kB CORE0 L1P/SRAM */
    CORE0_L1D_SRAM:  o = 0x10F00000 l = 0x00008000   /* 32kB CORE0 L1D/SRAM */
    // goes on with CORE1-CORE7
}
SECTIONS
{
#ifdef CORE0
    .myfastsection > CORE0_L2_SRAM
    .text:optimized: 	load >> CORE0_L2_SRAM
    // goes on with other sections, all of them placed in L2SRAM
}

The corresponding function are placed in .text:optimized using #pragma CODE_SECTION and arrays are placed in .myfastsection using #pragma DATA_SECTION and double-word aligned using #pragma DATA_ALIGN(., 2). The performance is very satisfying and looking at the generated assembly coded the compiler seems to pipeline well.

In the second example. I'm adding some RTSC because in some other code section (unrelated to the above one) I plan to use OMP. However, using the same compiler options for optimization, the performance of the function above greatly deteriorates (half the speed measured with both TSCL and omp_getwtime). The generated assembly code for the function is identical. My first guess was that I'm doing something wrong with the memory sections? In my modified cfg file I added

program.sectMap[".text:optimized"] = new Program.SectionSpec();
program.sectMap[".myfastsection"] = new Program.SectionSpec();
program.sectMap[".text:optimized"].loadSegment = "L2SRAM";
program.sectMap[".myfastsection"].loadSegment = "L2SRAM";

Shouldn't that be identical to the above linker.cmd? Is it also possibly (and necessary) to partition the L2SRAM for the different cores as above? In case I am not using any OMP in my code (even though I'm compiling with RTSC components), the performance is fine. However, as soon as I'm using OMP in a different function, called after my initial function, the performance is halfed. The initial function is called after omp_set_num_threads().

My second guess was that OMP introduces some overhead. However, I do not understand why since the initial function is totally unrelated to OMP. It would be helpful to get some additional insights here because in some cases it would be really useful to actually use OMP - but the performance degradation is not acceptable in our case.

NB: In the first case, code is loaded onto core0 only. In the second case (compilation with RTSC, no use of OMP in the code) and in the third case (compilation with RTSC, use of OMP in a different function), code is loaded onto all cores. The same optimizer flags are used in all cases. Arrays are double-word aligned and placed in L2SRAM in all cases. The functions are called 4 times in a row in all cases.

Please let me know if you need additional information. Thank you very much in advance.

Best wishes,

Idris

over 6 years ago

0 Garrett Ding over 6 years ago

TI__Mastermind 43296 points

Hi ldris,

Did you check the memory map to find out if the function you benchmark is allocated into the same memory region you defined?
Also, is the task you profile priority changed with/without RTSC?

Regard,
Garrett

0 Idris Kempf over 6 years ago in reply to Garrett Ding

Intellectual 685 points

Hi Garrett,

please excuse my late answer and thank you for yours. I had to focus on something else. I will come back to you asap.

Best wishes,

Idris

0 Idris Kempf over 6 years ago in reply to Idris Kempf

Intellectual 685 points

Ok Garrett you were right I examined the map file and there was a mismatch and the performance matched after fixing it. Might I ask you some additional questions please?

I looked into the OMP platform package, where I found the following:

            customMemoryMap: [
                ["L2SRAM",    
                                {name: "L2SRAM",  base: 0x00800000, 
                                len: 0x00060000, access: "RW"}],
                ["OMP_MSMC_NC_VIRT",   
                                {name: "OMP_MSMC_NC_VIRT", base: 0xA0000000, 
                                len: 0x00020000, access: "RW"}],
                ["OMP_MSMC_NC_PHY",   
                               {name: "OMP_MSMC_NC_PHY", base: 0x0C000000, 
                                len: 0x00020000, access: "RW"}],
                ["MSMCSRAM",   
                                {name: "MSMCSRAM", base: 0x0C020000, 
                                len: 0x003E0000, access: "RWX"}],
                ["DDR3",   
                                {name: "DDR3", base: 0x80000000, 
                                len: 0x20000000, access: "RWX"}],
            ],
            l2Mode:"128k",
            l1PMode:"32k",
            l1DMode:"32k",

and I have here the example linker.cmd:

    SHRAM:           o = 0x0C000000 l = 0x00400000   /* 4MB Multicore shared Memmory */
  
    CORE0_L2_SRAM:   o = 0x10800000 l = 0x00080000   /* 512kB CORE0 L2/SRAM */
    CORE0_L1P_SRAM:  o = 0x10E00000 l = 0x00008000   /* 32kB CORE0 L1P/SRAM */
    CORE0_L1D_SRAM:  o = 0x10F00000 l = 0x00008000   /* 32kB CORE0 L1D/SRAM */

1.) Are the OMP platform addresses per core?

2.) I guess that l2Mode:"128k" sets 128k cache, that's why in the linker.cmd I have length 0x00080000 but in the former one I have length 0x00060000?

3.) In the latter case, I'm able to assign different things to different L2 sections of different cores. How can I do this in the OMP case?

4.) In which spru*.pdf can I find more info on the mem config in the platform package?

Thank you very much!

0 Garrett Ding over 6 years ago in reply to Idris Kempf

TI__Mastermind 43296 points

ldris,

1) The addresses are not per core.
2) the first case sets 128K cache and (512-128) = 384K SRAM (0x60000), and the second case configures all 512KB (0x80000) L2SRAM as SRAM without cache.
3) You can configure the cache/ram size according to core number, something like this:
coreNum = CSL_chipReadReg(CSL_CHIP_DNUM);
if (coreNum == 0)
CACHE_setL2Size(CACHE_128KCACHE);
else
...
4) Are you looking for the details of the ti.runtime.openmp.platforms from CCS -> Project -> RTSC Tools -> Platform -> Edit/View, then browse to C:\ti\openmp_dsp_c667x_2_06_03_00\packages as Platform Package Repository and select the omp platform from Package Name drop list to find out the details of device name/family/clock speed, custom memory, cache and memory sections etc.

Regards,
Garrett

Processors

Processors forum

Compiler/TMS320C6678: Optimized compilation and performance with/without RTSC/OMP