This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C64+ Megamodule Cycle Accurate Simulator - OMAP3530 DSP Cache studies

Other Parts Discussed in Thread: OMAP3530

Hi,

I am posting for a customer here since the topic might be interesting for all DSP optimizers using OMAP3530.  CCSv4.2.01.4 is being used.

While optimizing IVA2/DSP code on OMAP3530 generated by C6Run there is only the simulator that has the means to study cache behaviour for us. The application has been stripped down to be loadable into the simulator, and the remaining DSP algorithm is almost the same as in the real application. The algorithm is tiny in code size, but heavily using many data buffers, which means that the L1P cache almost never gets modified, but the L1D cache accesses present roughly 25% cache misses. We need to get this figure way down.

Here comes a number of related questions that needs to be answered:

Q1: Running the profiler, the L1D cache and L1P cache and CPU stall statistics are visible. But the L2 statistics seems to be all zeros. Is the L2 cache not modeled?

Q2: By what means can the simulator's L2 cache be configured to provide statistics?

Q3: What is the best practice for using L2 in a heavy data processing scenario? As L2 SRAM or L2 data cache? Philosophies to follow?

Q4: The Cache Ram Viewer window always show up empty, regardless that L1D cache statistics giving evidence of it being in use. Is this not modeled? Broken? 

Q5: The penalty cycles for L1D cache misses are important, and we like to change these if possible to get closer to the real HW that obviously executes much more cycles for the same code than the simulator. What are the means to configure L1D cache miss penalty values in the simulator?

Q6: How can we get a measure on the L1D cache miss penalty from the real chip (in time or DSP cycles)?

Q7: When checking the box within the Profile Setup - "Code Coverage enable" - CCS will crash. Why? 

Thanks for advicing us, so that we can get more grip on the situation.

/Magnus Aman

 

  • Magnus,

    For Q7 are there any crash log files created?   http://processors.wiki.ti.com/index.php/Troubleshooting_CCS#Crash_Dump_File

    Regards,

    John

     

  • All my below response is with respect to C64x+ MegaModule Cycle Accurate Simulator. Please let me know if you are using any other simulator

    Q1: Running the profiler, the L1D cache and L1P cache and CPU stall statistics are visible. But the L2 statistics seems to be all zeros. Is the L2 cache not modeled?
    A1 : By default, L2 is in full RAM mode and hence no cache related events.

    Q2: By what means can the simulator's L2 cache be configured to provide statistics?
    A2: The application should configure the L2 in cache mode. This is done by setting the L2MODE bits in the L2CFG file. Please refer to the DSP spec for the details

    Q3: What is the best practice for using L2 in a heavy data processing scenario? As L2 SRAM or L2 data cache? Philosophies to follow?
    A3: I have send this question to some more people to get there opinions. But ideally , if you can fit your application and data in L2RAM then it is better to put in full ram mode

    Q4: The Cache Ram Viewer window always show up empty, regardless that L1D cache statistics giving evidence of it being in use. Is this not modeled? Broken?
    A4: By default L1 Data and Prog are in full RAM mode. If you configure L1DCFG and L1PCFG file to make this in cache mode, the cache ram viewer should show good values

    Q5: The penalty cycles for L1D cache misses are important, and we like to change these if possible to get closer to the real HW that obviously executes much more cycles for the same code than the simulator. What are the means to configure L1D cache miss penalty values in the simulator?
    A5: Cache miss penality depends on the below factors
          1 Miss penality inside L1DMC
          2 Miss penality inside L2
          3 All path delay from CPU down to the target.
          4 If external access, the latency of external memory
    The model gives accurate value for 1,2 and 3. But item4 would depend on the external memory latency which could vary.
    The only configurable latency the model provides is L2 RAM wait latency. Which can be changed using the parameter "RAM_WAIT_STATES" in the cfg file

    Give me some more detail on which area you are trying to access.
         
    Q6: How can we get a measure on the L1D cache miss penalty from the real chip (in time or DSP cycles)?
    A6: The application can use timer counter register(TSCL register).

     1 You can write a simple application with benchmark code doing data access to L1D RAM region(you can put a loop with data access to L1D RAM). Make sure L1D is in full RAM mode.
     2 Read the TSCL register before and after the benchmark code
     3 Now move the data access to external memory and do the same experiement
     4 The access to L1D RAM should not have data stall , so you can assume any extra cycle as stall cycles

    Q7: When checking the box within the Profile Setup - "Code Coverage enable" - CCS will crash. Why?
    A6: I have tried this without any issue. Let me try on the exact same version of the CCS you are using and get back on this

    Also I have some interesting wiki on how to optimize you code for cache in the below link. Please have a look

    http://processors.wiki.ti.com/index.php/Cache_Analysis_Using_Simulator

    Thanks and regards

    Abhilash

  • Abhilash,

    Thank you for your answers and explanations.  I have a couple of follow-up questions below:

     

    Q2: By what means can the simulator's L2 cache be configured to provide statistics?
    A2: The application should configure the L2 in cache mode. This is done by setting the L2MODE bits in the L2CFG file. Please refer to the DSP spec for the details

    Q2.1: When building a DSP binary with cl6x only (i.e. not through CCS), where do I find/change all cache settings that are configured by software?  (See also Q4.1 below.)

     

    Q4: The Cache Ram Viewer window always show up empty, regardless that L1D cache statistics giving evidence of it being in use. Is this not modeled? Broken? 
    A4: By default L1 Data and Prog are in full RAM mode. If you configure L1DCFG and L1PCFG file to make this in cache mode, the cache ram viewer should show good values

    A4.1: This seems to be determined by the "Device Memory Map" selection under Cpu Properties.  For C6421 the cache view is always empty, but for C6424 and C645x the cache content  is indeed shown.

    Q4.1: For the settings that are determined by the Device/Device Memory Map under Cpu Properties: where can the source of these settings be found?  They are not visible in the associated GEL files.  http://processors.wiki.ti.com/index.php/C64x+_Cycle_Accurate_Simulator gives the impression that this is available in clear text only for CCS v3.3.

     

    As you can probably tell, we are not sure what parts of the cache configuration is somehow part of the DSP binary and what is part of the execution environment (in this case the simulator).

     

    Q5: The penalty cycles for L1D cache misses are important, and we like to change these if possible to get closer to the real HW that obviously executes much more cycles for the same code than the simulator. What are the means to configure L1D cache miss penalty values in the simulator?
    A5: The only configurable latency the model provides is L2 RAM wait latency. Which can be changed using the parameter "RAM_WAIT_STATES" in the cfg file

    Q5.1: Which file contains the WAIT_STATES parameter?

     

    Thank you,

    Orjan

  • Hi Orjan Friberg,

                   Before answering some of the question. It would be good if you can let me know which device you are using to do the evaluation. I assume you should be taking the device which is nearest to IVA2.  I am asking this question because we have lot of configuration for c64x+ device and each have different subsystem configuration.

    Please let me know on this. I would prefer to respond based on device.

    Also regarding changing the WAIT_STATES for RAM, it is adviced not to change the configuration file and I am not sure if this is accessable to the user.  Please let me know the device name and I would try to see if there is a way to change this for the user.

    Thanks and regards

    Abhilash

  • Abhilash,

    We are using the OMAP 3530 (we will possibly move to OMAP 3730 later).  Sorry for not stating this previously.

    Thanks,

    Orjan

  • Orjan,

           When you select a target configuration on CCSv4.2, what is the configuration selected. I can not see OMAP3530 in my CCSv4 setup and not sure if this is support in CCSv4.

     

    Thanks and regards

    Abhilash

  • Abhilash,

    We are using the C64x+ Megamodule Cycle Accurate Simulator.  (At the moment we are only interested in the DSP part of the 3530.)

  • Sorry for the delayed response

    Q2.1: When building a DSP binary with cl6x only (i.e. not through CCS), where do I find/change all cache settings that are configured by software?  (See also Q4.1 below.)

    A2.1 : I am not in good position to answer this since I am not sure if you are using any specific Chip support Libraries to program the cache. The user can himself go and change these cache setting by programming the cache registers which are memory mapped. I would request you to have a look into the C64X+ device spec for more details.

    I have found couple of good document on how to use c64x+ cache below
    http://www.ti.com/lit/spru871  http://www.ti.com/lit/spru862. Please check and see if you getting enough details

     Q5.1: Which file contains the WAIT_STATES parameter?

    A5.1 : I need to correct my earlier statement. The wait state for the L2 RAM access is not configurable and it is hardcoded as per the c64x+ spec

    I would get back on Q4.1

     

     

     

  • Is there a correct device memory map for an OMAP 3530 in the C64x+ Megamodule Cycle Accurate Simulator?  None of the existing are an exact match for both L1 and L2 cache/IRAM.

    Regarding the Cache view not working  for the C6421: the L1PCFG and L1DCFG are both initialized to 0x7 (i.e. maximum cache) so that is not the cause for the Cache view not working in that configuration.  Again, we don't need to use that particular Device memory map; the Cache view works for other Device memory maps.

    As for the L2 cache and the Cache view, turning on the L2 cache in software seems to work.  When we write the L2MODE bits from the application, the L2 cache shows up (as 64 kB).

    Thank you,

    Orjan

  • Orjan Friberg said:

    Is there a correct device memory map for an OMAP 3530 in the C64x+ Megamodule Cycle Accurate Simulator?  None of the existing are an exact match for both L1 and L2 cache/IRAM.

    As an example, the DM6446 is close regarding the memory map but lacks the 32 kB dedicated L2 RAM that is in the 3530, which means if we decide to put data there through the linker command file then that will not be simulated correctly.

    Thanks,

    Orjan

  • I am describing some hooks to modify the configuration file to get the right DMC/UMC/PMC cache and ram paratmeters.

    When configurating the target select C6421 Device cycle accurate simulator

    Modify the file "tisim_c6421_ca.cfg" present in D:\CCS_4.2.0.10017\ccsv4\simulation\bin\configurations location

    You need to modify the DMC, PMC and UMC section. Please refer to the comment given before the module to understand how to change the cache size.

    For example , for 32kb cache, 48 kb RAM at init time. Total RAM is 0x80Kb

                ///////////////////////////////////////////////////////
                // DMC Configuration                                 //
                // *****************                                 //
                //                                                   //
                // Total Mem @ Sim Init  :- 80   Kbytes ----(1)       //
                // Cache Size @ Sim Init :- 32   Kbytes ----(2)       //
                // SRAM  Size @ Sim Init :- (1) - (2) = 48   Kbytes   //
                // SRAM Start Address    :- 0x00F04000               //
                //////////////////////////////////////////////////////

     

    Thanks and regards

    Abhilash

    Abhilash

  • Thank you.  I will try it out and report back within a couple of days.

    Thanks,

    Orjan