This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6x+: cache stats

Other Parts Discussed in Thread: TMS320DM6446

hi there

are there any registers in the TMS320C6x+ that would allow me to get an idea of the number of read/write misses occurring in the various levels of caching (L1/L2)? maybe some simple counters?

i know there is some information available in the CCS (i don't have CCS), so there must be some (undocumented) way of getting access to this programmatically?

cheers,

sam

  • Sam,

    C6000 is a very broad family of devices, and the answer can vary from one device to another. Please specify the device you are interested in.

    Generally, the simulator is the best place to get information like this. The Cycle Accurate or Cycle Approximate simulators that do implement cache will often have these statistics available. That would give you information similar to what is sometimes available through CCS emulation.

    You get the simulator in CCS, and you can download and install CCSv5 from the TI Wiki Pages. Search for CCSv5.

    Why do you need this information. It is most practical to build your application and benchmark that application. If it does not run as well as you need, then cache optimization may be one of the areas to examine, but it would not be the first or even near the top of the list for the average application.

    Regards,
    RandyP

  • hi randyp

    thanks for the quick reply.

    the device we use is a DaVinci TMS320DM6446 with a TMS320C64x+. On the c64x+ there is an application with multiple tasks, and we're using DSP/BIOS which we wrapped with some extra functionality we need. We have our own implementation of an ARM to DSP communication mechanism, and the task context switch functionality is instrumented in a way that allows us to run a tool on the ARM (similar to linux' top) that displays things like CPU load, memory usage, number of context switches and calls to new() per second etc. at run-time and on target, like so:

    general stuff:
    
        version: 00000000
        flags:  00000001 WAIT_ARM
    
    arg buffer:  0x87a51808
    
        arg buffer size:  512
        arg  0: /usr/share/taitsignalpath/tbc-dsp-p25.b0
        arg  1: --loglevel=INFO
        arg  2: --fw-stats
        arg  3: --supervised=TRUE
    
    sync table at 0x11f0fdb0:
    
        ARM state:     0x1
        DSP state:     0x3 (running)
        DSP exit code:  0 (0x0)
    
    system status at 0x87a51a98:
    
      # tasks at 0x11f0f920
    
       ID name                 prio  st  load   maxload minload   stksz  stkused   cswitch  cs/s  max cs/s  new/s 
        1 (null)                0   run 70.688% 78.468% 50.471%    1024      700    889357   465      4220      0
        2 main                  8   blk  0.000%  4.845%  0.000%   65536      188      6819     1      4117      0
        3 tbc-dsp.m_high_ec    12   blk  7.278%  7.748%  5.531%    4096     2332   1228198   666       670      0
        4 tbc-dsp.m_med_ec     11   blk  5.771% 11.803%  1.219%    6400     2328    751387   415       528      0
        5 tbc-dsp.m_low_ec     10   blk 15.530% 27.359% 13.947%   32768     2956    735034   385       527     75
        6 c-dsp.m_dont_care_ec  9   blk  0.264%  8.519%  0.096%   32768     2828     22756     6       149     34
    
      # heap info at 0x11f0fd6c
    
      standard heap: total:  2097152   used:  1017520( 48%)   largest free block: 1030152
      fast heap:     total:    32768   used:      576(  1%)   largest free block: 32192
    
      # generic stats at 0x11f0fd68 (0x11f0fd48)
    
       name                                 count count/s   currval    minimum  maximum     mean    
        dalink ISR [us]                    106648  106648        2          0      173       2.977787
        evDigIQSamples latency [us]             0       0        0          0        0       0.000000
        dropped syslog messages                 0       0        0          0        0       0.000000
        txstream latency [us]              337137  337137      330        307     1421     393.682002
        key on delta [scr ticks]               65      65     2337       2320     2376    2356.184615
        txstream samples slack [us]        337137  337137     4484       3342    29900    4400.523200
        txstream queue length             1527709 1527709        6          0       42       3.986401
        txstream isr duration [us]        1527710 1527710       28          8       77      22.692355
        rxstream buffer length            1527710 1527710        1          1        2       1.541668
        rxstream isr duration [us]        1527710 1527710       23         13       82      22.485233
        tx data latency [us]              1527710 1527710       54         30      115      45.650171
        rx data latency [us]              1527710 1527710       39         24       92      33.937602
    
    buffer table at 0x11f0fca0:
    
                                       total                   avail   free       total      bytes     total   qblocks
       ix id  direction   location     size   iRead   iWrite   bytes   bytes      bytes      per s     qblock  per s   
        0  0   DSP->ARM  0x87a00000    8192    3573     3573       0    8191      99322          0          0      0
        1  2   DSP->ARM  0x87a02000   16384    9342     9342       0   16383    3376590       2000          0      0
        2  3   DSP->ARM  0x87a08000    8192    3162     3162       0    8191    1186713        561          0      0
        3  4   ARM->DSP  0x87a06000    8192    3920     3920       0    8191       2588          0        116      0
        4  5   ARM->DSP  0x87a0a000    8192    5138     5138       0    8191     339758        102       6145      2
        5  6   DSP->ARM  0x87a0c000  262144       0        0       0  262143          0          0          0      0
        6  8   DSP->ARM  0x87a4e000    8192    6743     6743       0    8191      60028          0          0      0
        7  9   ARM->DSP  0x87a4c000    8192    8188     8188       0    8191     826268          0       5889      0
    
    

    the application does not have performance problems in general, but once in a blue moon the most critical task/loop has large spikes in the processing time. i am suspecting that either one of the interrupts or some unimportant, low-priority task does something that causes a burst of cache evictions, and i would like to extend my per-task statistics with cache miss stats. if i had a counter for each level of cache i could easily snapshot it in my task switch handler to get an idea how often and where this happens. i could then play around e.g. with freezing the cache for non-time critical tasks etc. to see how much this changes the picture.

    i don't see the simulator being a help here as there is so much ARM <-> DSP interaction that makes up our real world application. i wouldn't be able to run a realistic system for days and see how these numbers track in the simulator.

    hope my explanation makes sense,

    sam

  • Sam,

    I do not believe there is anything like what you want in the DM6446. If there was, it would be described in the C64x+ Megamodule Reference Guide or the C64x+ CPU & Instruction Set Reference Guide. There are also the DMSoC DSP Subsystem Reference Guide and the DSP Cache User's Guide to try. But I do not believe you will find any statistics trackers in the DM6446.

    We might move this thread to the Code Composer Forum where they talk about emulation details with great knowledge. Let me know if you would like this thread moved and I will get a Moderator to do that for you.

    Regards,
    RandyP

  • Hi sam,

    Moved this thread to code composer forum as it might be answered by them.

     

    Regards,

    Shankari

  • Samuel Nobs said:
    i know there is some information available in the CCS (i don't have CCS), so there must be some (undocumented) way of getting access to this programmatically?

    I believe this information is stored in some counters that is part of the emulation logic on the chip, which CCS has access to. I am not aware of any way to read this data programmatically from the application itself.

  • Unfortunately the version of C64x+ used in DM6446 doesn't implement these counters. 

    Regards,

    Oliver