This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

What do the EVMK2H DDR3 performance counters really count?

On the EVMK2H, I use the PERF_CNT_1 and and PERF_CNT_2 registers to count the number of DDR3 reads and writes.  It looks like they only count a fraction of the reads/writes, though.  On the DSP, I configured the counters as follows:

    hXmc->XMPAX[15].XMPAXL = 0x12101000 | 0xBF; // remap DDR3A controller into 32-bit address space
    hXmc->XMPAX[15].XMPAXH = 0x21010000 | 0x0B;
    hEmif->PERF_CNT_CFG = 0x00030002; // enable read & write counting

When executing this code:

    for (int j = 0; j < 10000000; j ++)
      dst[j] = dst[j+1];

I see that PERF_CNT_1 (configured as read counter) was incremented by 1252042 and PERF_CNT_2 (configured as write counter) was incremented by 5002263.  So in this case, the read counter is incremented by 1 for every 64 bytes read, and the write counter once for every 16 bytes written.

However, when executing this code:

    for (int j = 0; j < 10000000; j ++)
      dst[j] = 0;

the write counter is incremented by 2502540, thus once for every 32 bytes written.


I would have assumed that each 8-byte read/write would count as one, but this is clearly not the case.  Can somebody please explain me what these counters are really counting?  And how can I count the total number of bytes transferred?


Thanks,  John

  • Hi John Romein,

    By any chance you have gone through the below document in which a detailed information on registers, PERF_CNT_1 and and PERF_CNT_2 are given ?
    http://www.ti.com/lit/ug/spruhn7b/spruhn7b.pdf

    ( section : 2.19 Performance Monitoring)

    Regards,

    Shankari

    -------------------------------------------------------------------------------------------------------

    Please click the Verify Answer button on this post if it answers your question.
    --------------------------------------------------------------------------------------------------------

  • Actually, I used a slightly older document (sprugv8d), but the latest revision provides the same information. Table 2-12 (spruhn7b) mentions "total reads" and "total writes", but I do not understand how this can be used to determine the amount of bytes read/written. Simply multiplying by 8 (the bus width) is wrong, as the examples mentioned above read/write much more data than what the counters indicate. So how many bytes are read/written in one read/write action?

    Thanks, John
  • John
    Can you share your configuration of L1D and L2, and whether EMIF/DDR was marked cacheable, and if cacheable, whether it was marked writeback or writethrough, in your test setup.

    Regards
    Mukul
  • Dear Mukul,

    I use the OpenCL setup, thus the L2 SRAM is split into 768 KB software managed, 128 KB reserved, and 128 KB hardware managed cache. The tests described above are done with 32 KB hardware managed L1D cache (but if it matters: I sometimes change it to 28 KB software + 4 KB hardware managed cache). I did not change the cacheability settings; I assume the default it is write back (but I did not verify this).

    Thanks, John
  • Thanks
    Is it possible for you to share your actual L1DCFG, L2CFG and MAR register settings (EMIF cacheable).
    The power on default settings has L1D cache on, L2 cache off, EMIF non-cacheable.
    I am not an expert on OpenCL, but my assumption is that you wouldn't be working on those default states.

    There is a "simpler" explanation for performance counters on an older device TRM (SPRUH77A) , but the topology on K2H is more complicated.

    In general (as explained to my by the chip archs)

    For requests from the DSP, all requests are less than or equal to a DDR burst for a 64-bit DDR. Assuming you are using 64bit DDR. The counter value increments might depend on cache settings, compiler settings which in turns will determine how c66x DSP merges the data to access or request to/from DDR.

    For requests from other masters, the counter value will typically increment based on the master's default burst size, however it would also depend on the topology (SCR interconnect) and configuration of any bridges between the initiating master and DDR controller. For example some bridges might break the commands into smaller chunks etc.

    In your case , once the cache configuration is understood (which will determine how DSP/cache controller is sending requests for your loops to DDR) , perhaps we can explain the counter behavior further.

    At a high level, I have always found it hard to use these counters, as for other masters, it is so topology, access side dependent. It was intended for internal testing, but now we document these for completeness of memory map etc and perhaps in some circumstances they are useful too.

    Regards
    Mukul