This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6657 L1P benchmark regarding straight-line code

Genius 5785 points

Hello,

I want to understand L1P effect regarding straight-line code. There is no for-loop and if-branch. I guess that L1P disabled is faster than L1P enabled because every access from L1P to L2SRAM is cache miss. I tried a benchmark of L1P enabled and disabled. The result is below. L1P enabled is faster. What are the causes of the result? Please see attached project file. (CCSv6.1.3 / CGTv7.4.16)

L1P enabled: 841 cycles
L1P disabled: 1133 cycles

I saw the following document. I understand that some amount of the cache miss overhead can be overlapped with dispatch stalls that occur in the fetch pipeline. But I think it doesn't describe the difference between L1P enabled and disabled. I mean dispatch stalls also occur when L1P is disabled. Please give me some advice.

TMS320C66x DSP CorePac User Guide (SPRUGW0C) / 2.6 L1P Performance (P.39)

Regards,
Kazu

BenchDSftSat.zip

  • Hi Kazu,

    This has been forwarded to the corresponding experts. Their feedback will be posted directly here.

    Best Regards,
    Yordan
  • Interesting post,


    Would you please disable the pre-fetch as well?  Look at section 4.4.5 of the C66 CorePac  (http://www.ti.com/lit/ug/sprugw0c/sprugw0c.pdf) and disable the pre-fetch for the memory area where the program code resides. Then repeat the experiment.

    Let's see if the pre-fetch for the cache is what cause the difference

    Please report back here

    Thanks

    Ran

  • Hi Ran,

    ran35366 said:
    Would you please disable the pre-fetch as well?  Please report back here.

    Thank you for your reply. I report the results. My program is mapped in L2SRAM. I read MAR0, the value is 0x01. PFX bit for prefetch is 0. I can't set the bit to 1. I saw the following description in CorePac-UG. I think prefetch functionality is not applicable to a program mapped in L2SRAM. Please give me advice.

    Prefetch support in XMC aims to reduce the read miss penalty for streams of data entering C66x CorePac. Hence prefetching helps to reduce stall cycles and thereby improves memory read performance to MSMC RAM and EMIF.

    Regards,
    Kazu

  • Well, I reproduced your posting. As I said, this is very interesting and I need to consult with others. Give me a week or so and then if I do not answer ping me again (ask for update in the post)

    Ran
  • Kazu Kon

    I am not sure why we see the results that we see. The only theory that I have is the bandwidth management within the core itself.
    Look at chapter 8.2 of www.ti.com/.../sprugw0c.pdf My theory is that since you read the data from L2 as well, when the L1P cache is enabled it has higher priority than fetching instructions from L2, but this is only a theory.

    Does it make sense to you?

    Ran
  • Hello Ran,

    Thank you for your reply. I have two small changes for my test program. Please see attached file.

    I allocate some global variables to L1D cache before the benchmark starts by reading the data. I think L1D controller doesn't access to L2SRAM during benchmark.

    I also decrease some data accesses. The data initialization is only once before the benchmark starts. I think it's more clearly for L1P benchmark. An optimized code in assembly file is nothing by this change.

    The result is not changed. Could you give me some advice?

    L1P enabled: 776 cycles
    L1P disabled: 1057 cycles

    Regards,
    Kazu

    7776.BenchDSftSat.zip

  • I do not have another theory, but I have a suggestion -
    configure the L1P to the minimum cache size (I believe this is 4K) and see if you still get the speed up

    Then structure your code location to achieve the achieve the maximum efficiency. Your L1P SRAM will be only 28KB but it may be enough for whatever code you want to put there

    Ran
  • Hello Ran,

    Thank you for your reply. The result is not changed at all when I set L1PCFG from 4 to 1. Because text section mapped at L2SRAM is less than 4Kbyte. I see that the code size is 4032byte (0xfc0) by map file.

    Anyway I want to understand thing, which is not an optimization technique by mapping but the difference of an architecture between  L1P enabled and disabled. Could you give me advice Why I get these results?

    Regards,
    Kazu

  • As I said my theory is that the cache move has higher priority over regular fetch instruction, but if you want a deeper (better?) answer the people who designed the hardware should answer.  I will forward this posting to the that team

    Ran

  • Hello Ran,

    Thank you for your kind support. I'm not sure if the phenomenon is caused by priority you said or latency originally. I think that something which accesses to L2SRAM in my test program is only L1P controller. I want to know the difference of cycles between L1P cache miss and direct L2SRAM access. I wait for hardware team.

    Regards,
    Kazu

  • Hi Kazu-san
    This is still under investigation.
    We worked with the design team to look at this further and so far from their simulations they do not see any difference in the performance between the 2 use-cases - this is what they were expecting.

    We are now investigating why you see the difference on the application/bench setup ( we were able to reproduce your results too) - we are looking at if the compiler is still doing some optimizations or changes - even when no optimization options were chose and additionally we are also trying to see if there are any issues in the way we are doing the time stamping for cycle count etc.

    We will keep you posted if we have more updates on this.


    Regards
    Mukul
  • Hi Mukul,

    I appreciate your continued support. Thank you.

    Regards,
    Kazu

  • Hi Kazu

    This is my theory,  I am not sure that this is teh way the core works, but this may explain what we see

    The way I understand how the fetch mechanism is the following. The fetch packet is always 32 bytes (256 bits) - this is the program bus width. And the execute  packet is as many instructions are executed in parallel, can be one (that is 4 bytes) instruction, two , three and so on up to 8.

    Now my guess the mechanism that controls it has 64 bytes FIFO like buffer. There are two fetch packet in the buffer. As the core executes instructions instructions are read from the buffer. When the number of instructions in the buffer is 8 or less, a new fetch packet is fetched from the memory into the buffer.

    Now what if the fetch packet is not aligned on 32 bytes boundary?

    Because the L1P controller reads 256 bits that are aligned, it must read two times to load 256 bits into the buffer. So assume the fetch packet is aligned on 16 bytes and not on 32 bytes starting at address P + 16

    The first fetch packet it reads address P (for the first 16 bytes) and address P+32 (for the last 16 bytes)

    The second fetch packet is read from address P+32 (the first 16 bytes) and from address P+64 for the last 16 bytes.

    And so on and so forth

    So if the cache is enabled, the second fetch reads P+32 from the cache and not from the L2  memory and only P+64 is read from the memory, but then the next fetch packet first read is already in the cache.

    Think about it and tell me if it makes sense to you

    Ran

  • Hi Ran,

    I appreciate your advice.

    ran35366 said:
    So assume the fetch packet is aligned on 16 bytes and not on 32 bytes starting at address P + 16

    The first fetch packet it reads address P (for the first 16 bytes) and address P+32 (for the last 16 bytes)

    The second fetch packet is read from address P+32 (the first 16 bytes) and from address P+64 for the last 16 bytes.

    And so on and so forth

    So if the cache is enabled, the second fetch reads P+32 from the cache and not from the L2  memory and only P+64 is read from the memory, but then the next fetch packet first read is already in the cache.

    Think about it and tell me if it makes sense to you

    I show your theory in the attached file. Is my understanding correct? However, in fact, since the program is a straight-line code and no optimization option is attached, there are no parallel instructions at all.

    Regards,
    Kazu

    fetch_packet.xlsx

  • Kazu


    I looked at your file.  Thank you so much.  This is exactly what I tried to convey - Indeed you got it and illustrate it very well.  I will  use your file if I need to explain it in the future.

    By the way, it does not matter if there are parallel instructions or not. The important things are that all fetch packets are 32 bytes, and the alignment of the packets

    Ran

  • Hello Ran,

    Thank you so much for your support. Could you tell me about the case of low parallelism? Please see the attached file. Blue, red and green are execution packets of 3 to 4 parallels, respectively. I understand that red packet is read twice from memory. Is green packet read directly from FIFO without being read twice from memory?

    Regards,
    Kazu

    fetch_packet2.xlsx

  • Some comments -

    First in your drawing when you say byte (first byte, last byte) you mean word (32 bit)

    Second - the system does not insert NOP to memory, it just executes NOP

    The last stage should be execution packet not fetch packet - execution packet has NOP that are generated by the hardware

    And yes, my understanding the green execution packet will be fetch only once from the memory when the second fetch packet is loaded

    Now, I am not hardware guy, and this is only my theory.  Mukul may have a better theory.

    Best regards

    Ran

  • Hello Ran,

    Thank you for your comment and I apologize for any confusion. I'll rewrite my drawing and post it again.

    Regards,
    Kazu

  • Hello,

    Since I'm always looking for the best performance, I am intrigued by this discussion.  I work with the C6678 but I have verified that the behaviour, in term of clock count, is the same.

    I have investigated the problem using the CCS Hardware trace Analyzer on the C6678 EVM.

    I'm not sure if this really add something to the discussion, but I see that the "L1P disabled" run, compared with the "L1P enabled", pay a penalty from 7 to 8 cycle due to Pipeline stall at  the instruction that precede a BNOP. For instance, at line "temp_a = (check_a > (INT32)0x0000000000000000L) ? (INT32)0x000000007FFFFFFFL : temp_a;":

    Delta cycle cycle Stall cycle data Trace status Stall Event name ASM Notes
    1 30 STW.D2T2 B4,*B15[2]
    8 31 Stall 1 pipeline stall L1P Stall (L1P Miss) STW.D2T2 B4,*B15[2] Same instr. of previous cycle, just stall 8 cycle
    1 39 MV.L2 B4,B0
    4 40 [B'] BNOP.S1 0xXXXX,3 ok, 4 delta cycle due to 3 nop + branch
    ... ... ...

    When L1P is enable, the Stall is not present.

    It seems to me that the behaviour is not related to the fetch packets that, in my understanding of the documentation, are always aligned to 32 bytes (while the execution packet can be misaligned but not in this case since they are always composed of one instructions).

    The total number of stall in correspondence of the BNOP correspond to the total delta execution cycles between cache and non-cached runs.

    It seems that, even when disable, the L1P continue to have an active role in the processing of the branch.

    temp_a = (check_a > (INT32)0x0000000000000000L) ? (INT32)0x000000007FFFFFFFL : temp_a;

  • Hello Alberto,

    I appreciate your helpful information. Regardless of L1P cache enabled or disabled, I thought branch instruction including BNOP would cause pipeline stall. But according to your report, the stall doesn't occur when the L1P cache is enabled. I guess, it seems like L1P is working as like prefetch.

    Regards,
    Kazu

  • Hello Ran,

    Thank you for your document. I understand. I really appreciate your help.

    Regards,

    Kazu