This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Cache coherence operations taking a long time to complete

One of the developers has been having intermittent issues with the performance of the C6678 and, on analysis, a contributor (or, possibly, a symptom) is that sometimes the cache coherence operations are taking an astonishing amount of time to complete. The case that I'm looking at has core 1 performing two cache coherence operations involved in receiving a message from the sRIO peripheral (one on the free memory being added back onto the FDQ and the other on the descriptor).

Both are WB/Inv and the first is of 268 bytes in DDR3 and takes ~15us; the second is of 40 bytes in MSMC shared memory and takes ~12us!

For all of this time, 4 of the other cores are idle (for most of it, there is only one other core non-idle!) Core 0 is also performing cache operations at this time (Invalidates of just over 4K of MSMC memory where there is an inbound DirectIO transfer from the sRIO peripheral expected (but it takes longer than expected to turn up - it is not clear at the moment whether this is due to the same root-cause as the cache operation time, or due to congestion elsewhere in the sRIO fabric).

Can anyone suggest why the cache operations are taking so long and, more importantly, how we can ensure that they perform better in the future?

TIA,

SPH.

  • SPH,

    Can you provide what exact commands are being given for the coherence operation?

    What else is going on in terms of data traffic on the device?

    What are you using to time this operation. and what are you using to indicate start and stopping points?

    Best Regards,

    Chad

  • Hi Chad,


    Sorry for the delay; I've been travelling and haven't had time to sit down and reply to your questions.

    >> Can you provide what exact commands are being given for the coherence operation?

    >> What are you using to time this operation. and what are you using to indicate start and stopping points?

    I am doing the following sequence for each WB/Inv:

    1. Mask interrupts (clear GIE in CSR)
    2. Store a TSCL snapshot, which is what I use to time the operation, along with the pointer and number of bytes for the WB/Inv
    3. Write '1' to the INV bit of X_PF_CMD to invalidate the XMC prefetch buffer
    4. Write the pointer to L2_WI_BAR register
    5. Round up the byte count to a number of words and write it to the L2_WI_WC register
    6. Double MFENCE to wait for the operation to complete
    7. Restore interrupts (replace GIE in CSR with the value prior to step 1)

    In terms of the timing, there are two calls to this function (back-to-back) followed by a write to a QMSS register before TSCL is stored again. There are no interrupts in this sequence. The times I mention are between the 3 TSCL timestamps that have been stored.

    >> What else is going on in terms of data traffic on the device?

    As far as I can see, the only other activity on the device at this time is as described in the 3rd para of my post. Are there any particular forms of activity that could cause this type of behaviour?

    Cheers,

    SPH.

  • Ok, the sequence appears to be ok.  

    That said, if there's other activity going on the MFENCE will block operation until that activity has completed and thus give you a longer than expected time.

    Sounds like you're using the QMSS which if it's active in moving data it could potentially stop traffic until all activity has completed prior to allowing the CorePac to continue on, and thus effectively extend the time of the WB/INV process.

    What are your UCARBx and CPUARBx register values set to?  You may want to change the UCARBU maxwait to 2 or 4 and see what kind of improvement you get as well, if something is blocking it, this should guaranty much more will get through.

    Best Regards,

    Chad

  • Hi Chad,


    I've had a bit more of a look at the bandwidth management features of the C6678 - currently I have left them all at their default values. In this case, all of L2 is configured as cache, I believe, so would I be right in thinking that it is the MDMAARBU register that is important? (The UCARBU and CPUARBU appear to relate to use of L2 as memory, no?)

    While I can certainly try increase the priorities in MDMAARBU, I'm not sure why it should be necessary. Given that sRIO only allows a maximum of 16Gbps, and it is only another core performing cache operations of its own that can be contending with this, the TeraNet should be providing sufficient bandwidth that this sort of delay shouldn't be happening due to TeraNet congestion, as far as I can see. What am I missing here?

    Thanks for your help,

    SPH

  • You'll probably have more effect modifying the CPUARBU register.  

    The accesses that would be getting blocked, and causing the core to stall are requests from the CPU, and to change their priority and maxwait you'd modify the respective CPUARB registers.  The MDMAARB registers are going to change the priority of other IP Masters.

    I'm not sure what else may be going on, but I'd like to see the impact of reducing the maxwait, and increasing the priority on CPUARBU.

    Best Regards,

    Chad

  • Ah, this is where I think that I might be missing something. I am assuming that the core is being stalled on the MFENCE. I also assume that this is stalling until (and only until) the writeback to DDR3 or MSMC memory for all affected cache lines is complete.

    According to the CorePac UG, "The default values of CPUARB, IDMAARB,SDMAARB, and UCARB are sufficient for most applications. These registers define priorities that are internal to the C66x CorePac". Since we are not using IDMA, L1 and L2 are all cache (so no SDMA accesses I assume) and the CPU is stalled on MFENCE, my assumption is that there is no contention here (the only traffic going on is the user coherence operation) and the issue must be regarding getting the data out of the core pack. Again, according to the UG, "The MDMAARBU register defines priority for MDMA transactions outside of the C66x CorePac".

    Could you explain which assumption of my many assumptions is wrong?

    Many thanks for your help in getting to the bottom of this,

    SPH

  • Sorry, I was thinking about another thread when I wrote that reply.  It should be UCARB for the coherence operation, which you probably want to make sure is elevated to ensure the coherence operations get done efficiently.  But it could also be blocking from other activities w/ the MFENCE, but as you mentioned, that doesn't sound like it's highly active, but would be the MDMAARB.

    I'd focus on change and testing the UCARBU first.

    Best Regards,
    Chad

  • Hi Chad,

    This is a rare occurrence - I've only seen it (such that it causes a real-time deadline to be missed and the system to fail) a couple of time in the last month or more across all our systems, though I suspect that it is happening more frequently than that without cause a fatal error in our system. As such, I really need to understand what could be causing this problem further before I can start pressing keys on this.

    Can you explain the how changing UCARBU will affect the cache operation performance given the set of steps that I've outlined in a previous post as I think that I must be missing something in my understanding of this register?

    Thanks again,

    SPH

  • SPH,

    Basically, UCARBU sets the MAXWAIT for the Coherence operations.  Since it is a coherence operation, it's be default set to highest priority level, but there could be other items of the same priority level going on.  Changing the MAXWAIT to smaller values so that it will interrupt large blocks of other transfers to get the coherence activity done can save blocking.  

    Until the coherence operations are done, it will be stalled.  Here are a list of things the MFENCE stalls for.

    The MFENCE instruction is a new instruction introduced on the C66x DSP. This instruction will create a DSP stall
    until the completion of all the DSP-triggered memory transactions, including:
    • Cache line fills
    • Writes from L1D to L2 or from the CorePac to MSMC and/or other system endpoints
    • Victim write backs
    • Block or global coherence operations
    • Cache mode changes
    • Outstanding XMC prefetch requests

    If there's contention for the DDR3 from other masters, the changing of the MAXWAIT should help improve it.  Also lowering the MDMAARBU priority should help as well.

    It's hard to say, what specifically is causing the intermittent (and rare) blocking in your system.

    Best Regards,
    Chad

  • Basically, UCARBU sets the MAXWAIT for the Coherence operations.  Since it is a coherence operation, it's be default set to highest priority level, but there could be other items of the same priority level going on.  Changing the MAXWAIT to smaller values so that it will interrupt large blocks of other transfers to get the coherence activity done can save blocking.


    Ah, so you are saying that changing the MAXWAIT in UCARBU can change the priority used on the CorePAC's MDMA port? That is certainly not clear in the documentation, which seems to suggest that the UCARBU register only affects things happening internally to the CorePAC!

    Cheers,

    SPH

  • No, it's not changing the priority, as it already has the highest priority.  I'm saying if it's being blocked by something else of equal priority (which is possible if it has large blocks) then reducing the MAXWAIT will prevent blocking longer than the MAXWAIT values.  

    This is described in the C66x CorePac UG.

    Best Regards,

    Chad

  • Thank you Chad. Yes, I have read the section on bandwidth management in the UG (hence my quotes above) and, given my description of our system's use of L2, I don't see how this can explain a writeback to MSMC shared memory of 40 bytes (cache-line aligned - so a single cache line) taking ~12us.

    I don't mind making changes to our codebase but I first like to come up with a hypothesis regarding what is going on. This can then be tested, the fix applied and retested so I can be sure that the change has really fixed the problem. As yet, I'm not sure what you are suggesting is causing this enormous delay that could be fixed by MAXWAIT. The IDMA is not used, the core is sitting in an MFENCE, all L2 is deployed as cache so nothing can be accessing the L2 via the SDMA port so I don't see where there is potential for contention and, from my understanding, MAXWAIT is only used to elevate priorities in the case of contention and, even then, only within the CorePAC. Please help me to understand where I've got the wrong end of the stick...

    Thanks again,

    SPH

  • Hi Chad,

    Just a quick update on this. One if the developers here recent came across a scenario where a 28 bytes writeback/invalidate was taking ~14us almost all the time. Having nothing else to try, I tried changing UCARBD and UCARBU to 1; this resulted in the time dropping to ~6us. This leads me to asking two questions:

    1. Can you please explain how these values can affect the time taken to writeback the cache line given the CPU is sitting in an MFENCE, the IDMA is not used and, while this particular core (unlike the initial case) has some L2 configured as RAM, there was no evidence of any access across the SDMA port? cIs there any way that this register value can affect the priority use for the MSMC transaction through the MDMA port?
    2. 6us is still a huge amount of cycles to burn waiting for a single cache line to writeback; what else should I be looking at to further trim this time?

    Thanks for your help,

    SPH

  • Hi SPH,

    Have you try setting the "starvation bound register(SBND)" in the MSMC?
    I don't know this register will clear your problem but I hope it will.

    Please look at MSMC User Guide(sprugw7a) page.21
    "2.3.3 MSMC Bandwidth Management" for the details.

    best regards,
    g.f.

  • Hi g.f.,

    Thanks for the idea. I'll get the developer to try it out when he has a moment and see what impact it has.

    Cheers,

    SPH

  • Hi g.f.,


    I got the developer to try this but, unfortunately, it made no difference. So, I'm still in the situation where it takes 6us to write back 28 bytes. This is now, at least, faster than the 8-bit micros that I was playing with at school in the mid-80s... but only just.

    Does anyone have any other ideas of things that might improve the situation?

    SPH.

  • Hi Chad,

    Chad Courtney said:

    Sorry, I was thinking about another thread when I wrote that reply.  It should be UCARB for the coherence operation, which you probably want to make sure is elevated to ensure the coherence operations get done efficiently.  But it could also be blocking from other activities w/ the MFENCE, but as you mentioned, that doesn't sound like it's highly active, but would be the MDMAARB.

    I'd focus on change and testing the UCARBU first.

    I see that you have suggested this as an answer. I agree that it has resulted in a speed up for the cache operations and I am certainly grateful for that. However, 6us to write back a single cache line is still astonishingly slow so, while it is an improvement and possibly part of the solution, I can't really view it as a complete answer.

    In general we are getting quite disillusioned with the C6678 - it hasn't really given us the performance we were expecting - and the single biggest factor in compromising its performance is the cost of the cache coherence operations (at times - and it's not clear when and why they sometimes take soooo long).

    If I am able to understand why they are taking so long (and, ideally, have a means of speeding them up!), then I would say that I have an answer. Of course, whether it is the answer that I want is another matter...

    Cheers,

    SPH