TMS320C6657: MFENCE for atomic operations

Daisuke Maeda

Guru 20245 points

Part Number: TMS320C6657
Other Parts Discussed in Thread: TMS320C6678

Our customer wants to keep the order of accesses in an atomic operation.

Should the MFENCE instruction (_mfence() in Compiler Intrinsics) be used to wait for completion of writes to memory-mapped registers (except CorePac) and data spaces (except L1D and L2)?

Best regards,

Daisuke

over 5 years ago

0 lding over 5 years ago

TI__Guru* 95265 points

Hi,

Yes, see:

TMS320C66x DSP
Literature Number: SPRUGH7
November 2010
CPU and Instruction Set

3.8.12.1 MFENCE Restrictions
The MFENCE instruction is a new instruction introduced onto the C66x DSP. This
instruction will stall until the completion of all the CPU-triggered memory
transactions, including:
• Cache line fills
• Writes from L1D to L2 or from CorePac to MSMC and/or other system
endpoints
• Victim write backs
• Block or global coherence operations
• Cache mode changes
• Outstanding XMC prefetch requests
To determine if all the memory transactions are completed, the MFENCE instruction
checks an internal busy flag. MFENCE always wait at least 5 clock cycles before
checking the busy flag in order to account for pipeline delays.

Also, see the advisory 27 in Errata:

TMS320C6678
Multicore Fixed and Floating-Point Digital Signal Processor
Silicon Errata

The MFENCE instruction is used to stall the instruction fetch pipeline until the

completion of all CPU-triggered memory transactions.
Under very particular circumstances, MFENCE may allow the transaction after the
MFENCE to proceed before the preceding STORE completes.

Workaround: Replace a single MFENCE with two MFENCEs back to back. This remedies the issue
by resuming the stall in the case where the memory system prematurely indicated that
it was idle when STORE_A passed from L1D to L2.
1. STORE_A
2. MFENCE
3. MFENCE
4. TRANSACTION_B

Regards, Eric

0 Daisuke Maeda over 5 years ago in reply to lding

Guru 20245 points

Hi Eric-san,

Thank you for your reply.

Does the compiler add the MFENCE instructions to keep the order of accesses to variables that declared volatile?

Example code:

unsigned int myBuff[3];

int myFunc(void)
{
     volatile unsigned int *ioCtrl_0 = (volatile unsigned int *)0x00000000;
     volatile unsigned int *ioCtrl_1 = (volatile unsigned int *)0x10000000;
     volatile unsigned int *ioCtrl_2 = (volatile unsigned int *)0x20000000;

     /* Beginning of atomic operation */
     *ioCtrl_0 = 0;
     *ioCtrl_1 = 1;
     *ioCtrl_2 = 2;

     myBuff[0] = *ioCtrl_0;
     myBuff[1] = *ioCtrl_1;
     myBuff[2] = *ioCtrl_2;
     /* End of atomic operation */

return 0;
}

Best regards,

Daisuke

0 lding over 5 years ago in reply to Daisuke Maeda

TI__Guru* 95265 points

Daisuke,

No, you need to add the _mfence() yourself in the code.

Regards, Eric

0 Daisuke Maeda over 5 years ago in reply to lding

Guru 20245 points

Hi Eric-san,

Thank you for your reply.

I understand that the advisories related to MFENCE, i.e. the advisory 9 and the advisory 13, in Errata only apply to the cases that both L1D and L2 are configured as cache. If L2 is configured as SRAM, MFENCE is NOT required for L1D memory block and global coherence operations, and Single MFENCE can be used to wait for completion of writes.

Is my understanding correct?

Best regards,

Daisuke

0 lding over 5 years ago in reply to Daisuke Maeda

TI__Guru* 95265 points

Daisuke,

Are we talking to the same errata, http://www.ti.com/lit/er/sprz334h/sprz334h.pdf?

Advisory 9 —DDR3 Automatic Leveling Issue.................................................................................... 18

Advisory 13 —SRIO Messaging in Highly Oversubscribed System Issue .................................................. 24

From Advisory 27:

For the following advisory, double MFENCE can also be used as a workaround in addition to the workarounds already listed: • Advisory 7 - Potential L2 Cache Corruption During Block Coherence Operations Issue • Advisory 22 - L2 Cache Corruption During Block and Global Coherence Operations Issue.

How do you read "If L2 is configured as SRAM, MFENCE is NOT required for L1D memory block and global coherence operations, and Single MFENCE can be used to wait for completion of writes."?

I thought it is good to implement double _mfence() in all scenarios to avoid some corner case. This is how we did in the CSL code: ti\csl\csl_cacheAux.h. That is, using double _mfence().

_mfence();
/* Add another mfence to address single mfence issue
* Under very particular circumstances, MFENCE may allow
* the transaction after the MFENCE to proceed before
* the preceding STORE completes */
_mfence();

Regards, Eric

0 Daisuke Maeda over 5 years ago in reply to lding

Guru 20245 points

Hi Eric-san,

Thank you for your reply.

We will be talking to the same errata, but I am talking for C6657, not for C6678.

http://www.ti.com/lit/er/sprz381c/sprz381c.pdf

- Advisory 9: L2 Cache Corruption During Block and Global Coherence Operations Issue

- Advisory 13: Single MFENCE Issue

lding said:

I thought it is good to implement double _mfence() in all scenarios to avoid some corner case. This is how we did in the CSL code: ti\csl\csl_cacheAux.h. That is, using double _mfence().

I understand that the method is redundant but reliable.

Our customer does not use MFENCE for atomic operations at present, but there is no problem.

Should they add double MFENCE to their code?

I am concerned that the low incidence of issue in Errata will cause a problem later.

Best regards,

Daisuke

0 lding over 5 years ago in reply to Daisuke Maeda

TI__Guru* 95265 points

Daisuke,

Sorry, I looked at C6678 instead of C6657. The advisory is the same contents. The issue came from a customer usage case many years ago, and we found the problem and created the advisory.

Although your customer didn't see any issue for now, it is good to add double _mfence() to avoid a later surprise and taking lots of efforts to debug,

Regards, Eric

0 Daisuke Maeda over 5 years ago in reply to lding

Guru 20245 points

Hi Eric-san,

Thank you for your reply.

I will recommend our customer adding double MFENCE to their code.

Best regards,

Daisuke

0 Daisuke Maeda over 5 years ago in reply to lding

Guru 20245 points

Hi Eric-san,

I have an additional question.

If the CPU writes a data to memory and then reads it, is the following code correct as examples for MFENCE instruction?

1) Both L1D and L2 are configured as SRAM:

int noncacheable(void)
{
    volatile unsigned int *ptrL2SRAM = (volatile unsigned int *)0x00800000; /* LOCAL L2_SRAM */
    volatile unsigned int *ptrL1DSRAM = (volatile unsigned int *)0x00F00000; /* LOCAL L1D_SRAM */
    volatile unsigned int *ptrSHRAM = (volatile unsigned int *)0x0C000000;   /* Multicore shared Memmory */

    /* CPU writes a data to L1D_SRAM and then reads it *
     * The data is written to L1D_SRAM                 */
    *ptrL1DSRAM = 0xBEEFCAFE;
    /***********************************
     * MFENCE instruction not required *
     ***********************************/
    if (*ptrL1DSRAM != 0xBEEFCAFE)
        return -1;

    /* CPU writes a data to L2_SRAM and then reads it *
     * The data is written to L2_SRAM                 */
    *ptrL2SRAM = 0xBEEFCAFE;
    /***********************************
     * MFENCE instruction not required *
     ***********************************/
    if (*ptrL2SRAM != 0xBEEFCAFE)
        return -1;

    /* CPU writes a data to SHRAM and then reads it *
     * The data is written to SHRAM                 */
    *ptrSHRAM = 0xBEEFCAFE;
    _mfence();
    _mfence();
    if (*ptrSHRAM != 0xBEEFCAFE)
        return -1;

return 0;
}

2) L1D is configured as cache and L2 is configured as SRAM:

int cheableL1D(void)
{
volatile unsigned int *ptrL2SRAM = (volatile unsigned int *)0x00800000; /* LOCAL L2_SRAM*/
volatile unsigned int *ptrSHRAM = (volatile unsigned int *)0x0C000000; /* Multicore shared Memmory */

    /* CPU writes a data to L2_SRAM and then reads it                       *
     * On a write hit, the data is written to L1D cache                     *
     * On a write miss, the data is written to L2_SRAM, bypassing L1D cache */
    *ptrL2SRAM = 0xBEEFCAFE;
    /***********************************
     * MFENCE instruction not required *
     ***********************************/
    if (*ptrL2SRAM != 0xBEEFCAFE)
        return -1;

    /* CPU writes a data to SHRAM and then reads it                           *
     * On a L1D write hit, the data is written to L1D cache                   *
     * On a L1D write miss, the data is written to SHRAM, bypassing L1D cache */
    *ptrSHRAM = 0xBEEFCAFE;
    _mfence();
    _mfence();
    if (*ptrSHRAM != 0xBEEFCAFE)
        return -1;

return 0;
}

3) Both L1D and L2 are configured as cache:

int cheableL2(void)
{
volatile unsigned int *ptrSHRAM = (volatile unsigned int *)0x0C000000; /* Multicore shared Memmory */

    /* CPU writes a data to SHRAM and then reads it                              *
     * On a L1D write hit, the data is written to L1D cache                      *
     * On a L1D write miss, the data is written to L2 cache, bypassing L1D cache */
    *ptrSHRAM = 0xBEEFCAFE;
    /***********************************
     * MFENCE instruction not required *
     ***********************************/
    if (*ptrSHRAM != 0xBEEFCAFE)
        return -1;

return 0;
}

If the global address is used instead of the local address to access the local L1D_SRAM or the local L2_SRAM, does the CPU access outside the CorePac?

volatile unsigned int *ptrL2SRAM = (volatile unsigned int *)0x10800000; /* CorePac0 L2_SRAM */
volatile unsigned int *ptrL1DSRAM = (volatile unsigned int *)0x10F00000; /* CorePac0 L1D_SRAM */

Best regards,

Daisuke

0 Daisuke Maeda over 5 years ago in reply to lding

Guru 20245 points

Hi Eric-san,

I'm sorry for asking so many times.

Is MFENCE required when the CPU writes a data and then reads it, as I asked above?

If that is required, our customer will need to add double MFENCE throughout their code.

As far as I know, in the use case of MFENCE instruction after STORE instruction, a data stored by CPU is used by other system masters.

Best regards,

Daisuke

0 lding over 5 years ago in reply to Daisuke Maeda

TI__Guru* 95265 points

Daisuke,

Sorry for the late response! You have cases with different L1D and L2 cache combinations and direct L1D access, I need more study and come back to you.

Regards, Eric

0 lding over 5 years ago in reply to lding

TI__Guru* 95265 points

Hi,

Can you explain the usage of above mfence is a single core application or multicore (C6657 has two cores)?

A single CPU is coherent with itself when writing/reading to memory directly or through it’s L1 and/or L2 caches. So mfence is no need.

The requirement for a fence operation is when there is a CPU action that is dependent on the completion of a previous action to a system resource.

E.g. STORE_A goes to an MMR or to some memory through path_A, and Transaction_B is dependent on STORE_A having completed. An example would be writing to a DMA transfer descriptor in memory (STORE_A), then triggering the DMA channel that depends on the descriptors to start (TRANSACTION_B). If the descriptor write were to be blocked for any reason, but the DMA is started, then it will consume a stale or corrupted descriptor.

Another example would be a shared memory in DDR between two cores.

STORE_A = update to shared memory with direct write or write and flush, and TRANSACTION_B = to semaphore triggering the other CPU. You’d want to gate the notification until you’re sure the external memory is updated.

Regards, Eric

0 Daisuke Maeda over 5 years ago in reply to lding

Guru 20245 points

Hi Eric-san,

Thank you for your reply.

lding said:

A single CPU is coherent with itself when writing/reading to memory directly or through it’s L1 and/or L2 caches. So mfence is no need.

Thanks to you, my concerns have cleared up.

Best regards,

Daisuke

Processors

Processors forum

TMS320C6657: MFENCE for atomic operations