This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C66x Reading another Core's L2 inconsistent state

Hi


We are using the TI C6678 processor in our product and are encountering a bug that, for lack of understanding, are not yet able to fix.  Due to IP rights issue, we cannot post our original code so I have attempted to create a simple project that illustrate the problem.  Our product uses code that have been compiled with optimization turned on (-O3 -ms3), and we only encounter the issue after about 20 minutes.  But my sample project will only demonstrate the issue when optimization was turned off.   So the exact scenario might not be duplicated, but the underlying cause may be the same.


The overall architecture of our system is that we have one core responsible for communication with Desktop PC (TCP core), one core responsible for updating a status buffer (MONITOR core), and other cores responsible for doing other tasks, but with the result written to a circular buffer.  The status buffer (written by the MONITOR core) is in L2 of the MONITOR core.  The circular buffer is in DDR3.  In order to guarantee what's read for that tail value is coherent, we created a structure to solve the issue by:

Structure {
Systematic row: {
Absolute position - 64 bit value
Relative position - 32 bit value
Update counter - 32 bit value
}
Redundant Row {
Absolute position - 64 bit value
Relative position - 32 bit value
Update counter - 32 bit value
}
}

The monitor core's job is to update the Systematic row first (and it's very first write is to the 64 bit absolute position), then perform a copy from the systematic row to the redundant row.   The PC will read the whole structure at various / random intervals.  When the PC sees non-matching systematic and redundant values, it will discard those values and reread the structure.  The ISSUE is that when the PC reads a structure where the systematic and redundant rows are equal, SOMETIMES the relative position is inconsistent with the absolute value.  In fact, we see that the SYSTEMATIC and REDUNDANT Absolute positions (the two 64 bit values) are lagging the  RELATIVE positions by 1 update.  WHERE AS the MONITOR core will never put the structure in such a state (where the ABSOLUTE value is lagging the RELATIVE value)


When I examine the disassembly code from my sample project (and run it in debugger in assembly step mode)I can see that the the MONITOR is writing the 64 bit Absolute position first, then write to 32 bit relative.. etc, in the order we wrote.  And we can see that the reader is reading (in double words) in the same order, but we can't understand how can the REDUNDANT ROW can be EQUAL to the SYSTEMATIC row, yet the 64 bit ABSOLUTE value can be inconsistent with RELATIVE value.

Please note that many minor changes to the attached sample program will "remove" the issue, but possibly only mask the problem, where as our actual product will still experience the bug.  If you will compile the sample program, please use CGTool 7.4.8, debug mode (no Optimization..)   Note that if use change the #if 0 to  #if 1 in the generate_pattern function (that is, temporarily assign a local stack variable to use to dereference), the problem is masked...


Or if you change the delay() call in the consume_pattern to remove the rand() call or to change it to  rand() & 1 or 0, the problem is masked.


We are trying to understand why the L2 in the MONITOR core (generate_pattern), when read in TCP core (consume_pattern), is in inconsistent state.

Best Regards

TestParallel.rar

  • Have you tried a Cache Writeback of the data. Sounds like it may be resident in the local cores L1D cache and has stale data in local cores L2 SRAM. It will eventually get evicted as cache is being used (and written back to L2) but if it's in L1D at the time and modified, it will not be written back to L2 SRAM until it's either Evicted or a Cache writeback is performed. Also the core that's reading the data needs to make sure that it Invalidates the cache for this location prior to reading new data. Cache coherence maintenance from one corepac reading another corepac's L1/L2 memory space is not done automatically and must be done manually.

    Please see the C66x Cache User Guide and C66x CorePac to understand how data caching is handled.

    Best Regards,
    Chad
  • CACHE is disabled for L2 address range. And MONITOR is writing to it's L2 using the cache disabled CACHE range. Since the whole structure is 64 + 32 + 32 + 64 + 32 + 32 = 256 bits = 32 bytes, within 1 L1d cache line, if the data was in L1d cache, then when TCP core is reading that region, it would either have ALL VALID data or ALL INVALID value. This structure is aligned to 128 bit address. CACHE is disabled by the cache initialization function. But yes, we did put in cache_wb and cache_inv functions in our product, it did not fix the issue. Thanks.
  • Just to clarify, we have specifically turned OFF the cache-ability of the L2 address block. Also, we have called CACHE_wbL1d, as well as patched the cache functions to execute two MFENCE instructions. They are not solving the issue. The basic question is,

    After executing

    A.a = 0, A.b = 0, (a is 64 bit, b is 32 bit)

    A.a += x
    A.b += x
    A.c++

    B = A

    ===

    Why would A.a and A.b not share the same lower order bits, yet B is still an exact copy of A?
  • I have been side tracked to work on something else. But just recently I have gotten back on this issue. From what I have experimented with, L1/L2 coherence is not an issue here. In fact, CACHE is DISABLED in the code I have posted. Both core were accessing the L2 SRAM directly with MAR bit set to disabled. I have since modified my sample code to use a Local L2 address on Core 3, and use the global L2 (for Core3) address on Core 2:

    SHARED_DATA *shared_write = (SHARED_DATA*)0x00801440; // Core 3 L2, referenced by Core 3
    SHARED_DATA *shared_read = (SHARED_DATA*)0x13801440; // Core 3 L2, referenced by Core 2

    The fact that in Core 3, Absolute is modified before Relative must be remembered when examining the result:
    The program is modified so that the function that modifies the shared location is NOT optimized (NOTE that absolute-- 64 bit value-- is incremented first):

    #pragma FUNCTION_OPTIONS(add_address64, "--opt_level=off -O0")
    void add_address64(unsigned int *relativeAddress, unsigned long long *absolute, unsigned int inc) {
    (*absolute) += inc;
    (*relativeAddress) += inc;
    int outputSize = 0x80000;
    if ((*relativeAddress) >= outputSize)
    (*relativeAddress) -= outputSize;
    }


    But when Core 2 breaks (breakpoint set inside __DO_WRITE_ADDR_0 function), the value in the relative field is NEWER than the value in absolute field.

    Can someone at TI please investigate this?

    Still patiently waiting for answer.

    ================== PROGRAM BELOW =================
    /*
    * main.c
    */

    #include <ti/csl/csl_cache.h>
    #include <ti/csl/csl_cacheAux.h>

    #include <stdlib.h>
    #include <string.h>

    extern __cregister volatile unsigned int DNUM;

    // #define RANDOM_DELAY
    // #define INDIRECT_COPY

    #include <string.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>

    #pragma FUNC_CANNOT_INLINE(__DO_WRITE_ADDR_0)
    #pragma FUNCTION_OPTIONS(__DO_WRITE_ADDR_0, "-O0")
    #pragma RETAIN(__DO_WRITE_ADDR_0)
    #pragma NO_HOOKS(__DO_WRITE_ADDR_0)
    void __DO_WRITE_ADDR_0() {
    *((unsigned*)0) = 0;
    printf("");
    }

    typedef struct _SHARED_ROW {
    unsigned long long absolute;
    unsigned int relative, version;
    } SHARED_ROW;

    typedef struct _SHARED_DATA {
    SHARED_ROW sys;
    SHARED_ROW red;
    } SHARED_DATA;

    SHARED_DATA *shared_write = (SHARED_DATA*)0x00801440; // Core 3 L2
    SHARED_DATA *shared_read = (SHARED_DATA*)0x13801440; // Core 3 L2

    #pragma FUNCTION_OPTIONS(add_address64, "--opt_level=off -O0")
    void add_address64(unsigned int *relativeAddress, unsigned long long *absolute, unsigned int inc) {
    (*absolute) += inc;
    (*relativeAddress) += inc;
    int outputSize = 0x80000;
    if ((*relativeAddress) >= outputSize)
    (*relativeAddress) -= outputSize;
    }

    #pragma FUNCTION_OPTIONS(CopyInt, "--opt_level=off -O0")
    inline void CopyInt(int* src, int* dst, int length){
    int i;
    for (i = 0; i< length; i++)
    *dst++ = *src++;
    }

    #pragma FUNCTION_OPTIONS(ApplyRedundancy, "--opt_level=off -O0")
    void ApplyRedundancy(unsigned int* dst, size_t length32) {
    int lengtHalf = length32 >> 1;
    CopyInt((int*)dst, (int*)dst + lengtHalf, lengtHalf);
    }

    #ifdef INDIRECT_COPY

    #pragma FUNC_CANNOT_INLINE(indirect_copy)
    #pragma FUNCTION_OPTIONS(indirect_copy, "-O0")
    void indirect_copy(void *dst, void *src, int size) {
    memcpy(dst, src, size);
    }

    #else

    #define indirect_copy(x, y, z) memcpy(x, y, z)

    #endif

    void ti_cache_init() {
    int mar;
    CACHE_setL1PSize(CACHE_L1_32KCACHE); // set L1 to max // SPRU871J 3.4.3.1
    CACHE_setL1DSize(CACHE_L1_32KCACHE); // set L1 to max // SPRU871J 3.4.3.1

    // MAR0 - MAR15 are READ-ONLY
    // MAR16 - MAR127 are for addresses from 0x10000000 - 0x7FFFFFFF
    for (mar = 16; mar < 128; mar++)
    CACHE_disableCaching(mar);
    // 512 MB = 0x20000000, correspond to MAR128 to MAR159: 0x80000000 - 0x9FFFFFFF
    for (; mar < 256; mar++)
    CACHE_enableCaching(mar);
    CACHE_setL2Size(CACHE_256KCACHE); // L2 cache size to 256k (maximum) // SPRU871J 4.4.5
    }

    #pragma FUNCTION_OPTIONS(delay, "--opt_level=off -O0")
    void delay(unsigned x) {
    while (x--)
    asm(" NOP ");
    }

    // #define delay(x) while (x--) asm(" NOP ");

    void generate_pattern() {
    #if 0
    SHARED_DATA *xaddr = shared;
    printf("Initializing Core %i Generator\n", DNUM);
    sjl_C66_cache_init();
    srand(time(NULL));
    xaddr->sys.absolute = 0;
    xaddr->sys.relative = 0;
    xaddr->sys.version = 0;

    while (1) {
    add_address64(&xaddr->sys.relative, &xaddr->sys.absolute, 0x11);
    xaddr->sys.version++;
    asm(" MFENCE ");
    asm(" MFENCE ");
    ApplyRedundancy((unsigned*)xaddr, sizeof(SHARED_DATA) >> 2);
    }
    #else
    printf("Initializing Core %i Generator\n", DNUM);
    ti_cache_init();
    srand(time(NULL));
    shared_write->sys.absolute = 0;
    shared_write->sys.relative = 0;
    shared_write->sys.version = 0;

    while (1) {
    add_address64(&shared_write->sys.relative, &shared_write->sys.absolute, 0x11);
    shared_write->sys.version++;
    asm(" MFENCE ");
    asm(" MFENCE ");
    ApplyRedundancy((unsigned*)shared_write, sizeof(SHARED_DATA) >> 2);
    }
    #endif
    }

    void print_difference(unsigned lastIteration, unsigned currentIteration, SHARED_DATA *last, SHARED_DATA *current) {
    printf(
    "ON CORE\tNotValid:\tITERATION \tVERSION \tRELATIVE\tABSOLUTE\n" \
    " %i\tLast Sys:\t%10X\t%8X\t%8X\t%16llX\n" \
    " %i\tLast Red:\t%10X\t%8X\t%8X\t%16llX\n" \
    " %i\tCurr Sys:\t%10X\t%8X\t%8X\t%16llX\n" \
    " %i\tCurr Red:\t%10X\t%8X\t%8X\t%16llX\n",
    DNUM, lastIteration, last->sys.version, last->sys.relative, last->sys.absolute,
    DNUM, lastIteration, last->red.version, last->red.relative, last->red.absolute,
    DNUM, currentIteration, current->sys.version, current->sys.relative, current->sys.absolute,
    DNUM, currentIteration, current->red.version, current->red.relative, current->red.absolute);
    }

    inline unsigned compare_sys_red(SHARED_DATA *data) {
    if (data->sys.absolute == data->red.absolute &&
    data->sys.relative == data->red.relative &&
    data->sys.version == data->red.version)
    return 1;
    return 0;
    }

    // consume_pattern is RUNNING in CORE 2
    void consume_pattern() {
    SHARED_DATA last, current;
    unsigned r;
    // unsigned cycle = 0, lastIteration = 0; // Commenting out the line exhibit the behavior, uncommenting it removes the behavior

    printf("Initializing Core %i Consumer\n", DNUM);

    memset(&last, 0, sizeof(SHARED_DATA));
    ti_cache_init();
    srand(time(NULL));

    while (1) {
    /* To Get Failures:
    * add following
    r = rand() & Number > 2 (6, 12, 25, 50, 100, 101...)
    delay(r);
    *
    */
    r = rand() & 6;
    delay(r);

    indirect_copy(&current, shared_read, sizeof(SHARED_DATA));
    if (compare_sys_red(&current) && current.sys.version != last.sys.version) {
    if ((current.sys.absolute & 0xFFFF) != (current.sys.relative & 0xFFFF)) {
    // printf("CORE 2 CONSUMER\n");
    // print_difference(lastIteration, cycle, &last, &current);
    __DO_WRITE_ADDR_0();
    }
    last = current;
    // lastIteration = cycle++;
    }
    }
    }

    // Device to Function mapping:
    // DNUM == 3: Generator, generate data pattern into Local L2
    // DNUM == 2: Consumer / Transmitter, read data pattern from 0x138xxxxx

    int main(void) {
    if (DNUM == 3) {
    generate_pattern();
    } else if (DNUM == 2) {
    consume_pattern();
    }

    return 0;
    }