Hi
We are using the TI C6678 processor in our product and are encountering a bug that, for lack of understanding, are not yet able to fix. Due to IP rights issue, we cannot post our original code so I have attempted to create a simple project that illustrate the problem. Our product uses code that have been compiled with optimization turned on (-O3 -ms3), and we only encounter the issue after about 20 minutes. But my sample project will only demonstrate the issue when optimization was turned off. So the exact scenario might not be duplicated, but the underlying cause may be the same.
The overall architecture of our system is that we have one core responsible for communication with Desktop PC (TCP core), one core responsible for updating a status buffer (MONITOR core), and other cores responsible for doing other tasks, but with the result written to a circular buffer. The status buffer (written by the MONITOR core) is in L2 of the MONITOR core. The circular buffer is in DDR3. In order to guarantee what's read for that tail value is coherent, we created a structure to solve the issue by:
Structure {
Systematic row: {
Absolute position - 64 bit value
Relative position - 32 bit value
Update counter - 32 bit value
}
Redundant Row {
Absolute position - 64 bit value
Relative position - 32 bit value
Update counter - 32 bit value
}
}
The monitor core's job is to update the Systematic row first (and it's very first write is to the 64 bit absolute position), then perform a copy from the systematic row to the redundant row. The PC will read the whole structure at various / random intervals. When the PC sees non-matching systematic and redundant values, it will discard those values and reread the structure. The ISSUE is that when the PC reads a structure where the systematic and redundant rows are equal, SOMETIMES the relative position is inconsistent with the absolute value. In fact, we see that the SYSTEMATIC and REDUNDANT Absolute positions (the two 64 bit values) are lagging the RELATIVE positions by 1 update. WHERE AS the MONITOR core will never put the structure in such a state (where the ABSOLUTE value is lagging the RELATIVE value)
When I examine the disassembly code from my sample project (and run it in debugger in assembly step mode)I can see that the the MONITOR is writing the 64 bit Absolute position first, then write to 32 bit relative.. etc, in the order we wrote. And we can see that the reader is reading (in double words) in the same order, but we can't understand how can the REDUNDANT ROW can be EQUAL to the SYSTEMATIC row, yet the 64 bit ABSOLUTE value can be inconsistent with RELATIVE value.
Please note that many minor changes to the attached sample program will "remove" the issue, but possibly only mask the problem, where as our actual product will still experience the bug. If you will compile the sample program, please use CGTool 7.4.8, debug mode (no Optimization..) Note that if use change the #if 0 to #if 1 in the generate_pattern function (that is, temporarily assign a local stack variable to use to dereference), the problem is masked...
Or if you change the delay() call in the consume_pattern to remove the rand() call or to change it to rand() & 1 or 0, the problem is masked.
We are trying to understand why the L2 in the MONITOR core (generate_pattern), when read in TCP core (consume_pattern), is in inconsistent state.
Best Regards