TMS570LC4357: Wrong ECC address when testing DMA in SDL(SL_SelfTest_DMA)

Part Number: TMS570LC4357
Other Parts Discussed in Thread: HALCOGEN

Tool/software:

I'm trying to use SafeTI Diagnostic Library(SDL) on TMS570L4357. SDL Version I'm using is 2.4.0. and HALCoGen 4.7.1.
I had problem when I tried to run DMA self-test(SL_SelfTest_DMA) with DMA_ECC_TEST_MODE_1BIT, DMA_ECC_TEST_MODE_2BIT, DMA_SOFTWARE_TEST. Two tests with FAULT_INJECT finished succesfully.

Here is the code I used to test:

dmaRAMBASE_t *dmaRAM = (dmaRAMBASE_t *)0xfff80000u;
dmaDisableECC();
dmaDisable();
dmaEnable();
dmaEnableECC();

SL_Init_Memory(RAMTYPE_DMA_RAM);
execResult = SL_SelfTest_DMA(dmaTests[subIndex].testType);

Looking deep in SL_SelfTest_DMA function, I found out that DMAECCSBE register got wrong value.

The function writes wrong value to address 0xFFF80010, but DMAECCSBE is set to 0x00000018 instead of 0x00000010. Also when I clear error info by writing 1 to SBERR bit and reading from DMAECCSBE register, it resets to 0x00000008 instead of 0x00000000 as noted in the documentation.

I followed the conclusion in this link(e2e.ti.com/.../tms570lc4357-dma-ecc-self-test-failing-randomly-due-to-incorrect-address-stored-in-dmaeccsbe) about ressetting DMA RAM before testing but the problem persists.

There are several things I noticed:

1. Unlike in the linked post, my test result is consistent and not random. The problematic address is always 0x00000018.

2. After initializing DMA RAM, I thought it would all be cleared to 0. However it had some value at every 0xn9 address(0x09, 0x19, 0x29, etc.). Following is the image of the memory after initialization.


Unironically, it is part of problematic address when read as 4 byte data(0x18). When I manually reset those memory to 0 after initializing, DMAECCSBE register is set correctly to 0x00000010 and the test passes, like in following code:

dmaRAMBASE_t *dmaRAM = (dmaRAMBASE_t *)0xfff80000u;
dmaDisableECC();
dmaDisable();
dmaEnable();
dmaEnableECC();

SL_Init_Memory(RAMTYPE_DMA_RAM);

*(uint32_t *)0xfff80008 = 0;
*(uint32_t *)0xfff80018 = 0;

execResult = SL_SelfTest_DMA(dmaTests[subIndex].testType);

Can someone help me figure out what's going on, and how can I solve this issue fundamentally?

Thank you!
  • Hi Kim,

    Apologies for the delay:

    Understanding the Problem and Your Findings:

    1. DMAECCSBE Value: The DMAECCSBE (DMA ECC Single-Bit Error Address) register stores the byte address where a single-bit ECC error occurred. When it shows 0x00000018, it means an ECC error was detected at byte address 0xFFF80018 (relative to the DMA RAM base 0xFFF80000).
    2. SL_Init_Memory Issue: Your observation that SL_Init_Memory(RAMTYPE_DMA_RAM) does not fully clear the DMA RAM, specifically leaving 0xn9 values, is the key.
      • DMA RAM on TMS570L4357 is typically 2KB, from 0xFFF80000 to 0xFFF807FF.
      • The 0xn9 pattern is suspicious. When you read 4-byte data, 0x18 can be formed from these patterns. For example, if 0xFFF80018 contains 0xXX000009 (where XX is some other byte), and the ECC logic expects 0x00000000 for a clean state, this pre-existing 0x09 (or 0x19, 0x29, etc.) could be interpreted as a single-bit error when the ECC is enabled and the memory is accessed.
    3. Manual Clear Fix: When you manually set *(uint32_t *)0xfff80008 = 0; and *(uint32_t *)0xfff80018 = 0;, you are effectively clearing the problematic pre-existing data that was causing the ECC error at those specific locations. This confirms that the content of the memory, not necessarily the SDL's fault injection, is the initial problem.
    4. DMA ECC Self-Test Expectation: The SDL's DMA ECC self-tests (especially DMA_ECC_TEST_MODE_1BIT and DMA_ECC_TEST_MODE_2BIT) rely on a pristine memory state. They inject a specific fault (1-bit or 2-bit) and then verify that only that injected fault is detected. If there are pre-existing ECC errors due to uninitialized memory, the test will fail because it detects an unexpected error.
    5. 0x00000008 Residual: The 0x00000008 residual after clearing the SBERR bit might indicate another minor issue or a specific behavior of the DMA ECC register on your device, but it's secondary to the main problem of the initial 0x18 error. The primary goal is to get the test to pass, which means preventing the initial 0x18 error.

    Root Cause:

    The fundamental issue is that SL_Init_Memory(RAMTYPE_DMA_RAM) in your SDL version (2.4.0) on TMS570L4357 is not fully or correctly initializing the entire DMA RAM to a known, ECC-clean state (e.g., all zeros). This leaves residual data that triggers ECC errors when the DMA controller accesses these locations with ECC enabled, interfering with the self-test.

    Solution:

    The most robust and fundamental solution is to explicitly zero out the entire DMA RAM before running the SDL DMA self-tests. This ensures a clean slate, allowing the SDL's fault injection to be the only source of ECC errors during the test.

    --
    Thanks & regards,
    Jagadish.

  • OK... but Why does SL_Init_Memory not fully initialize memory? It's supposed to do so.

  • Does anyone have idea?

  • Hi Kim,

    Apologies for the delayed response, i was off for few days so didn't get time to work on this issue further:

    While the provided technical documents do not contain specific information about the SL_Init_Memory function, they do describe the principles of RAM initialization and several potential failure modes that can lead to an incomplete initialization and subsequent ECC errors.

    The primary reasons for incomplete memory initialization, based on the documentation, are improper handling of the hardware initialization sequence or partial memory writes that do not cover a full ECC block.

    Detailed Elaboration

    The provided context highlights several key concepts regarding memory initialization and ECC error prevention:

    1. Hardware-Based RAM Initialization

    The TMS320F2837xS Real-Time Microcontrollers manual describes a hardware feature for RAM initialization.

    • To prevent ECC or parity errors from reads of uninitialized RAM, a RAM_INIT feature is available for each memory block.
    • This process is started by setting an INIT bit for the specific RAM block. This initializes the block with 0x0 data and calculates the corresponding ECC/Parity bits.
    • A critical condition for success is that no master can access the memory while initialization is taking place. The software must poll the INITDONE bit to confirm completion. If a memory access occurs before INITDONE is set, the documentation explicitly states that "the memory read/write as well as initialization does not happen correctly."

    2. Uninitialized State and ECC

    The DRA74x_75x and DRA72x Performance document explains the state of memory at power-on and the need for software intervention.

    • The ECC parity bits are uninitialized when the device is first powered on.
    • It is crucial for software to initialize every block of memory (e.g., every 128 bits) that will be read after ECC is enabled. This ensures the parity bits are correctly set before any access occurs.

    3. Partial Memory Writes

    The same document warns against initializing only a portion of an ECC-protected memory line.

    • If software writes to only a part of an ECC memory block (e.g., one 32-bit word within a 128-bit line), the ECC controller performs a read-modify-write operation.
    • This operation reads the existing data and the associated uninitialized (and therefore incorrect) parity bits from the ECC memory.
    • The controller then attempts to check for errors using this invalid parity data, which can lead to false ECC error generation. To avoid this, the full memory line must be initialized at once, for example, by using a CPU memset or an EDMA transfer that is aligned to the ECC block size.

    --
    Thanks & regards,
    Jagadish.

  • Both manuals and documents are NOT for device that I'm using. Are you sure they apply to my case?