This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: Question about ECC SelfTest

Part Number: TDA4VM
Other Parts Discussed in Thread: DRA829,

Hi,TI

1. Do all Ramids support inject only? When ECC self-test, use "ECC_ENABLE = 1, ECC_CHECK=0" to open ECC for fault injection self-test?

I tried to enable ECC in the above way, turned off the ECC after the self-check, and then enabled ECC after the fault was cleared.

When the ESM associates with errpin, the errpin is pulled down by an unrecoverable fault of the ECC_MEMTYPE_MCU_CBASS_ECC_AGGR0 module.

The fault in the self-test should not be cleared.

In SDL, module like ECC_MEMTYPE_MCU_CBASS_ECC_AGGR0, there is no RamId that can be read directly in memory, should not involve Ram initialization,

and address content write back, is there any other way to clear self-check fault?

2. From SDL, the group of the following ramids is 0. How to do ECC self-check on these ramids?

3. MCU_NAVSS0_UDMASS_ECC_AGGR0_NAVSS_MCU_J7_UDMASS_PSILSS0_L2P_NAVSS_PSIL_EDC_CTRL_0_RAM_ID:
Why does this RamId report a 2bit fault after ECC initialization (no injection fault)?

MCU_NAVSS0_UDMASS_ECC_AGGR0:  Why does the board restart after 90 seconds of ECC self-check in this module?

4. Why does the following module or ramId not receive ESM_STS and ESM_RAW register responses after injection failure?
PSRAMECC0_PSRAM256X32EC_ECC_AGGR
PSRAMECC0_PSRAM256X32EC_ECC_AGGR_PSRAM256X32E_16FFC_PSRAM0_ECC_RAM_IDECC

FS_PSRAMECC0_PSRAM256X32EC_ECC_AGGR_PSRAM256X32E_16FFC_PSRAM0_ECC_RAM_ID
R5FSS1_CORE0_ECC_AGGR (about iCatch dCatch TCM)
R5FSS0_CORE1_ECC_AGGR (about iCatch dCatch TCM)
R5FSS1_CORE1_ECC_AGGR (about iCatch dCatch TCM)

5. The following three modules involve high address mapping.
COMPUTE_CLUSTER0_A72SS0_COMMON_ECC_AGGR
COMPUTE_CLUSTER0_A72SS0_CORE0_ECC_AGGR
COMPUTE_CLUSTER0_A72SS0_CORE1_ECC_AGGR

1) After the mapping is complete, the COMMON_ECC module can access the mapped aggregator address to inject the fault, interconnect type ramId self-check can pass; wrapper type self-check does not pass, why ESM_STS and ESM_RAW registers do not receive fault response after fault injection?
2) Why do two modules CORE0_ECC and CORE1_ECC get stuck when accessing the mapped address?

6. CBASS_ECC_AGGR0_MSRAM32KX256E_ECC_AGGR
In SDL, why is the base address of the ECC AGGR corresponding to this module 0, and there is no ESM EventId? How to do eccc self-check on this module?

Thanks,

Yanni

  • Hi Yanni,

    I am checking internally with the experts. Will provide a response by the end of this week.

    Thanks,

    Josiitaa

  • Hi Yanni,

    1. I am not sure if I understand your question correctly. Could you please elaborate?

    When ECC self-test, use "ECC_ENABLE = 1, ECC_CHECK=0" to open ECC for fault injection self-test?

    Was this programmed manually or are you using code defined in SDL?

    Do all Ramids support inject only?

    Most aggregators support injection along with error detection and correction, while the others are inject only. You can refer the INJECT_TYPE defined for each aggregator in the sdlr_soc_ecc_aggr.h file in sdl/include/soc/j721e/. The INJECT_TYPE 1 refers to inject only, whereas 0 refers to ram IDs where error injection can be done along with detection and correction.

    2. Which SDK and SDL version are you using? Is it the SDL that comes along with the SDK release?

    3. I am in discussion with the HW team regarding this issue. Have you tried clearing the ESM event?

    4. Are you using any SDL test examples here?

    5. For the high memory regions, check the RAT to see if the addresses have been mapped correctly. 

    wrapper type self-check does not pass, why ESM_STS and ESM_RAW registers do not receive fault response after fault injection?

    This occurs as this aggregator is an Inject only type aggregator.

    Why do two modules CORE0_ECC and CORE1_ECC get stuck when accessing the mapped address?

    This might be because the cores have not been powered on. Before you read/write to an aggregator, the associated IP must be enabled properly.

    6.

    In SDL, why is the base address of the ECC AGGR corresponding to this module 0, and there is no ESM EventId?

    Are you trying to test any specific IPs? Or just a general test? This is a known issue that is part of the release notes as incomplete metadata on MSRAM and can be ignored.

    Thanks,

    Josiitaa

  • 1.It is manually coded with reference to sdl. We want to self-test the ECC during each ignition cycle.
    Referring to SDL, some ramids that are not directly readable in memory and are not injection-only mode force events to occur during self-testing. This glitch can't be cleaned up after I do this.
    If I do not force the event to occur, then the ESM_STS register will have no fault response.
    This fault can not be cleared, will it have anything to do with the forced time? How can I clean forced failures?

    2.I refer to "SDL_RLS_01.00.00".

    3.I tried to clear ESM events, but it didn't have the desired effect. 

    4.No, I used my own code to test, code reference "SDL_RLS_01.00.00".

    5.Can you provide a way to see if the RAT address mapping is successful? I call CSL_ratConfigRegionTranslation return a success.

    How does an injection-only aggregator perform an ECC self-test? Does the ESM respond to events in ECC mode?

    Can you provide a way to see if the relevant IP is enabled?

    6.Can it be understood that this module does not support the ECC function?

  • Hi Yanni,

    2.I refer to "SDL_RLS_01.00.00".

    This is a known issue with the metadata in the SDL_RLS_01.00.00 release. We have an update coming up in SDK 9.0

    3.I tried to clear ESM events, but it didn't have the desired effect. 

    We have been discussing internally, and while we do not have an official response, would like to provide you below information in hopes of moving this forward.

    The issue being seen is very similar to i2191 as shown below.   If there is PSIL traffic is introduced into the scenario, described in i2191, then issue slightly different from H/W perspective, but work around is likely to remain the same.

    Are you able to put the work around for i2191 in place.  Expectation is that this would resolve the issue.

    The work around, as above is: J721E DRA829/TDA4VM Processors Silicon Revision 1.1/1.0 (Rev. C) (ti.com)

    4.No, I used my own code to test, code reference "SDL_RLS_01.00.00".

    The modules should receive ESM_STS and ESM_RAW register responses. Is the error being propagated after error injection? You must trigger an access to the RAM by either reading or writing to the memory.

    6.Can it be understood that this module does not support the ECC function?

    There was a duplicate MSRAM instance that was accidentally included in the metadata and therefore can be ignored.

    I will get back to you with responses for the other questions by the end of this week.

    Thanks,

    Josiitaa

  • Hi Josiitaa

    3.I looked at the i2191 solution and had the following questions: 

    1) I did ecc self-check after MCU McalDriver_Init, not sure if all voltage domains are functional, is there a way to determine it? Or what can be done to make all voltage domains functional?

    2) Does the ECC function of this module need to be enabled in the main domain?

    4) If the main domain is in sleep state, could the mcu domain also report an unrecoverable failure?

    4. In SDL, the start address of this ramId is 0, how can I access it in the mcu domain?

    6. Can you pinpoint exactly which instance it repeats with?

  • Hi Yanni,

    I did ecc self-check after MCU McalDriver_Init, not sure if all voltage domains are functional, is there a way to determine it? Or what can be done to make all voltage domains functional?

    In this case seems like MCU domain is active because code is being run from MCU R5F. The thing to check would be if Main domain is functional. You could use the get_device_state TISCI call to get the current status of main domain. Link: https://software-dl.ti.com/tisci/esd/latest/2_tisci_msgs/pm/devices.html#pm-devices-msg-get-device

    Does the ECC function of this module need to be enabled in the main domain?
    I’m not clear on this question. If you are asking if the ECC aggregator for SDL_MCU_NAVSS0_UDMASS needs to be programmed/initialized from a core in the Main domain, the answer is no. It can be initialized from the MCU domain.
     In SDL, the start address of this ramId is 0, how can I access it in the mcu domain?

    It needs to be programmed in the RAT to access the address 0x0.

    Can you provide a way to see if the RAT address mapping is successful? I call CSL_ratConfigRegionTranslation return a success.

    If the API doesn’t return any error, then it should be successful.

    How does an injection-only aggregator perform an ECC self-test?

    If it is inject-only, then the behavior depends on the IP. In inject-only case, the ECC aggregator is there only for error inject. Detection and correction happens in the IP itself. In some cases, the error may also be routed to ESM, but not to the same ESM event as the ECC aggregator. In some cases there may not be an ESM event (like with j721e R5F inject-only endpoints).

    Can you provide a way to see if the relevant IP is enabled?

    The same get_device_state TISCI call can be used to check the state of the IPs as well.

    Can you pinpoint exactly which instance it repeats with?
    The duplicate one is the one that is not listed in the supported instances for the device in the sdl_ecc.h header file. If you are looking at the sdlr_soc_ecc_aggr.h header file for j721e, you will find metadata for:
    SDL_MCU_MSRAM_1MB0_MSRAM128KX64E_ECC_AGGR
    SDL_MSRAM_512K0_MSRAM16KX256E_ECC_AGGR
    SDL_CBASS_ECC_AGGR0_MSRAM32KX256E_ECC_AGGR
     
    You will find that the 3rd one, SDL_CBASS_ECC_AGGR0_MSRAM32KX256E_ECC_AGGR, is never used in SDL testing and also not added as a supported instance.
    Thanks,
    Josiitaa
  • If TDA4 goes to sleep, will an ECC error be reported?

  • Hi Josiitaa,

    About inject-only

    If the corresponding event is not routed to the ECC aggregator, what events respond?
    If there is no route to the ESM, how do you check that the injection failure has occurred?

    Thanks,

    Yanni

  • Hi Yanni,

    The inject-only endpoints will come through different event than the ESM event associated with the ECC aggregator. The behavior depends on the IP. In inject-only case, the ECC aggregator is there only for error inject. Detection and correction happens in the IP itself. In some cases, the error may also be routed to ESM, but not to the same ESM event as the ECC aggregator. In some cases there may not be an ESM event (like with j721e R5F inject-only endpoints).

    SDL_ECC_callBackFunction(SDL_ECC_applicationCallbackFunction) is a callback used to plug in to the exception handler, so we get ESM notifications for MCU R5F inject only errors that occur. It is an application provided external callback function for ECC handling called inside the reference functions when ECC errors occur. NOTE: This is application supplied and not part of the SDL. The SDL ECC module will call the SDL_ECC_applicationCallbackFunction API to notify the application that the error has occured, since the notification does not go through ESM.

    Thanks,

    Josiitaa

  • Will the mcu receive an error if the main domain is powered off?

  • Hi Josiitaa,

    I would like to confirm whether the fault detection mechanism for ECC faults is the same when injection faults occur during self-test and when they occur during runtime?

    Thanks,

    Yanni

  • Hi Josiitaa,

    This problem is the same problem as the previous one, and I want to describe it more clearly.

    For MemoryECC and Software Test of MemoryECC for all modules in the chip, is the mechanism for identifying ECC faults consistent within the TI chip?
    Examples are as follows:
    "SPRUIR1_DRA829_TDA4VM_Safety_Manual_Automotive.pdf", ADC module has two security mechanisms ADC2 and ADC-T1, ADC-T1, It says, "Reporting of forced errors uses same mechanism that reports unforced errors,"
    Also want to know whether the detection methods of ECC faults of these two mechanisms are consistent inside the chip?

    Thanks,

    Yanni

  • Hi Josiitaa,

    I created a TASK to perform ECC self-check of ADC module. After fault injection, callbak cleared the fault, but ESM kept reporting errors.

    Then I added the following module ECC self-check, this time the program directly stuck.
    SDL_ECC_MEMTYPE_MCU_CBASS_ECC_AGGR0
    SDL_ECC_MEMTYPE_MCU_NAVSS0
    SDL_MCU_NAVSS0_UDMASS_ECC_AGGR0

    If the Software Test of Memory ECC fails to clear or the program is stuck, does it mean that the software test of memory ECC fails or the program is stuck?

    Thanks,

    Yanni

  • Hi Josiitaa,

    I used get_device_state to get that their state is active.
    But the ecc self-test is still a failure.

    Is there any other reason?

    Thanks,

    Yanni

  • Hi Yanni,

    Sorry for the delayed response.

    Please help me summarize your doubts.

    1. Are you asking if the fault detection mechanisms the same for MemoryECC and Software Test of MemoryECC?

    consistent inside the chip?

    Could you explain what you mean by if they are consistent?

    2.

    this time the program directly stuck.

    Could you please share the output logs? What do you mean by the program is stuck?

    Thanks,

    Josiitaa

  • Hi Josiitaa,
    Here's the background.
    Both MemoryECC and Software Test of MemoryECC should be implemented. Considering the above problems of Software Test of MemoryECC in our implementation process, I have such concerns that if the two fault detection mechanisms are consistent and the Software Test of MemoryECC fails, will there be problems in opening MemoryECC?
    About stuck:,log as follows:
    Here are a few questions we need your help with:

    1.It is manually coded with reference to sdl. We want to self-test the ECC during each ignition cycle.
    Referring to SDL, some ramids that are not directly readable in memory and are not injection-only mode force events to occur during self-testing. This glitch can't be cleaned up after I do this.
    If I do not force the event to occur, then the ESM_STS register will have no fault response.
    This fault can not be cleared, will it have anything to do with the forced time? How can I clean forced failures?

    Can you help me with this question?

    We have been discussing internally, and while we do not have an official response, would like to provide you below information in hopes of moving this forward.

    The issue being seen is very similar to i2191 as shown below.   If there is PSIL traffic is introduced into the scenario, described in i2191, then issue slightly different from H/W perspective, but work around is likely to remain the same.

    Are you able to put the work around for i2191 in place.  Expectation is that this would resolve the issue.

    The work around, as above is: J721E DRA829/TDA4VM Processors Silicon Revision 1.1/1.0 (Rev. C) (ti.com)

    I checked the errata, but it didn't mention that injection failure would cause a restart. Is it possible that there are unknown negative effects after injection failure?
    Can you provide a way to see if the relevant IP is enabled?

    I used get_device_state to get that their state is active.
    But the ecc self-test is still a failure. 
    Is there any other reason?

    Thank you very much for your support.

    Thanks,

    Yanni

  • Hi Yanni,

    About stuck:,log as follows:

    Where are you seeing these traces? Could you provide details about which modules you are testing and what APIs are being used?

    It is recommended that you run these tests for diagnostics once at startup, before entering into your safety application. When are you running these diagnostics?

    This fault can not be cleared, will it have anything to do with the forced time? How can I clean forced failures?

    The steps to follow to clear bits are in the SDL ECC documentation:

    Look for the section about “to clear and acknowledge the ECC interrupt”

    Additionally, depending on the ESM configuration, the ESM error pin, may need to be cleared as well.

    Look for the section “If an error pin is asserted…”

    How are you forcing these errors? Are you performing error injection or setting any bits manually? Which modules are you testing?

    Is it possible that there are unknown negative effects after injection failure?

    If that error bit s being hooked up to the restart line, when you set up the ESM and if it is configured that the pin stays active, then the PMIC can cause a restart.

    But the ecc self-test is still a failure. Is there any other reason?

    Which modules are the ECC tests failing for?

    Regards,

    Josiitaa