This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MCU-PLUS-SDK-AM243X: ECC: During 1bit error injection, SDL_ecc_aggrIsSVBUSRegReadDone is getting stuck

Part Number: MCU-PLUS-SDK-AM243X


Tool/software:

We are trying to inject 1bit error to all the memory regions of BLAZAR_IIRAM_ECC and BLAZAR_IDRAM_ECC. But during the reading of ECC Ram control registers, SDL_ecc_aggrIsSVBUSRegReadDone is getting stuck randomly. You can verify the same from the below call stack.

 

As per the "TRM_spruim2h.pdf", this the right procedure to read the ECC control and status register.

But there is no timeout for this function, which leads our program to stuck in a never ending loop. 

Do you have any suggestion on what could lead to such scenario? 

  • Hello,

    Can you please share the parameters you are passing to the injecterror API when the code is getting stuck.

    Regards,

    Nihar Potturu. 

  • We are injecting SEC error and below are the details of the parameter.
    st_inject_error_config
    =
        {
           .pErrMem = ADDRESS,
           .flipBitMask = 0x1,
           .chkGrp = 0x0,
        };

    SDL_ECC_injectError
    (SDL_MCU_M4FSS0_BLAZAR_ECCAGGR,
                                                    SDL_MCU_M4FSS0_BLAZAR_ECC_BLAZAR_IDRAM_ECC_RAM_ID,
                                                    SDL_INJECT_ECC_ERROR_FORCING_1BIT_ONCE,
                                                    &st_inject_error_config);

    We have found that it is getting stuck during SDL_ecc_aggrReadEccRamCtrlReg after 
    error configuration is written to the register in SDL_ecc_aggrWriteEccRamCtrlReg. 
  • Hello ,

    We found that disabling and enabling interrupts in both SDL_ecc_aggrReadSVBUSReg() and SDL_ecc_aggrWriteSVBUSReg() and limiting the total entries for base address to 1 --> SDL_ECC_Base_Address_TOTAL_ENTRIES defined to 1. This fixes our problem of ECC failing due to corruption in the ECC registers.

    Since SDL it's a safety certified library, could you provide more feedback on the impact by doing it like this?

  • Hello,

    Apologies for the delay here.

    Since SDL it's a safety certified library, could you provide more feedback on the impact by doing it like this?

    It is not recommended to make to changes in the SDL driver code and it would affect the safety certification. 

    Are you running this code from M4F core? I suspect this might be happening because of running the example from the same memory where error injection is being done. Are you running the code from M4F IRAM and DRAM(You can check the linker command file of our project). Can you try placing the code in DDR and check if you still see the issue. 

    Can you also confirm if you are using debug build or release build when you observe this issue? 

    Regards,

    Nihar Potturu. 

  • Hello Nihar. Yes. We are running the code from M4F core as per our architecture, this core is our safety channel. The R5F core is for non-safe code.

    could you confirm if we see this issue in both debug and release? I believe this happens on both.

    I know it is recommended not to make changes to the SDL but I believe this change is necessary due to our architecture and we will present the changes to TUV for impact analysis. I wanted to confirm from TI the impact or concerns and if this could be related to SDL integration. But it seems the issue is due to running this test from the M4F.

    Thanks

  • Yes We were able to see this error in both debug and release build. 

  • Hello,

    I work with Manmath and Luis and investigated this issue.

    To summarize quickly,
     - We are operating from the M4F core
     - We are injecting single bit errors only
     - Code is running from the M4F SRAM only. Running from DDR is not an option due to the architecture and functional safety requirements.

    After investigation, I came to the conclusion that it is the fact that the aggrBus read and write functions are used both from the main branch and from ISR, combined with the bus reading interface which causes trouble in certain cases.
    Below is a diagram of my guess:




    Disabling interrupt before read/write operations and re-enabling it after solved the problem.

  • Hello,

    Apologies for the delay here.

    I have tried injecting error into the same RAM ID as what you were trying and I don't see any issues at my end. The code is not getting stuck inside the Injecterror function.

    I am attaching the example function below :

    static SDL_ECC_MemSubType ECC_Test_AGGR1_A0subMemTypeList[MAIN_AGGR1_AGGR1_MAX_MEM_SECTIONS] =
    {
        SDL_MCU_M4FSS0_BLAZAR_ECC_BLAZAR_IIRAM_ECC_RAM_ID,
        SDL_MCU_M4FSS0_BLAZAR_ECC_BLAZAR_IDRAM_ECC_RAM_ID,
    };
    
    
    static SDL_ECC_InitConfig_t ECC_Test_AGGR1A0ECCInitConfig =
    {
        .numRams = MAIN_AGGR1_AGGR1_MAX_MEM_SECTIONS,
        /**< Number of Rams ECC is enabled  */
        .pMemSubTypeList = &(ECC_Test_AGGR1_A0subMemTypeList[0]),
        /**< Sub type list  */
    };
    
    // Attaching only ECC Init code
    
    if (retValue == SDL_APP_TEST_PASS) {
        /* Initialize ECC */
        result = SDL_ECC_init(SDL_MCU_M4FSS0_BLAZAR_ECCAGGR, &ECC_Test_AGGR1A0ECCInitConfig);
        if (result != SDL_APP_TEST_PASS) {
            /* print error and quit */
            DebugP_log("SDTF_init: Error initializing M4F core ECC: result = %d\n\n", result);
    
            retValue = SDL_APP_TEST_FAILED;
        } else {
            DebugP_log("\n\nSDTF_init: AGGR1 ECC Init complete \n\n");
        }
    }
    
    // Error Injection Code
    int32_t runECC2BitAGGR1_InjectTest(void)
    {
        SDL_ErrType_t result;
        int32_t retVal=0;
        uint32_t subType;
    
        SDL_ECC_InjectErrorConfig_t injectErrorConfig;
    
        memset(&injectErrorConfig, 0, sizeof(injectErrorConfig));
    
        DebugP_log("\n\n AGGR1 Double bit error inject Example test UC-1: starting");
    
        /* Run one shot test for AGGR1 2 bit error */
        /* Note the address is relative to start of ram */
        injectErrorConfig.pErrMem = (uint32_t *)(0x00030000u);
    
        injectErrorConfig.flipBitMask = 0x1;
        injectErrorConfig.chkGrp = 0x0;
    
        subType = SDL_ECC_AGGR1_IAM64_MAIN_INFRA_CBASS_CBASS_MAIN_0_AM64_MAIN_INFRA_CBASS_CBASS_MAIN_SYSCLK0_4_CLK_EDC_CTRL_CBASS_INT_MAIN_SYSCLK0_4_BUSECC_RAM_ID;
    
        result = SDL_ECC_injectError(SDL_MCU_M4FSS0_BLAZAR_ECCAGGR,
                                  SDL_MCU_M4FSS0_BLAZAR_ECC_BLAZAR_IDRAM_ECC_RAM_ID,
                                  SDL_INJECT_ECC_ERROR_FORCING_1BIT_ONCE,
                                  &injectErrorConfig);
    
        if (result != SDL_APP_TEST_PASS ) {
            DebugP_log("\n\n AGGR1  Double bit error inject test: Subtype %d: test failed",
                        subType);
            retVal = SDL_APP_TEST_FAILED;
        } else {
            DebugP_log("\n\n AGGR1 Double bit error inject test: Subtype 0x%p test complete",
                        subType);
        }
    
        return retVal;
    }

    Can you please re-verify if the SDL_ECC_init is happening properly and if there are any differences between my test and yours?

    Regards,

    Nihar Potturu. 

  • Hello Nihar,

    Thank you for your reply.

    I checked the SDL initialisation code and it is very similar and the returned value is checked and correct.

    There is in my opinion two major differences between your test and ours.

    1/ Memory coverage
    To be able to quickly reproduce the problem, we had to make a while() loop which injects errors at the highest possible rate into the memory.
    The memory address is not fixed but iterates from 0x30000 to memory end and then loop back to 0x30000.

    2/Interrupt handling
    We are using an SDL interrupt handler very similar to the one in the SDK example.
    See below a code extract:

    static int32_t esm_callback_fun(SDL_ESM_Inst esm_inst,
                                    SDL_ESM_IntType esm_intr_type,
                                    uint32_t u32_grp_channel,
                                    uint32_t u32_index,
                                    uint32_t u32_int_src,
                                    void *const p_arg)
    {
        int32_t s32_return = 0;
        int32_t s32_status = 0;
        SDL_ECC_MemType ecc_memtype;
        SDL_Ecc_AggrIntrSrc ecc_intr_src;
        SDL_ECC_ErrorInfo_t st_ecc_error_info;
        /*Get the ECC error information from the ESM error information*/
        s32_status = SDL_ECC_getESMErrorInfo(esm_inst, u32_int_src, &ecc_memtype, &ecc_intr_src);
        
        if(s32_status == 0)
        {   /* Any additional customer specific actions can be added here */
            s32_status = SDL_ECC_getErrorInfo(ecc_memtype, ecc_intr_src, &st_ecc_error_info);
            
            /*If error info return is of no error then register the bit error else clear the interrupts*/
            if(s32_status == 0)
            {
                /*Check if we need inject error flag is on*/
                if (st_ecc_error_info.injectBitErrCnt != (uint32_t)0)
                {
                    /*Clears the pending interrupt if any*/
                    SDL_ECC_clearNIntrPending(ecc_memtype, st_ecc_error_info.memSubType, ecc_intr_src, SDL_ECC_AGGR_ERROR_SUBTYPE_INJECT, st_ecc_error_info.injectBitErrCnt);
                }
                else
                {
                    SDL_ECC_clearNIntrPending(ecc_memtype, st_ecc_error_info.memSubType, ecc_intr_src, SDL_ECC_AGGR_ERROR_SUBTYPE_NORMAL, st_ecc_error_info.bitErrCnt);
                }
                s32_status = SDL_ECC_ackIntr(ecc_memtype, ecc_intr_src);
                
                /* user specific and error handling */
            }
        }
        
        return s32_error;
    }

  • Hello Fabien,

    To be able to quickly reproduce the problem, we had to make a while() loop which injects errors at the highest possible rate into the memory.
    The memory address is not fixed but iterates from 0x30000 to memory end and then loop back to 0x30000.

    Can you share what you are doing here? I will try inject errors similarly to see if I can reproduce the issue.

    Also, when you don't have this while loop where you inject errors continuously, do you see the issue of getting stuck inside the SDL_ecc_aggrReadSVBUSReg function? 

    Regards,

    Nihar Potturu. 

  • Hello Nihar. Without the while loop this condition doesn't happen. But we need the loop as this is part of our continuous test (CBIT). We will share the scenario with TUV and let you know if they need more information. Thanks

  • Hello Luis, 

    Thank you for the update. Closing this thread for now. Feel free to reply back if you have any additional questions. 

    Regards,

    Nihar Potturu.