AM6526: Sporadic issue with SBL when calling SciClient functions

Ruediger Wurth

Part Number: AM6526

Dear TI experts,

we have a problem with SBL and SciClient using TI RTOS SDK7.3.
We know that this SDK is deprecated but we cannot go with SDK8.x because we need an RTOS at A53 (no Linux).
The issue appeared when we migrated from SDK6.3 to SDK7.3, using our custom board.

It is a sporadic issue: in average, about every ~500 boot cycles some boards do not boot.
Instead, boot phase is "frozen" and after about 180 seconds, the ROM bootloader starts rebooting with secondary boot medium, which is in our case UART. First boot medium is QSPI from NOR flash, using the OSPI periphieral.
In some cases the ROM bootloader does not trigger, but the SBL somehow gets stuck endlessly.

We already spend a lot of time (!) on this issue and tried to drill it down via UART_printf() inside SBL.

This issue usually appears in one of 2 functions:
1. either in Sciclient_procBootGetProcessorState() called by SBL_ConfigMcuLockStep(false)
2. or in SBL_ReleaseCore() called by SBL_ReleaseAllCores()

The most seen effect is that the SciClient function just doesn't return until the ROM bootloader watchdog triggers.
If the issue appears in the 1. or the 2. function listed above seem to depend on how many UART_printf() we added.
So it sounds also like a timing issue, since each UART_printf() adds some delay.

When the issue is in the first mentioned function I was able to deeper analyze:
Sciclient_procBootGetProcessorState() calls Sciclient_serviceSecureProxy(), which has certain while() loops.
The second while() loop is waiting for the answer from DMSC:

int32_t Sciclient_serviceSecureProxy(const Sciclient_ReqPrm_t *pReqPrm,
                                     Sciclient_RespPrm_t      *pRespPrm)
{
   ...

        /* Check if some message is received*/
        while (((HW_RD_REG32(Sciclient_threadStatusReg(rxThread)) &
                CSL_SEC_PROXY_RT_THREAD_STATUS_CUR_CNT_MASK) - initialCount) <= 0U)
        {
            if (timeToWait > 0U)
            {
                timeToWait--;
                if (g_u32DebugVerboseFlag) UART_printf("SciW2x loop\n");
            }
            else
            {
                status = CSL_ETIMEOUT;
                break;
            }
        }
    ...
}

In the normal case this loop is not entered at all because the while() condition is initially false.
But in the error case this loop is looping 'endlessly' until the ROM bootloader watchdog triggers.
I proved this with a UART_printf() inside the loop and in the error case I get for 180 seconds the UART loop output and then the reset occures followed by UART chip ident string and 'CCCC...'.

The tricky thing is that it appears quote sproadic, so that always an endurance test of multiple hours is needed which switches a device off/on continuously.

We also have complete systems with 20 or more devices, but there it is not possible to connect UARTs. But with such complete systems it is much more likely to reproduce the issue. Typically is appears on some device much more often than on others. On some devices it seem not to appear at all.

To ensure that we have no hardware issue we run the same systems with an older firmware with SDK6.3 and then everything is absolutely stable: No issue during 5 days continuously switching off/on the whole system (~ 100.000 total boots)

We already analyzed some potential general root causes:

Do we have an issue when reading from NOR flash? Answer is no. Checked by copying image to DDR, then doing checksums, then booting image from DDR instead via OSPI FFS.
Do we have a problem with DDR RAM access? Normally DDR RAM is not used by SBL. We also did RAM tests.
Do we have a problem by starting A53 firmware? No, checked via several methods, e.g. a very early A53 boot delay function.
Do we have an SBL internal memory range issue? No, checked linker file and map files.

Now some questions to TI:

Is there a known issue in DMSC firmware in PDK 7.3 for AM65x?
Is there a known issue in SciClient library in PDK 7.3 for AM65x?
How could we further drill down this issue?
Would it make sense to use the TISCI "Trace Layer" feature? If yes, could you support us here?

Regards, Ruediger

over 2 years ago

0 Ruediger Wurth over 2 years ago

Intellectual 705 points

Update: In the last days I could identify and fix the issue

It was a combination of a mistake we did a long time ago and an issue in the DMSC firmware:

We run our R5 application in split mode, and until today we use only one R5 core. But when we developped the SBL for our very first custom boards (year 2020), we set in the X.509 certificate of SBL bootCoreOpts = 0, by calling certificate generation script with -m EFUSE_DEFAULT instead of original bootCoreOpts = 2 via -m SPLIT_MODE.

According to TRM 4.5.4.1 Boot Info (OID 1.3.6.1.4.1.294.1.1) the effect is that SBL runs in lockstep mode.

This chance was never a problem until we used DMSC firmware of SDK7.3: After calling SBL_ConfigMcuLockStep(false) the R5 processing gets unstable.

Our fix is now that we change bootCoreOpts to 2 (split mode) and everything is now stable. Now we run both SBL and application in split mode.

One processor were the unstable effect occurs is labelled with: AM6526BACDXEAF
According data sheet, the last character 'F' says: "Safety features enabled including lock-step MCU"

Question: is it a know issue when starting SBL in lockstep but then switching R5 application to split mode? Maybe, this not supported?

Regards, Ruediger

0 Mukul Bhatnagar over 2 years ago in reply to Ruediger Wurth

TI__Guru* 84655 points

Hello Ruediger

Thanks for the update. Glad to see that you were able to figure out the issue in your setup. Regret the delay in response - as mentioned previously the SDK 7.x and development of non Linux based secure solution support remains challenging for us - and is outside our support scope at the moment due to legacy and expertise gaps.

I will try to check on your query on lock step vs split mode and get back to you - hopefully this is addressable looking at even the latest SBL offering etc.

Regards

Mukul

0 Mukul Bhatnagar over 2 years ago in reply to Mukul Bhatnagar

TI__Guru* 84655 points

Hello Ruediger

Have not yet heard back from a few folks who maybe familiar with the specifics of this particular release. I will update the thread once I have additional information.

0 Mukul Bhatnagar over 1 year ago in reply to Mukul Bhatnagar

TI__Guru* 84655 points

Ruediger Wurth said:
We run our R5 application in split mode, and until today we use only one R5 core. But when we developped the SBL for our very first custom boards (year 2020), we set in the X.509 certificate of SBL bootCoreOpts = 0, by calling certificate generation script with -m EFUSE_DEFAULT instead of original bootCoreOpts = 2 via -m SPLIT_MODE.

It appears that given all of our EVMs have non functional safety variants (lockstep not enabled) our flows of SBL do not comprehend the lock step mode development. The lockstep to split mode switch requires an R5 reset. This is likely what is causing the instability issue.

If the switch is managed through ROM via the certificates then ROM manages this reset, but if you do it in SBL --> Application you would've needed to do this Reset as part of your SBL code.

Hope this helps some.

0 Ruediger Wurth over 1 year ago in reply to Mukul Bhatnagar

Intellectual 705 points

Thank you Mukul for your explanation.
Anyway the issue is fixed on our side and we can close this topic.

Regards, Ruediger

Processors

Processors forum

AM6526: Sporadic issue with SBL when calling SciClient functions