This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MCU-PLUS-SDK-AM243X: Flash-driver sometimes reads garbage when using OSPI-PHY-mode with Gigadevice-Flash

Part Number: MCU-PLUS-SDK-AM243X

Hello,

we are using MCU Plus SDK 08.04 without the new flash-driver since there is still an issue with ISSI-Flashes we are working on with TI. We are also using different flashes on our pcbas. The correct flash is identified by its id on startup and the matching configurations are then loaded. So currently there can be an ISSI or an Gigadevice (GD25LX256E)-Flash.

The ISSI-Flashes work with PHY-mode but the Gigadevice-Flashes do not properly. At least that's our first assumption. It may also occur now because we got some new charges of AM2434-Sitara SoCs which are used with the combination of the GigaDevice-Flash.

We noticed that at Boot-time, when booting from Flash the Flash_read-commands in our SBL return with Success but the read data is not correct. This happens randomly sometimes. Like in 20% of the power-ups.

This does not happen with the ISSI-Flash. We set the configuration of the Gigadevice-flash according to its data sheet. Also the RBL does work, and the read-id-command in 1s-1s-1s-mode also works. Also setting it to OCTAL SPI works but then using the PHY-Mode seems to break the operation.

The speed is set to 200 MHz or to 133 Mhz, in both cases the problem occurs, when the PHY is enabled. When PHY is disabled the problem does not occur.

We have 1k Pull-Down connected to DQS, which should be suitable.

To catch the issue I inserted a while-loop in our SBL-code when the read data is not correct. Interestingly when I connect via CCS the memory browser shows the correct data inside the flash and if I trigger a read-operation (so jumping to the read-operation in CCS and execute it) it suddenly reads the correct data.

I thought as a temporary workaround that maybe a sleep would then provoke the correct behaviour but even a 200 ms-sleep after Flash_open still does not help.

I saw there is a big file for PHY-tuning. Does this take care of the mentioned bug of the PHY-mode in the errata (i2189)?

Since our SBL uses sensitive data which is important for booting which is located inside the flash, a wrong read would lead to a fallback and the correct fw which should be booted is not booted anymore. This will mean that from now on produced devices will end in a non-useable state for our customers. This is a blocking point for us.

As a workaround we can disable the PHY-mode, but that would mean we would only run with 50 MHz. And we also noticed that the read-operations will take much longer. This would slow the application since we also use a webserver which needs flash-access and a file-system and so on.

Best regards

Felix

  • Hi ,

    The best guess is that their curve is slightly shifted here and there, and the algorithm doesn’t start at the correct ends. We have some stuff we can try, will discuss the same in the call scheduled.

    The one external factor which can fail the tuning is temperature (maybe ?).

    Best regards,
    Aakash

  • Hi Aakash,

    I will follow up with a phyGraph-Curve in a bad case and send it to you and Anand as we agreed in the meeting. The temperature thing could probably be a reason since it does not happen at first power-up but at like the 2nd or 3rd off-on-action when the device already has some temperature.

    Best regards

    Felix

  • Hi ,

    As discussed, if you are able to generate a graph with OSPI_phyTuneGrapher which needs to be called independently. Then we might get some data. This data can be used to plot the graph. After the graph is plotted, we can find what are optimum parameters for the PHY. Do let us know if you have any update on this.

    Best Regards,
    Aakash

  • Hey Aakash, so I implemented the function and wanted to catch that case again.

    Well. I think we need to consider multiple cases.

    The first products that showed the problem were newly finished products with a housing and a potting compound and with latest Sitara hs-fs-derivates. Since we a re still in the development phase we have multiple stages out there. I now have a pcba without potting compound with GigaDevice flash and Sitara gp which was produced some months earlier. And here I can't recreate the issue. I also tried to heaten it up to see if this affects it but it does not.

    Also it did not happen with all of the latest produced devices. Just with some few of them. I am also checking if something in the layout changed maybe. But it seems that this only occurs with the latest Sitara batch we received. I will come back with closer information and I also try to find a faulty device again, my last one got damaged somehow in the process (we needed to scratch it and possibly some components were damaged).

    I think I will receive a new device next week and then I try to recreate the issue and create the graph.

    Best regards,

    Felix

  • Hi Felix,

    I have requested help from the expert in this. He can help you find the problem much more in detail.

    Best Regards,
    Aakash

  • Hi Felix, 

    Do you have an estimate of how many new boards and in how many of them the issue presents? Also, once you find a failing board, is the issue reproducible on constantly or does it only happen intermittently? is the product also showing issues at boot time with the GD flash or is this exclusive to runtime? 

    Please let us know if you find any differences between the revisions of the schematics, while in the meantime I'll try to find out if there is anything between GP and HS-FS parts that could cause this behavior. I'll update the thread if we figure out something that could be cause for concern. 

    Best,

    Daniel

  • Hey Daniel,

    I coordinated with our colleagues. We will provide the information as soon as possible.

    Currently the issue seems to happen in 75% of the devices. This was definetely fixed with the workaround by disabling the OSPI-PHY. Our service department checked it. The issue was reproducible with the one board I had. But "constantly" in a sense that every third to fourth boot had this behaviour.

    The issues are happening in the bootloader and also in the application, but therefore notice: The Bootloader is a separate "application" in this case: the bootloader used the OSPI-PHY. So it can happen that the boot succeeds but then the following application initializes the OSPI again also with the OSPI-PHY enabled. here we also noticed at startup that our fileSystem could not read the data correctly and ran into a custom assert of us, which happens when the data is corrupt. So either the one (Bootloader did not succeed to read correct data from flash) or the other case (application-flash-read did not succeed) happened.

    Sadly the mentioned device was damaged when opening it and thus I currently have no device here with which I can reproduce it now. We are opening another device currently. So I will keep you also updated and provide a PHY-Graph like mentioned in the beginning as soon as I can.

    Best regards

    Felix

  • Hi Felix, 

    Thanks for the update, I am still trying to find information on my side regarding the type of device, so far it seems that there shouldn't be anything that could cause this issue between GD and HS-FS parts.

    Best,

    Daniel

  • ISSI_25WX256_working_02_out.txtGigaDevice_25LX256E_working_02_out.txt
    The issue using the phy with OSPI could not yet be reproduced, but the phyTuneGraph could be recorded for the ISSI device (working without issues) and the GigaDevice device (which sometimes has issues). Both graphs are captured of working flash devices at room temperature, with a clock of 200Mhz. Could you see any issues why the GigaDevice flash may start with a bad DDR tuning?

  • Hi Robert,

    From the plots, it looks like RD delay is already being set to 2. From some research, switching the device to HSFS most likely has no impact in the behavior observed before. A few things:

    • So you have no access anymore to the board that was originally failing? an interest experiment would have been placing a HSFS device on it and check for functionality
    • What is currently your setup? did you get a new board with an HSFS device and GigaDevice flash and are testing on this currently? 
    • Could you try lowering the speed to 166MHz and see if the issue is reproduced in the new setup or not? 

    Best,

    Daniel

  • Hi Felix,

    This thread has been unlocked. Please answer Daniel's question.

    Best regards,

    Ming

  • Hi Daniel,
    sorry for replying so late.

    Our Setup is running the second Stage Bootloader, OSPI is configured with:

    /* OSPI attributes */
    static OSPI_Attrs gOspiAttrs[CONFIG_OSPI_NUM_INSTANCES] =
    {
        {
            .baseAddr             = CSL_FSS0_OSPI0_CTRL_BASE,
            .dataBaseAddr         = CSL_FSS0_DAT_REG1_BASE,
            .inputClkFreq         = 166666666U,
            .intrNum              = 171U,
            .intrEnable           = FALSE,
            .intrPriority         = 4U,
            .dtrEnable            = TRUE,
            .dmaEnable            = TRUE,
            .phyEnable            = TRUE,
            .dacEnable            = FALSE,
            .xferLines            = OSPI_XFER_LINES_OCTAL,
            .chipSelect           = OSPI_CS0,
            .frmFmt               = OSPI_FF_POL0_PHA0,
            .decChipSelect        = OSPI_DECODER_SELECT4,
            .baudRateDiv          = 4,
            .dmaRestrictedRegions = gOspiDmaRestrictRegions,
        },
    };

    We found a device with the issue, in case of the error a bad txDll, rxDll had been chosen. Tracing the DDR tune algorithm showed, that a singularity has been found outside in the lowtxDll, rxDll corner, with a successful read of the attack vector. The traces logged to RAM for each phy tune setting. Each setting is represented by a uint32 where first byte indicates a AttackVector hit with 0x01 and miss with 0x00. The second byte is the rxDll value, the third the txDll value und the fourth the rdDelay value.

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    selectedPhyConfig struct OSPI_PhyConfig {txDLL=8,rxDLL=6,rdDelay=2} 0x700123F4
    otp1bottomLeft struct OSPI_PhyConfig {txDLL=8,rxDLL=6,rdDelay=2} 0x70012368
    otp1topRight struct OSPI_PhyConfig {txDLL=9,rxDLL=6,rdDelay=2} 0x700123C8
    otp1gapLow struct OSPI_PhyConfig {txDLL=9,rxDLL=6,rdDelay=2} 0x70012388
    otp1gapHigh struct OSPI_PhyConfig {txDLL=0,rxDLL=0,rdDelay=0} 0x70012378
    otp1rxLow struct OSPI_PhyConfig {txDLL=18,rxDLL=6,rdDelay=2} 0x700123A8
    otp1rxHigh struct OSPI_PhyConfig {txDLL=18,rxDLL=42,rdDelay=2} 0x70012398
    otp1txLow struct OSPI_PhyConfig {txDLL=8,rxDLL=36,rdDelay=2} 0x700123E8
    otp1txHigh struct OSPI_PhyConfig {txDLL=63,rxDLL=12,rdDelay=2} 0x700123D8
    otp1temp struct OSPI_PhyConfig {txDLL=59,rxDLL=38,rdDelay=2} 0x700123B8
    otp1slope float 0.654545426 0x7007637C
    otp1intercept float 0.763637543 0x70076378
    @70010768 --> phyTrace
    stat rx tx rd
    01 00 00 00 : --> Inital check if attack vector exists
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


    With this log at line 167 we see that the singularity was chosen as the start of the diagonal. Line 172 shows, that the next value on the diagonal fails. In line 185 the setting is finally tested again with a attack vector hit and is used by the driver. Further operation of the bootloader failed - maybe due to temperature - because all settings around fail.



  • Our idea is now to change the DDR tune algorithm at all to following solution with fast and fix runtime:

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    /*
    *
    The algorithm here is used to find the must stabble settings for OSPI DRR operation.
    Normally Map of working settings (X=hit, 0=miss):
    _________________________________________ RX_Dll
    |0000000000000000000000000000000000000000
    |0000000000000000000000000000000000000000
    |00000XXXXXXXXXXXX00000000000000000000000
    |0000XXXXXXXXXXXXX00000000000000000000000
    |000XXXXXXXXXXXXXX00000000000000000000000
    |00XXXXXXXXXXXXXXX00000000000000000000000
    |00XXXXXXXXXXXXXXX00000000000000000000000
    |00XXXXXXXXXXXXXXX00000000000000000000000
    |00XXXXXXXXXXXXXXX00000000000000000000000
    |00XXXXXXXXXXXXXXX00000000000000000000000
    |00XXXXXXXXXXXXXXX00000000000000000000000
    |0000000000000000000000000000000000000000
    |0000000000000000000000000000000000000000
    |0000000000000000000000000000000000000000
    |0000000000000000000000000000000000000000
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    The idea is to limit the search for the settings to only use rdDelay 2, additionally we reduce the rxDll and txDll to a small window where we expect the hits (this was already done by the existing algorithms). The algorithm than just test points within this window (large grid). By weighting the values of the successful txdll and rxdll settings, a txDll/rxDll couple the middle of the successful region will be chosen.

    Con:
    - scan uses only the rdDelay 2

    Pro:
    + weighting the values avoid choosing singular values (as with our GigaDevice)
    + deterministic runtime
    + faster runtime due to very few test settings

    What do you think about this algorithm? Do you see any issues?

  • Hi ,

    What do you think about this algorithm? Do you see any issues?

    How often do you intend to call this function ?

    • On every Boot ?
    • In a low priority task periodically ?

    Best Regards,
    Aakash

  • Hi Aakash,

    The idea is to call this algorithm on each boot up.

    What is your experience for temperature changes, will it be sufficient to determine the ideal settings once? For example in a worst case szenario, lets say we clock the OSPI with 200 Mhz and the device starts booting at -10°C and in the application the device later heats itself up to about 90°C, how many rxDll/txDll steps will the setting fade away? If it will be still in the valid region, it will not be necessary to readjust the settings. Otherwise we will need a task which measures the temperature and retries the algorithm in case we left the temperature range we expect we have still valid settings.

    Best Regards,
    Robert

  • Hi Robert,

    The current algorithm in the SDK does not contain the temperature optimization to choose the best point possible. Due to this, the scenario you are describing will most likely result in a fail as temperature rises. It is hard to say how much the variation would be without testing or having a working temperature optimization in the search for the point. I will discuss this some of the software experts and get back to you with possible solutions, I'm mostly looking to evaluate how much time it would take to optimize this code to take in account temperature to see if we can come up with a fast fix. I'll update you on Monday with more details after our discussion.

    Best,

    Daniel 

  • Hi Robert,

    I apologize for the delay on this topic, we have ran into issues trying to contact the right experts to help with temperature optimization integration in the code. I will bring this up in discussions again next week an update you by Wednesday. 

    Thanks,

    Daniel

  • Hi Robert, 

    I will file a ticket to include temperature optimization in future releases of the AM243x SDK. I have found this implementation from the processor SDK that takes in account the temperature range when choosing a tuning point: nor_spi_phy_tune.c « ospi « nor « flash « src « board « ti « packages - processor-sdk/pdk - Unnamed repository; edit this file 'description' to name the repository.

    The code provided in the CGIT repo is applicable to AM64x so you can use it as a reference for AM243x. You'll be able to see how it is implemented in lines 1108-1110:

    Please let me know if this helps you move forward with your issue.

    Best,

    Daniel