This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM6442: Boot time increased in SDK 08.00.00.21

Part Number: AM6442

Hi,

My customer is evaluating AM6442 on TI EVM and found boot time was increased (~1sec) in SDK 08.00.00.21 compared to 07.03.01.006.
In customer's application, the boot up time is critical and ~1 second is big loss.

Here is u-boot log. As you can see, just after the line "Trying to boot from MMC2", the time increases from 0.72sec to 1.59sec.
07.03.01.006

[1.143601 0.002127] SYSFW ABI: 3.1 (firmware rev 0x0015 '21.1.1--v2021.01a (Terrific Lla')
[1.145946 0.002345] Trying to boot from MMC2
[1.872576 0.726630]
[1.872684 0.000108]
[1.872730 0.000046] U-Boot 2020.01-ge995ed0ec1 (May 28 2021 - 14:48:57 +0000)
[1.874734 0.002004] 


08.00.00.21
[1.052047 0.002139] SYSFW ABI: 3.1 (firmware rev 0x0015 '21.5.0--v2021.05 (Terrific Llam')
[1.054408 0.002361] Trying to boot from MMC2
[2.645767 1.591359]
[2.645872 0.000105]
[2.645919 0.000047] U-Boot 2021.01-g53e79d0e89 (Aug 10 2021 - 21:47:02 +0000)
[2.647926 0.002007]


Can you fix this?
Full logs are also attached.
[0.000001 0.000001] ��
[0.682366 0.682365] U-Boot SPL 2020.01-ge995ed0ec1 (May 28 2021 - 16:20:13 +0000)
[0.696507 0.014141] EEPROM not available at 80, trying to read at 81
[0.742221 0.045714] SYSFW ABI: 3.1 (firmware rev 0x0015 '21.1.1--v2021.01a (Terrific Lla')
[0.815961 0.073740] SPL initial stack usage: 13396 bytes
[0.839675 0.023714] Trying to boot from MMC2
[0.943860 0.104185] Starting ATF on ARM64 core...
[0.944943 0.001083] 
[0.953376 0.008433] NOTICE:  BL31: v2.4(release):2021.00.003-dirty
[0.954930 0.001554] NOTICE:  BL31: Built : 14:41:43, May 28 2021
[1.141400 0.186470] 
[1.141474 0.000074] U-Boot SPL 2020.01-ge995ed0ec1 (May 28 2021 - 14:48:57 +0000)
[1.143601 0.002127] SYSFW ABI: 3.1 (firmware rev 0x0015 '21.1.1--v2021.01a (Terrific Lla')
[1.145946 0.002345] Trying to boot from MMC2
[1.872576 0.726630] 
[1.872684 0.000108] 
[1.872730 0.000046] U-Boot 2020.01-ge995ed0ec1 (May 28 2021 - 14:48:57 +0000)
[1.874734 0.002004] 
[1.874778 0.000044] SoC:   AM64X SR1.0
[1.878339 0.003561] Model: Texas Instruments AM642 SK
[1.879479 0.001140] Board: AM64-SKEVM rev E2
[1.880276 0.000797] DRAM:  2 GiB
[1.916881 0.036605] MMC:   sdhci@fa00000: 1
[1.924966 0.008085] Loading Environment from FAT... In:    serial@2800000
[1.934926 0.009960] Out:   serial@2800000
[1.935624 0.000698] Err:   serial@2800000
[1.936274 0.000650] Net:   
[1.953973 0.017699] Warning: ethernet@8000000 using MAC address from ROM
[1.955636 0.001663] eth0: ethernet@8000000
[1.958270 0.002634] Hit any key to stop autoboot:  0 

[0.000001 0.000001] ��
[0.555449 0.555448] U-Boot SPL 2021.01-g53e79d0e89 (Aug 10 2021 - 21:47:42 +0000)
[0.565159 0.009710] EEPROM not available at 80, trying to read at 81
[0.600891 0.035732] SYSFW ABI: 3.1 (firmware rev 0x0015 '21.5.0--v2021.05 (Terrific Llam')
[0.663679 0.062788] SPL initial stack usage: 13392 bytes
[0.687864 0.024185] Trying to boot from MMC2
[0.884963 0.197099] Starting ATF on ARM64 core...
[0.886210 0.001247] 
[0.894609 0.008399] NOTICE:  BL31: v2.5(release):08.00.00.004-dirty
[0.896256 0.001647] NOTICE:  BL31: Built : 21:46:29, Aug 10 2021
[1.049839 0.153583] 
[1.049908 0.000069] U-Boot SPL 2021.01-g53e79d0e89 (Aug 10 2021 - 21:47:02 +0000)
[1.052047 0.002139] SYSFW ABI: 3.1 (firmware rev 0x0015 '21.5.0--v2021.05 (Terrific Llam')
[1.054408 0.002361] Trying to boot from MMC2
[2.645767 1.591359] 
[2.645872 0.000105] 
[2.645919 0.000047] U-Boot 2021.01-g53e79d0e89 (Aug 10 2021 - 21:47:02 +0000)
[2.647926 0.002007] 
[2.653722 0.005796] SoC:   AM64X SR1.0
[2.654566 0.000844] Model: Texas Instruments AM642 SK
[2.655698 0.001132] Board: AM64-SKEVM rev E2
[2.656540 0.000842] DRAM:  2 GiB
[2.696028 0.039488] MMC:   mmc@fa00000: 1
[2.709749 0.013721] In:    serial@2800000
[2.710670 0.000921] Out:   serial@2800000
[2.711413 0.000743] Err:   serial@2800000
[2.712157 0.000744] Net:   eth2: ethernet@8000000
[2.726094 0.013937] Hit any key to stop autoboot:  0 



Thanks and regards,
Koichiro Tashiro

  • Hi Koichiro-san,
    I'll look into it, and get back to you early next week.
    Best,
    -Hong

  • Hi Koichiro-san,
    I tested the SDK 8.2, and uploaded the SD boot time log in attachment.
    www.ti.com/.../08.02.00.14
    Can we check the SDK 8.2 on your side?
    Best,
    -Hong

    sitara@sitara-Latitude-E6410:~/test/SDK_debug/grabserial$ grabserial -d /dev/ttyUSB0 -t -m "U-Boot SPL" -e 30
    [0.000001 0.000001] 
    [0.000332 0.000331] U-Boot SPL 2021.01-g44a87e3ab8 (Mar 30 2022 - 16:29:38 +0000)
    [0.031618 0.031618] SYSFW ABI: 3.1 (firmware rev 0x0016 '22.1.1--v2022.01 (Terrific Llam')
    [0.079510 0.047892] SPL initial stack usage: 13392 bytes
    [0.095564 0.016054] Trying to boot from MMC2
    [0.303653 0.208089] Starting ATF on ARM64 core...
    [0.304561 0.000908] 
    [0.304609 0.000048] NOTICE:  BL31: v2.5(release):08.01.00.006-dirty
    [0.306290 0.001681] NOTICE:  BL31: Built : 16:24:35, Mar 30 2022
    [0.447660 0.141370] 
    [0.447759 0.000099] U-Boot SPL 2021.01-g44a87e3ab8 (Mar 30 2022 - 16:28:20 +0000)
    [0.002094 0.002094] SYSFW ABI: 3.1 (firmware rev 0x0016 '22.1.1--v2022.01 (Terrific Llam')
    [0.079975 0.077881] Trying to boot from MMC2
    [0.815820 0.735845] 
    [0.815915 0.000095] 
    [0.815974 0.000059] U-Boot 2021.01-g44a87e3ab8 (Mar 30 2022 - 16:28:20 +0000)
    [0.818169 0.002195] 
    [0.818222 0.000053] SoC:   AM64X SR1.0
    [0.818833 0.000611] Model: Texas Instruments AM642 EVM
    [0.832392 0.013559] Board: AM64-GPEVM rev E2
    [0.833412 0.001020] DRAM:  2 GiB
    [0.863792 0.030380] NAND:  0 MiB
    [0.864151 0.000359] MMC:   mmc@fa10000: 0, mmc@fa00000: 1
    [0.880716 0.016565] In:    serial@2800000
    [0.881816 0.001100] Out:   serial@2800000
    [0.895946 0.014130] Err:   serial@2800000
    [0.943746 0.047800] Net:   eth0: ethernet@8000000port@1
    [0.960563 0.016817] Hit any key to stop autoboot:  2 
    
    

  • Hi Hong,

    Thanks for your reply. Customer confirmed the boot time is decreased in SDK8.2.
    But it is not clear why SDK 8.0 was longer. Customer does not have a plan to migrate SDK 8.2 soon, so they wants to know how to fix it in SDK8.0.

    Customer did some debug and found memcopy in u-boot takes longer time in SDK8.0.
    SDK 07.03.01.006: memcopy takes 0.006s.

    [2022-04-19 10:47:58.432] External data: dst=80800000, offset=360, size=d54d8
    [2022-04-19 10:47:58.436] memcpy(dest = 0x80800000, src = 0x80800020, n = 0xd54d8
    [2022-04-19 10:47:58.442] memcpy() done.
    [2022-04-19 10:47:58.444] Image OS is U-Boot
    


    SDK 08.00.00.21: memcopy takes 0.697s.
    [2022-04-19 10:43:18.877] External data: dst=80800000, offset=514, size=e80f8
    [2022-04-19 10:43:18.881] memcpy(dest = 0x80800000, src = 0x80800014, n = 0xe80f8
    [2022-04-19 10:43:19.578] memcpy() done.
    [2022-04-19 10:43:19.580] Image OS is U-Boot
    


    BTW, above debug logs were taken with below changes.
    diff --git a/common/spl/spl_fit.c b/common/spl/spl_fit.c
    index 55511a90d0..74dd182a1d 100644
    --- a/common/spl/spl_fit.c
    +++ b/common/spl/spl_fit.c
    @@ -16,6 +16,9 @@
    #include <asm/cache.h>
    #include <linux/libfdt.h>
    
    +// Enable Debug
    +#define _DEBUG 1
    +
    DECLARE_GLOBAL_DATA_PTR;
    
     #ifndef CONFIG_SPL_LOAD_FIT_APPLY_OVERLAY_BUF_SZ
    @@ -329,8 +332,41 @@ static int spl_load_fit_image(struct spl_load_info *info, ulong sector,
                               }
                               length = size;
                 } else {
    +                         printf("memcpy(dest = 0x%lx, src = 0x%p, n = 0x%x\n", load_addr, src, (unsigned int) length);
                               memcpy((void *)load_addr, src, length);
    +                         printf("memcpy() done.\n");
    +           }
    +
                  }
    
                 if (image_info) {
    


    Why memcopy performance was different in each SDK version?

    Thanks and regards,
    Koichiro Tashiro

  • Tashiro-san,

    as per related email discussion, the issue seems to be caused by U-Boot's memcpy() implementation in combination how the data in the u-boot.img FIT image happens to end up getting aligned for a particular build, which can depend on many factors related what all goes into a particular FIT image and how large all those individual items are. Specifically, on U-Boot SPL running on the A53 core, if the memcpy() source data is not 8-byte aligned (aarch64 native word size), a byte-wise memcpy() subroutine is used, as shown in below object/source dump from the A53 SPL code:

    a0797059@dasso:~/git/u-boot (HEAD detached at 594c170701)
    $ $TOOLCHAIN_PATH_ARMV8/bin/aarch64-none-linux-gnu-objdump -S .out_a53/spl/u-boot-spl | grep -A 30 '<memcpy>\:'
    000000008008ea90 <memcpy>:
    void * memcpy(void *dest, const void *src, size_t count)
    {
            unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src;
            char *d8, *s8;
    
            if (src == dest)
        8008ea90:   eb01001f        cmp     x0, x1
        8008ea94:   54000100        b.eq    8008eab4 <memcpy+0x24>  // b.none
                    return dest;
    
            /* while all data is aligned (common case), copy a word at a time */
            if ( (((ulong)dest | (ulong)src) & (sizeof(*dl) - 1)) == 0) {
        8008ea98:   aa010003        orr     x3, x0, x1
        8008ea9c:   f2400863        ands    x3, x3, #0x7
        8008eaa0:   54000120        b.eq    8008eac4 <memcpy+0x34>  // b.none
        8008eaa4:   aa0003e4        mov     x4, x0
        8008eaa8:   d2800003        mov     x3, #0x0                        // #0
                    }
            }
            /* copy the reset one byte at a time */
            d8 = (char *)dl;
            s8 = (char *)sl;
            while (count--)
        8008eaac:   eb03005f        cmp     x2, x3
        8008eab0:   540001c1        b.ne    8008eae8 <memcpy+0x58>  // b.any
                    *d8++ = *s8++;
    
            return dest;
    }
        8008eab4:   d65f03c0        ret

    A solution to this is to force 8-byte alignment (instead of the default 4-byte alignment) for the generation of the u-boot.img FIT image by updating the mkimage build parameters for that image like this:

    $ git show
    commit ddafbea530be93485fc5b75c63e172fbd51f5223 (HEAD -> ti-u-boot-2021.01-fast-memcpy-dev)
    Author: Andreas Dannenberg <dannenberg@ti.com>
    Date:   Thu Jan 12 11:53:29 2023 -0600
    
        mkimage: Set FIT image align size to 8 to better support 64-bit devices
    
        The default FIT image header and data align size of 4 as defined within
        U-Boot's mkimage tool itself can lead to inefficiencies when used on
        64-bit architectures, for example the memcpy() operations performed
        during FIT image data extractions could end up being performed as
        byte-wise copy operations instead of the much faster full native
        word-wise (8 byte) operations, depending on how the binary data in the
        FIT image happends to be aligned (based on FIT header size, for example).
    
        To improve and make this behavior more consistent force an 8-byte
        alignment for the U-Boot based FIT image usually loaded from SPL.
    
        Signed-off-by: Andreas Dannenberg <dannenberg@ti.com>
    
    diff --git a/Makefile b/Makefile
    index c8a878a4ce..01e0b8c1da 100644
    --- a/Makefile
    +++ b/Makefile
    @@ -1383,7 +1383,7 @@ endif
     endif
    
     ifdef CONFIG_SPL_LOAD_FIT
    -MKIMAGEFLAGS_u-boot.img = -f auto -A $(ARCH) -T firmware -C none -O u-boot \
    +MKIMAGEFLAGS_u-boot.img = -f auto -A $(ARCH) -B 8 -T firmware -C none -O u-boot \
            -a $(CONFIG_SYS_TEXT_BASE) -e $(CONFIG_SYS_UBOOT_START) \
            -p $(CONFIG_FIT_EXTERNAL_OFFSET) \
            -n "U-Boot $(UBOOTRELEASE) for $(BOARD) board" -E \
    

    Please let me know if this addresses the concern on your end, and I'll see how we can improve the official U-Boot code base with such a change.

    Regards, Andreas

  • Hi Andreas,

    Thanks for your answer. I asked customer to check it.

    Thanks and regards,
    Koichiro Tashiro

  • Tashiro-san,

    thanks. Please close the loop once you hear back, so we can drive an official fix as needed.

    Regards, Andreas

  • Hi Andreas,

    Customer is trying to check the proposed solution.
    Meanwhile, could you answer below customer's question?

    Q) The reason why boot time is slower in SDK8.x is understood.
    But why the same problem was not observed in SDK7.x?
    What causes the boot time differences between SDK7.x and 8.x?

    Thanks and regards,
    Koichiro Tashiro

  • Hi Tashiro-san,

    But why the same problem was not observed in SDK7.x?
    What causes the boot time differences between SDK7.x and 8.x?

    I suspect this is caused by differences in how the DTB blobs are generated in SDKv7.x vs SDKv8.x. The later SDK may include additional items (bytes) into one of the headers (possibly due to changes to tools/fit_image.c between "old" and "new" U-Boot), thus shifting the alignment of the DTB-included data section(s) away from a 8-byte (64-bit) alignment boundary. You could reverse-compile the U-Boot FIT image and see if there are any differences in number of entries and entry length between the "old" and "new" U-Boot to check this theory. Something like this...

    $ dtc -I dtb .out_a53/u-boot.img

    Regards, Andreas

  • Hi Andreas,

    The customer understood the reason why boot time is slower with SDK8.x than SDK7.x.
    He has another question.
    We understood memcopy with unaligned data gets slower because unaligned data are handled byte-wise.
    But the customer observed unaligned memcopy was not so slow in SDK7.x.
    Is there any reason why unaligned memcopy is faster in SDK7.x than SDK8.x?

    SDK7.3

    [2022-12-27 11:29:03.968] Benchmark memcpy()
    [2022-12-27 11:29:03.969] memcpy(dest = 0x80900000, src = 80800000, n = 0x100000
    [2022-12-27 11:29:03.969] memcpy() done.
    [2022-12-27 11:29:03.969] memcpy(dest = 0x80900000, src = 80800014, n = 0x100000
    [2022-12-27 11:29:03.985] memcpy() done.
    [2022-12-27 11:29:03.985] memcpy(dest = 0x80900000, src = 80800020, n = 0x100000
    [2022-12-27 11:29:03.985] memcpy() done.
    [2022-12-27 11:29:03.985] memcpy(dest = 0x80900000, src = 80800024, n = 0x100000
    [2022-12-27 11:29:04.002] memcpy() done.
    


    SDK8.2
    [2022-12-27 11:20:04.961] Benchmark memcpy()
    [2022-12-27 11:20:04.961] memcpy(dest = 0x80900000, src = 0x80800000, n = 0x100000
    [2022-12-27 11:20:05.030] memcpy() done.
    [2022-12-27 11:20:05.030] memcpy(dest = 0x80900000, src = 0x80800014, n = 0x100000
    [2022-12-27 11:20:06.000] memcpy() done.
    [2022-12-27 11:20:06.000] memcpy(dest = 0x80900000, src = 0x80800020, n = 0x100000
    [2022-12-27 11:20:06.064] memcpy() done.
    [2022-12-27 11:20:06.064] memcpy(dest = 0x80900000, src = 0x80800024, n = 0x100000
    [2022-12-27 11:20:07.036] memcpy() done.
    


    Thanks and regards,
    Koichiro Tashiro

  • Hi Tashiro-san,

    can you please double-check the concern is now between SDK v7.3 and SDK v8.2? We have had discussed different versions/SDKs before, so I want to make sure I look at the right thing.

    I had another look at the U-Boot source code between v7.3 and v8.2 (via `git diff 07.03.00.005 08.02.00.006`) but nothing really spiked out. I also looked at the disassembler code of both memcpy() routines, and they look identical. I suspect it'll be down to a difference in clock system setup (by way of SYSFW built-in defaults, which may be different?), and/or DDR settings. But I need to run some tests on hardware to dig a bit deeper and see if I can confirm the customer findings. But first I want to make sure I'm using the correct SDKs.

    Regards, Andreas

  • Hi Andreas,

    The SDK versions the customer compared memcopy() are SDK7.3, SDK8.2 and SDK8.4.
    I attached SDK7.3 and SDK8.2 in the last post as example.
    SDK8.4 showed similar behavior as SDK8.2, i.e. unaligned memcopy() takes long time.

    SDK8.4

    [2022-12-27 11:10:46.048] Benchmark memcpy()
    [2022-12-27 11:10:46.048] memcpy(dest = 0x80900000, src = 0x80800000, n = 0x100000
    [2022-12-27 11:10:46.079] memcpy() done.
    [2022-12-27 11:10:46.079] memcpy(dest = 0x80900000, src = 0x80800014, n = 0x100000
    [2022-12-27 11:10:46.533] memcpy() done.
    [2022-12-27 11:10:46.533] memcpy(dest = 0x80900000, src = 0x80800020, n = 0x100000
    [2022-12-27 11:10:46.565] memcpy() done.
    [2022-12-27 11:10:46.565] memcpy(dest = 0x80900000, src = 0x80800024, n = 0x100000
    [2022-12-27 11:10:47.035] memcpy() done.
    


    Thanks and regards,
    Koichiro Tashiro

  • Hi Tashiro-san,

    ok, thanks. I'll focus my investigation on 7.3 vs 8.2 then. It'll take a couple of hours to run some tests to see if I can find an additional root cause for the still-existing difference in the memcpy() behavior itself (independent of DTB alignment issues discussed earlier) so I need to work this in, hopefully this week.

    Regards, Andreas

  • Hi Tashiro-san,

    as per offline discussion, un-locking this thread so we can continue the investigation. I was able to clear my plate some and allocate time to work this issue this week. Will get back in a day or two with an update.

    Regards, Andreas

  • Hi Tashiro-san,

    I think I figured out the last piece of the differences in speed between memcpy() in U-Boot from out SDK v7.3 vs v8.2 in addition to the concern with the un-aligned memcpy() of DTB artifacts that we discussed earlier.

    Long story short, U-Boot SPL (A53) from SDK v7.3 has DCACHE enabled using this Git commit...

    https://git.ti.com/cgit/ti-u-boot/ti-u-boot/commit/common/spl/spl.c?h=ti-u-boot-2020.01&id=bd236384ceef74d51f4661f79c629377d7705fb2

    ...whereas v8.2 has not because this setting did not get migrated when we moved from SDK v7.x to SDK v8.x, causing a regression. This behavior was fixed for AM64x starting SDK v8.4 (Git tag 08.04.01.001 or later), which re-introduced the enabling of DCACHE in U-Boot SPL (A53) using this Git commit here...

    https://git.ti.com/cgit/ti-u-boot/ti-u-boot/commit/arch/arm/mach-k3/am642_init.c?h=ti-u-boot-2021.01&id=3f031bb1baad799bec97f36ca12ca74fd7c79b05

    Note 1: The method of how/where DCACHE is enabled is different between SDK v7.3 and SDK v8.4+. In the older SDK it was done in common/spl/spl.c:board_init_r(), and in newer SDKs it is done in arch/arm/mach-k3/am642_init.c:board_init_f().

    Note 2: This discussion only applies to U-Boot SPL (A53). In U-Boot (A53) itself the DCACHE is always enabled across those SDK versions.

    Note 3: Interestingly customer reported in SDK v8.4 the un-aligned access is also slow, which it should not. I would like to ask them to verify/double-check which specific U-Boot commit they used for testing. Anything newer than Git tag 08.04.01.001 should not show this issue. They should also inspect their v8.4 source code to see whether it contains the function call to enable the cache or not as shown in the linked commit if they still see issues (there should be a call to spl_enable_dcache() from arch/arm/mach-k3/am642_init.c:board_init_f() ).

    You can actually simulate the impact of DCACHE on memcpy() from the U-Boot commandline by turning on CONFIG_CMD_CACHE=y as follows (this testing was done using SDK v8.6):

    # Turn on DCACHE (note that it is default ON already)
    => dcache on
    
    # Perform "aligned" memcpy
    => time cp 0x80800000 0x80900000 0x100000
    
    time: 0.006 seconds
    
    # Perform "unaligned" memcpy
    => time cp 0x80800014 0x80900000 0x100000
    
    time: 0.026 seconds
    
    # Turn off DCACHE (to simulate U-Boot SPL A53 behavior with SDK v8.2)
    => dcache off
    
    # Perform "aligned" memcpy
    => time cp 0x80800000 0x80900000 0x100000
    
    time: 0.188 seconds
    
    # Perform "unaligned" memcpy
    => time cp 0x80800014 0x80900000 0x100000
    
    time: 1.489 seconds

    This experiment pretty much confirms the originally reported concern. The biggest performance impact is caused by having un-aligned data and DCACHE off. When DCACHE is on, the performance difference is still visible between aligned and un-aligned data block but it is much smaller (delta of ~1.3s reduces to 20ms for 1MB).

    Regards, Andreas