This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: In-line ECC error handling in J721E

Part Number: TDA4VM

Tool/software:

Dear Expert, 

Once DRAM in-line ECC enabled, R5 u-boot will initial write ECC protected area in advance.

It will increase system boot latency (even with BIST module initial write) and not friendly to user experience. 

To reduce boot latency, I have a idea to postpone ECC protected area initial write behavoir from R5 u-boot stage to post-boot stage (e.g, after Linux boots up)

Given SoC RMW (read-modify-write) policy on ECC protected area, it means, 

  1. After system boots up, if I initial write specific patten to ECC protected area (which is not initial written in R5 u-boot, to reduce boot latency)
  2. There will be read ops ahead of write ops due to RMW, and this extra read ops will get garbage data and generate ECC error interrupt. 

My workaround solution is that, after system bootup,  disable ECC interrupt before initial write ECC protected area, then resume ECC interrupt after initial write.

Then I do below experiment in SDK ti-processor-sdk-linux-adas-j721e-evm-10_01_00_04

in R5 u-boot stage, 

  • add ti,ecc-enable = "true" and ss_cfg register range into k3-j721e-ddr.dtsi to enable in-line ECC function in MSMC2DDR bridge 
  • configure ECC protected area 0xC000-0000 ~ 0xD000-1000 
  • then initial write DRAM area    0xC000-0000 ~ 0xD000-0000  with specific patten 0x12345678deadbeef
  • that is to say, 0xD000-0000 ~ 0xD000-1000 is ECC protected, but not initial written.

in A72 u-boot stage, 

=> md.l 0xc0000000 0x4                                                                    // 0xC000-0000 ~ 0xD000-0000 already be initial written
c0000000: deadbeef 12345678 deadbeef 12345678 ....xV4.....xV4.
=> md.l 0x029800a8 0x1                                                                    // DDRSS_V2A_INT_SET_REG, find ECC interrupt enabled
029800a8: 00000038                                                                          // 0x38 means multi 1-bit/1-bit/2-bit error interrupt enabled            
=> mw.l 0x029800ac 0x38 0x1                                                            // DDRSS_V2A_INT_CLR_REG, write 0x38 to disable ECC interrupt
=> md.l 0x029800a8 0x1                                                                    // DDRSS_V2A_INT_SET_REG, now ECC interrupt disabled
029800a8: 00000000

=> mw.l 0xd0000000 0x12345678 0x1000                                            // Try to initial write ECC protected area 0xD000-0000 ~ 0xD000-1000
"Error" handler, esr 0xbf000002                                                            // But unexpected exception occurs
elr: 000000008081c158 lr : 000000008081c150 (reloc)
elr: 000000008feda158 lr : 000000008feda150

x0 : 0000000000000faf x1 : 00000000d0000144
x2 : 0000000000000004 x3 : 000000008deb8e26
x4 : 0000000000000100 x5 : 0000000000000000
x6 : 000000008ff9065a x7 : 0000000000000044
x8 : 0000000000000010 x9 : 0000000000000000
x10: 000000000000000d x11: 0000000000000006
x12: 000000008de799a8 x13: 000000008de79b00
x14: 0000000000000008 x15: 000000008de79741
x16: 000000008feda0d0 x17: 0000000000000000
x18: 000000008de9dda0 x19: 00000000d0000000
x20: 0000000000000004 x21: 0000000000123456
x22: 000000008deb6c10 x23: 0000000000000004
x24: 000000008ffd8294 x25: 0000000000000000
x26: 0000000000000000 x27: 0000000000000000
x28: 000000008deb8ea0 x29: 000000008de797f0

Code: d2800001 94029cfd 93407e82 aa1303e1 (d1000400)
// add debug print in expection handler, DDRSS_V2A_INT_STAT_REG read zero, no ECC interrupt generated
// where does this exception come from ?       how to avoid or handle it ?
--> leonwang add arch/arm/lib/interrupts_64.c do_error() 271: 0x029800a4 is:00000000 // DDRSS_V2A_INT_STAT_REG
--> leonwang add arch/arm/lib/interrupts_64.c do_error() 272: 0x02980150 is:00000018 // DDRSS_ECC_1B_ERR_CNT_REG
--> leonwang add arch/arm/lib/interrupts_64.c do_error() 273: 0x02980158 is:01400000 // DDRSS_ECC_1B_ERR_ADR_LOG_REG
--> leonwang add arch/arm/lib/interrupts_64.c do_error() 274: 0x02980160 is:01400000 // DDRSS_ECC_2B_ERR_ADR_LOG_REG
Resetting CPU ...

resetting ...


Then here are my questions:

  1. When ECC interrupted disabled, why initial write will cause exception ?   how to avoid or handle such exception ?
  2. Is there any ECC error handle sample code for MSMC2DDR bridge ?   

BTW, I already find ECC error handle sample code in ti-processor-sdk-rtos-j721e-evm-10_01_00_04/sdl/examples/ecc 

But it leverages ECC implementation in DDRSS, not follow the recommendations in SDK doc

ECC error handle sample code for MSMC2DDR brige will be appreicated :-)

I'm looking forward to your feedback. 

Thanks.

  • Hi,

    Our expert is on workshop, kindly expect delay in response.

    Best Regards,
    Sudheer

  • Hi,

    Thank you for the detailed email.   There is an overview link available, which you likely have already seen:

    8.13. Enabling TI’s inline ECC for DDR — Processor SDK RTOS J721E

    In regard to specific details, will loop in someone more familiar with the UBoot bootflow.   In parallel it may assist if you post the dtsi modifications.

    Regards,

    kb

  • Hi, KB, 

    Thanks for your feedback. 

    See my modification below in red: 

    R5 u-boot dts file, enable in-line ECC
    file: ti-processor-sdk-linux-adas-j721e-evm-10_01_00_04/board-support/ti-u-boot-2024.04+git/arch/arm/dts/k3-j721e-ddr.dtsi

    memorycontroller@0298e000 {

    compatible = "ti,j721e-ddrss";

    reg = <0x0 0x02990000 0x0 0x4000>,
    <0x0 0x0114000 0x0 0x100>,
    <0x0 0x02980000 0x0 0x200>;
    reg-names = "cfg", "ctrl_mmr_lp4", "ss_cfg" ;

    ti,ecc-enable = "true";

    R5 u-boot dram host driver, config ecc protect area and initial part of it
    file: ti-processor-sdk-linux-adas-j721e-evm-10_01_00_04/board-support/ti-u-boot-2024.04+git/drivers/ram/k3-ddrss/k3-ddrss.c

    static void k3_ddrss_lpddr4_ecc_init(struct k3_ddrss_desc *ddrss)

    {

       ......

        k3_ddrss_set_ecc_range_r0(base, 0x40000000, 0x10001000);  // ECC protected DRAM area

                                                                                                     // mapping to 0xC000-0000, +0x1000-1000

      ......

        k3_ddrss_preload_ecc_mem_region( (u64 *)0xC0000000, 0x10000000, 0x12345678deadbeef);

                                                                                                     // initial write DRAM area 0xC000-0000, +0x1000-0000

                                                                                                     // postpone 0x1000 DRAM area initial write to A72 u-boot

                                                                                                     // But unexpected error occurs even ECC interrupt disabled 

     ......

    }

    I want to check if this "delayed initial write" solution works in A72 u-boot.

    If ecc error handle flow is available and works in u-boot, then  I can apply it to Linux.

    Thanks for your support in advance !

  • Hi,

    Within the u-boot k3-ddrss driver, can you try removing the ECC_CK enable in k3_ddrss_lpddr4_ecc_init?

    Then add it back in the u-boot prompt after the initial write.

    Best,
    Jared

  • Hi, Jared, 

    Thanks for your quick reply. 

    But ECC_CK is recommend to set before using DDR. (as well as RMW_EN)

    I'm not sure if is it's doable to postpone ECC_CK enablement from R5 u-boot stage to post-boot stage (e.g., after u-boot or Linux boots)

    Anyway, I will try it once I go to Lab tomorrow.

    Thanks.

  • Hi,

    Can you try setting register emif_ew_ctlcfg_DDRSS_CTL_206's ECC_ENABLE field to 1 so that ECC is enabled, but does no error detection or correction?

    Best,
    Jared

  • Hi, Jared, 

    Please check update below

    Test #01: Postpone ECC_CK enable to A72 u-boot prompt

    Seems that it's ok to enable ECC_CK at runtime, but ECC error interrupt is not triggered normally. 

    // 01. R5 u-boot config ECC protected area 0xD000-0000 ~ 0xD000-1000, but no initial write, ECC_CK disabled 
    // 02. check ECC protected area content before initial write, no ECC error as expected due to ECC_CK disabled
    Hit any key to stop autoboot: 0
    => md.b 0xd00001e0 0x40
    d00001e0: 08 40 00 80 94 84 10 52 80 00 41 13 08 01 01 20 .@.....R..A....
    d00001f0: 81 1c 0a 00 00 08 88 04 00 00 01 0a 20 00 03 80 ............ ...
    d0000200: 00 11 20 00 00 00 c1 18 12 01 00 00 00 90 19 20 .. ............
    d0000210: 00 20 01 00 88 00 80 00 03 88 00 11 18 40 10 c2 . ...........@..
    // 03. initial write on part of ECC area (0xD000-0000, +0x200) with pattern "a5", then check content
    => mw.b 0xd0000000 0xa5 0x200
    => md.b 0xd00001e0 0x40
    d00001e0: a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 ................
    d00001f0: a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 ................
    d0000200: 00 11 20 00 00 00 c1 18 12 01 00 00 00 90 19 20 .. ............
    d0000210: 00 20 01 00 88 00 80 00 03 88 00 11 18 40 10 c2 . ...........@..
    // 04. enable ECC_CK in DDRSS_ECC_CTRL_REG
    => md.l 0x02980120 0x1
    02980120: 00000013 ....
    => mw.l 0x02980120 0x17 0x1
    => md.l 0x02980120 0x1
    02980120: 00000017

    // 05. check ECC protected area again
    => md.b 0xd00001e0 0x20     //  initial written area (0xD000-0000, +0x200) no ECC error as expected
    d00001e0: a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 ................
    d00001f0: a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 a5 ................
    =>                                        //  unwritten area, not trigger ECC error, why !!??
    d0000200: 00 11 20 00 00 00 c1 18 12 01 00 00 00 90 19 20 .. ............
    d0000210: 00 20 01 00 88 00 80 00 03 88 00 11 18 40 10 c2 . ...........@..

    // ... skip... keep press enter to seq read more DRAM area, not trigger ECC error...
    =>
    d0000560: b1 14 20 00 80 29 03 18 06 40 00 11 03 24 18 00 .. ..)...@...$..
    d0000570: a9 00 18 10 38 01 00 32 38 08 00 20 90 00 00 00 ....8..28.. ....
    =>               // finally, trigger ECC error on 0xD000-0580; I retry above seq 3 times, always fail at this addr, why?
    "Synchronous Abort" handler, esr 0x96000210, far 0xd0000580
    elr: 00000000808bfa6c lr : 00000000808bf964 (reloc)
    elr: 000000008ff7da6c lr : 000000008ff7d964
    x0 : 0000000000000009 x1 : 000000008de79738
    x2 : 00000000fffffffe x3 : 0000000000000020
    x4 : 0000000000000000 x5 : 000000008de79730
    x6 : 0000000000000030 x7 : 000000008de79680
    x8 : 0000000000000000 x9 : 00000000ffffffd8
    x10: 000000000000000d x11: 0000000000000006
    x12: 000000008de799a8 x13: 000000008de79b00
    x14: 0000000000000008 x15: 000000008de79730
    x16: 000000008fed99f4 x17: 0000000000000000
    x18: 000000008de9dda0 x19: 0000000000000010
    x20: 0000000000000001 x21: 0000000000000010
    x22: 00000000d0000580 x23: 000000008ffb31e9
    x24: 000000008de79739 x25: 0000000000000000
    x26: 000000008de796e8 x27: 0000000000000002
    x28: 0000000000000001 x29: 000000008de79680

    Code: 794002c3 92403c63 78397b43 17ffffee (394002c3)
    --> leonwang add arch/arm/lib/interrupts_64.c do_sync() 215: 0x029800a4 is:00000030  // DDRSS_V2A_INT_STAT_REG
    --> leonwang add arch/arm/lib/interrupts_64.c do_sync() 216: 0x02980158 is:01400016
    --> leonwang add arch/arm/lib/interrupts_64.c do_sync() 217: 0x02980160 is:01400016

    Test #02: Access DDRSS_CTL_206 register 

    => md.l 0x02990338 0x1
    02990338: 02000fff ....
    => mw.l 0x02990338 0x02010fff       // set ECC_ENABLE field bits[17-16] to 1
    => md.l 0x02990338 0x1
    02990338: 02010fff

    Note ECC feature is available inside both MSM2DDR and DDR controller, but only one can be enabled.

    In ti-processor-sdk-linux-adas-j721e-evm-10_01_00_04, seems that only MSM2DDR ECC driver is available in R5-uboot.

    Do you suggest use DDR controller ECC feature?  any u-boot patch available ? 

    Thanks.

  • Hi Leon,

    Do you suggest use DDR controller ECC feature?

    You should use the MSMC2DDR controller as it's more optimal.

    any u-boot patch available ? 

    No, there is no u-boot patch available for this, because this is not something done before.

    I will do some testing separately, but can you answer my question in the meantime? How will this improve your system in the end? You will still have to set the initial writes to the DDR before you can properly do anything in Linux that requires ECC, and it will be slower because it isn't using the BIST. 

    Additionally, how will you prime the memory while Linux is using the DDR at the same time?

    Best,
    Jared

  • Hi, Jared, 

    Let me clarify. 

    Once in-line ECC feature enabled, during R5 u-boot stage, the extra initial write (even using BIST) on ECC protected DRAM will increase boot latency.

    It's not friendly to boot-latency sensitive system, e.g., Automotive

    My idea is to postpone initial write from R5 u-boot stage to Linux-post stage.

    It means Linux can boot in short latency (e.g., screen light up, camera on),  then background software run initial write, but now it won't hurt user experience. 

    I already have software solution to make post-stage write on specific DRAM area won't impact Linux kernel.

    The final missing piece is the ECC error handle for initial write on ECC protected area (caused by RMW policy) 

    Can I get your email, or discuss in private chat ? I can share you more detail.

    Thanks for your support in advance :-)

  • Hi,

    I see. I will continue testing and see what I can accomplish on my side.

    If you have an FAE, they can post to the internal processors forum on your behalf.

    Best,
    Jared

  • Hi,

    You're currently testing memory that doesn't fall into the DRAM range. If you run bdinfo, you can see where the DRAM memory is placed in the addressing.

    => bdinfo
    boot_params = 0x0000000000000000
    DRAM bank   = 0x0000000000000000
    -> start    = 0x0000000080000000
    -> size     = 0x0000000080000000
    DRAM bank   = 0x0000000000000001
    -> start    = 0x0000000880000000
    -> size     = 0x0000000080000000
    flashstart  = 0x0000000000000000
    flashsize   = 0x0000000000000000
    flashoffset = 0x0000000000000000
    baudrate    = 115200 bps
    relocaddr   = 0x00000000ffebe000
    reloc off   = 0x000000007f6be000
    Build       = 64-bit
    current eth = ethernet@46000000port@1
    ethaddr     = 24:76:25:a5:73:11
    IP addr     = <NULL>
    fdt_blob    = 0x00000000fde79740
    new_fdt     = 0x00000000fde79740
    fdt_size    = 0x0000000000024660
    multi_dtb_fit= 0x0000000000000000
    lmb_dump_all:
     memory.cnt = 0x2 / max = 0x10
     memory[0]	[0x80000000-0xffffffff], 0x80000000 bytes flags: 0
     memory[1]	[0x880000000-0x8ffffffff], 0x80000000 bytes flags: 0
     reserved.cnt = 0x5 / max = 0x10
     reserved[0]	[0x9e800000-0xb7ffffff], 0x19800000 bytes flags: 4
     reserved[1]	[0xb8000000-0xd7ffffff], 0x20000000 bytes flags: 0
     reserved[2]	[0xd8000000-0xe5ffffff], 0x0e000000 bytes flags: 4
     reserved[3]	[0xfce75000-0xffffffff], 0x0318b000 bytes flags: 0
     reserved[4]	[0x880000000-0x8a6ffffff], 0x27000000 bytes flags: 4
    devicetree  = separate
    serial addr = 0x0000000002800000
     width      = 0x0000000000000000
     shift      = 0x0000000000000002
     offset     = 0x0000000000000000
     clock      = 0x0000000002dc6c00
    arch_number = 0x0000000000000000
    TLB addr    = 0x00000000fffe0000
    irq_sp      = 0x00000000fde79730
    sp start    = 0x00000000fde79730
    Early malloc usage: 3648 / 8000

    The DRAM falls into the range: 0x80000000 to 0x900000000. Can you run tests with setting the memory within this range?

    Another note, looking at the memory the BIST would set, the range is 0xe0000000 (8/9 * 0x100000000).

    The errors that you are encountering are likely due to you trying to access inaccessible system registers.

    Best,
    Jared

  • Hi, Jared, 

    Thanks for your feedback

    I double check my previous post and find there is typo error in my 1st post description, I add one extra zero in start address...

    I already correct it in my 1st post, the ECC protected area should be  0xC000-000 ~ 0xD000-1000,  sorry for the misunderstanding ...

    But my test command sequence is correct, you can check the log in my 1st post.

    Let's focus on my 1st post test sequence, the test area fall into bank-0 range:

    Regarding BIST, thanks for your reminder.

    I'm not familiar with BIST, to avoid use it, I write one function to initial write part of ECC protected area (0xC000-0000 ~ 0xD000-0000) with CPU as temporary solution.  Left 0xD000-0000 ~ 0xD000-1000 area unwritten. 

    After all, my final purpose is not run ECC protected area initial write in u-boot at all.

    static void k3_ddrss_preload_ecc_mem_region(u64 *addr, u64 size, u64 pattern)
    {
        u64 i = 0; 
        static u64 count = 0; 
        for (i = 0; i < (size / 8); i++) 
        {    
            if( count % 0x01000000 == 0)
                printf(".");
    
            addr[i] = pattern;
            count++;
        }    
    }
    
    static void k3_ddrss_lpddr4_ecc_init(struct k3_ddrss_desc *ddrss)
    {
        ... skip ...
        // init ecc area on low mem
        printf("---> leonwang add %s %d init write to cpu addr: 0xC000_0000 + 0x1000_0000\n", __FILE__, __LINE__);
        k3_ddrss_preload_ecc_mem_region( (u64 *)0xC0000000, 0x10000000, 0x12345678deadbeef);
        ... skip ...
    }

    Now the technical barrier block me is that how to handle ECC error during initial write in u-boot prompt ? (as well as in Linux-post stage)

    There is no ECC error handler in u-boot, once ECC error occurs, the u-boot will reset. (I want to handle ECC error and keep system going on)

    Seems there are two options in front of me:

    1. workaround solution: ignore ECC error, just disable ECC error interrupt temporarily during initial write;
      1. But in my test, there is unexpected "sync abort" occurs. I still don't know how to avoid or handle it
    2. service ECC interrupt: I study J721E TDA4VM TRM, below are several ECC related registers, but no detail about handle flow, e.g.,
      1. DDRSS_ECC_1B_ERR_ADR_LOG_REG,  I patch do_sync() and dump it when error occurs, then how to decode error address from register content ?
      2. DDRSS_V2A_INT_STAT_REG, TRM says "write 1 to clear status after interrupt has been serviced", then how to service ECC interrupt ?

    Do you have any suggestion ?

    Is there any ECC error handle document or sample code I can refer to ?

    Thanks a lot for your support again :-)

    Best Regards

    Leon

  • Hi Leon,

    There is documentation for the RTOS SDK on ECC handling: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/10_01_00_04/exports/docs/sdl/sdl_docs/userguide/j721e/modules/ecc.html 

    For your Linux image, I also see the sync abort once I enable the ECC. I haven't seen any issues as long as I postpone the error count reg clear, ECC1BERR_EN, ECC2BERR_EN, ECCM1BERR_EN, and ECC_CK.

    Can you try holding off on setting the following registers until after you've written the initial memory within Linux?

    Address Value Description
    0x02980150 0x01 Clear ECC count
    0x029800A8 0x38 ECC1BERR_EN, ECC2BERR_EN, and ECCM1BERR_EN
    0x02980120 0x17 ECC_CK

    Another option would be to try only setting a specific range for ECC (and writing this portion) instead of the entire RAM.

    Best,
    Jared

  • Hi, Jared, 

    Very appreciate for your suggestion !

    It works that postpone ECC_CK enablement, until finish initial write on ECC protected area in Linux prompt  :-)

    It should be good workaround solution to avoid extra ECC handle work in Linux kernel.

    But one concern is that postpone ECC_CK is not typical usage, it violates datasheet recommendation. 

    Do you know if anyone (customer) do it like this before ?

    or do you expect any potential risk ?

    Thanks for your support in advance.

    Best, 
    Leon

  • Hi Leon,

    I do not know of anyone doing this before. I also don't know whether there will be any increased risk due to this change.

    I will ask some of our other experts if they have any opinions/knowledge and get back to you.

    Best,
    Jared

  • Hi Leon,

    Once again, I have no knowledge of customers doing this before. The issue in my eyes is that Linux executes from the DDR. Enabling ECC in this way is not recommended, and any issues that result from it, you will have to debug yourself.

    If you are trying to minimize boot time, you can try looking at the RTOS documentation for more solutions: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/10_01_00_04/exports/docs/psdk_rtos/docs/user_guide/index.html 

    You could also try only setting a portion of the DDR to ECC, so that the BIST writing takes less time.

    Best,
    Jared

  • Hi, Jared, 

    ok, I get it.  Very appreciate for your support again !

    Best, 

    Leon