This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

66AK2H12: ECC error handler triggered only once

Part Number: 66AK2H12

Hello Team,

General motivation:

Our 3rd party provider of 66AK2H12 boards straggling to build boards because of ECC error yield issues.

We want to help with the production of boards, by giving a better solution/answer to 2bit ECC cases.

The original design of 2bit ECC is to reboot the device (panic) when it happens.

I would like to change the design of 2bit ECC. I removed the panic command and I print the 2bit ECC error address.

For testing, I inject 2bit ECC error from user space. And I see that the handler is triggered, but only at the first time. 

From the second time and on, the handler doesn't triggered, but I see that the DDR controller registers do change and show ECC error.

How to make the handler triggered again for every 2bit ECC error?

Attached snap shots of DDR control registers states before and after injecting the 2bit ECC error.

changed handler code:

static irqreturn_t ddr3_ecc_err_irq_handler(int irq, void *reg_virt)
{
int ret = IRQ_NONE;
int i;
u32 irq_status;
u32 err_ddr_address_2b = 0;
u32 err_ddr_address_1b = 0;
u32 data;

void __iomem *ddr_reg = (void __iomem *)reg_virt;

irq_status = readl(ddr_reg + DDR3_IRQ_STATUS_SYS);
pr_warn("DDR3 ECC irq status 0x%x \n", irq_status);
if(irq_status > 0)
{

if ((irq_status & DDR3_2B_ECC_ERR) ||
(irq_status & DDR3_WR_ECC_ERR)) {
// panic("Unrecoverable DDR3 ECC error, irq status 0x%x, "
// "rebooting kernel ..\n", irq_status);

pr_warn("Unrecoverable DDR3 2 bits ECC error, irq status 0x%x \n", irq_status);

}

ret = IRQ_HANDLED;
}
return ret;
}

Injecting ECC error from user space:

dmesg | grep -A 2 -B 2 "ECC"
md.l 0x21010000
md.l 0x21010100
mw.l 0x960000000 0xffffffff 80
sleep 1
echo 1 > /proc/sys/vm/drop_caches
sleep 1
md.l 0x960000000
sleep 1
devmem 0x21010110 w 0x0
sleep 1
devmem 0x960000068 w 0xfffffffc
sleep 1
devmem 0x21010134 w 0x01000000
sleep 1
devmem 0x21010114 w 0xCFFFA000
sleep 1
devmem 0x21010110 w 0xF0000001
sleep 1
echo 1 > /proc/sys/vm/drop_caches
sleep 1
md.l 0x960000000

md.l 0x21010000
md.l 0x21010100
dmesg | grep -A 2 -B 2 "ECC"

BEFORE INJECTING 2bit ECC ERROR:

# md.l 0x21010000
21010000: 40461c02 40000004 6200ce63 00000000    ..F@...@c..b....
21010010: 000017cd 00000000 16709885 00001d4a    ..........p.J...
21010020: 4461ff53 00000000 543f111f 00000000    S.aD......?T....
21010030: 00000000 00000000 00000000 00000000    ................
21010040: 00000000 00000000 00000000 00000000    ................
21010050: 00000000 00ffffff c0071410 00021c1c    ................
21010060: 00002010 00000000 00000000 00000000    . ..............
21010070: 00000000 00000000 00000000 00000000    ................
21010080: 1b0ee523 034b724c 00010000 00000000    #...LrK.........
21010090: 14a8c3fe 00000000 00000000 00000000    ................
210100a0: 00000000 00000000 00000000 00000000    ................
210100b0: 00000000 00000038 00000000 00000038    ....8.......8...
210100c0: 00000000 00000000 70073200 00000000    .........2.p....
210100d0: 00000000 00000000 00000000 00000000    ................
210100e0: 00000000 00000000 00000000 00000000    ................
210100f0: 00000000 00000000 00000000 00000000    ................
# md.l 0x21010100
21010100: 00000000 00000000 00000000 00000000    ................
21010110: b0000000 00000000 00000000 00000000    ................
21010120: 00001f1f 00000000 00000000 00000000    ................
21010130: 00000000 00000000 00000000 00000000    ................
21010140: 00000000 00000000 00000000 00000000    ................
21010150: 00000000 00000000 00000000 00000000    ................
21010160: 00000000 00000000 00000000 00000000    ................
21010170: 00000000 00000000 00000000 00000000    ................
21010180: 00000000 00000000 00000000 00000000    ................
21010190: 00000000 00000000 00000000 00000000    ................
210101a0: 00000000 00000000 00000000 00000000    ................
210101b0: 00000000 00000000 00000000 00000000    ................
210101c0: 00000000 00000000 00000000 00000000    ................
210101d0: 00000000 00000000 00000000 00000000    ................
210101e0: 00000000 00000000 00000000 00000000    ................
210101f0: 00000000 00000000 00000000 00000000    ................


AFTER INJECTING 2bit ECC ERROR:

# md.l 0x21010000
21010000: 40461c02 40000004 6200ce63 00000000    ..F@...@c..b....
21010010: 000017cd 00000000 16709885 00001d4a    ..........p.J...
21010020: 4461ff53 00000000 543f111f 00000000    S.aD......?T....
21010030: 00000000 00000000 00000000 00000000    ................
21010040: 00000000 00000000 00000000 00000000    ................
21010050: 00000000 00ffffff c0071410 00021c1c    ................
21010060: 00002010 00000000 00000000 00000000    . ..............
21010070: 00000000 00000000 00000000 00000000    ................
21010080: 1b10fc7d 034c7009 00010000 00000000    }....pL.........
21010090: e720bc2c 00000000 00000000 00000000    ,. .............
210100a0: 00000000 00000010 00000000 00000010    ................
210100b0: 00000000 00000038 00000000 00000038    ....8.......8...
210100c0: 00000000 00000000 70073200 00000000    .........2.p....
210100d0: 00000000 00000000 00000000 00000000    ................
210100e0: 00000000 00000000 00000000 00000000    ................
210100f0: 00000000 00000000 00000000 00000000    ................
# md.l 0x21010100
21010100: 00000000 00000000 00000000 00000000    ................
21010110: f0000001 cfffa000 00000000 00000000    ................
21010120: 00001f1f 00000000 00000000 00000000    ................
21010130: 00000000 01000000 00000000 00000000    ................
21010140: b0000020 00000000 00000000 00000000     ...............
21010150: 00000000 00000000 00000000 00000000    ................
21010160: 00000000 00000000 00000000 00000000    ................
21010170: 00000000 00000000 00000000 00000000    ................
21010180: 00000000 00000000 00000000 00000000    ................
21010190: 00000000 00000000 00000000 00000000    ................
210101a0: 00000000 00000000 00000000 00000000    ................
210101b0: 00000000 00000000 00000000 00000000    ................
210101c0: 00000000 00000000 00000000 00000000    ................
210101d0: 00000000 00000000 00000000 00000000    ................
210101e0: 00000000 00000000 00000000 00000000    ................
210101f0: 00000000 00000000 00000000 00000000    ................

ECC interrupt gets initialized 

int keystone_init_ddr3_ecc(struct device_node *node)
{
void __iomem *ddr_reg;
int error_irq = 0;
int ret;

/* ddr3 controller reg is configured in the sysctrl node at index 0 */
ddr_reg = of_iomap(node, 0);
if (!ddr_reg) {
pr_warn("Warning!! DDR3 controller regs not defined\n");
return -ENODEV;
}

/* add DDR3 ECC error handler */
error_irq = irq_of_parse_and_map(node, 1);
if (!error_irq) {
/* No GIC interrupt, need to map CIC2 interupt to GIC */
pr_warn("Warning!! DDR3 ECC irq number not defined\n");
return -ENODEV;
}

ret = request_irq(error_irq, ddr3_ecc_err_irq_handler, 0,
"ddr3-ecc-err-irq", (void *)ddr_reg);
if (ret) {
WARN_ON("request_irq fail for DDR3 ECC error irq\n");
return ret;
}

return 0;
}

DDR configuration:

void ddr3_init_ecc(u32 base)
{
u32 ddr3_size;

if (!ddr3_ecc_support_rmw(base)) {
ddr3_disable_ecc(base);
return;
}

ddr3_ecc_init_range(base);
ddr3_size = ddr3_get_size();
ddr3_reset_data(CONFIG_SYS_SDRAM_BASE, ddr3_size);

ddr3_enable_ecc(base, 0);
}

void ddr3_enable_ecc(u32 base, int test)
{
u32 ecc_val = KS2_DDR3_ECC_ENABLE;
u32 rmw = ddr3_ecc_support_rmw(base);

if (test)
ecc_val |= KS2_DDR3_ECC_ADDR_RNG_1_EN;

if (!rmw) {
if (!test)
/* by default, disable ecc when rmw = 0 and no
ecc test */
ecc_val = 0;
} else {
ecc_val |= KS2_DDR3_ECC_RMW_EN;
}

ddr3_ecc_config(base, ecc_val);
}

static void ddr3_ecc_config(u32 base, u32 value)
{
u32 data;

__raw_writel(value, base + KS2_DDR3_ECC_CTRL_OFFSET);
udelay(100000); /* delay required to synchronize across clock domains */

if (value & KS2_DDR3_ECC_EN) {
/* Clear the 1-bit error count */
data = __raw_readl(base + KS2_DDR3_ONE_BIT_ECC_ERR_CNT_OFFSET);
__raw_writel(data, base + KS2_DDR3_ONE_BIT_ECC_ERR_CNT_OFFSET);

__raw_writel(KS2_DDR3_1B_ECC_ERR_THRESH_VAL(0) | KS2_DDR3_1B_ECC_ERR_WIN_VAL(0),
base + KS2_DDR3_ONE_BIT_ECC_ERR_THRESH);

/* enable the ECC interrupt */
__raw_writel(KS2_DDR3_1B_ECC_ERR_SYS | KS2_DDR3_2B_ECC_ERR_SYS |
KS2_DDR3_WR_ECC_ERR_SYS,
base + KS2_DDR3_ECC_INT_ENABLE_SET_SYS_OFFSET);

/* Clear the ECC error interrupt status */
__raw_writel(KS2_DDR3_1B_ECC_ERR_SYS | KS2_DDR3_2B_ECC_ERR_SYS |
KS2_DDR3_WR_ECC_ERR_SYS,
base + KS2_DDR3_ECC_INT_STATUS_OFFSET);
}
}

picture below is DTS section of ECC interrupt declaration:

  • Hello Shankari,

    I read those links, based on those I did my work above.

    The problem is, ECC error triggers the IRQ handler only once!
    If I try to inject more ECC error. The handler doesn't triggered again.

    What do I need to change for solving this problem?

    Regards,

    Yaniv 

  • Hello Shankari,

    Please let me know if more information is ewquired?

    Thank you,

    Yaniv

  • Hi Yaniv,

    ret = IRQ_HANDLED;

    Could you comment this line and try. Once Interrupt is set to "IRQ_HANDLED", the ECC error is not handled.

    Also try with followng values in "ret" in variable.

    IRQ_NONE
    IRQ_WAKE_THREAD

    Thanks,

    Rajarajan U

  • Hi Rajarajan,

    Thank you for your response, but this didn't solve my issue.

    What more can be done?

    Thank you,

    Yaniv

  • Hi Yaniv,

    After setting "ret = IRQ_NONE / IRQ_WAKE_THREAD", is there any change in response 

    IRQ_NONE
    IRQ_WAKE_THREAD

    Thanks,

    Rajarajan

  • Hi Rajarajan,

    No, there was no change in response... I mean, the handler was triggered only once!

    Thanks,

    Yaniv

  • Hi Yaniv,

    After the code changes, you have been compiling the whole linux kernel, right?

    Thanks

    Rajarajan 

  • Hi Rajarajan,

    Yes, I compile the whole Linux Kernel and use the new bin file. Thank you for checking Slight smile

    I can see changes that I make to the prints command (see handler code above). This is how I know the Handler is only triggered once.

    Can we do a frontal meting, I want to show you the procedure and results?

    Thanks,

    Yaniv

  • Hi Yaniv,

    We need to analyze,

    1. where the ECC interrupt gets initialized ()
    2. Whether the interrupt mask was disabled after Interrupt is triggered,

    I have been analysing and provide inputs in following posts.

    Thanks,

    Rajarajan U

  • Hi Rajarajan,

    I Edited the original message and added how the ECC interrupt and DDR controller are configured/initialized.

    About disabling, I don't disable the interrupt at all.

    Thank you,

    Yaniv 

  • Hi Rajarajan,

    Do you have news?

    Thank you in advance,

    Yaniv

  • Hi Rajarajan,

    I'm not sure what do you mean "full" log?

    here is your request for /proc/interrupts full list (see event 480):

  • Hi Yaniv,

    int keystone_init_ddr3_ecc(struct device_node *node)
    {
    	void __iomem *ddr_reg;
    	int error_irq = 0;
    	int ret;
    
    	/* ddr3 controller reg is configured in the sysctrl node at index 0 */
    	ddr_reg = of_iomap(node, 0);
    	if (!ddr_reg) {
    		pr_warn("Warning!! DDR3 controller regs not defined\n");
    		return -ENODEV;
    	}
    
    	/* disable and clear unused ECC interrupts */
    	writel(DDR3_1B_ECC_ERR | DDR3_SYS_ERR,
    	       ddr_reg + DDR3_IRQ_ENABLE_CLR_SYS);
    
    	writel(DDR3_1B_ECC_ERR | DDR3_SYS_ERR,
    	       ddr_reg + DDR3_IRQ_STATUS_SYS);
    
    	/*
    	 * check if we already have unrecoverable errors
    	 * reboot in that case
    	 */
    	check_ecc_error(ddr_reg);
    
    	writel(DDR3_2B_ECC_ERR | DDR3_WR_ECC_ERR,
    	       ddr_reg + DDR3_IRQ_ENABLE_CLR_SYS);
    
    	/* add DDR3 ECC error handler */
    	error_irq = irq_of_parse_and_map(node, 1);
    	if (!error_irq) {
    		/* No GIC interrupt, need to map CIC2 interrupt to GIC */
    		pr_warn("Warning!! DDR3 ECC irq number not defined\n");
    		ret = -ENODEV;
    		goto err;
    	}
    
    	ret = request_irq(error_irq, ddr3_ecc_err_irq_handler, 0,
    		"ddr3-ecc-err-irq", (void *)ddr_reg);
    	if (ret) {
    		WARN_ON("request_irq fail for DDR3 ECC error irq\n");
    		goto err;
    	}
    
    	writel(DDR3_2B_ECC_ERR | DDR3_WR_ECC_ERR,
    	       ddr_reg +  DDR3_IRQ_ENABLE_SET_SYS);
    
    	return 0;
    err:
    	iounmap(ddr_reg);
    	return ret;
    }

    Please include this on your changed handler code definition of "keystone_init_ddr3()" and check for "interrupt count" in "cat /proc/interrupt"

    Thanks

    Rajarajan U

  • Hi Rajarajan,

    Thank you for your response. I'll test it and will return to you with the results.

    One question:

    what is "check_ecc_error" function?

    It's not documented in the code or your site.

    Thank you,

    Yaniv Shiber

  • Hi,

    Except for function "check_ecc_error(ddr_reg);", I implemented all your changes.

    The problem didn't change. It still works only once after reboot.

    Attached terminal output.

    What do you suggest todo?

    # md.l 0x21010000
    21010000: 40461c02 40000004 6200ce63 00000000    ..F@...@c..b....
    21010010: 00001869 00000000 166c9875 00001d4a    i.......u.l.J...
    21010020: 447dff53 00000000 543f117f 00000000    S.}D......?T....
    21010030: 00000000 00000000 00000000 00000000    ................
    21010040: 00000000 00000000 00000000 00000000    ................
    21010050: 00000000 00ffffff c0071410 00021c1c    ................
    21010060: 00002010 00000000 00000000 00000000    . ..............
    21010070: 00000000 00000000 00000000 00000000    ................
    21010080: 33361e18 099d87a2 00010000 00000000    ..63............
    21010090: 2cc3d9d6 00000000 00000000 00000000    ...,............
    210100a0: 00000000 00000020 00000000 00000020    .... ....... ...
    210100b0: 00000000 00000038 00000000 00000038    ....8.......8...
    210100c0: 00000000 00000000 70073200 00000000    .........2.p....
    210100d0: 00000000 00000000 00000000 00000000    ................
    210100e0: 00000000 00000000 00000000 00000000    ................
    210100f0: 00000000 00000000 00000000 00000000    ................
    # md.l 0x21010100
    21010100: 00000000 00000000 00000000 00000000    ................
    21010110: f0000001 cfffa000 00000000 00000000    ................
    21010120: 00001f1f 00000000 00000000 00000000    ................
    21010130: 00000011 01000000 00000002 b0000020    ............ ...
    21010140: 00000000 00000000 00000000 00000000    ................
    21010150: 00000000 00000000 00000000 00000000    ................
    21010160: 00000000 00000000 00000000 00000000    ................
    21010170: 00000000 00000000 00000000 00000000    ................
    21010180: 00000000 00000000 00000000 00000000    ................
    21010190: 00000000 00000000 00000000 00000000    ................
    210101a0: 00000000 00000000 00000000 00000000    ................
    210101b0: 00000000 00000000 00000000 00000000    ................
    210101c0: 00000000 00000000 00000000 00000000    ................
    210101d0: 00000000 00000000 00000000 00000000    ................
    210101e0: 00000000 00000000 00000000 00000000    ................
    210101f0: 00000000 00000000 00000000 00000000    ................
    # dmesg | grep -A 2 -B 2 "ECC"
    [    0.000000] switching to high address space at 0x800000000
    [    0.000000] cma: CMA: reserved 16 MiB at 1b000000
    [    0.000000] Memory policy: ECC disabled, Data cache writealloc
    [    0.000000] On node 0 totalpages: 507904
    [    0.000000] free_area_init_node: node 0, pgdat c06d3700, node_mem_map c6500000
    # cat /proc/interrupts
                CPU0       CPU1       CPU2       CPU3
     29:          0          0          0          0       GIC  arch_timer
     30:   15768974   15922364   15770546   15764548       GIC  arch_timer
     56:          0          0          0          0       GIC  a15-l1l2-ecc-err-irq
     70:          0          0          0          0       GIC  pcie-error-irq
     76:          0          0          0          0       GIC  qpend0.7
     77:          0          0          0          0       GIC  qpend1.8
     78:          0          0          0          0       GIC  qpend2.9
     79:          0          0          0          0       GIC  qpend3.10
     86:   55032572          0          0          0       GIC  hwqueue-8710
     88:          0          0          0          0       GIC  hwqueue-8712
     89:          0          0          0          0       GIC  hwqueue-8713
    142:          0          0          0          0       GIC  timer64-event
    184:         23          0          0          0       GIC  SRIO
    185:         30          0          0          0       GIC  SRIO LSU
    248:       3627          0          0          0       GIC  hwqueue-acc-37
    309:        208          0          0          0       GIC  serial
    419:          0          0          0          0       GIC  hyperlink0.39
    420:          0          0          0          0       GIC  hyperlink1.40
    480:          1          0          0          0       GIC  ddr3-ecc-err-irq
    483:          0          0          0          0       GIC  cic2_out32.11
    484:          0          0          0          0       GIC  cic2_out33.12
    485:          0          0          0          0       GIC  cic2_out34.13
    486:          0          0          0          0       GIC  cic2_out35.14
    487:          0          0          0          0       GIC  cic2_out36.15
    488:          0          0          0          0       GIC  cic2_out37.16
    489:          0          0          0          0       GIC  cic2_out38.17
    490:          0          0          0          0       GIC  cic2_out39.18
    491:          0          0          0          0       GIC  cic2_out40.19
    492:          0          0          0          0       GIC  cic2_out41.20
    493:          0          0          0          0       GIC  cic2_out42.21
    494:          0          0          0          0       GIC  cic2_out43.22
    495:          0          0          0          0       GIC  cic2_out44.23
    496:          0          0          0          0       GIC  cic2_out45.24
    497:          0          0          0          0       GIC  cic2_out46.25
    498:          0          0          0          0       GIC  cic2_out47.26
    499:          0          0          0          0       GIC  cic2_out18.27
    500:          0          0          0          0       GIC  cic2_out19.28
    501:          0          0          0          0       GIC  cic2_out22.29
    502:          0          0          0          0       GIC  cic2_out23.30
    503:          0          0          0          0       GIC  cic2_out50.31
    504:          0          0          0          0       GIC  cic2_out51.32
    505:          0          0          0          0       GIC  cic2_out66.33
    506:          0          0          0          0       GIC  cic2_out67.34
    520:          0          0          0          0  keystone-ipc-irq  2620040.dsp0
    521:          0          0          0          0  keystone-ipc-irq  2620044.dsp1
    522:          0          0          0          0  keystone-ipc-irq  2620048.dsp2
    523:          0          0          0          0  keystone-ipc-irq  262004c.dsp3
    524:          0          0          0          0  keystone-ipc-irq  2620050.dsp4
    525:          0          0          0          0  keystone-ipc-irq  2620054.dsp5
    526:          0          0          0          0  keystone-ipc-irq  2620058.dsp6
    527:          0          0          0          0  keystone-ipc-irq  262005c.dsp7
    IPI0:          0          0          0          0  CPU wakeup interrupts
    IPI1:          0          0          0          0  Timer broadcast interrupts
    IPI2:       1320       4500       2406       2174  Rescheduling interrupts
    IPI3:         11         19         23         16  Function call interrupts
    IPI4:         11      27299      25299      36183  Single function call interrupts
    IPI5:          0          0          0          0  CPU stop interrupts
    Err:          0
    #
    

    Thank you,

    Yaniv

  • Hi Yaniv,

    Sorry for the delayed response. I wanted to check on the interrupt count of the ECC. Does it always stay 1?
    I am not able to figure out from the above image as to which one is the ECC interrupt.

    Can you attach a text file?

    - Keerthy

  • Hi,

    In the file above there are two sections:

    1. a printout of DDR controller's registers. The count of 1-bit error is written at address 0x21010130.... you can see it happens many times (0x11). And I expect to receive interrupt for each time it happens in the Arm.... but the problem is that I get only one interrupt in the Linux.
    2. the second printout is the /proc/interrupts. you can look on index 480 "ddr3-ecc-err-irq". You can see the Linux get interrupted only once (on CPU0)... this is the problem! .... I expected to have many interrupts.

    I'll be happy to make a video session and to display this issue to your Linux expert.

    This ticket is open for every long time.

    Your help is needed ASAP.

    Thank you,

    Yaniv