This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM625: Issue about tidss rcu_preempt self-detected stall on CPU

Part Number: AM625
Other Parts Discussed in Thread: TFP410

Tool/software:

Hi Ti expert,


We are using AM625 custom board with VGA connected to LCD monitor.

We got an error log with "rcu_preempt self-detected stall on CPU" and system hang.

 

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[ 34.281580] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 34.281611] rcu: 0-....: (2 GPs behind) idle=9104/1/0x4000000000000000 softirq=0/0 fqs=4819 rcuc=21107 jiffies(starved)
[ 34.281624] (t=21000 jiffies g=6881 q=1364 ncpus=4)
[ 34.281637] CPU: 0 PID: 137 Comm: irq/289-tidss Tainted: G O 6.1.46-rt13-BSP_12.4--g17da321871 #1
[ 34.281643] Hardware name: Texas Instruments AM625 SK (DT)
[ 34.281648] pstate: a0000005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 34.281654] pc : _raw_spin_unlock_irq+0x18/0x60
[ 34.281669] lr : irq_finalize_oneshot.part.0+0x64/0x100
[ 34.281687] sp : ffff000000eb3d90
[ 34.281689] x29: ffff000000eb3d90 x28: ffff800008089000 x27: ffff000001ccfb10
[ 34.281700] x26: ffff000001ccfadc x25: ffff800008089ee0 x24: ffff000001daee00
[ 34.281708] x23: ffff000001ccfa00 x22: ffff000001ccfa60 x21: ffff000001ccfadc
[ 34.281715] x20: ffff000001daee00 x19: ffff000001ccfa00 x18: ffff8000091ee000
[ 34.281723] x17: 0000000000000000 x16: 0000000000000000 x15: 000000000000003c
[ 34.281730] x14: ffffffffffffffff x13: 0000000000000000 x12: 0000000000000000
[ 34.281737] x11: ffff000001ccf680 x10: ffff8000091ee000 x9 : 0000000000000000
[ 34.281744] x8 : ffff800008b594e8 x7 : 000000000000002b x6 : ffffffffffffffff
[ 34.281751] x5 : ffff000001ccfa60 x4 : ffff000001ccfa60 x3 : 0000000000100000
[ 34.281760] x2 : ffff800009220000 x1 : ffff0000015d6c00 x0 : 0000000100000001
[ 34.281769] Call trace:
[ 34.281772] _raw_spin_unlock_irq+0x18/0x60
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

full log :
dmesg_error.txt
Fullscreen
1
2
3
4
U-Boot SPL 2023.04-ga37da23008 (Apr 02 2024 - 05:22:32 +0000)
SYSFW ABI: 3.1 (firmware rev 0x0009 '9.1.8--v09.01.08 (Kool Koala)')
SPL initial stack usage: 13384 bytes
Trying to boot from MMC1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX



Please help check if this error is caused by Tidss.


Thanks,
Allen


  • Hi Johathan,

    Please apply the kernel patch below and capture the ftrace (cat /sys/kernel/tracing/trace_pipe in a telnet session maybe) along with the Linux console log.

    kernel-61-dss-irq-dump.diff
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    diff --git a/drivers/gpu/drm/tidss/tidss_dispc.c b/drivers/gpu/drm/tidss/tidss_dispc.c
    index 5bcc9153a977..92ceff0a049f 100644
    --- a/drivers/gpu/drm/tidss/tidss_dispc.c
    +++ b/drivers/gpu/drm/tidss/tidss_dispc.c
    @@ -870,6 +870,21 @@ void dispc_k2g_set_irqenable(struct dispc_device *dispc, dispc_irq_t mask)
    dispc_k2g_read_irqenable(dispc);
    }
    +void dispc_dump_irq_regs(struct dispc_device *dispc, int pos)
    +{
    + u32 reg[5];
    + char *prefix = pos ? "<<" : "__";
    +
    + reg[0] = dispc_read(dispc, DISPC_IRQSTATUS);
    + reg[1] = dispc_read(dispc, DISPC_VID_IRQSTATUS(0));
    + reg[2] = dispc_read(dispc, DISPC_VID_IRQSTATUS(1));
    + reg[3] = dispc_read(dispc, DISPC_VP_IRQSTATUS(0));
    + reg[4] = dispc_read(dispc, DISPC_VP_IRQSTATUS(1));
    +
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • Okay testing restarted

  • Uploaded 4 rcu_preempt failures over the weekend, with the tracing.  Note I don't see any difference between the dispc_dump_irq_regs output before and after the preempt happens. But maybe I missed something.

    Connection-6252-TX-XXD-RI-23026345-2024-09-15-15-08-01.log

    Connection-6252-TX-XXD-RI-23026301-2024-09-16-05-45-07.log

    Connection-6252-TX-XXD-RI-23026301-2024-09-16-05-01-55.log

    Connection-6252-TX-XXD-RI-23026301-2024-09-15-05-55-22.log

    As side note: With this trace code running, we saw a huge uptick in kernel panics relating to the camera test. All of them pointing to csi2rx_get_frame_desc, even though the panic itself was different.

    It would be interesting to see if these odd errors continue even if we aren't cat-ing the trace_pipe... But either way, this seems to instigate some error or instability in the camera drivers.

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    [50.117] TESTFIXTURE:Capturing test patterns
    [50.120]
    [50.120] Setting pipeline to PAUSED ...
    [50.280] Pipeline is live and does not need PREROLL ...
    [50.284] Pipeline is PREROLLED ...
    [50.286] Setting pipeline to PLAYING ...
    [50.289] New clock: GstSystemClock
    [50.291] [ 31.852262] Unable to handle kernel paging request at virtual address ffff80008289ba5c
    [50.299] [ 31.860237] Mem abort info:
    [50.302] [ 31.863260] ESR = 0x0000000096000061
    [50.306] [ 31.867073] EC = 0x25: DABT (current EL), IL = 32 bits
    [50.311] [ 31.872424] SET = 0, FnV = 0
    [50.314] [ 31.875498] EA = 0, S1PTW = 0
    [50.317] [ 31.878646] FSC = 0x21: alignment fault
    [50.321] [ 31.882661] Data abort info:
    [50.324] [ 31.885532] ISV = 0, ISS = 0x00000061, ISS2 = 0x00000000
    [50.329] [ 31.891020] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
    [50.334] [ 31.896079] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
    [50.340] [ 31.901391] swapper pgtable: 4k pages, 48-bit VAs, pgdp=000000008333b000
    [50.346] [ 31.908094] [ffff80008289ba5c] pgd=10000000f7fff003, p4d=10000000f7fff003, pud=10000000f7ffe003, pmd=10000000818ac003, pte=006800004bc9b713
    [50.359] [ 31.920637] Internal error: Oops: 0000000096000061 [#1] PREEMPT SMP
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  •  Thank you for these logs, they are valuable. From the logs this looks like your board, have you also been able to see/reproduce this on the TI EVM?

    Bin had a setup running since Friday, but has not seen a fail, so thought we will just cross check. 

    In your case it seems like some how DSS is not clearing up the interrupt and interrupt storm happens, hogging the A53 core 0 and eventually causing RCU_stall. 

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    [31.195] <idle>-0 [000] d.h1. 15.816868: dispc_dump_irq_regs: __: irq 0x1, vid0 0x0, vid1 0x0, vp0 0x2, vp1 0x0
    [31.205] <idle>-0 [000] d.h1. 15.816874: dispc_dump_irq_regs: <<: irq 0x0, vid0 0x0, vid1 0x0, vp0 0x0, vp1 0x0
    [31.216] CPU:0 [LOST 39936 EVENTS]
    [31.219] memtester-301 [000] dNH1. 20.602278: dispc_dump_irq_regs: __: irq 0x1, vid0 0x0, vid1 0x0, vp0 0x0, vp1 0x0
    [31.282] memtester-301 [000] dNH1. 20.602282: dispc_dump_irq_regs: <<: irq 0x1, vid0 0x0, vid1 0x0, vp0 0x0, vp1 0x0
    [31.282] memtester-301 [000] dNH1. 20.602286: dispc_dump_irq_regs: __: irq 0x1, vid0 0x0, vid1 0x0, vp0 0x0, vp1 0x0
    [31.283] memtester-301 [000] dNH1. 20.602291: dispc_dump_irq_regs: <<: irq 0x1, vid0 0x0, vid1 0x0, vp0 0x0, vp1 0x0
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • We have not tried to reproduce this on the TI EVM.

    In your case it seems like some how DSS is not clearing up the interrupt and interrupt storm happens, hogging the A53 core 0 and eventually causing RCU_stall. 

    Okay breaking this down so I understand it. I agree that the logs are consistent with a IRQ storm and getting stuck handling DSS interrupts.

    Fullscreen
    1
    2
    [31.219] memtester-301 [000] dNH1. 20.602278: dispc_dump_irq_regs: __: irq 0x1, vid0 0x0, vid1 0x0, vp0 0x0, vp1 0x0
    [31.282] memtester-301 [000] dNH1. 20.602282: dispc_dump_irq_regs: <<: irq 0x1, vid0 0x0, vid1 0x0, vp0 0x0, vp1 0x0
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    [000] -
    This indicates the traces are being made by CPU0. 
    Thus CPU0 isn't hung

    dNH1. -

     d - irqs are disabled.
    This makes sense since we are printing this from within an interrupt handler

     N - both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set. 
    This seems to indicate that the scheduler has stuff to do that isn't getting done.

     H - hard irq occurred inside a softirq
     h - hard irq is running
    The fact that it switches from h -> H when the problem starts seems consistent with an IRQ storm. Since the hard irq would occur as soon as we switched to softirq and thus the interrupts were re-enabled.

     1 - preempt-depth of 1

    Couldn't find a description of the last ., buts its always a period so i guess its not relevent.

     irq 0x1, vid0 0x0, vid1 0x0, vp0 0x0, vp1 0x0

    This is showing that the VID/VP interrupts aren't the issue.  Somehow the DISPC_IRQSTATUS interrupt field isn't getting cleared, as it should be 0x0 by the end of the handler...

  • This is showing that the VID/VP interrupts aren't the issue.  Somehow the DISPC_IRQSTATUS interrupt field isn't getting cleared, as it should be 0x0 by the end of the handler...

    Note the current implementation of dispc_k3_clear_irqstatus, uses the VP/VID interrupt statuses to clear the DISPC_IRQSTATUS field instead of the value from DISPC_IRQSTATUS. This means if we ever (however unlikely) end up in this situation where an IRQ is set in DISPC_IRQSTATUS but isn't in either VP_IRQSTATUS or VID_IRQSTATUS, then the DISPC_IRQSTATUS register will never be cleared. And we will end up in this infinite loop.

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    static
    void dispc_k3_clear_irqstatus(struct dispc_device *dispc, dispc_irq_t clearmask)
    {
    unsigned int i;
    u32 top_clear = 0;
    for (i = 0; i < dispc->feat->num_vps; ++i) {
    if (clearmask & DSS_IRQ_VP_MASK(i)) {
    dispc_k3_vp_write_irqstatus(dispc, i, clearmask);
    top_clear |= BIT(i);
    }
    }
    for (i = 0; i < dispc->feat->num_planes; ++i) {
    if (clearmask & DSS_IRQ_PLANE_MASK(i)) {
    dispc_k3_vid_write_irqstatus(dispc, i, clearmask);
    top_clear |= BIT(4 + i);
    }
    }
    if (dispc->feat->subrev == DISPC_K2G)
    return;
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Note that the implementation of dispc_k2g_read_and_clear_irqstatus makes more sense as it is correctly using the value of DISPC_IRQSTATUS to clear DISPC_IRQSTATUS.

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    static
    dispc_irq_t dispc_k2g_read_and_clear_irqstatus(struct dispc_device *dispc)
    {
    dispc_irq_t stat = 0;
    /* always clear the top level irqstatus */
    dispc_write(dispc, DISPC_IRQSTATUS,
    dispc_read(dispc, DISPC_IRQSTATUS));
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    I will test using the code from the k2g, however this would have the side effect that we might be clearing interrupts that the function as written wasn't intending to clear.  Example dispc_k3_set_irqenable wants to only clear interrupts it's currently enabling and so in that situation, we'd clear too many interrupts.

        dispc_write(dispc, DISPC_IRQSTATUS,
                dispc_read(dispc, DISPC_IRQSTATUS));

  •  Could you please try cherry picking https://github.com/torvalds/linux/commit/76c0b99d614127ceadcd3563dee4983c20627e09 and see if it fixes the camera issue reported in previous comment:  csi2rx_get_frame_desc+0x11c/0x1b0 [cdns_csi2rx] ?

  • Hi Jonathan,

    In additional to the the csi patch Devarsh mentioned above, please apply the following kernel patch to see if it resolves the rcu stall problem. (You don't need to dump the ftrace in this test.)

    With this patch, I expect one of the following messages in kernel dmesg:

    - tidss 30200000.dss: clearing irq 0x1 instead of 0x0
    - tidss 30200000.dss: irq 0x1 not cleared

    If the first message is printed, the rcu stall issue should be fixed. But if the second message is printed, I would expect the rcu issue still exists.

    kernel-61-dss-irq-workaround.diff
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    diff --git a/drivers/gpu/drm/tidss/tidss_dispc.c b/drivers/gpu/drm/tidss/tidss_dispc.c
    index 5bcc9153a977..2cd18955d174 100644
    --- a/drivers/gpu/drm/tidss/tidss_dispc.c
    +++ b/drivers/gpu/drm/tidss/tidss_dispc.c
    @@ -939,6 +954,7 @@ void dispc_k3_clear_irqstatus(struct dispc_device *dispc, dispc_irq_t clearmask)
    {
    unsigned int i;
    u32 top_clear = 0;
    + u32 val;
    for (i = 0; i < dispc->feat->num_vps; ++i) {
    if (clearmask & DSS_IRQ_VP_MASK(i)) {
    @@ -955,10 +971,18 @@ void dispc_k3_clear_irqstatus(struct dispc_device *dispc, dispc_irq_t clearmask)
    if (dispc->feat->subrev == DISPC_K2G)
    return;
    + val = dispc_read(dispc, DISPC_IRQSTATUS);
    + if (val > top_clear) {
    + dev_warn_ratelimited(dispc->dev,
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • I don't understand the use of the > check on a bit field.  Shouldn't we be checking for them being not equal or specifically checking for when top_clear is zero and val is not?

    val = dispc_read(dispc, DISPC_IRQSTATUS);
    if (val != top_clear) {
    if (val && !top_clear) {
  • Checking for (val != top_clear) is not correct. There are cases in which val is 0 but top_clear is not.

    Checking for (val && !top_clear) probably is more accurate, I just didn't want to check for 2 conditions.

    Please hold on this patch. The sw dev team just shared with me a different solution, let me test it and add some dbg printk then share the patch with you.

  • Please hold on this patch. The sw dev team just shared with me a different solution, let me test it and add some dbg printk then share the patch with you.

    Gotcha.  I do agree that adding some debug prints to try to catch how we ended up in this state would be good.

    I've attached my proposed fix.

    0001-drm-tidss-Fix-chance-of-irq-storm-with-k3_clear_irqs.diff
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    From 6a3e30438d518223c0613329c99240b8507df116 Mon Sep 17 00:00:00 2001
    From: Jonathan Cormier <jcormier@criticallink.com>
    Date: Tue, 17 Sep 2024 10:22:18 -0400
    Subject: [PATCH] drm/tidss: Fix chance of irq storm with k3_clear_irqstatus
    dispc_k3_clear_irqstatus was using a clearmask based on its submodule
    irqstatuses to clear the main dispc_irqstatus register. This led to
    the scenario were if ever the submodules irqstatuses were clear but
    the main dispc_irqstatus wasn't, the driver would never clear the irq
    and an interrupt storm would occur.
    Make sure we are setting the dispc clearmask based on the dispc_irqstatus register.
    Signed-off-by: Jonathan Cormier <jcormier@criticallink.com>
    ---
    drivers/gpu/drm/tidss/tidss_dispc.c | 21 ++++++++++++---------
    1 file changed, 12 insertions(+), 9 deletions(-)
    diff --git a/drivers/gpu/drm/tidss/tidss_dispc.c b/drivers/gpu/drm/tidss/tidss_dispc.c
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • Hi Jonathan,

    Attached below is the patch from our sw dev, it is currently under discussion with the DSS driver maintainer. What the patch basically does is to clear DISPC_IRQSTATUS register with its current value at the end of dispc_k3_clear_irqstatus().

    0001-drm-tidss-Clear-global-irq-status-unconditionally.patch

  • Attached below is the patch from our sw dev, it is currently under discussion with the DSS driver maintainer. What the patch basically does is to clear DISPC_IRQSTATUS register with its current value at the end of dispc_k3_clear_irqstatus().

    I did test this change last night.  We got about 1200 test cycles without any rcu_preempts.  This test was not long enough to guarantee a rcu_preempt but we probably should have seen one. 

    The reason I didn't stick with this simple implementation is because of the irqenable implementation.  It was possible the irqenable code could accidentally consume an interrupt that wouldn't then be handled.  I don't know enough about the dss to know if this would be critical or not.

  • I guess your last night test is still running?

    I don't work on DSS and don't know this module either, but I will relay your concern to our sw dev team.

  • I guess your last night test is still running?

    I don't work on DSS and don't know this module either, but I will relay your concern to our sw dev team.

    We had already stopped it and started a variation of your last post xD.  I am curious exactly how we end up in this state.

    I appreciate that, thanks for the help you've given so far.

  • Testing with my last patch is also fine, all the last few patches are all basically do the same thing.

    I am curious exactly how we end up in this state.

    Please see the inline comments in patch "0001-drm-tidss-Clear-global-irq-status-unconditionally.patch", the problem seems related to a hardware bug in which the software doesn't handle the corner case correctly.

    I appreciate that, thanks for the help you've given so far.

    No worries, I am glad we finally found the root cause of the rcu stall problem. Thank you for the patience.

  • So far with this patch, we've not seen any of the camera induced kernel panics.

  • This is great! Thanks for the update.

    The reason I didn't stick with this simple implementation is because of the irqenable implementation.  It was possible the irqenable code could accidentally consume an interrupt that wouldn't then be handled.  I don't know enough about the dss to know if this would be critical or not.

    Here is the response from our sw dev about this concern:

    The concern about clearing any extra level 1 interrupt which was not enable in first place should not happen in the first place since we are not relying on level 1 interrupts altogether and checking all level 2 status directly instead.

  • Testing with my last patch is also fine, all the last few patches are all basically do the same thing.

    At just shy of 2000 tests, no preempts/irq storms yet.  Will check tests again tomorrow.

    Note: I have seen 4 "clearing irq 0x1 instead of 0x0" prints between the two units. I assume that each of these would have put us in the irq storm or at least given us a chance of being stuck.

    "clearing irq 0x0 instead of 0x1" happen a couple times every boot.  This does seem a little odd as I'd assume you cannot have a VP or VID irq without the parent irqstatus also being set. But maybe thats part of the hardware bug mentioned above.  Regardless clearing an irq that isn't set should be okay, so I'm ignoring these.

  • The concern about clearing any extra level 1 interrupt which was not enable in first place should not happen in the first place since we are not relying on level 1 interrupts altogether and checking all level 2 status directly instead.

    Hmm, does that mean there are separate interrupt lines for the level 2 interrupts to the processor, so they don't have to go through the level 1?   I had assumed this was an irq chain or vector? I don't recall the exact terminology.

  • Note: I have seen 4 "clearing irq 0x1 instead of 0x0" prints between the two units. I assume that each of these would have put us in the irq storm or at least given us a chance of being stuck.

    Yes, this is the case which would trigger irq storm without the patch.

    "clearing irq 0x0 instead of 0x1" happen a couple times every boot. 

    I believe my version of the patch won't print this message, since I have "if (val > top_clear)", and '1' will be written to DISPC_IRQSTATUS register.

    What the "if" condition in the patch have you changed to? is '0' or '1' written to DISPC_IRQSTATUS register in this case?

  • Right now its just doing !=.  So 0 gets written, clearing no interrupts.

    Attached below is the patch from our sw dev, it is currently under discussion with the DSS driver maintainer. What the patch basically does is to clear DISPC_IRQSTATUS register with its current value at the end of dispc_k3_clear_irqstatus().

    0001-drm-tidss-Clear-global-irq-status-unconditionally.patch

    Note that this patch seems to accidentally get rid of the "Flush posted writes" line

  • Right now its just doing !=.  So 0 gets written, clearing no interrupts.

    Okay. I guess this is okay, since the interrupt has been clear in the level 2 register.

    With !=, the patch is basically does the same thing as our sw dev's version:
    dispc_write(dispc, DISPC_IRQSTATUS, dispc_read(dispc, DISPC_IRQSTATUS));

    Note that this patch seems to accidentally get rid of the "Flush posted writes" line

    Thanks for catching it, I am not sure if this flush read() is important or not, but I will let our sw dev know.

  • dispc_write(dispc, DISPC_IRQSTATUS, dispc_read(dispc, DISPC_IRQSTATUS));

    Agreed.  I'm mostly running this version of the fix to get some indication of how many storms we've prevented.

  • The concern about clearing any extra level 1 interrupt which was not enable in first place should not happen in the first place since we are not relying on level 1 interrupts altogether and checking all level 2 status directly instead.

    Assuming the testing continues to look good tomorrow, we are planning on providing a patch to our customer.  Can I get an updated copy of the patch, including commit message that TI plans to use?

  • I will ask for the final patch but not sure I will get it though. As far as I know, the patch is still under review with the DSS maintainer in the kernel upstream.

    Regarding missing "Flush posted writes" in the patch, it was intentional. But our sw dev told me that the DSS maintainer wants to keep it, so it should be added back in the final version.

  • Is there a mailing list thread?  I'd like to add my reviewed-by/tested-by on it.

  • Is there a mailing list thread?  I'd like to add my reviewed-by/tested-by on it.

    I will ask out sw dev for it.

    To narrow down the irq storm issue, can you please help to run another test? Please revert the previous kernel patch which fixes the irq storm, but apply the following debug patch. The irq storm or rcu stall would happen with this patch, but the kernel log should provide more information.

    Please note that I got the patch from our sw dev, and only did kernel compile test. Please let me know if you run into any issue with this patch.

    kernel-61-dss-irq-dbg-0919.diff
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    diff --git a/drivers/gpu/drm/tidss/tidss_dispc.c b/drivers/gpu/drm/tidss/tidss_dispc.c
    index 5bcc9153a977..64fad287c7fd 100644
    --- a/drivers/gpu/drm/tidss/tidss_dispc.c
    +++ b/drivers/gpu/drm/tidss/tidss_dispc.c
    @@ -38,6 +38,8 @@
    #include "tidss_dispc_regs.h"
    #include "tidss_scale_coefs.h"
    +bool toggle = 0;
    +
    static const u16 tidss_k2g_common_regs[DISPC_COMMON_REG_TABLE_LEN] = {
    [DSS_REVISION_OFF] = 0x00,
    [DSS_SYSCONFIG_OFF] = 0x04,
    @@ -884,6 +886,15 @@ static void dispc_k3_vp_write_irqstatus(struct dispc_device *dispc,
    u32 stat = dispc_vp_irq_to_raw(vpstat, vp_idx);
    dispc_write(dispc, DISPC_VP_IRQSTATUS(vp_idx), stat);
    +
    + if (toggle && vp_idx == 0) {
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • I saw 4 irq storms overnight with the above debug.  Each printed the following.

    [24.006] [   11.970174] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 2, DISPC_IRQ status = 1 toggle = 0
    [24.007] [   11.980029] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 0, DISPC_IRQ status = 1 toggle = 0
    [24.007] [   11.989900] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 0, DISPC_IRQ status = 1 toggle = 0
    ... Repeat last line forever

    We've added some additional print statements to track this down.
  • Thanks for the log.

    Note: I have seen 4 "clearing irq 0x1 instead of 0x0" prints between the two units. I assume that each of these would have put us in the irq storm or at least given us a chance of being stuck.

    "clearing irq 0x0 instead of 0x1" happen a couple times every boot.  This does seem a little odd as I'd assume you cannot have a VP or VID irq without the parent irqstatus also being set. But maybe thats part of the hardware bug mentioned above.  Regardless clearing an irq that isn't set should be okay, so I'm ignoring these.

    Do you still have the log showing both "clearing" messages? If so, please share.

  • Do you still have the log showing both "clearing" messages? If so, please share.

    Connection-6252-TX-XXD-RI-23026301-2024-09-17-15-00-31.log

  • We've added some additional print statements to track this down.

    Alright so some interesting developments.  I added on to the Mismatch detected debug print to get some additional info.

    I also thought it would be interesting to know if the irqenable function was getting called, since it also can call dispc_k3_clear_irqstatus.  However when I added a pr_err in that function, suddenly irq storm error rates jumped from 1 in 600 tests to 7 in 10 tests.  It seems that adding some amount of delay to the dispc_k3_set_irqenable() function, greatly increases the chance of this bug.

    Removing the print statement or using trace_printk instead, returns the error rate to normal slow rates.

    Note:  A debug countdown counter was added to ensure the next 10 dispc_k3_clear_irqstatus calls would print out after an irqenable.  I wanted to make sure I wasn't missing calls.

    In this log, right before we enter the irq storm, we see that some IRQs were disabled.  These are bits 5 and 6, DSS_IRQ_VP_VSYNC_EVEN(0) | DSS_IRQ_VP_VSYNC_ODD(0). Presumably from tidss_irq_disable_vblank().  Note: From other logs, I have seen an irq storm trigger from the opposite tidss_irq_enable_vblank(). 

    The clear_mask ends up as 1, which doesn't end up doing anything. And yet the very next clear_irqstatus, the VP_IRQ_STATUS register is no longer 2.  I haven't yet found anything that clears that register between the two calls...  So maybe, if the hardware is clearing the VP irq status itself, thats what is creating the difference between the vp status and the dispc status. Don't know, pretty strange.

    Fullscreen
    1
    2
    3
    4
    [ 11.970224] dispc_k3_set_irqenable : irqenabled - mask = 91, old = f0, clr = 1
    [24.014] [ 11.975895] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 2, DISPC_IRQ status = 1 toggle = 0, clearmask = 1, top_stat = 1, top_clear = 0
    [24.014] [ 11.989591] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 0, DISPC_IRQ status = 1 toggle = 0, clearmask = 0, top_stat = 1, top_clear = 0
    [24.014] [ 12.003395] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 0, DISPC_IRQ status = 1 toggle = 0, clearmask = 0, top_stat = 1, top_clear = 0
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Next, I commented out the dispc_k3_clear_irqstatus call from the irqenable function.  And the irq storm triggered anyways so it doesn't look like it's the clearmask of 1 that is at fault. Even though its a bit out of place. 

    Note: Bit(0) is DSS_IRQ_DEVICE_OCP_ERR. This doesn't seem to exist on this hardware. And all the VP/VID irq functions ignore it.  Not sure if its an artifact from a different driver or an older version of the driver...

    Fullscreen
    1
    2
    3
    4
    [ 11.970224] dispc_k3_set_irqenable : irqenabled - mask = 91, old = f0, clr = 1
    [24.042] [ 11.977565] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 0, DISPC_IRQ status = 1 toggle = 0, clearmask = 0, top_stat = 0, top_clear = 0, counter = 9
    [24.042] [ 11.992459] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 0, DISPC_IRQ status = 1 toggle = 0, clearmask = 0, top_stat = 0, top_clear = 0, counter = 8
    [24.042] [ 12.007354] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 0, DISPC_IRQ status = 1 toggle = 0, clearmask = 0, top_stat = 0, top_clear = 0, counter = 7
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    dbg.diff
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    diff --git a/drivers/gpu/drm/tidss/tidss_dispc.c b/drivers/gpu/drm/tidss/tidss_dispc.c
    index 5bcc9153a977..e769cc95c590 100644
    --- a/drivers/gpu/drm/tidss/tidss_dispc.c
    +++ b/drivers/gpu/drm/tidss/tidss_dispc.c
    @@ -38,6 +38,9 @@
    #include "tidss_dispc_regs.h"
    #include "tidss_scale_coefs.h"
    +bool toggle = 0;
    +u32 counter = 0;
    +
    static const u16 tidss_k2g_common_regs[DISPC_COMMON_REG_TABLE_LEN] = {
    [DSS_REVISION_OFF] = 0x00,
    [DSS_SYSCONFIG_OFF] = 0x04,
    @@ -884,6 +887,15 @@ static void dispc_k3_vp_write_irqstatus(struct dispc_device *dispc,
    u32 stat = dispc_vp_irq_to_raw(vpstat, vp_idx);
    dispc_write(dispc, DISPC_VP_IRQSTATUS(vp_idx), stat);
    +
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    I'm not sure where to look next.  Hopefully, yall have some insight that I'm missing.

    Note I do think the DISPC_IRQ clear change, still makes sense.  Regardless what we find the cause is for the start of the irq storm.   It doesn't make sense to me to clear DISPC_IRQ bits based on the assumptions of its level 2 interrupts.

  • > The clear_mask ends up as 1, which doesn't end up doing anything. And yet the very next clear_irqstatus, the VP_IRQ_STATUS register is no longer 2.

    Yes this looks pretty strange, not something driver intended to from from my POV and thinking if attached patch should fix this behavior although I am not 100% sure about the intention of this code from original author.

    > haven't yet found anything that clears that register between the two calls...  So maybe, if the hardware is clearing the VP irq status itself, thats what is creating the difference between the vp status and the dispc status. Don't know, pretty strange.

    [24.014] [   11.975895] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 2, DISPC_IRQ status = 1 toggle = 0, clearmask = 1, top_stat = 1, top_clear = 0

    Yes that looks pretty strange on who clears the vp status after above log :) we will try to see the code again and also check with folks here if any clue we can get. But thanks a lot for observations :

  • Thanks Jonathan, this information helps. Could you please revert all the patches and try with just the attached patch ?  /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Clear_2D00_the_2D00_interrupt_2D00_status_2D00_for_2D00_interrupts_2D00_.patch

    We started testing this patch this morning, so we still need some time to get good results.

    I'm unclear if this patch makes sense.  I would think you would want to clear any possible pending interrupts before you enable the interrupts, like it was doing before.  And it could also make sense to clear them after disabling them, though not before disabling them as that would leave a small window for them to be set again.

    But I guess all of this assumes the irq status bits are sticky and don't auto-clear when the irqs are disabled.  If they did auto-clear that might cause the problem we are seeing, atleast if the 2nd levels auto-cleared and the 1st level didn't.

  • > We started testing this patch this morning, so we still need some time to get good results.

    Thanks Jonathan.

    > I would think you would want to clear any possible pending interrupts before you enable the interrupts, like it was doing before.  And it could also make sense to clear them after disabling them, though not before disabling them as that would leave a small window for them to be set again.

    Well I thought about calling the clear function towards the end i.e. after disabling the irq but that may also clear the top level irq status for any newly enabled interrupt unknowingly. Furthermore the way I see this clearing is as a fail-safe or backup option since ideally the isr should be clearing any pending interrupts anyhow.

    >However when I added a pr_err in that function, suddenly irq storm error rates jumped from 1 in 600 tests to 7 in 10 tests.  It seems that adding some amount of delay to the dispc_k3_set_irqenable() function, greatly increases the chance of this bug.

    I tried to look into this and I see some possibilty of race condition between different functions handling the clearing of irq status related registers. Attached patch should fix it, although I am not entirely sure if the issue we are facing is related to the race condition. If you could try out with just the attached patch too without any other changes that would help root-cause this further.

    /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Avoid_2D00_race_2D00_condition_2D00_while_2D00_handling_2D00_interr.patch

    In this log, right before we enter the irq storm, we see that some IRQs were disabled.  These are bits 5 and 6, DSS_IRQ_VP_VSYNC_EVEN(0DSS_IRQ_VP_VSYNC_ODD(0). Presumably from tidss_irq_disable_vblank().  Note: From other logs, I have seen an irq storm trigger from the opposite tidss_irq_enable_vblank(). 

    That's strange but for some reason, your system going into suspend state or display is timing out after staying idle ? If so then, it may trigger tidss_irq_disable_vblank() ? If you could put prints in dispc_runtime_resume() and dispc_runtime_suspend(), that may help confirm the same.

    >But I guess all of this assumes the irq status bits are sticky and don't auto-clear when the irqs are disabled.  If they did auto-clear that might cause the problem we are seeing, atleast if the 2nd levels auto-cleared and the 1st level didn't.

    [24.014] [   11.975895] dispc_k3_clear_irqstatus: Mismatch detected : VP_IRQ_STATUS = 2, DISPC_IRQ status = 1 toggle = 0, clearmask = 1, top_stat = 1, top_clear = 0

    This i presume is w.r.t above log, per my conversation with hardware team here, auto-clear on disable is not supported. But if IP is powered off than the status may get cleared including both the VP status and top level IRQ status, but as I infer you saw VP status as reset but top level IRQ status was set to 0x1, if you could confirm that this is happenning right after suspend/resume cycle i.e. after resume top level irq status is still set as 0x1, then it is an abnormal behaviour and I can follow up with hardware team here.

    Also thanks for the help and sorry for suggesting multiple experiments on this. BTW is it possible to share your test setup related details like how you are testing or if you are using any test scripts and any hacks to reproduce the issue quickly?

    Regards

    Devarsh

  • Thanks Jonathan, this information helps. Could you please revert all the patches and try with just the attached patch ?  /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Clear_2D00_the_2D00_interrupt_2D00_status_2D00_for_2D00_interrupts_2D00_.patch

    Tests ran about 2400 times.  I'd expect about 2-4 rcu_preempts but saw none. 

    This seems to push this issue more towards it being a software bug, not a hardware bug. 

    Based on this, I think what is happening is the VSYNC irqs get disabled, but the code doesn't clear the statuses.  And unfortunately the level 2 statuses clear themselves, leaving the level 1 DISPC status still set.  And the previous code will never clear the level 1 status if the level 2 aren't set.

    I do think it still makes sense to clear the DISPC interrupts, like mine and Bins patch does. And clear the level 2 interrupts in the irqenable.

    I updated my original patch to include this new information.  Let me know if you think its worth testing this variant of the fix.

    0001-v2-drm-tidss-Prevent-potential-IRQ-storm-in-k3_clear_irq.diff
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    From c0574ff68313e96dd975ec9ba8b441f79c0420de Mon Sep 17 00:00:00 2001
    From: Jonathan Cormier <jcormier@criticallink.com>
    Date: Tue, 17 Sep 2024 10:22:18 -0400
    Subject: [PATCH] drm/tidss: Prevent potential IRQ storm in k3_clear_irqstatus
    The function dispc_k3_clear_irqstatus was previously using a clear mask
    derived from its submodule irqstatus values to clear the main
    dispc_irqstatus register. This created a scenario where, if the
    submodule irqstatus bits were cleared but the main dispc_irqstatus
    wasn't, the driver would fail to clear the interrupt, leading to an
    interrupt storm.
    Additionally, when vp/vid irqenable values were being cleared while the
    corresponding IRQ status bits remained set. The vp/vid IRQ status bits
    would unexpectedly clear themselves, while the associated
    dispc_irqstatus bits remained set. This mismatch would put us in the
    interrupt storm mentioned above.
    This patch ensures the main clear mask is correctly derived from the
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • > Tests ran about 2400 times.  I'd expect about 2-4 rcu_preempts but saw none. 

    Thanks Jonathan this is good information.

    >Based on this, I think what is happening is the VSYNC irqs get disabled, but the code doesn't clear the statuses.  And unfortunately the level 2 statuses clear themselves, leaving the level 1 DISPC status still set.  And the previous code will never clear the level 1 status if the level 2 aren't set.

    Well I had discussed the same with IP team previously and they mentioned this scenario is not possible that HW clear the VP status automatically. I am planning to run some experiment to see if this is actually happenning in which case it is an IP/H/W bug.

    > I updated my original patch to include this new information.  Let me know if you think its worth testing this variant of the fix.

    This looks similar to /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Clear_2D00_global_2D00_irq_2D00_status_2D00_unconditionally.patch and the version Bin made so I am pretty sure it would work :) as previous test worked, but last I checked with maintainer on these approaches I was asked to get clarification on exact HW issue landing us into this state and H/W team has denied existence of any such issues faced in their testing, so for now my focus is to get some proof and steps to reproduce which clearly convey any H/W abnormally if any and also rule out any s/w bugs and later decide on which version of patch is the best approach. But if we decide to go with this and maintainer is okay than irrespective of  which version gets chosen we will add developed-by/signed-off for you and Bin.

    Also could you please help to put a test with all changes reverted and just applying the avoid race condition patch shared in previous patch : /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Avoid_2D00_race_2D00_condition_2D00_while_2D00_handling_2D00_interr.patch ?

    I just want to rule out any race conditions causing this behaviour :).

    Regards

    Devarsh

  • Also could you please help to put a test with all changes reverted and just applying the avoid race condition patch shared in previous patch : /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Avoid_2D00_race_2D00_condition_2D00_while_2D00_handling_2D00_interr.patch ?

    Can you work me through this patch? The k2g functions don't run on the k3 hardware. The hw_init function gets called before the irq handlers are setup.  I don't see where the race conditions are, at least on the am62x.

  • There is a possibility of race between irq handler (tidss_irq_handler) and other functions to program/clear/handle the interrupt status related registers for e.g. tidss_irq_disable_vblank. This patch also adds locking to the irq handler to avoid this scenario. 

  • Ah okay, I didn't notice those locks were already in place.

  • Thanks Jonathan, this information helps. Could you please revert all the patches and try with just the attached patch ?  /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Clear_2D00_the_2D00_interrupt_2D00_status_2D00_for_2D00_interrupts_2D00_.patch

    Okay stopped testing this patch.  Two units ran about 5k tests each without a rcu_preempt.

    Starting the spin_lock patch testing.

  • Also could you please help to put a test with all changes reverted and just applying the avoid race condition patch shared in previous patch : /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Avoid_2D00_race_2D00_condition_2D00_while_2D00_handling_2D00_interr.patch ?

    Both test units have had several rcu_preempts with only this patch.

  • Thanks Jontathan, this helps I will probably try some experiments and have some conversations with IP team for any unexpected behaviours from the IP before concluding. But anyways I hope you are unblocked on this thread with the set of patches we have on this thread.

  • Devarsh,

    We have shared a version of the patches discussed here with our customer and initial testing has shown an elimination of the rcu_preempts during operation. Thanks for your help.

    -Bob

  • But anyways I hope you are unblocked on this thread with the set of patches we have on this thread

    Thanks Devarsh,  Our customer has been testing with Bins Clear global irq patch, and it looks good.  So yes we are unblocked.

    However, I am still waiting to see what we think the final official patch should look like. Let me know if we can help.

  • Thanks Jonathan, this information helps. Could you please revert all the patches and try with just the attached patch ?  /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Clear_2D00_the_2D00_interrupt_2D00_status_2D00_for_2D00_interrupts_2D00_.patch

    We have been testing this patch by itself, and while it reduces the failure rate, we still see rcu preempts.  Our latest testing showed a failure in ~1/5000 tests.

  • Thanks Jonathan, that's an interesting observation, I would be curious to know the observations with both

    /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Avoid_2D00_race_2D00_condition_2D00_while_2D00_handling_2D00_interr.patch and https://e2e.ti.com/cfs-

    file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Clear_2D00_the_2D00_interrupt_2D00_status_2D00_for_2D00_interrupts_2D00_.patch applied together.

    If RCU stall is still seen with above two patches applied, then I guess it could be that some H/W IP bug is landing into a situation where level 1 and level 2 irq are getting mismatched and we then mandatorily need /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_drm_2D00_tidss_2D00_Clear_2D00_global_2D00_irq_2D00_status_2D00_unconditionally.patch 

    to force clear the global irq unconditionally. 

  • Thanks Jonathan, that's an interesting observation, I would be curious to know the observations with both

    Sure we will start this testing.

    If RCU stall is still seen with above two patches applied, then I guess it could be that some H/W IP bug is landing into a situation where level 1 and level 2 irq are getting mismatched and we then mandatorily need

    We are currently planning on using both the clear irq status patch and the clear irq unconditionally patch as that is super solid.  Granted just the clear irq unconditionally is enough to remove the chance of the infinite irq storm.

  • Hi Jonathan,

    Just wanted to check if tests ran well or not.

    Regards

    Devarsh

1 2 3