This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

PROCESSOR-SDK-AM65X: Issue with nested interrupts on R5F

Part Number: PROCESSOR-SDK-AM65X
Other Parts Discussed in Thread: SYSBIOS

Dear TI team,

I've been investigating an issue with our own application that looks very similar to another issue posted before (https://e2e.ti.com/support/processors/f/791/t/787929), but that thread unfortunately ended without a solution since the problem "disappeared".

The symptoms are very similar to what the original post describes: Our application has a thread that repeatedly calls Task_sleep(1), and eventually that thread "dies" because according to ROV the timeout is 4 billion + (actually a small negative number). In our case there are also two interrupts involved, the timer interrupt and a PRU interrupt (from the ethercat slave firmware).

What happens is that the timer interrupt is triggered, and the interrupt processing gets within ti_sysbios_knl_Clock_doTick__I() up to the point where (&ti_sysbios_knl_Clock_Module__state__V)->ticks++ has been incremented but ->swiCount++ not yet. At this point the timer interrupt is preempted by a PRU interrupt which runs until completion. At that moment a NEW timer interrupt is being processed, i.e. the core is not resuming processing of the original timer interrupt, but starts executing from ti_sysbios_family_arm_v7r_keystone3_Hwi_dispatchIRQ__I(). That interrupt processing also moves through ti_sysbios_knl_Clock_doTick__I(), increments ->ticks++ once more, and then triggers the clock SWI. At that point ->ticks has been incremented twice, but ->swi_count was only incremented once, causing the timeout for our Task_sleep(1) to be missed. If I manually "correct" the corresponding Clock object's ->currTimeout processing resumes just fine.

The real issue is apparently that the timer interrupt is preempted by the PRU interrupt but then doesn't get to finish processing, but rather takes the timer interrupt again.

The VIM chapter in the TRM (revision D) has a small paragraph on interrupt priorization that seems to suggest that the VIM would keep track of preempted interrupts:

If the CPU switches this interrupt to active (by reading the
VIM_FIQVEC/VIM_IRQVEC register), then the currently active interrupt will be pushed onto a stack. When
an interrupt is cleared by reading the VIM_FIQVEC/VIM_IRQVEC register, if there are any interrupts on
the stack, the first entry is popped off and put back into the VIM_ACTFIQ/VIM_ACTIRQ register, so that
software may continue where it left off.

Unfortunately the TRM is otherwise rather sparse on the details of this "stack", e.g. it doesn't say anywhere how deep that stack is.

I'm currently using SYS-BIOS from 6.75.02 from processor SDK 06.00, but I also gave SYS-BIOS 6.76.02 (latest download on the website) a try, unfortunately with the same results.

Regards,

Dominic

  • Hi Dominic,

    Would it be possible for you to share a sample project that shows this issue?

    Regards,
    Frank

  • Hello Frank,

    no, unfortunately I don't have a sample project that I can share.

    I've uploaded a part of an instruction trace that shows the problem:

    /cfs-file/__key/communityserver-discussions-components-files/791/clock_5F00_misses_5F00_timeout_5F00_cut.xls

    Index 50323 shows a PRU interrupt that is processed just fine

    Index 51093 is the "first" timer interrupt that runs until index 51323. At that point it incremented ticks but not yet swi_count.

    Index 51324 is another PRU interrupt that is preempting the timer interrupt. That runs just fine until index 51593.

    Index 51594 is the "second" timer interrupt. This one runs until it is finished, and then the clock SWI is executed.

    Unfortunately the TRM doesn't provide enough information for me to be able to tell if the current implementation in SYS/BIOS is correct. For example if there really is a stack of preempted interrupts in the VIM hardware that stack would probably have some limits, i.e. a maximum nesting depth. It also fails to explain how interrupt priorization works with the hardware handshaking (vector interface). I believe it also contains a minor bug, since in one sentence it says that "an interrupt is cleared by reading the VIM_FIQVEC/VIM_IRQVEC register" while it is obviously cleared by writing to those registers, and made active by reading the registers.

    I probably wont have time to create a sample that shows the issue since due to the many problems we had with the R5F we're going to use the A53 for now. We'll probably have to use the R5F again at a later point, so it would be great if you could verify if there is an issue with the current implementation.

    Regards,

    Dominic

  • Dominic,

    Thanks for the trace and the additional info. Have you tried using a longer tick period? It may be a workaround until we can determine if this is a bug. I am working on trying to replicate this, I will get back to you by this coming Wednesday to let you know of progress.

    Thanks,

    Sean

  • A few things in the meantime:

    - Could you provide the kernel cfg file?

    - Can you describe what is being done in the PRU interrupt?

    - In the trace you provided, we can see that the first clockDoTick has no logging but the second clockDoTick has a logTick immediately afterwards. Do you know why logging is only happening on some ticks?

    Thanks,

    Sean

  • We may have found the issue. Could you change your Hwi.c to look like this and seeing if the rebuilt kernel still sees the problem?

    705 /*
    706  *  ======== Hwi_dispatchIRQC ========
    707  *  Configurable IRQ interrupt dispatcher.
    708  */
    709 Void Hwi_dispatchIRQC(Hwi_Irp irp)
    710 {
    711     Int i;
    712     Int swiKey;
    713     UInt intNum;
    714     Hwi_Object *hwi;
    715     BIOS_ThreadType prevThreadType;
    716 
    717     /* Read register to make active and mask all other irqs
    718      * equal or lower priority */
    719     Hwi_vim.IRQVECADDRESS;
    720 
    721     if (Hwi_vim.ACTIVEIRQ & 0x80000000) {
    722         intNum = Hwi_vim.ACTIVEIRQ & 0x3FF;
    723     }
    724     else {
    725         Hwi_module->spuriousInts++;
    726         Hwi_vim.IRQVECADDRESS = 0;
    727         return;
    728     }

    Hwi.c should be in (bios_folder)/packages/ti/sysbios/family/arm/v7r/keystone3/

    Line 719 reads a VIM register which should cause VIM to mask all equal and lower priority interrupts as well as a few other side effects.

  • Hello Sean,

    thank you for working on this.

    Unfortunately that change didn't solve my problem. I'm still getting freezes, but they are actually worse / less predictable than the issue that I had before. The CPU was still executing, but the system time (->ticks) didn't increase anymore. I haven't had time to further anaylze this behaviour yet.

    I'm also not sure if that change is correct, since as far as I understand reading from VIM_IRQVEC is the equivalent of the hardware handshake, and the VE bit in SCTLR is set to 1, so SYS/BIOS seems to be using the hardware handshake mechanism.

    I've spent some more time debugging this issue. It seems that the interrupt preemption happens rather often, but that doesn't always cause a timeout to be missed. I guess that makes sense, since only if the tick ISR is preempted AFTER incrementing ->ticks and BEFORE incrementing ->swi_count does the clock code miss a timeout.

    I've modified the code in ti_sysbios_family_xxx_Hwi_switchAndRunDispatchC slightly to be able to break on the "nested IRQ" case, and that breakpoint triggers often (i.e. almost immediately). This matches a trace of Clock module logging events (from ROV, see attached document: /cfs-file/__key/communityserver-discussions-components-files/791/clock_5F00_misses_5F00_timeout_5F00_ClockEvents.xls) that shows (for the brief period of time) a lot of ti.sysbios.knl.Clock.LM_tick events that appear way too soon after the previous tick event.

    I've debugged a case of the nested IRQ (according to the trace the first IRQ was preempted right after HWIs were being re-enabled in Hwi_dispatchIRQC) and checked the VIM_ACTIRQ register right before and after the nested IRQ writes to VIM_IRQVEC (Hwi_vim.IRQVECADDRESS = intNum;). Before the write VIM_ACTIRQ contains 0x800000be, after the write it contains 0xbe. According to the documentation this should have caused the preempted interrupt to be put back into VIM_ACTIRQ, but that's not the case. Once the nested IRQ finishes and interrupts get re-enabled the interrupt processing starts again from ti_sysbios_family_arm_v7r_keystone3_Hwi_dispatchIRQ__I.

    Regarding your questions:

    - Could you provide the kernel cfg file?

    Would it be ok to send you the configuration file via a private message? That makes it easier for me to share sources.

    - Can you describe what is being done in the PRU interrupt?

    Our code is based on TI's ethercat slave sample application. From what I can see PRUICSS_hwiIntHandler calls EscCmdLowAckIsr which merely acknolwedges the interrupt, then PRUICSS_hwiIntHandler posts a semaphore, and the ISR is done.

    - In the trace you provided, we can see that the first clockDoTick has no logging but the second clockDoTick has a logTick immediately afterwards. Do you know why logging is only happening on some ticks?

    Sorry, I forgot to "unfold" part of the trace. If you check the excel sheet you'll see that rows 872-989 have their height set to 0. I set that to make the trace easier readable.

    Regards,

    Dominic

  • Thanks for trying that Dominic,

    Sending the cfg file via PM would be great thanks. We're still working the issue, I will let you know by the 24th what the status is. Thanks for your in-depth debugging and responses, it is very helpful. I will get back to you soon.

    Thanks,

    Sean

  • Dominic, we've reproduced the issue and have filed a bug for it (SYSBIOS-1218). We've confirmed that setting the Hwi priorities equal to each other prevents the bad behavior as noted in the other thread you posted (if a work around is needed). We are looking for the fix, I will post an update within a week.

    Thanks,

    Sean

  • Hi Dominic,

    We think we have a fix. The testing team ran this one continuously for several hours and reported no issues so we have more confidence in this one. Please replace the files in <bios>/packages/ti/sysbios/family/arm/v7r/keystone3 with the two attached files (making sure to save the originals somewhere just in case). Then rebuild your project and you should be ready to test.

    /cfs-file/__key/communityserver-discussions-components-files/791/Hwi.c

    /cfs-file/__key/communityserver-discussions-components-files/791/Hwi_5F00_asm.sv7R

    Please try those and let us know if it resolves your issue. Thanks!

    Sean

  • Hello Sean,

    thanks for providing this fix. I've only had time to briefly test it, but so far haven't seen the problem anymore.

    I diffed the original version of these files with your changes and it looks like you're dropping the vectored mode (hardware handshake) that you've been using before and switched to the fixed exception vector at 0x00000018. Does this mean there's a hardware issue with vectored mode? I don't think this is going to be much of a problem for SYS/BIOS applications, but a bare metal application might want to use the vectored interrupt for improved performance. If there is a confirmed hardware issue it would be nice to have this documented in the processor errata.

    One thing I noticed is that if I just load the application while our modified SBL is sitting in the command line interface, the VE bit is already set, and I have to manually clear it. Since the SBL supports a compile time option that skips the reset (SBL_SKIP_MCU_RESET) which we've been using for other reasons this might be an issue for someone else who comes across this thread, too.

    I'll perform further testing and reply here again.

    Regards,

    Dominic

  • Hi Dominic,

    Glad that the fix appears to work for you. We are still investigating whether there is an issue in hardware, but if it is determined that there is one it will be added to an errata. The SYS/BIOS fix will likely just forgo the hardware interrupt acceleration as SYS/BIOS wouldn't benefit from it anyways as you noted.

    Thanks again for your feedback and testing.

    Sean

  • Can I mark this as resolved?

  • I believe we currently have a workaround, but as Sean wrote it is still unclear whether there is a hardware issue or if there's just some steps missing in software.

    Since in theory the vectoring interface would provide improved performance (SYS/BIOS doesn't make use of this right now) we would be very interested in hearing back from TI once it has been determined if there's a hardware issue. Is there a possibility to get that kind of feedback after acknowleding that the issue is resolved (for now)?

    Regards,

    Dominic

  • Hi Dominic,

    We've pinged the hardware team on this and hope to have an answer soon.

    Todd

  • Hi Dominic,

    This is Anish from TI HW Design team. We have determined that this issue is indeed a problem with the hardware. Nesting of interrupts with Vector Interface in R5F will be unavailable in this revision of the hardware

    Thanks,

    Anish

  • We gave the proposed workaround more testing, and unfortunately the system seems to be very unreliable with the changes in Hwi.c and Hwi_asm.sv7R.

    Sometimes it crashes very early, with an Abort because the Hwi_dispatchIRQC received intNum = 0x26 from the VIM, but there is no HWI installed for MCU_TIMER_0_INT. Since the HWI pointer is NULL in that case the code tries to jump to 0xe59ff018, which is of course an invalid pointer. The BIOS appears to use the second timer, since there is a HWI installed for MCU_TIMER_1_INT (0x27). Not sure if this interrupt is "valid". I load the application via the debugger while the SBL is running, i.e. without performing an actual reset of the MCU, but this approach has been working just fine before.

    If the application gets past this initial issue it often dies because it doesn't get any further tick interrupts (tick count remains at 0). Sometimes the application just works, but then stops working after several minutes.

    I haven't had time to further debug these issues, but they only occur with the proposed workaround. I'll post here again if I have more input regarding the issues with the workaround.

    Anish Reghunath said:
    This is Anish from TI HW Design team. We have determined that this issue is indeed a problem with the hardware. Nesting of interrupts with Vector Interface in R5F will be unavailable in this revision of the hardware

    Thanks for confirming that there's an issue with the hardware. Do you know if nesting should work without using the vector interface?

    Regards,

    Dominic

  • Hi Dominic,

    Most of the team is out this week due to holiday. Someone will take a look at this early next week.

    Todd

  • I've debugged the "Abort" issue a bit more.

    Like I wrote before the processor is trying to execute the handler for HWI 0x26, but that HWI is NULL, so it is using some data from the vector table and is trying to execute 0xe59ff018. Inside the VIM in INTR_EN_SET[1] only 0x80 is enabled, and that is interrupt number 0x27, not 0x26.

    I'm guessing that intNum 0x26 is a leftover from the SBL. At the time when I load the application the SBL is still running. Interrupts were currently disabled (I bit set). I then load the application via the debugger, and the application runs up to main. If I let it continue from there it crashes.

    While in SBL the VIM has 0x26 in ACTIRQ (i.e. ACTIRQ is not valid). The RAW[1] interrupt status has 0x40 set, as has the IRQ_EN register, but since the CPSR.I bit is set the processor wont take the interrupt.

    At the time when the application reaches main() the I bit is still set, but the application went through Hwi_Module_startup which called Hwi_initIntController. This function disabled and cleared all interrupts. When I check the VIM register ACTIRQ is still 0x26, the RAW[1] status still has 0x40 set (MCU_TIMER0 is a level interrupt, so the interrupt is still pending), but the IRQ_EN[1] register is cleared. If I read the PRIIRQ register it contains 0x80000026, indicating that 0x26 is the highest priority pending interrupt.

    Somewhere in main our application calls BIOS_start which eventually calls Hwi_startup. That function enables interrupts, and at that point my application crashes:

    106750,0x9B1D1EFC,0xEB001077,ti_sysbios_BIOS_startFunc__I,2707,package/cfg/sysbios_r5f_per5f.c,E:\IBV\PROJEKTE\JADE\WORKSPACE\FieldbusSlave\Release_R5\configPkg,106749,
    106751,0x9B1D60E0,0xF1080040,ti_sysbios_family_arm_v7r_keystone3_Hwi_startup__E,347,E:/IBV/PROJEKTE/JADE/WORKSPACE/ti/bios_6_76_02_02/packages/ti/sysbios/family/arm/v7r/keystone3/Hwi.c,,106750,
    106752,0x9B1D60E4,0xF1080080,ti_sysbios_family_arm_v7r_keystone3_Hwi_startup__E,350,E:/IBV/PROJEKTE/JADE/WORKSPACE/ti/bios_6_76_02_02/packages/ti/sysbios/family/arm/v7r/keystone3/Hwi.c,,106751,
    106753,0x00000018,0xE59FF018,n/a,,n/a,n/a,106752,Normal interrupt (IRQ)
    106754,0x9B1C6270,0xE24EE004,ti_sysbios_family_arm_v7r_keystone3_Hwi_dispatchIRQ__I,58,E:/IBV/PROJEKTE/JADE/WORKSPACE/ti/bios_6_76_02_02/packages/ti/sysbios/family/arm/v7r/keystone3/Hwi_asm.sv7R,,106753,
    ...
    106832,0x9B1AB1D4,0xE12FFF3C,ti_sysbios_family_arm_v7r_keystone3_Hwi_dispatchIRQC__I,761,E:/IBV/PROJEKTE/JADE/WORKSPACE/ti/bios_6_76_02_02/packages/ti/sysbios/family/arm/v7r/keystone3/Hwi.c,,106831,

    The trace stops at that point, but the application sits on the prefetch abort vector (0xc). The link register points to the address following the branch to the (hwi->fxn)(hwi->arg). From the registers I can tell that hwi was NULL (which is not caught by the Hwi_dispatchIRQC code).

    I believe the issue is that the VIM has "accepted" 0x26 as the highest priority pending interrupt, and disabling that interrupt doesn't "undo" that. At the point when the (modified) BIOS code read IRQVEC the NUM field from PRIIRQ is copied to ACTIRQ and the interrupt is made active. Unfortunately the VIM chapter in the AM65x TRM is a bit sparse on details, e.g. it contains the word pending exactly 5 times:

    When an interrupt goes from pending to active...

    ... then the FIQn/IRQn will be asserted and that interrupt made pending as normal

    ... when there are no pending unmasked interrupts ...

    ... determine which groups have IRQs pending ...

    ... determine which groups have IRQs pending ...

    The only reference to an IRQ being MADE pending (i.e. what makes an interrupt pending, what side-effects does that have, ...) is in the chapter regarding interrupt priorization, where it tells me that a higher priority interrupt can be "made pending as normal". The wording "pending unmasked interrupts" makes it sound like an interrupt could be pending but masked?

    I'm guessing that the "highest priority pending IRQ" in PRIIRQ refers to the highest priority IRQ that is active from the peripheral and unmasked, and masking that interrupt doesn't change anything about that.

    What changes have an effect on PRIIRQ?

    • "Taking" the IRQ by having CPSR.I cleared and the VE bit enabled certainly does
    • "Taking" the IRQ by reading vom IRQVEC does, too
    • I'm guessing that arrival of an even-higher priority IRQ would make that IRQ go into PRIIRQ?
    • Is there any other means of "clearing" a pending IRQ from PRIIRQ?

    It might well be that all of this is just caused by our workflow of booting the processor via SBL and then loading the application via the debugger. If the application gets loaded via the SBL the R5F is undergoing a reset (unless SBL_SKIP_MCU_RESET is set...). Does resetting the R5F via DMSC/SYSFW/SciClient also reset the VIM? Is there any way to fully reset the VIM?

    Even if this is only the result of our workflow it would still be great to A) understand this and B) solve this, since for example SBL_SKIP_MCU_RESET is a documented "feature" of the SBL, and as far as I can tell right now could lead to the same (mis-)behaviour.

    Regards,

    Dominic

    P.S.: Thanks, I'm aware of US holidays. It would be great to have an answer some time next week.

  • Dominic Rath said:
    What changes have an effect on PRIIRQ?

    I've made some more experiments regarding PRIIRQ:

    • If an unmasked higher-priority IRQ becomes pending, PRIIRQ gets updated
    • If there are two unmasked and pending IRQs and the higher priority IRQ gets masked, PRIIRQ gets updated
    • If there is only one unmasked and pending IRQ and that IRQ gets masked, PRIIRQ remains as before

    This was all verified with VE disabled and reading IRQVEC instead.

    I've implemented a workaround for our application that clears pending, masked IRQs from PRIIRQ by making them active and acknolweding them while the processor's I bit is set, and so far this is working reliably.

    It would be nice to get confirmation on the observed behaviour of the VIM with regard to PRIIRQ.

    I've also got one more question regarding Anish's answer:

    Anish Reghunath said:
    ... will be unavailable in this revision of the hardware

    When you say "in this revision of the hardware" does that mean it is going to be fixed in a future revision, and do you know which revision/when this is going to be fixed?

    Regards,

    Dominic

  • I had another issue debugging my workaround to make sure it works as intended:

    • When I interrupt the SBL using the debugger, CPSR.I bit is set, PRIIRQ is valid and contains 0x26, ACTIRQ is invalid
    • I load the application with a breakpoint set to 0x0
    • At 0x0, CPSW.I, PRIIRQ and ACTIRQ are still the same
    • As the first step, my application makes sure CPSR.I and .F are set, in case I interrupted the SBL with interrupts enabled
    • At the beginning of TI_decompress_lzss CPSW.I, PRIIRQ and ACTIRQ are still the same
    • At the end of TI_decompress_lzss, PRIIRQ is invalid and ACTIRQ is valid and contains 0x26

    I've tracked this down to the point where TI_decompress_lzss writes to ti_sysbios_family_arm_v7r_keystone3_Hwi_Module__root__V. If I step past the strb that initializes the fourth byte of that struct, the IRQ moves from PRIIRQ to ACTIRQ.

    Since that store couldn't possibly affect the VIM I suspected the debugger interfered with the VIM. I closed the ROV classic view, and the problem disappeared. With the ROV closed I can load the application up to main, and the IRQ is still in PRIIRQ, not in ACTIRQ. If I then open the ROV classic with the Hwi module selected, the IRQ moves from PRIIRQ to ACTIRQ. If another module is selected, nothing happens. If I use the "new" Runtime Object View instead of ROV classic and have the Hwi module opened, the irq moves from PRIIRQ to ACTIRQ.

    It looks like the ROV Hwi module reads VIM.IRQVEC, since that is the only thing that should have the effect of moving an IRQ from PRIIRQ to ACTIRQ.

    • What kind of reads does the ROV classic in CCS 9.2.0 perform?

    This looks like a bug in ROV, since it definitely shouldn't affect target state in this way?

  • Dominic Rath said:

    It would be nice to get confirmation on the observed behaviour of the VIM with regard to PRIIRQ.

    The observed behaviour matches with the VIM design intent and specification

    To provide a simplified explanation, you can consider the VIM as being in three states always: IDLE_st, PRIIRQ_st, ACTIRQ_st

    1. State transition is always IDLE_st -> PRIIRQ_st -> ACTIRQ_st -> IDLE_st . It cannot change state from PRIIRQ_st to IDLE_st without going through ACTIRQ_st
    2. An unmasked interrupt triggers transition from IDLE_st to PRIIRQ_st
    3. In PRIIRQ_st, all actions which results in VIM staying in PRIIRQ_st or moving to ACTIRQ_st are allowed. So from your observations,
      1. If an unmasked higher-priority IRQ becomes pending, PRIIRQ gets updated – vim allows this as the state continues to be PRIIRQ_st
      2. If there are two unmasked and pending IRQs and the higher priority IRQ gets masked, PRIIRQ gets updated - vim allows this as the state continues to be PRIIRQ_st
      3. If there is only one unmasked and pending IRQ and that IRQ gets masked, PRIIRQ remains as before – vim cannot go back to IDLE_st, so it remains in PRIIRQ_st with the previous status
    4. If you leave VIM in PRIIRQ_st the only way to bring it back to IDLE is to take it through ACTIRQ_st (as you are doing in the work-around) or reset the VIM (resetting the R5 resets the VIM too)

    Dominic Rath said:
    When you say "in this revision of the hardware" does that mean it is going to be fixed in a future revision, and do you know which revision/when this is going to be fixed?

    This is still under internal discussion. We do not have the timelines and version numbers yet

  • Hi Dominic,

    Dominic Rath said:
    What kind of reads does the ROV classic in CCS 9.2.0 perform?

    Unfortunately, ROV (both Classic and new version) reads the entire VIM register set for the "Basic" view, even though it only needs VIM.GROUP[].ENABLEDSET & VIM.GROUP[].RAWSTATUSSET.  I will file a bug report for SYS/BIOS for this, and hopefully a fix makes it into a release quite soon.

    One thing that hasn't been mentioned yet in this thread, we recently found a bug where dynamically-created Hwi instances neglect to honor the Hwi_Params.triggerType parameter, causing every Hwi instance created at run-time (in C code) to be configured as Hwi.TriggerType_LEVEL (which is the default setting).  Statically-created Hwi instances (created in .cfg file) do recognize the triggerType correctly.  I'm not sure if this affects your code or not, but please consider this if you are creating Hwi instances at run-time in your C code that are PULSE triggered.  We have a fix that will be coming in the next SYS/BIOS 6.76 release (within a week or two).

    Regards,

    - Rob