This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM625: DMA interrupt latency issue

Hi,

This is to follow up the thread https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1200987/am625-rerouting-interrupt-to-dma-event-and-dma_dev_to_mem-channel-selection.

The original question in the referred thread is closed with all the provided kernel patches. Now the system runs into DMA interrupt latency issue which would cause GPMC/FPGA data overflow. Please see the details of the description below.

=============

At last - I realised I was being dumb and I figured out where I was going wrong - it is now working as expected...

So now the driver initialises correctly and I can then enable the FPGA by writing to an 'enable' register, the FPGA then raises the GPIO1_15 line and the DMA transfer is triggered and on completion the dma completion calback is called which advises user-space that data is available and sends an interrupt acknowledge to the FPGA - I now see a continuous stream of interrupts until I either stop it, or the interrupt acknowledge arrives too late, in which case it crashes...

There are two things I see - firstly, if I stop the FPGA and attempt to deallocate the DMA, I see the error 'tu-udma 485c0100.dma-controller: chan0 teardown timeout!', however, this does not appear to have any serious negative side-effect, but it is of concern.

Of more concern is the fact that I see quite a lot of jitter on the interrupt line - with the configuration I am running at the moment, the interrupt frequency is 1kHz and the typical duration that the interrupt remains high is ~40us, however, I do occasionally see the duration of this high increasing to more the 1ms, and in the event that the FPGA stops and reports an error that the interrupt acknowledge was not received timeously. The first few times I ran my test I saw this overrun situation after < 60s on each run. There is no other load on the system - top shows 0% CPU load. If I load the system using stress-ng I can see much larger jitter on this interrupt.

  • Hi Hamish,

    I saw the teardown timeout in my test when trying to close the transfer before it completed.

    About the DMA completion callback jitter issue, do you see it in sw triggering dma scenario?

  • Hi Bin,

    Yes we also see this jitter with the sw triggering scenario - however, we do not see it with our am335x product

  • Hi Bin, I thought I would not bring this topic up until we had verified the operation with the interrupt re-routing, in the hope that it would be cured once we had essentially the same behavior between the am335x platform and the am62x platforms.

  • Hi Hamish,

    No worries. Thanks for the details.

    I am trying to understand if the jitter is due kernel handling the DMA completion interrupt too late, or the DMA completion callback is executed too late. Then we might see how to solve it.

    I know currently the GPMC acknowledgement is done in DMA completion callback in your fpga driver by writing to a GPMC memory address. Can you please move the ACK to the UDMA driver DMA completion interrupt handler as shown in the pseudo code below?

    diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c
    index 1026212f5b30..b1c5720c416c 100644
    --- a/drivers/dma/ti/k3-udma.c
    +++ b/drivers/dma/ti/k3-udma.c
    @@ -1235,6 +1235,8 @@ static irqreturn_t udma_udma_irq_handler(int irq, void *data)
                    d->tr_idx = (d->tr_idx + 1) % d->sglen;
     
                    if (uc->cyclic) {
    +                       // if (uc->bchan->id == channel_id for GPMC)
    +                       //     write GPMC ACK register
                            vchan_cyclic_callback(&d->vd);
                    } else {
                            /* TODO: figure out the real amount of data */
    

  • Hi Bin,

    Yes, I will do this in the morning - I need to call it a day now

  • Hi Hamish,

    Yes, I will do this in the morning - I need to call it a day now

    Before doing this, please check the following first:

    Please check if you have CONFIG_HW_RANDOM_OPTEE=m defined in your kernel .config. If so, please run command 'modprobe -r optee-rng' in your board Linux to remove the optee-rng driver, then test if you still see the DMA completion interrupt jitter problem. This optee-rng driver could affect kernel preempt schedule and cause long interrupt latency.

  • Hi Bin, 

    We have # CONFIG_HW_RANDOM_OPTEE is not set in our kernel config, so this is probably not affecting us.

  • Hi Hamish,

    Just wanted to check on the progress, have you get a chance to modify the code to send the completion notification in udma_udma_irq_handler()?

  • Hi Bin,

    I did this in a slightly different way, but it has the same effect - I call the dma completion callback directly within the interrupt as opposed to just sending the acknowledge to the FPGA, so what this is doing is, it increments a frame counter, sends the acknowledge to the FPGA and it calls the wake_up_interruptable on the wait queue (i.e. notifies userspace that data is ready), and I still see the jitter. I did also comment out the increment of the frame counter and the notification to userspace and I still see the jitter.

  • Hi Hamish,

    Thanks for the update. Now it becomes a system performance problem... I am unable to tell how to solve it until we understand the cause.

    You don't use RT kernel, do you? I have asked around here to see if RT kernel would improve the jitter, but one of our senior engineer doesn't believe RT kernel alone would help in the scenario, so I don't want to suggest the path to RT kernel, since you would have to migrate all your kernel code to RT to test it out.

    And you also don't do printk() in your dma completion callback function, right? kernel printk() would add latency.

    I want to see if I can replicate the issue on my setup then look into it. My plan is to find a PWM pin on the EVM and use it to toggle gpio1_15 in 1ms interval, then profile the dma completion isr and the completion callback to see the jitter.

    I now got a couple of other customer escalations, but I will work along with this profiling test. I will keep you posted.

  • Hi Bin,

    I have tried using the RT kernel and although there is some improvement in jitter, we see some and if we load the system we experience the DMA overruns in the FPGA. Fortunately in the Yocto environment this migration to RT is trivial, it is essentially a 1-liner in a configuration file!

    In my DMA driver, the only places there are printk's is in init or teardown code - there is one in the callback, but that is after the FPGA has signalled a missed interrupt acknowledge, and that is an unrecoverable error from the FPGA point of view.

    I did also try moving the dma interrupt to a core on it's own, as well as making that core tickless with no improvement.

    I will be leaving on vacation on Saturday, but I will be checking mail from time to time - I also have a meeting with colleagues tomorrow to determine how to proceed during my absence, and it is likely that one of these colleagues will continue on this effort while I am away, but I will make the necessary introductions before I do go.

  • Hi Hamish,

    I did also try moving the dma interrupt to a core on it's own, as well as making that core tickless with no improvement.

    Our sw engineer told me AM62x (and all K3 devices) cannot move DMA interrupts. Could you please explain how did you move it to a different core? Any log showing the the change and result etc?

  • Hi Hamish,

    Now the issue is no longer about support of GPIO triggering DMA for GMPC, but interrupt latency. Do you think it makes sense to start a new e2e thread about the latency problem? Now this thread is about 3 pages long, it becomes difficult in every time opening the thread and navigating to the latest of the conversation.

  • Hi Bin,

    Yes I think we have closed the topic of interrupt rerouting, and I think this is a completely different topic so I would agree with moving into a new thread - Would you like to create a new thread, or should I?

  • I have never done creating a thread on customer's behalf. But this time I can give it a try, then copy some of the related conversation here to the new thread.

    I will put the new link to the new thread here.

  • Hi Bin,

    Before moving the interrupt:

    cat /proc/interrupts | grep MSI-INTA | grep Edge
     78:        168          0          0          0  MSI-INTA 1712640 Edge      485c0100.dma-controller chan0

    Then

    echo 3 > /proc/irq/78/smp_affinity_list

    Then 

     78:        953          0          0          9  MSI-INTA 1712640 Edge      485c0100.dma-controller chan0
     78:        953          0          0         24  MSI-INTA 1712640 Edge      485c0100.dma-controller chan0
     ...
     78:        953          0          0       1102  MSI-INTA 1712640 Edge      485c0100.dma-controller chan0

    And as you can see there are no more interrupts on core 0 for the dma interrupt, but they are all occuring on core 3

  • Hi Hamish,

    Thanks for the details.

    Our sw engineer told me AM62x (and all K3 devices) cannot move DMA interrupts.

    The sw engineer implemented some IRQ affinity support for Ethernet and ICSS some time ago, looks like it works for all IRQs.

  • Hi Hamish,

    With RT kernel, can you please try to increase the kernel ksoftirq process priority to see if it helps?

    - move DMA irq to core3;

    - use command 'ps aux | grep ksoftirq' to find the PID for core3 ksoftirq;

    - use command 'chrt -f -p 10 <ksoftirq_pid>' to increase its priority;

    If you have a user space program which triggers the DMA to GPMC, you might have to adjust its priority as well (maybe to '9', next lower than its ksoftirq).

  • I heard something about CPU isolation (kernel cmdline parameter?), which is likely more important than ksoftirq priority in your use case. I will look it up to see how to use it.

  • Hi Bin,

    I used CPU isolation (yes kernel cmdline parameter) and interrupt association in another project, which was actually for critical real-time processing which is how I knew about this. I just came to check my email but I am exhausted now, so I will test this in the morning and report back.

  • Hi Bin,

    I just tried increasing the ksoftirq priority and I see no noticeable change in the jitter

  • Hi Hamish,

    Have you tried isolcpus=3 cmdline parameter and set dma irq affinity to core3? In this way the core3 is dedicated for the DMA interrupt handler but nothing else, I am wondering this would help in the jitter issue.

  • Hi Bin,

    I have tried the isolcpus=3 cmdline parameter and set dma irq affinity to core3 as you suggested to Hamish (I'm Hamish's colleague).

    Indeed this reduces the jitter significantly in the few short manual tests that I did until now: The DMA acknowledgement duration never seems to exceed 40us (typical duration is 20us for the 332 byte DMA transfers I used in my test), even under full load of the remaining 3 cores (using stress-ng). I did the tests with a RT kernel.

    This is looking good so far and I will do some longer running measurements. I will also run some tests with a non-RT kernel.

  • Hi Martin,

    This is really a great news! Thanks for the update.

    Looking forward to the final test result!

  • Hi Bin,

    I now let the test with the RT kernel run over night (about 12 hours) under full load. The good news is, that the DMA transfers were still running the next morning, which means that the DMA acknowledgement never exceeded 1 millisecond.

    I also set up a scope with a trigger that would fire when the DMA pulse width on the GPIO1_15 line exceeds 100 microseconds. Unfortunately the trigger has fired when I checked it in the morning. This means that the jitter is bigger than what I estimated from the manual test runs.

    I also did some manual tests with the non-RT kernel: The jitter there exceeded 1 ms within 30 min even without any artificial load from stress-ng. I have to admit that I forgot to increase the priority of the ksoftirq process for these tests.

    I will try to come up with a measurement setup that would allow me to generate a histogram of the jitter to have more reliable numbers.

    The other pending issue is the timeout reported from the udma driver when we stop the cyclic DMA transfer. We suspect that this prevents us from re-starting a DMA transfer with different settings (buf_len, period_len parameters). Did you have time to look into the cause of the timeout yet?

  • Hi Martin,

    I also set up a scope with a trigger that would fire when the DMA pulse width on the GPIO1_15 line exceeds 100 microseconds. Unfortunately the trigger has fired when I checked it in the morning. This means that the jitter is bigger than what I estimated from the manual test runs.

    Interrupt jitter > 100us is not an issue as long as it is not > 1ms, right?

    I also did some manual tests with the non-RT kernel: The jitter there exceeded 1 ms within 30 min even without any artificial load from stress-ng. I have to admit that I forgot to increase the priority of the ksoftirq process for these tests.

    With this 1ms jitter requirement, you would have to use RT kernel, plus all the tuning (cpu isolation and irq affinity) that you have in place.

    The other pending issue is the timeout reported from the udma driver when we stop the cyclic DMA transfer. We suspect that this prevents us from re-starting a DMA transfer with different settings (buf_len, period_len parameters). Did you have time to look into the cause of the timeout yet?

    I am currently in another issue debugging. Once I have progress on finding the root cause, I can start to look into the DMA timeout problem. I don't have the ETA at this moment, but will keep you posted.

  • Hi Bin,

    Interrupt jitter > 100us is not an issue as long as it is not > 1ms, right?

    Yes, that's correct. It's just good to know the limits and how much headroom there is for future products.

    I am currently in another issue debugging. Once I have progress on finding the root cause, I can start to look into the DMA timeout problem. I don't have the ETA at this moment, but will keep you posted.

    Thanks for the heads up.

  • Hi Martin,

    When your application tries to tear down the DMA, does the FPGA still continues firing the GPIO in 1ms interval? or the FPGA driver already told it to stop firing GPIO?

    And can you please describe where/when in your driver calls dmaengine_terminate_async() to tear down the dma channel?

  • Hi Bin,

    When your application tries to tear down the DMA, does the FPGA still continues firing the GPIO in 1ms interval? or the FPGA driver already told it to stop firing GPIO?

    The FPGA driver first tells the FPGA to stop firing GPIO, then tears down the DMA.

    And can you please describe where/when in your driver calls dmaengine_terminate_async() to tear down the dma channel?

    There are 2 cases in which the driver terminates the DMA transfer:

    1. Userspace asks the driver to terminate the DMA via the sysfs interface:
      In this case the driver first tells the FPGA to stop firing GPIO, then tears down the DMA with a dmaengine_terminate_sync() call (which basically calls dmaengine_terminate_async() followed by dmaengine_synchronize())
    2. The FPGA reports an error during DMA transfer:
      The driver detects the FPGA error in the DMA transfer complete callback (dma_tc_callback()) and tells the FPGA to stop firing GPIO. Originally the driver also called dmaengine_terminate_sync() in the dma_tc_callback(), which caused a kernel panic because the dmaengine_synchronize() function may not be called from within the callback.
      I worked around the kernel panic by not calling dmaengine_terminate_sync() in the dma_tc_callback() at all, but instead require the userspace to terminate the DMA via the sysfs interface as well.
      I also tried to call dmaengine_terminate_async() in the dma_tc_callback() and dmaengine_synchronize() via the sysfs interface, but the behaviour is the same as just calling dmaengine_terminate_sync() via the sysfs interface.

    I hope this explanation is understandable - otherwise feel free to ask.

  • Hi Martin,

    It appears the DMA driver needs some work for the DMA channel termination, I haven't figured out how to fix it in the DMA driver yet.

    But one workaround is, in your drive don't call dmaengine_terminate_sync(), but

    use dmaengine_terminate_async() and dmaengine_synchronize(), and fire the DMA trigger once between these two function calls. You could use the software trigger (write to the DMA register) as you used before I implemented the GPIO directly triggering DMA.

    Please let me know if this can be used as a short term workaround in your project.

  • Hi Bin,

    But one workaround is, in your drive don't call dmaengine_terminate_sync(), but

    use dmaengine_terminate_async() and dmaengine_synchronize(), and fire the DMA trigger once between these two function calls. You could use the software trigger (write to the DMA register) as you used before I implemented the GPIO directly triggering DMA.

    That's a good suggestion. I will try it out.

  • Hi Martin,

    Instead of the "extra trigger" workaround, can you please apply the following kernel patch to see if it fixes the teardown problem for you?

    diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c
    index 22013bfd8ad6..4f3564444084 100644
    --- a/drivers/dma/ti/k3-udma.c
    +++ b/drivers/dma/ti/k3-udma.c
    @@ -1037,9 +1037,12 @@ static int udma_stop(struct udma_chan *uc)
     				   UDMA_CHAN_RT_CTL_TDOWN);
     		break;
     	case DMA_MEM_TO_MEM:
    -		udma_tchanrt_write(uc, UDMA_CHAN_RT_CTL_REG,
    -				   UDMA_CHAN_RT_CTL_EN |
    -				   UDMA_CHAN_RT_CTL_TDOWN);
    +		u32 val = UDMA_CHAN_RT_CTL_EN | UDMA_CHAN_RT_CTL_TDOWN;
    +
    +		if (uc->cyclic)
    +			val |= UDMA_CHAN_RT_CTL_FTDOWN;
    +
    +		udma_tchanrt_write(uc, UDMA_CHAN_RT_CTL_REG, val);
     		break;
     	default:
     		uc->state = old_state;
    

  • Hi Bin,

    I am back from my vacation... I tested the above patch and it appears to work. We no longer see the timeout in the teardown and I can re-program the geometry of the DMA buffer and then resume transfers (as we are able to do on our am335x platform), so this looks like this has solved this issue, thank you very much!

    I have now started cleaning up my FPGA driver patches and there is now only one difference between the driver for the am335x and the am625x and that is in the dmaengine_prep_dma_cyclic function call. In the am335x call, the period_len parameter must be the frame length / src_addr_width, whereas in the am625x call, the period_len parameter must be the period_len. The src_addr_width in both am335x and am625x is 2 because the GPMC bus-width we use is 16-bits.

    Of course, I could abstract this difference in my driver at compile-time based on the CONFIG_ARM64 for example, but I think that this should not be the case and that the interface to dmaengine_prep_dma_cyclic should be identical.

  • Hi Hamish,

    Glad to hear it is all working now!

    In the am335x call, the period_len parameter must be the frame length / src_addr_width, whereas in the am625x call, the period_len parameter must be the period_len.

    My understanding from the kernel source code (include/linux/dmaengine.h)

      * @device_prep_dma_cyclic: prepare a cyclic dma operation suitable for audio.  
      *      The function takes a buffer of size buf_len. The callback function will 
      *      be called after period_len bytes have been transferred. 

    that the "period_len" is the FIFO depth in bytes which is frame length in your use case. But it is trivial to make it to "frame_length / src_addr_width". I will adjust the kernel driver and send you the patch soon.

  • Hi Hamish,

    I just compared my cyclic mode implementation with that in edma driver for AM335x, but I don't see anything different regarding the period_len parameter in both DMA drivers.

    In the am335x call, the period_len parameter must be the frame length / src_addr_width, whereas in the am625x call, the period_len parameter must be the period_len.

    In your project, what is the relation of frame length and gpio trigger? Does the DMA transfer frame_lenth bytes per trigger or a half of frame_length bytes per trigger?

    what is the relation of period_len on am625x with frame_length?

  • Hi Bin,

    In both the am335x and am625x FPGA environments, each DMA transfer should be the full frame length - I am just wondering if this may in fact be a bug in the am335x edma driver? I always wondered why we had to divide the frame length by the bus-width, that does not correspond with the documentation, but it has always been this way...

  • Hi Hamish,

    I can double the "period_len" parameter in the K3 udma driver so that you can pass in "frame_length/src_addr_width" in your application driver. I will send you the patch shortly once I am in office.

    Or you could modify the udma driver by yourself. In the very beginning of the new function udma_prep_dma_cyclic_triggered_tr() in k3-udma.c, add


    period_len *= uc->cfg.src_addr_width;

    Please pay attention that in the patch file "udma-add-triggered-cyclic-mode-1115.diff", line 27, period_len has been used to calculate the number of periods. So calculating "periods" has be moved down after adjusting period_len.

  • Hi Bin,

    I created a patch and it is working fine, so no need to worry any further on this

  • Hi Hamish,

    Thanks.

    We don't have any other open issues on this project at this moment, right?

    I am still curious why edma driver requires the period_len to be a half of the FIFO size. I see some other kernel ADC drives also set this way. I will continue looking into this whenever I have time, and update on this thread if I found anything to be changed in the udma driver.

  • Hi Bin,

    I do believe that we now have a reasonable work-around to the issue, although it does mean that we are isolating one CPU. I am just curious as to the reason we have this jitter on the am625x but we do not see it on the am335x. 

    I would also be keen to hear if you find anything about the edma driver and the period_len!

    Thank you very much for all of your help - it has been a long road, but we have reached a solution! I will resolve the issue

  • Hi Hamish,

    I do believe that we now have a reasonable work-around to the issue, although it does mean that we are isolating one CPU. I am just curious as to the reason we have this jitter on the am625x but we do not see it on the am335x. 

    This has multiple factors. Comparing to K3 devices, AM335x architecture is much simpler. AM335x has a single core A8, while AM62x has 4x A53 in one cluster, which has a much more complex fabric than that on AM335x. The architecture complexity on AM62x contributes to the latency jitter. Our dev engineer also told me ARM64 in general has worse jitter than ARM 32-bit.

    I would also be keen to hear if you find anything about the edma driver and the period_len!

    I went through all the kernel drivers which call dmaengine_prep_dma_cyclic(), some use period_len as the full buffer size, while others use it as a half of the buffer size. So I think this is DMA controller driver implementation specific.

    Then I looked at the EDMA driver again (I am not familiar with EDMA), it seems the function handles dmaengine_prep_dma_cyclic() using multiple "slots" for DMA transfer.

    nslots = (buf_len / period_len) + 1;

    each "slot" handles one DMA transfer for "period_len" length of data. (well, in your case nslots = 3, without hands-on testing it, I am not sure what the 3rd slot is doing...)

    In my udma driver udma_prep_dma_cyclic_triggered_tr() only uses one "slot" (one TR in DMA term), so period_len would have to be equal to your FIFO length.

    Hopefully this explains the difference.