AM625: DMA transfer latency

Part Number: AM625

Hi,

This is a follow up to https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1323189/am625-dma-interrupt-latency-issue.

We are now trying to maximize the DMA transfer frequency to test the limits of our DAQ implementation.

To recap: With the help of Bin Liu we were able to use interrupt driven cyclic DMA transfers of the DMA engine to transfer data from the FPGA to the AM625. The interrupt line was reset in the transfer complete callback function. With core isolation this allowed us to handle up to 1000 DMA transfers per second.

Now we are trying to increase the DMA transfer frequency with some modifications: We modified the FPGA to reset the interrupt line on its own when the DMA transfer starts and the FPGA buffer falls below a given theshold. This way the data transfer can run without any kernel driver code on the AM625. Measuring the width of the DMA interrupt pulse gives us the latency of the DMA transfer, which we found to be about 3 microsecond.

But starting from about 8000 DMA transfers per second we started to see sporadic delays of about 20 microseconds to the start of the DMA transfer. The delay increases with the transfer frequency (up to over 50 microseconds at 10'000 DMA transfers per second).

Since there should be no kernel scheduling involved in the DMA transfer, we are puzzled by this behavior and would like to understand it better.

Chapter 11.1.1.5 "Block Copy DMA (BCDMA)" of the AM62x Reference Manual states: "An internal DMA scheduler is used to control the ordering and rate at which this multiplexing occurs.". Could this be the reason for the sporadic delays?

In the referenced thread in the last message, Bin Liu mentioned that on the AM625 the DMA transfers are distributed to slots by the EDMA driver. Could this be the reason for the sporadic delays?

Any tips on how to optimize or explain the sporadic delays would be welcome.

  • Hello Martin,

    Thank you for your query !

    Please allow us some time to internally discuss and crosscheck with a TI expert.

    Please expect our response early next week.  

    Thanks for your patience !

    Kind Regards

    Anastas Yordanov

  • Hi Anastas,

    Thanks for looking into it. Are there any update on the issue yet?

    Regards

    Martin

  • Hi Martin,

    I would like to apologize that I had incidentally missed to catch up with our BCDMA hardware and Linux software experts the last week.
    I have notified them and expect to have their response by Thursday (Feb-19) COB.

    In the meantime, I am not sure to understand how many BCDMA channels were involved in your previous setup. Would you please specify along with their set OrderID, priority (or clarify again if already mentioned in previous thread) ? 


    Thank you for your understanding !
    Best Regards
    Anastas Yordanov

  • Hi Anastas,

    In the meantime, I am not sure to understand how many BCDMA channels were involved in your previous setup. Would you please specify along with their set OrderID, priority (or clarify again if already mentioned in previous thread) ? 

    We use the the DMA Engine API of the Linux kernel to to setup the cyclic DMA transfer for one channel "rx" - but I don't know the OrderID or priority (I assume that's handled by the BCDMA driver for the DMA Engine ti,am64-dmss-bcdma).

    In the device tree we link our driver to the main_bcdma node of the TI Linux Kernel (see https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux.git/tree/arch/arm64/boot/dts/ti/k3-am62-main.dtsi#n121) like this:
    dmas = <&main_bcdma 1 0 0>;
    dma-names = "rx";

    I hope this helps.

    Regards,

    Martin

  • Hi Martin,

    I am still waiting for a response from the hardware expert.

    Regarding your new dtsi input, I will need to internally check with a TI Linux software DMSS DMAs driver expert.

    Please expect that some delay in the response is possible. I hope I can reach out to the expert and get the answers early next week.

    Thank you for your patience !

     Best Regards

    Anastas Yordanov

  • Hi Anastas,

    Did you get a response yet?

    Regards

    Martin

  • Hello, 

    Anastas is currently out of office, returning tomorrow. We will work to get an update as soon as possible. 

    Thanks,

    Chris 

  • Hi Martin,

    Sorry for the delay.

    Now we are trying to increase the DMA transfer frequency with some modifications: We modified the FPGA to reset the interrupt line on its own when the DMA transfer starts and the FPGA buffer falls below a given theshold. This way the data transfer can run without any kernel driver code on the AM625.

    The previous work has been done about 2 years ago, can you please remind me some of the implementation details? By each DMA transfer, you meant each cyclic transfer which is about a few hundreds bytes (depending on the configuration for your use case), right?

    Does the interrupt line get reset by FPGA for each cyclic transfer? If so, how would the FPGA know the DMA transfer is complete without IRQ handling on AM625 kernel side?

  • Hi Bin,

    The previous work has been done about 2 years ago, can you please remind me some of the implementation details? By each DMA transfer, you meant each cyclic transfer which is about a few hundreds bytes (depending on the configuration for your use case), right?

    Yes, I meant each cyclic transfer.

    Does the interrupt line get reset by FPGA for each cyclic transfer? If so, how would the FPGA know the DMA transfer is complete without IRQ handling on AM625 kernel side?

    The FPGA knows the amount of data transferred with each cycle and it knows the amount of data currently in the fifo and together with the threshold the state of the DMA transfer can be determined.

    But we are currently relaxing some guarantees, to explore how fast we can get the system running with the least latency - and the transfer confirmation in the kernel added latency that may be unnecessary with this technique.

  • Hi Martin,

    Thanks for the details.

    Now we are trying to increase the DMA transfer frequency with some modifications: We modified the FPGA to reset the interrupt line on its own when the DMA transfer starts and the FPGA buffer falls below a given theshold.

    Now the interrupt line is no longer reset by AM625, but does the kernel on AM625 still handle the DMA completion interrupt? for example updating the DDR buffer pointer per DMA cyclic cycle? If so, 8000 transfers per second means 125usec per interrupt, which would be kind of high for Linux to handle all the interrupts right on time, given Linux is not a real-time OS.

  • Now the interrupt line is no longer reset by AM625, but does the kernel on AM625 still handle the DMA completion interrupt?

    No, for this test we don't register a transfer complete callback function to the DMA Engine anymore - precisely to exclude the kernel. If we have done it right (and if no lower level part of the DMA Engine is handling the interrupt) the data should end up in the RAM without any involvement of the kernel.

    We setup a test to transfer 72 bytes per cycle and measured that it takes 2usec to transfer the data per cycle.

    The duration of the interrupt signal now gives us a good indication of how long it takes the DMA hardware to start the transfer (for each cycle).

    With no kernel involvement, we would expect this duration to be constant (within limits) independent of the interrupt rate - staying below 10kHz gives a 100usec window for the transfer (plenty of headroom).

    This is the case up to a interrupt rate of about 7kHz, the interrupt signal duration is 2-4usec.

    But starting from 8kHz, outliers started to appear with durations of 10 - 25 usec (most durations are still 2-4usec).

    Since there is no kernel involvement anymore, we can't explain the cause of these outliers.

  • If we have done it right (and if no lower level part of the DMA Engine is handling the interrupt) the data should end up in the RAM without any involvement of the kernel.

    I would have to check the kernel. I really haven't touched the kernel dma drivers for quite a while and forgot some details.

    But starting from 8kHz, outliers started to appear with durations of 10 - 25 usec (most durations are still 2-4usec).

    Just to confirm, you see in those outliers, it is the DMA transfer time gets longer, not the DMA transfer start time gets delayed, correct?

  • Just to confirm, you see in those outliers, it is the DMA transfer time gets longer, not the DMA transfer start time gets delayed, correct?

    No, the way I setup the test with the threshold, the duration indicates the time until the DMA transfer has started (read the first 12 bytes). So this appears to us to reveal latencies in the DMA Engine.

  • Okay, the outliers means the DMA transfers started late. 

    No, for this test we don't register a transfer complete callback function to the DMA Engine anymore - precisely to exclude the kernel. If we have done it right (and if no lower level part of the DMA Engine is handling the interrupt) the data should end up in the RAM without any involvement of the kernel.

    I guess you remove the DMA completion interrupt handler in your GPMC GPFA driver, but I am not sure either if it completely removes the DMA completion interrupts or kernel still handles them in lower level.

    Can you please measure the latency again with the following kernel DMA patch? The patch should disable the DMA completion interrupts in very transfer cycle.

    diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c
    index e0845d7908e9..284fae333fef 100644
    --- a/drivers/dma/ti/k3-udma.c
    +++ b/drivers/dma/ti/k3-udma.c
    @@ -3633,7 +3633,7 @@ udma_prep_dma_cyclic_triggered_tr(struct udma_chan *uc, dma_addr_t buf_addr,
                            buf_len, period_len, periods);
     
            cppi5_tr_init(&tr_req[tr_idx].flags, CPPI5_TR_TYPE15, false,
    -                     true, CPPI5_TR_EVENT_SIZE_ICNT2_DEC, 0);
    +                     true, CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
            cppi5_tr_set_trigger(&tr_req[tr_idx].flags, uc->config.tr_trigger_type,
                            CPPI5_TR_TRIGGER_TYPE_ICNT2_DEC, 0, 0);

  • Hi Bin,

    I applied the patch and ran the measurement again for several interrupt rates/cyclic transfer rates - Unfortunately it doesn't seem to have an effect.

    For the measurement I use a digital oscilloscope that records the DMA signal at 1MHz sampling rate (1microsecond resolution) for 60 second and measures the width of each DMA pulse. I can then plot a histogram of all those measurements.

    Here are the histrograms (I just saw that I can attach images):

    5000 transfers per second:

    7000 transfers per second:

    8000 transfers per second - here you see that latencies bigger than 10 microseconds start to appear:

    I hope these results help.

  • Hi Martin,

    I applied the patch and ran the measurement again for several interrupt rates/cyclic transfer rates - Unfortunately it doesn't seem to have an effect.

    Thanks for the test.

    measures the width of each DMA pulse.

    I guess the DMA pulse is the GPIO trigger signal which the FPGA drives high to start the signaling then drives low when AM625/DMA started reading the data from GPGA, right?

    A couple more questions before I start the dialog with our silicon design team:

    - I know your current test the transfer is 72 bytes in each cycle. Have you measured with different transfer size, such as 64, 128, 256 bytes to see if different transfer size affects the observation? I think the AM625 DMA burst line is 128 bytes.

    - Are you able to measure the time of the transfer time, from start to the end of transferring 72 bytes or other length? I want to know if the outliers are because of DMA starts late, or the previous transfer takes too long then delays the next DMA start time.

  • Hi Bin,

    I guess the DMA pulse is the GPIO trigger signal which the FPGA drives high to start the signaling then drives low when AM625/DMA started reading the data from GPGA, right?

    Yes, correct.

    - I know your current test the transfer is 72 bytes in each cycle. Have you measured with different transfer size, such as 64, 128, 256 bytes to see if different transfer size affects the observation? I think the AM625 DMA burst line is 128 bytes.

    I made a calculation error with the 72 bytes - it actually is 332 bytes per cycle. I'm sorry for the confusion.

    I will do some measurements with values below 128 bytes and post the results here.

    - Are you able to measure the time of the transfer time, from start to the end of transferring 72 bytes or other length? I want to know if the outliers are because of DMA starts late, or the previous transfer takes too long then delays the next DMA start time.

    I have to check with our FPGA designer whether we can achieve that in the FPGA, since we can't do that in the kernel.

    The AM625 DMA component doesn't provide registers to drive a pin where I could measure the transfer duration, right?

  • The AM625 DMA component doesn't provide registers to drive a pin where I could measure the transfer duration, right?

    Right, I am not aware of such direct pin exists.

  • Hi Bin

    - Are you able to measure the time of the transfer time, from start to the end of transferring 72 bytes or other length? I want to know if the outliers are because of DMA starts late, or the previous transfer takes too long then delays the next DMA start time.

    I managed to measure the GPMC 100MHz clock signal (the clock is active during a DMA transfer) together with the interrupt signal on my oscilloscope:
    In this measurement the transfer size was 76 Bytes at 8000 transfers per second - there is a delay of the transfer happening, the transfer duration did not change.

    Most of the time it is like this (clock signal is brown, interrupt is green):

    But sometimes there are delays:

      

    And for this one I has to zoom out:

  • Hi Martin,

    Thank you for these test results.

    I am not familiar with the GPMC module, but I am wondering if possible the delay is not from the DMA rather the GPMC.

    Can you please dump the GPMC configuration? I believe you can enable CONFIG_OMAP_GPMC_DEBUG in kernel .config, then the gpmc driver will dump all the configuration in kernel boot log. I might take this information and talk to our GPMC expert first.

  • Hi Bin,

    Here's the kernel boot log of the GPMC:

    [    1.045762] omap-gpmc 3b000000.memory-controller: GPMC revision 6.0
    [    1.052227] gpmc_mem_init: disabling cs 0 mapped at 0x0-0x1000000
    [    1.058926] gpmc cs1 before gpmc_cs_program_settings:
    [    1.064112] cs1 GPMC_CS_CONFIG1: 0x00001000
    [    1.068389] cs1 GPMC_CS_CONFIG2: 0x00101001
    [    1.072666] cs1 GPMC_CS_CONFIG3: 0x22060514
    [    1.076943] cs1 GPMC_CS_CONFIG4: 0x10057016
    [    1.081219] cs1 GPMC_CS_CONFIG5: 0x010f1111
    [    1.085501] cs1 GPMC_CS_CONFIG6: 0x8f070000
    [    1.089777] gpmc cs1 access configuration:
    [    1.093964] gpmc,mux-add-data = <0>;
    [    1.097618] gpmc,device-width = <2>;
    [    1.101272] gpmc,wait-pin = <0>;
    [    1.104571] gpmc,burst-length = <4>;
    [    1.108228] gpmc cs1 timings configuration:
    [    1.112506] gpmc,cs-on-ns = <10>; /* 1 ns - 10 ns; 1 ticks */
    [    1.118383] gpmc,cs-rd-off-ns = <160>; /* 151 ns - 160 ns; 16 ticks */
    [    1.125059] gpmc,cs-wr-off-ns = <160>; /* 151 ns - 160 ns; 16 ticks */
    [    1.131736] gpmc,adv-on-ns = <40>; /* 31 ns - 40 ns; 4 ticks */
    [    1.137790] gpmc,adv-rd-off-ns = <50>; /* 41 ns - 50 ns; 5 ticks */
    [    1.144199] gpmc,adv-wr-off-ns = <60>; /* 51 ns - 60 ns; 6 ticks */
    [    1.150610] gpmc,adv-aad-mux-on-ns = <10>; /* 1 ns - 10 ns; 1 ticks */
    [    1.157291] gpmc,adv-aad-mux-rd-off-ns = <20>; /* 11 ns - 20 ns; 2 ticks */
    [    1.164417] gpmc,adv-aad-mux-wr-off-ns = <20>; /* 11 ns - 20 ns; 2 ticks */
    [    1.171546] gpmc,oe-on-ns = <60>; /* 51 ns - 60 ns; 6 ticks */
    [    1.177526] gpmc,oe-off-ns = <160>; /* 151 ns - 160 ns; 16 ticks */
    [    1.183938] gpmc,oe-aad-mux-on-ns = <10>; /* 1 ns - 10 ns; 1 ticks */
    [    1.190533] gpmc,oe-aad-mux-off-ns = <30>; /* 21 ns - 30 ns; 3 ticks */
    [    1.197318] gpmc,we-on-ns = <50>; /* 41 ns - 50 ns; 5 ticks */
    [    1.203296] gpmc,we-off-ns = <160>; /* 151 ns - 160 ns; 16 ticks */
    [    1.209710] gpmc,rd-cycle-ns = <170>; /* 161 ns - 170 ns; 17 ticks */
    [    1.216299] gpmc,wr-cycle-ns = <170>; /* 161 ns - 170 ns; 17 ticks */
    [    1.222888] gpmc,access-ns = <150>; /* 141 ns - 150 ns; 15 ticks */
    [    1.229297] gpmc,page-burst-access-ns = <10>; /* 1 ns - 10 ns; 1 ticks */
    [    1.236241] gpmc,bus-turnaround-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.242739] gpmc,cycle2cycle-delay-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.249506] gpmc,wait-monitoring-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.256093] gpmc,clk-activation-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.262593] gpmc,wr-data-mux-bus-ns = <70>; /* 61 ns - 70 ns; 7 ticks */
    [    1.269448] gpmc,wr-access-ns = <150>; /* 141 ns - 150 ns; 15 ticks */
    [    1.276128] GPMC CS1: cs_on            :   0 ticks,   0 ns (was   1 ticks)   0 ns
    [    1.283784] GPMC CS1: cs_rd_off        :   6 ticks,  60 ns (was  16 ticks)  60 ns
    [    1.291445] GPMC CS1: cs_wr_off        :   8 ticks,  80 ns (was  16 ticks)  80 ns
    [    1.299113] GPMC CS1: adv_on           :   0 ticks,   0 ns (was   4 ticks)   0 ns
    [    1.306768] GPMC CS1: adv_rd_off       :   2 ticks,  20 ns (was   5 ticks)  20 ns
    [    1.314424] GPMC CS1: adv_wr_off       :   2 ticks,  20 ns (was   6 ticks)  20 ns
    [    1.322080] GPMC CS1: adv_aad_mux_on   :   0 ticks,   0 ns (was   1 ticks)   0 ns
    [    1.329734] GPMC CS1: adv_aad_mux_rd_off:   0 ticks,   0 ns (was   2 ticks)   0 ns
    [    1.337477] GPMC CS1: adv_aad_mux_wr_off:   0 ticks,   0 ns (was   2 ticks)   0 ns
    [    1.345221] GPMC CS1: oe_on            :   2 ticks,  20 ns (was   6 ticks)  20 ns
    [    1.352876] GPMC CS1: oe_off           :   6 ticks,  60 ns (was  16 ticks)  60 ns
    [    1.360531] GPMC CS1: oe_aad_mux_on    :   0 ticks,   0 ns (was   1 ticks)   0 ns
    [    1.368185] GPMC CS1: oe_aad_mux_off   :   0 ticks,   0 ns (was   3 ticks)   0 ns
    [    1.375840] GPMC CS1: we_on            :   0 ticks,   0 ns (was   5 ticks)   0 ns
    [    1.383495] GPMC CS1: we_off           :   2 ticks,  20 ns (was  16 ticks)  20 ns
    [    1.391164] GPMC CS1: rd_cycle         :   6 ticks,  60 ns (was  17 ticks)  60 ns
    [    1.398820] GPMC CS1: wr_cycle         :   8 ticks,  80 ns (was  17 ticks)  80 ns
    [    1.406478] GPMC CS1: access           :   5 ticks,  50 ns (was  15 ticks)  50 ns
    [    1.414133] GPMC CS1: page_burst_access:   0 ticks,   0 ns (was   1 ticks)   0 ns
    [    1.421792] GPMC CS1: bus_turnaround   :   0 ticks,   0 ns (was   0 ticks)   0 ns
    [    1.429447] GPMC CS1: cycle2cycle_delay:   0 ticks,   0 ns (was   0 ticks)   0 ns
    [    1.437102] GPMC CS1: wr_data_mux_bus  :   3 ticks,  30 ns (was   7 ticks)  30 ns
    [    1.444757] GPMC CS1: wr_access        :   4 ticks,  40 ns (was  15 ticks)  40 ns
    [    1.452413] GPMC CS1: wait_monitoring  :   0 ticks,   0 ns (was   0 ticks)   0 ns
    [    1.460068] GPMC CS1: clk_activation   :   2 ticks,  20 ns (was   0 ticks)  20 ns
    [    1.467722] GPMC CS1 CLK period is 10 ns (div 1)
    [    1.472446] gpmc cs1 after gpmc_cs_set_timings:
    [    1.477077] cs1 GPMC_CS_CONFIG1: 0x2c601200
    [    1.481354] cs1 GPMC_CS_CONFIG2: 0x00080600
    [    1.485631] cs1 GPMC_CS_CONFIG3: 0x00020200
    [    1.489907] cs1 GPMC_CS_CONFIG4: 0x02000602
    [    1.494183] cs1 GPMC_CS_CONFIG5: 0x00050806
    [    1.498460] cs1 GPMC_CS_CONFIG6: 0x84030000
    [    1.502736] gpmc cs1 access configuration:
    [    1.506923] gpmc,mux-add-data = <2>;
    [    1.510577] gpmc,device-width = <2>;
    [    1.514234] gpmc,wait-pin = <0>;
    [    1.517534] gpmc,wait-on-write = <1>;
    [    1.521277] gpmc,wait-on-read = <1>;
    [    1.524932] gpmc,burst-length = <4>;
    [    1.528585] gpmc,sync-write = <1>;
    [    1.532062] gpmc,gpmc,sync-read = <1>;
    [    1.535896] gpmc cs1 timings configuration:
    [    1.540171] gpmc,cs-on-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.545876] gpmc,cs-rd-off-ns = <60>; /* 51 ns - 60 ns; 6 ticks */
    [    1.552201] gpmc,cs-wr-off-ns = <80>; /* 71 ns - 80 ns; 8 ticks */
    [    1.558524] gpmc,adv-on-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.564312] gpmc,adv-rd-off-ns = <20>; /* 11 ns - 20 ns; 2 ticks */
    [    1.570722] gpmc,adv-wr-off-ns = <20>; /* 11 ns - 20 ns; 2 ticks */
    [    1.577132] gpmc,adv-aad-mux-on-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.583631] gpmc,adv-aad-mux-rd-off-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.590485] gpmc,adv-aad-mux-wr-off-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.597339] gpmc,oe-on-ns = <20>; /* 11 ns - 20 ns; 2 ticks */
    [    1.603305] gpmc,oe-off-ns = <60>; /* 51 ns - 60 ns; 6 ticks */
    [    1.609360] gpmc,oe-aad-mux-on-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.615769] gpmc,oe-aad-mux-off-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.622271] gpmc,we-on-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.627970] gpmc,we-off-ns = <20>; /* 11 ns - 20 ns; 2 ticks */
    [    1.634026] gpmc,rd-cycle-ns = <60>; /* 51 ns - 60 ns; 6 ticks */
    [    1.640257] gpmc,wr-cycle-ns = <80>; /* 71 ns - 80 ns; 8 ticks */
    [    1.646490] gpmc,access-ns = <50>; /* 41 ns - 50 ns; 5 ticks */
    [    1.652545] gpmc,page-burst-access-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.659311] gpmc,bus-turnaround-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.665810] gpmc,cycle2cycle-delay-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.672576] gpmc,wait-monitoring-ns = <0>; /* 0 ns - 0 ns; 0 ticks */
    [    1.679167] gpmc,clk-activation-ns = <20>; /* 11 ns - 20 ns; 2 ticks */
    [    1.685933] gpmc,wr-data-mux-bus-ns = <30>; /* 21 ns - 30 ns; 3 ticks */
    [    1.692787] gpmc,wr-access-ns = <40>; /* 31 ns - 40 ns; 4 ticks */

  • Hi Martin,

    I talked to our DMA designer, the DMA transfer start latency should be way shorter than 1us. So it seems the 10-25us latency in your test comes from the GPMC. I will talk to our GPMC expert on what to check next.

  • Hi Bin,

    Are there any updates on this issue yet?

    Regards

    Martin

  • Hi Martin,

    Sorry for the late response. I talked to our GPMC expert and we'd like to check some of the GPMC signals.

    Your kernel dump shows CS1 is used to connect to the FPGA. We want to see other CS config registers. You probably can read them using devmem2 program.

    On the scope probe, along with the interrupt signal and GPMC clk, can you please also probe the GPMC OEn_REn signal? It is the Output/Read Enable.