Because of the Thanksgiving holiday in the U.S., TI E2E™ design support forum responses may be delayed from November 25 through December 2. Thank you for your patience.

This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

[FAQ] AM6x: Optimizing SPI-transfer inter-byte gaps using the DMA in Linux

Part Number: AM625
Other Parts Discussed in Thread: SK-AM62B,

Overview

In some applications it is critical that SPI transfers happen with the smallest-possible inter-byte gaps to either optimize throughout, or to meet certain overall system real-time requirements (i.e., some external SPI device may require a consistent burst of activity to operate correctly). This E2E FAQ discusses how the AM6x DMA can be used to optimize SPI transfers by optimally moving data from/to the on-chip SPI controller which in turn will optimize the transfers with any external devices, instead of using the programmed I/O (PIO) approach that is used by default in the Linux driver stack part of the TI Linux SDKs v9.x  which is rather sensitive to the overall timing behavior in the Linux Kernel. The discussion and examples are using the AM625 device on the SK-AM62B EVM but the same concept applies to all other AM6x devices in a similar manner.

Setting up the SPI Controller to use DMA for Transfers

In order for the DMA to be used for SPI transfers, both of the following two prerequisites must be met:

  1. The SPI controller must be configured and enabled for DMA during driver probe, and
  2. A certain minimum transfer size requirement must be met

In order to enable the SPI controller to use the DMA, the device tree node for the desired module must contain valid module instance-specific dmas and dma-names properties. Those are not currently present in the device-specific device tree files for some AM6x devices such as AM62x, but can be added by appending the respective device tree nodes. See the TISCI user guide at https://software-dl.ti.com/tisci/esd/latest/index.html for AM6x device and controller-instance specific DMA thread numbers to be used. The below patch shows how to enable the SK-AM62B main device tree file for SPI0 operation with DMA. Note that most of the contents of this patch was taken from the k3-am625-sk-mcspi-loopback.dtso file already present in the Kernel tree, with the two key additions being the dmas and dma-names properties. This was done to make this approach discussed here easier to re-create, without having to apply device tree overlay files.

diff --git a/arch/arm64/boot/dts/ti/k3-am625-sk.dts b/arch/arm64/boot/dts/ti/k3-am625-sk.dts
index f9b7fa2e8156b..21e01afc6a191 100644
--- a/arch/arm64/boot/dts/ti/k3-am625-sk.dts
+++ b/arch/arm64/boot/dts/ti/k3-am625-sk.dts
@@ -366,3 +366,36 @@ K3_TS_OFFSET(12, 17)
 			>;
 	};
 };
+
+/* The below was taken from k3-am625-sk-mcspi-loopback.dtso */
+&main_pmx0 {
+	main_spi0_pins_default: main-spi0-pins-default {
+		pinctrl-single,pins = <
+			AM62X_IOPAD(0x01bc, PIN_INPUT, 0) /* (A14) SPI0_CLK */
+			AM62X_IOPAD(0x01c0, PIN_INPUT, 0) /* (B13) SPI0_D0 */
+			AM62X_IOPAD(0x01c4, PIN_INPUT, 0) /* (B14) SPI0_D1 */
+			AM62X_IOPAD(0x01b4, PIN_INPUT, 0) /* (A13) SPI0_CS0 */
+		>;
+	};
+};
+
+/* The below was taken from k3-am625-sk-mcspi-loopback.dtso w/ added dma properties */
+&main_spi0 {
+	status = "okay";
+	#address-cells = <1>;
+	#size-cells = <0>;
+	pinctrl-0 = <&main_spi0_pins_default>;
+	pinctrl-names = "default";
+	dmas = <&main_pktdma 0xc300 0>, <&main_pktdma 0x4300 0>;
+	dma-names = "tx0", "rx0";
+	spidev@0 {
+		/*
+		 * Using spidev compatible is warned loudly,
+		 * thus use another equivalent compatible id
+		 * from spidev.
+		 */
+		compatible = "rohm,dh2228fv";
+		spi-max-frequency = <24000000>;
+		reg = <0>;
+	};
+};

Then, one needs to consider that the SPI controller drivers/spi/spi-omap2-mcspi.c driver used on AM6x devices only actually uses the DMA for transfers when the transfer size is at least DMA_MIN_BYTES which by default is 160 bytes. This means, even if the use of the DMA was configured using the device tree as discussed above, the DMA will NOT be used for any transfers smaller than DMA_MIN_BYTES, as the SPI controller will still be operated in PIO mode. However we can force the DMA to be used for smaller transfers by updating DMA_MIN_BYTES as per application needs. The below patch shows how this limit can be lowered to 8 bytes as an example.

$ git show
commit 844240c526f2287889ef6591005e2473fb73b94a (HEAD -> ti-linux-6.1.y-am62-spi0-dma-dev)
Author: Andreas Dannenberg <dannenberg@ti.com>
Date:   Wed May 1 03:56:05 2024 -0500

    spi: omap2-mcspi: Use DMA for transfers as small as 8 byte
    
    To optimize overhead the McSPI driver only uses DMA for transfers of at
    least 160 bytes in size. However if the goal is to minimize 'inter-byte'
    gaps we need to lower this limit to our application and system needs.
    
    Signed-off-by: Andreas Dannenberg <dannenberg@ti.com>

diff --git a/drivers/spi/spi-omap2-mcspi.c b/drivers/spi/spi-omap2-mcspi.c
index 90c3d80b95fae..69c4a1ce2f9f6 100644
--- a/drivers/spi/spi-omap2-mcspi.c
+++ b/drivers/spi/spi-omap2-mcspi.c
@@ -98,7 +98,7 @@ struct omap2_mcspi_dma {
 /* use PIO for small transfers, avoiding DMA setup/teardown overhead and
  * cache operations; better heuristics consider wordsize and bitrate.
  */
-#define DMA_MIN_BYTES                  160
+#define DMA_MIN_BYTES                  8
 
 
 /*

Test Results PIO vs. DMA Transfers

Let's do a quick analysis to see the impact of using the DMA on optimizing inter-byte gap timing during SPI transfers. For this we added the changes discussed earlier to an SDK v9.x-based Linux kernel tree and used the tools/spi/spidev_test.c tool available in the Linux Kernel tree, cross-compiled and installed on an SK-AM62B board.

First, let's do a SPI transfer of 7 bytes in length, which is below the DMA_MIN_BYTES limit configured earlier, this way resulting in PIO-based transfers.

root@am62xx-evm:~# ./spidev_test -D /dev/spidev1.0 -v -p '1234567'
spi mode: 0x0
bits per word: 8
max speed: 500000 Hz (500 kHz)
TX | 31 32 33 34 35 36 37 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  |1234567|
RX | FF FF FF FF FF FF FF __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  |.......|

Probing the SPI signals on the user expansion headers of the SK-AM62B board during the transfer we can observe an inter-byte gap of about 4us, which is caused by software activity re-loading and accessing the SPI controller registers.

Now let's do a similar transfer, but transferring 8 bytes of data. As per previously configured DMA_MIN_BYTES limit this should result in a SPI transfer using the DMA.

root@am62xx-evm:~# ./spidev_test -D /dev/spidev1.0 -v -p '12345678'
spi mode: 0x0
bits per word: 8
max speed: 500000 Hz (500 kHz)
TX | 31 32 33 34 35 36 37 38 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  |12345678|
RX | FF FF FF FF FF FF FF FF __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  |........|

Observing the logic analyzer traces of the SPI signals we can indeed see the inter-byte gaps have been reduced to about 2us, showing the impact of the DMA servicing the SPI controller. Note that this value is not representative of the 'best case' as those inter-byte gaps will be much smaller at higher SPI CLK speeds.

Debugging Aids

Since the use of the DMA or lack thereof happens somewhat silently, the user may be left wondering whether the desired mode is in fact active. For debugging purposes consider adding the below patch to the SPI controller drivers/spi/spi-omap2-mcspi.c driver at to make the behavior more observable:

diff --git a/drivers/spi/spi-omap2-mcspi.c b/drivers/spi/spi-omap2-mcspi.c
index 6a27f8315ff5d..90c3d80b95fae 100644
--- a/drivers/spi/spi-omap2-mcspi.c
+++ b/drivers/spi/spi-omap2-mcspi.c
@@ -384,6 +384,8 @@ static void omap2_mcspi_tx_dma(struct spi_device *spi,
 	mcspi = spi_master_get_devdata(spi->master);
 	mcspi_dma = &mcspi->dma_channels[spi->chip_select];
 
+	dev_info_once(mcspi->dev, "%s:\n", __func__);
+
 	dmaengine_slave_config(mcspi_dma->dma_tx, &cfg);
 
 	tx = dmaengine_prep_slave_sg(mcspi_dma->dma_tx, xfer->tx_sg.sgl,
@@ -423,6 +425,8 @@ omap2_mcspi_rx_dma(struct spi_device *spi, struct spi_transfer *xfer,
 	mcspi_dma = &mcspi->dma_channels[spi->chip_select];
 	count = xfer->len;
 
+	dev_info_once(mcspi->dev, "%s:\n", __func__);
+
 	/*
 	 *  In the "End-of-Transfer Procedure" section for DMA RX in OMAP35x TRM
 	 *  it mentions reducing DMA transfer length by one element in master
@@ -573,6 +577,8 @@ omap2_mcspi_txrx_dma(struct spi_device *spi, struct spi_transfer *xfer)
 
 	mcspi = spi_master_get_devdata(spi->master);
 
+	dev_info_once(mcspi->dev, "%s:\n", __func__);
+
 	if (cs->word_len <= 8) {
 		width = DMA_SLAVE_BUSWIDTH_1_BYTE;
 		es = 1;
@@ -959,6 +965,12 @@ static int omap2_mcspi_request_dma(struct omap2_mcspi *mcspi,
 		mcspi_dma->dma_rx = NULL;
 	}
 
+	if (mcspi_dma->dma_rx)
+		dev_info(mcspi->dev, "RX DMA enabled\n");
+
+	if (mcspi_dma->dma_tx)
+		dev_info(mcspi->dev, "TX DMA enabled\n");
+
 no_dma:
 	return ret;
 }

The patch should result in the following diagnostic messages added to the Kernel log:

root@am62xx-evm:~# dmesg | grep omap2_mcspi
[    5.432797] omap2_mcspi 20100000.spi: RX DMA enabled
[    5.439199] omap2_mcspi 20100000.spi: TX DMA enabled
[   22.773984] omap2_mcspi 20100000.spi: omap2_mcspi_txrx_dma:
[   22.780566] omap2_mcspi 20100000.spi: omap2_mcspi_tx_dma:
[   22.790539] omap2_mcspi 20100000.spi: omap2_mcspi_rx_dma:

The two messages RX DMA enabled and TX DMA enabled will get output during the boot process when the SPI controller driver is probed. Seeing these messages confirms that the DMA is fundamentally configured correctly (via device tree) and available to use. Note that as discussed earlier that doesn't imply that the DMA is also used for all transfers. This is where the omap2_mcspi_*_dma log messages come in, which confirm the actual DMA transfer functions were triggered. Note that those messages are only output once (per Linux session) as to not negatively affect the system's real-time behavior when doing repeated transfers.

TURBO Mode (Experimental)

This section discusses the use of the TURBO mode to further reduce and optimize the inter-byte gaps during SPI transfers. The TURBO mode is a hardware feature part of the McSPI peripheral module, please refer to the device TRM for additional technical background. There is a partial implementation of this feature in the Linux Kernel but it is dormant (no easy way to activate it) and it is limited to RX-only transfers, so it is not usable/accessible for all practical purposes.

The attached patch builds on this and extends the McSPI driver to expose the TURBO mode feature via device tree property, and lifts the RX-only limitation. The patch set also includes the enabling of the DMA as discussed earlier in this FAQ as well as an reduction of the minimum transfer size cutoff for which the DMA is used down to 8 bytes, optimizing for best-possible overall performance. Note that the use of TURBO mode has only undergone very limited bench testing and hence this implementation provided here should be considered EXPERIMENTAL for now.

Let's look at the performance gains that can be achieved. For testing, we use the spidev_test based test command that was used earlier, but we also add a parameter to set the SPI operating frequency to the maximum of 50MHz so we can better observe the impact of the TURBO mode.

root@am62xx-evm:~# ./spidev_test -D /dev/spidev1.0 -s 50000000 -v -p '12345678'
spi mode: 0x0
bits per word: 8
max speed: 50000000 Hz (50000 kHz)
TX | 31 32 33 34 35 36 37 38 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  |12345678|
RX | 31 32 33 34 35 36 37 38 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  |12345678|

Below is a timing diagram of the 8-byte long transfer at SPI_CLK=50MHz with only using the DMA. An inter-byte gap of 170ns can be observed. For this test the TURBO mode was disabled by removing the ti,spi-turbo-mode property from the device tree file.

Now here is a timing diagram of the same transfer, but with TURBO mode enabled. It can be seen that the inter-byte gap is now only 72ns long, which is significantly shorter than the 170ns observed without TURBO mode. This much-improved inter-byte gap will greatly increase effective throughput of the SPI communication protocol.

The McSPI driver also has an added diagnostic print that prints (only once!) to the Kernel message log when the TURBO mode was actually enabled/used in hardware. You can check the Kernel log after some transfers to see if the below log entry can be found as a sign that the driver stack was configured and is working correctly.

root@am62xx-evm:~# dmesg | grep TURBO 
[   63.447266] spidev spi1.0: omap2_mcspi_transfer_one: enabling TURBO mode

ti-linux-6.6.y-mcspi-turbo-dev-4-nov-2024.tar.gz