This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/AM3358: SPI DMA issue

Part Number: AM3358

Tool/software: Linux

I'm working with an am3358 running linux kernel 4.9.69. Our board has a few Nordic radios connected as SPI slaves to the am3358. We operate the SPI bus at 8MHz. All of our SPI transactions are pretty small: less than or equal to 64 bytes, but there are a lot of them.

Everything works flawlessly when DMA is not involved, although we are seeing pretty high CPU usage. The system has run flawlessly for more than a week.

Recently I've started looking into using DMA for the SPI transactions to try to reduce the CPU load. The driver spi-omap2-mcspi.c uses the following to decide whether or not to use DMA:
#define DMA_MIN_BYTES 160

I changed this value to 0, so that the DMA is used for every transaction, and I'm seeing a pretty significant reduction in CPU usage by doing that. However, SPI doesn't work flawlessly anymore. I see the SPI transactions stop after about 1/2 hour to 1 hour.  I also see this happen if I set DMA_MIN_BYTES to 32 or 64.

Do you have any idea why that would be? Do you know of any reason enabling DMA on smaller transactions would cause the SPI to stop working? There don't appear to be any errors propagated up to our driver that is using the SPI subsystem. And there aren't any errors I see when I enable debug in spi.c and spi-omap2-mcspi.c

Thanks!

  • Can you add debug to see if you are stuck in a wait_for_completion call?

    Steve K.
  • It looks like I am getting stuck in a wait_for_completion call ...

    As an experiment, I converted the wait_for_completion() calls in spi-omap2-mcspi to wait_for_completion_timeout(), and I hit the timeout in omap2_mcspi_rx_dma()

    Additionally, when I remove the module after getting into the error condition, I see the following:

    [58123.213768] omap_hwmod: spi1: _wait_target_disable failed
    [58128.850390] spi_master spi2: could not stop message queue
    [58128.855883] spi_master spi2: problem destroying queue
    [58128.869961] spi_master spi2: queue remove failed

  • I dumped the peripheral registers when I hit the error condition:

    [ 515.038618] JZ omap2_mcspi_rx_dma() timed out
    [ 515.043069] OMAP2_MCSPI_REVISION = 0x0000002b
    [ 515.047805] OMAP2_MCSPI_SYSSTATUS = 0x00000001
    [ 515.057359] OMAP2_MCSPI_IRQSTATUS = 0x00000001
    [ 515.063953] OMAP2_MCSPI_IRQENABLE = 0x00000000
    [ 515.070601] OMAP2_MCSPI_WAKEUPENABLE = 0x00000001
    [ 515.075353] OMAP2_MCSPI_SYST = 0x00000000
    [ 515.083063] OMAP2_MCSPI_MODULCTRL = 0x00000001
    [ 515.087815] OMAP2_MCSPI_XFERLEVEL = 0x00401f1f
    [ 515.095545] OMAP2_MCSPI_CHCONF0 = 0x3811c3d4
    [ 515.102957] OMAP2_MCSPI_CHSTAT0 = 0x0000002e
    [ 515.107711] OMAP2_MCSPI_CHCTRL0 = 0x00000001
    [ 515.115953] OMAP2_MCSPI_TX0 = 0x00000000
    [ 515.123027] OMAP2_MCSPI_RX0 = 0x00000000

    [ 540.640711] JZ omap2_mcspi_rx_dma() timed out
    [ 540.645390] OMAP2_MCSPI_REVISION = 0x0000002b
    [ 540.650129] OMAP2_MCSPI_SYSSTATUS = 0x00000001
    [ 540.660151] OMAP2_MCSPI_IRQSTATUS = 0x00000001
    [ 540.666923] OMAP2_MCSPI_IRQENABLE = 0x00000000
    [ 540.674266] OMAP2_MCSPI_WAKEUPENABLE = 0x00000001
    [ 540.679022] OMAP2_MCSPI_SYST = 0x00000000
    [ 540.686883] OMAP2_MCSPI_MODULCTRL = 0x00000001
    [ 540.694689] OMAP2_MCSPI_XFERLEVEL = 0x00401f1f
    [ 540.699448] OMAP2_MCSPI_CHCONF0 = 0x3811c3d4
    [ 540.707502] OMAP2_MCSPI_CHSTAT0 = 0x0000002e
    [ 540.714205] OMAP2_MCSPI_CHCTRL0 = 0x00000001
    [ 540.718959] OMAP2_MCSPI_TX0 = 0x00000000
    [ 540.727341] OMAP2_MCSPI_RX0 = 0x00000000

  • Thanks. I am analyzing this.

    I wrote a small test program that uses spidev to transmit and receive data. I'm using a Beaglebone Black and have a jumper between spi0_d0 and spi0_d1. I modified the mcspi driver so that the DMA threshold is 32 bytes and in the program I'm sending and receiving 200 bytes of data. After 4 hours still no errors.

    Steve K.
  • What if you send/receive 64 bytes of data instead of 200?  Maybe it has to do with the small packets I'm sending?  I can try adjusting my packet size to 200 and see if the problem goes away.

  • I ran all weekend with 200 byte packets.  The issue must be related to using DMA with smaller packets.  Let me know if you are able to reproduce with 64 byte packets.

  • I'm unable to repro with spidev and spidev_test (modified so it runs in a loop) with 64 byte packets. My best guess is that it doesn't repro because the timing is different than our custom driver...

    We have a custom driver on top of spi: it submits a spi transfer immediately after the previous one finishes.

    I've instrumented the code such that I should be able to tell if a new transfer is attempted while one is still in progress. So far I've been unable to find any issues.

    The problem seems to be that a dma transfer is started, but the completion never gets completed. Do you have any other ideas about what would cause that to happen? Or what to look for in the memory-mapped registers for spi/dma?

    Thanks!
  • I'll look around some more for what would cause the completion not to complete. Can you dump the registers before you hit the error? I'd like to compare them.

    Steve K.
  • Here's a register dump on a typical spi transaction just before wait_for_completion is called in omap2_mcspi_rx_dma. If you're looking for the register dump just before the error happens, I'll see if I can make that happen or let you know if it impacts the ability to reproduce the bug ...

    [ 266.365716] BEFORE:
    [ 266.367941] OMAP2_MCSPI_REVISION = 0x0000002b
    [ 266.379863] OMAP2_MCSPI_SYSSTATUS = 0x00000001
    [ 266.387904] OMAP2_MCSPI_IRQSTATUS = 0x00020001
    [ 266.394725] OMAP2_MCSPI_IRQENABLE = 0x00000000
    [ 266.399477] OMAP2_MCSPI_WAKEUPENABLE = 0x00000001
    [ 266.407469] OMAP2_MCSPI_SYST = 0x00000000
    [ 266.414152] OMAP2_MCSPI_MODULCTRL = 0x00000001
    [ 266.418905] OMAP2_MCSPI_XFERLEVEL = 0x00001f1f
    [ 266.426557] OMAP2_MCSPI_CHCONF0 = 0x381103d4
    [ 266.433036] OMAP2_MCSPI_CHSTAT0 = 0x0000002e
    [ 266.437789] OMAP2_MCSPI_CHCTRL0 = 0x00000001
    [ 266.445536] OMAP2_MCSPI_TX0 = 0x00000000
    [ 266.450286] OMAP2_MCSPI_RX0 = 0x00000000
  • I put in a check to see if OMAP2_MCSPI_CHSTAT_EOT (end of transfer status) is cleared before attempting to modify any registers in omap2_mcspi_rx_dma(). When I use spidev I don't see any issues: EOT is always cleared. However with our custom driver I see EOT is occasionally set. That doesn't seem right, however I'm not sure why it's happening yet. I put in some other checks to see if a new transfer is ever attempted before the previous one completes and I haven't hit any of that debug code so far. Any idea what could cause that?
  • I got a salae logic analyzer capture when the failure happens.  Attached.

    DEBUG_1 err goes high just after the wait_for_completion_timeout in omap2_mcspi_rx_dma.  The timeout is set to 140 jiffies.

    DEBUG_0 xfer goes high at the start of omap2_mcspi_transferone and low at the end

    DRDY8 is drive by the slave device and goes high when the slave device has received the spi data.  It goes low when the slave device is ready to for the master to send another transaction

    The rest of the signals are the standard spi signals miso, mosi, clock, chip select.

    If you backtrack about 1.4 seconds from where DEBUG_1 err goes high, the last successful spi transaction happens, and then the failed transaction starts.  DEBUG_0 xfer goes high, but none of the other lines toggle.

    The only thing that's weird about the last successful transaction is that there is a brief hiccup in the spi clock 33 bytes into the transaction, which seems a little weird since dma is being used.

  • It turns out that I can reproduce with spidev.  Previously I was using a 12 byte packet and was unable to reproduce.  However, now I'm using a 64 byte packet I can reproduce it.

    There's a short pause (61.96 usec in the following screenshot) that happens just after the 33rd byte, but then the SPI transfer resumes and finishes.  However the next transfer after that doesn't work.

    I'm also attaching the salaea logic analyzer trace.

    spidev - repro.logicdata.zip

    DEBUG_1 err goes high when wait_for completion times out.

    DEBUG_0 xfer has a double pulse at transfer_one entry and a single pulse at transfer_one exit

      

  • I tried over the weekend using our custom spi driver and 32 byte packets and was unable to reproduce the issue.

    I can reproduce with 64 byte packets with spidev and with our custom driver.  I am going to try different spi transfer sizes and see what happens...

    Have you been able to reproduce yet?

  • I think it has something to do with using the fifo when the transfer size is a multiple of 32 bytes. I added some code to bypass enabling the fifo: I added an unconditional goto disable_fifo in omap2_mcspi_set_fifo and am not seeing the issue any more. Ideally I think we want to use the fifo because we want to reduce cpu usage as much as possible.

    Have you been able to reproduce it? I mentioned I was able to reproduce with spidev, so I would imagine you should be able to now. Is there any additional information I can provide you? So far this issue hasn't been a huge priority over here, but it is starting to become one due to schedule reasons.
  • Hi James,
    I'll try to reproduce it later today with spidev.

    Steve K.
  • I am able to reproduce it with 64 byte packets. I'll start digging into this.

    Steve K.

  • Thanks for looking into it!

    I've tried experimenting with the fifo depth and XFERLEVEL AFL/AEL settings to see if I can make things work, but have been unable to find any settings that work. 

  • I added a gpio pulse in omap2_mcspi_tx_callback, and it looks like that completion fires right around the 33rd byte of the transfer.  Not sure if it's related to the issue or not, but it does seem suspicious...

  • I think I found a fix. The trick appears to be setting up the tx dma transfer first without setting CHCONF_DMAW, and then setting up the rx dma transfer, and then when they're both set up, set DMAW/DMAR at the same time. I'm going to let this run overnight to double confirm things are working, and then I'll send you a patch.
  • Here is a patch, my spi test ran overnight with it.   Not sure if there's a better way to fix it though, can you take a look and let me know?

    start-dma-simultaneously.patch.zip

  • Here's the set of patches I came up with:

    1. change wait_for_completion to wait_for_completion_timeout

    2. fix the race condition by starting tx/rx dma at the same time

    3. device tree binding for dma minimum length

    4. device tree binding for completion timeout

    dma-patches.tar.gz

  • Hi James,

    Please allow me to setup the test condition, reproduce, and I will follow-up with feedback.

    Marcus

  • This might help you reproduce it.  I've attached a modified spidev_test program that runs in a loop.  I was able to reproduce it with this.  You can invoke it as follows:

    sudo ./spidev_loop -D /dev/spidev2.0 -s 8000000 -p "0123456789012345678901234567890123456789012345678901234567890123"

    spidev_loop.c.zip

  • OK I'm able to reproduce this issue with the latest test app you provided. Allow me to understand the root cause in relation to the workaround you provided.
  • The workaround seems to be a good candidate for a permanent update in the TI driver.
    I will submit your patches to our software development team.
  • I reported the problem to the TI software team and they are root causing the problem.