DK-TM4C129X: Estimating uDMA processing time for a scatter gather transfer

Peter Jaquiery

Part Number: DK-TM4C129X

I'm working on a timing critical ADC unloading process which I have working, but with downstream processing taking about 50% longer than I want. The ADC is an 8 channel 18 bit SAR that I'm unloading in about 4.8 us using 20 MHz SPI (QSSI) and uDMA. That gets me a 9 x 16bit block with the 8 x 18bit samples "barber polled" through it. At present I'm using a uDMA memory scatter gather transfer to unpack the 18bit samples into 32bit words. I then left shift the 32bit words to get the data correctly aligned and copy the result to an output queue. The align and copy process takes about 1.5 us with locally optimized code.

However the scatter gather uDMA process seems to take almost 6 us using processor memory with the processor running at 60 MHz. The process makes 8 transfers of 3 bytes each. There is no other DMA processing contending for the bus. Ideally I'd like to the unpack processing under 3 us which lets me run the ADC at 100 kHz.

Does 6 us sound about right for 8 x 3 byte uDMA scatter gather transactions?

What may influence the total transaction time?

My setup code is:

   // Set up scatter gather processing to unpack 18 bit samples from DMA
   // buffer into 32 bit sample block
   uint8_t *byteSource = reinterpret_cast<uint8_t *>(sDMARxBuffer);
   size_t taskIdx = 0;

   while (*inputNums && taskIdx < 8)
      {
      size_t srcByteOffset = (*inputNums - 1) * 18 / 8;

      ++inputNums;
      sSampleXferTable[taskIdx] = uDMATaskStructEntry(
         3, UDMA_SIZE_8,
         UDMA_SRC_INC_8, byteSource + srcByteOffset,
         UDMA_DST_INC_8, sSamplesUnpackBuffer + taskIdx,
         UDMA_ARB_2,
         !*inputNums ? UDMA_MODE_AUTO : UDMA_MODE_MEM_SCATTER_GATHER
         );
      ++taskIdx;
      }

   IntRegister(UDMA_INT_SW, CommitSamples);
   IntEnable(UDMA_INT_SW);
   uDMAChannelAttributeEnable(UDMA_CHANNEL_SW, 0);

   // Give scatter gather DMA processing high priority
   UDMA_PRIOSET_R = 1 << UDMA_CHANNEL_SW;

   // Set up unpack pointers
   uDMAChannelScatterGatherSet(UDMA_CHANNEL_SW, 8, sSampleXferTable, 0);

Is there another way I could solve this problem? One possibility is overlapping the ADC unloading and sample unpacking uDMA processes, but I haven't had much success in making that work.

over 6 years ago

0 Bob Crosby over 6 years ago

TI__Guru 72500 points

Unpacking using scatter/gather will not be very efficient. Since the number of samples is dynamic, you spend the CPU time setting up the task list. Then for every 3 bytes you transfer, you have an additional 3 cycles of uDMA memory to peripheral transfers to load the next task.

Have you considered using Peripheral Scatter-Gather to copy from the SSI to separate memory arrays for each channel? That way, the transfer and unpack happens while the SSI is receiving the next data.

0 Peter Jaquiery over 6 years ago in reply to Bob Crosby

Intellectual 635 points

Hi Bob,

the setup only happens once at start sampling time when I know which channels are required so the setup overhead is pretty much irrelevant. CPU overhead per sample block is in a couple of interrupt handlers:

void UnpackSamples()
   {
   // Interrupt handler for SSI1 uDMA following DMA transfer.
   SetRGBLed(BLULed | GRNLed);

   // Debugging
   ++*sSampleTxBuffer;

   // Clean up after SSI1 transfer
   SSIDMADisable(SSI1_BASE, SSI_DMA_TX | SSI_DMA_RX);
   SSI1_ICR_R = SSI1_RIS_R; // Clear the pending interrupts

   // Reset DMA interrupts to allow next transfer to happen
   SSI1_DMACTL_R &= ~(SSI_DMA_TX | SSI_DMA_RX);

   // Release the ADC chip select
   *(GPIO_PORTB_AHB_DATA_BITS_R + ADC_CS_BIT) = ADC_CS_BIT;

   // Kick off the sample unpack and commit processing
   uDMAChannelScatterGatherSet(UDMA_CHANNEL_SW, 8, sSampleXferTable, 0);
   NVIC_EN1_R |= 1 << (44 - 32); // Enable the uDMA SW interrupt
   UDMA_ENASET_R |= 1 << UDMA_CHANNEL_SW;
   UDMA_SWREQ_R = 1 << UDMA_CHANNEL_SW;
   SetRGBLed(GRNLed);
   }


void CommitSamples()
   {
   SetRGBLed(BLULed);
   NVIC_UNPEND1_R = 1 << (44 - 32); // Clear pending uDMA SW interrupt
   NVIC_DIS1_R = 1 << (44 - 32); // and disable it

   int32_t *to = gSampleBuffer->mBuffer[gSampleBuffer->mNextIn].mChanSamples;
   int32_t *from = sSamplesUnpackBuffer;

   *to++ = (*from++ << 14) & 0xFFFFC000;
   *to++ = (*from++ << 12) & 0xFFFFC000;
   *to++ = (*from++ << 10) & 0xFFFFC000;
   *to++ = (*from++ <<  8) & 0xFFFFC000;
   *to++ = (*from++ << 14) & 0xFFFFC000;
   *to++ = (*from++ << 12) & 0xFFFFC000;
   *to++ = (*from++ << 10) & 0xFFFFC000;
   *to   = (*from   <<  8) & 0xFFFFC000;

   // Commit the new sample to the sample buffer
   if (++gSampleBuffer->mNextIn >= kSampleBufferSize)
      gSampleBuffer->mNextIn = 0;

   if (gSampleBuffer->mNextOut == gSampleBuffer->mNextIn)
      // There's been a buffer overflow. Drop the oldest sample.
      if (    gSampleBuffer->mNextOut == gSampleBuffer->mNextIn
         && ++gSampleBuffer->mNextOut >= kSampleBufferSize
         )
         gSampleBuffer->mNextOut = 0;

   SetRGBLed(LEDSOff);
   }

which run in about 1.5 us each with local optimization for speed tuned on.

The killer for this whole task is that the ADC (ADS8598S - 8 channel 18 bit SAR) generates a packed stream of 8 x 18 bits giving a 144 bit block (9 x 16 bit words) with the 18 bit sample words barber polled through the block. SSI works with word sizes ranging from 4 to 16 bits so doesn't natively handle the 18 bit samples from the ADC. I can get around that by transferring 9 x 16 bit words and manually handling the SPI enable line for the ADC, but then I have the unpacking problem.

Hmm, I could set the SSI up to use 9 bit words. I still have to do some post processing to combine the two 9 bit values for each sample into an appropriately aligned 32 value, but I'm doing most of that work already and it's pretty fast.

I'll let you know how I get on. It may take a few days to get back to that task though - it's been put aside while I work on other things.

0 Peter Jaquiery over 6 years ago in reply to Peter Jaquiery

Intellectual 635 points

Using 9 bit SSI transfers instead of 16 bit gets me from about 16 us per sample down to about 8 us.

There is some variability in either the uDMA processing time or in the interrupt latency at the end of the uDMA transfer. Most of the time I see about 0.6 us from the end of the SPI transfer to the start of interrupt processing for the end of transfer interrupt. Every 1 ms or 0.7 ms (the period alternates!) I see 3 - 4 us interrupt latency. Any idea what that may be? I'd guess USB SOF if it were a regular 1 ms, but it isn't.

0 Bob Crosby over 6 years ago in reply to Peter Jaquiery

TI__Guru 72500 points

The amount of CPU activity on the RAM bus and the peripheral bus will affect the time to do the uDMA transfers. The CPU always has priority. The time to respond to an interrupt request varies depending on the number of cycles the current instruct takes to execute even if no other interrupts are being processed. However, the alternating nature of this delay implies it is application related.

0 Peter Jaquiery over 6 years ago in reply to Bob Crosby

Intellectual 635 points

It turned out to be a couple of sections of code that were being a bit aggressive in disabling interrupts! A little more thought about just what needs to be disabled and priority levels for interrupts is warranted.

I tracked the problem down by waggling spare I/O bits, watching them with a Saleae logic analyzer and commenting out chunks of code. The logic analyzer let me easily search for long pulses ( 8 us - 11 us!), using pulse width based event detection, spaced hundreds of pulses apart. Great tool!

Arm-based microcontrollers

Arm-based microcontrollers forum

DK-TM4C129X: Estimating uDMA processing time for a scatter gather transfer