USB-API Buffers & DMA

Shashank Chintalagiri

I've been trying to understand how the USB-API handles it's buffers. Everything in this question / discussion relates to CDC TX.

First, context :

st = USBCDC_intfStatus(intfNum, &bytesSent, &bytesRecieved);
        
if ((st & kUSBCDC_busNotAvailable) || (st & kUSBCDC_waitingForSend)){
       return 0;
}

I use the above code snippet to wait until it's OK to send more data to the USB intf. This snippet works as advertised.

However, I've noticed that this reports the CDC interface is ready to accept more data _before_ the API calls USBCDC_handleSendCompleted(). The reason I noticed it is because I was using the handler to prepare for the next dataset. If I use a flag that I set before initiating a send and clear in the handler, my data integrity problems go away at the expense of greatly reduced throughput (throughput measured as actual number of bytes transferred, not number of good bytes). I need to be able to find out when the source buffer can be nuked, ie, when it's made it's way into the EP buffer, not when it's sent off the host. When attempting to figure out what was going on, I notice the following :

There seem to be two end-point buffers in the API (XY), and the API switches between them. It looks like it's allowing one to be filled in from the source while the other is sent away by USB. It looks tailormade for handling large source buffers. This is behaviour that I would like to be able to make use of from the application space as well, with smaller source buffers.

The actual transfer from source buffer to EP buffer is done by memcpyDMA() in usbdma.c. This function is essentially called from within an Interrupt context, and if I'm reading this right, it's triggering DMA and looping on the interrupt flag!?

I'm also having some trouble following this line of code :

*pCT1 = byte_count; //Set counter for usb In-Transaction

Am I correct in assuming that this is what tells the USB peripheral hardware (and not the API) that the endpoint buffer is ready to be transmitted? If so, could someone help me understand what's going on. If not, how does the endpoint buffer, after being filled by memcpyDMA(), get transmitted over to the host?

over 10 years ago

0 zrno soli over 10 years ago

Guru 34753 points

Some USB benchmarks for MSP430 are here...

http://forum.43oh.com/topic/2775-msp430-usb-benchmark

And basic / average calculation for my USB assembler stack are here...

http://forum.43oh.com/topic/3931-native-usb-port-on-lm4f120-board

Question is what you need and what you want to do? DMA can improve some things, but opposite scenario is possible too. Easier / more simple / faster way is changing (almost 2 KB) IEP/OEP pointers on the fly, instead of using (only two 64 bytes) fixed X/Y buffers. Indication when data are sent / received is done by NAK bit.

0 Clemens Ladisch over 10 years ago

Guru 318740 points

The USB module hardware is described in the User's Guide. See especially section 40.3.3.2 (bulk in transfer).

The CDC API copies data from the user buffer to the UBM's X/Y buffers, but handleSendCompleted() is called only when the next attempt to fill the X/Y buffer from the user buffer fails.

As far as I can tell, calling USBCDC_sendData() after the user buffer has been emptied but before handleSendCompleted() is called will continue the previous sending operation, i.e., handleSendCompleted() is not yet called, and no zero packet will be sent.

If you want to get a notification as soon as the user buffer is empty, you could add another check for nCdcBytesToSendLeft==0 at the end of CdcToHostFromBuffer(). However, the new data cannot be copied to the X/Y buffers until on of those becomes empty, so this will not speed up your transfers.

What data integrity problems are you alluding to?

The DMA interrupt flag just indicates that DMA has completed; this interrupt is not enabled. (In theory, DMA stops the CPU while it's copying, so checking the interrupt flag should not be needed, but it might be possible that somebody else has reconfigured the DMA.)

The assignment to *pCT1 indeed tells the UBM that the buffer is ready to be sent (it both sets the size, and clears the NAK flag).

0 Shashank Chintalagiri over 10 years ago in reply to Clemens Ladisch

Prodigy 195 points

Clemens,

The data integrity problems are/were all mine: not a result of the API itself but improper use of it. I took IntfStatus returning a Write not in progress to mean handleSendComplete() is already done.

Essentially, I was relying on handleSendComplete() to update pointers within the source buffer, while the IntfStatus result was used to trigger another another write. The actual order of those events occuring is reverse of what I expected, so the last chunk of data from my buffer was being sent twice because of the stale pointers being used.

This complication is because I'm trying to tie in an application space Ring buffer to USB CDC sends. An XY buffer in application space would be far easier to implement and it seems would offer almost all of the advantages of a ring buffer. I just happen to feel the ring buffer solution, if I can get it to work, would be more elegant in some intangible sense.

My ideal implementation would rarely send a zero packet, just continuously streaming data as fast as it possibly can.

In exactly one set of circumstances, setting a global flag from within CdcToHostFromBuffer()., right at the beginning of the preexisting check for nCdcBytesToSendLeft==0, doubled effective throughput (from about 180-190 kbps to 320kbps). Sadly, I failed to record exactly what circumstances those were. This is probably because my application code usually ends up calling the API send functions with only about one buffer's worth of data ready to go. This amplifies the other API/peripheral/protocol related overhead as a fraction of the real data being sent. I'll look into how I can avoid sending the zero packet without creating a whole new set of race conditions.

I've made some changes to application code which have stabiized the transmit at about 180kbps without changing any USB-API code. The application code is no spring chicken either, so it may be slower than the actual USB transmit. I'm looking into ways to do some crude profiling of the code. Presently, my sends are being requested with (64*n) - 1 bytes, where n is the largest number that results in a copy smaller than the presently ready data.

The DMA halting the CPU during the transfer is not something I realized. I should probably read up on the DMA documentation as well. This would mean that using DMA is not as useful as I'd have hoped. A naive memcpy implementation operating at 16 bits could probably handle the copy at 4-5 clocks per byte, while DMA could do it at 1. Its an improvement, certainly, by the USB API uses the DMA only to call a 64 byte DMA copy. To do this, it makes a series of crazy driverlib DMA calls, each of which is calling a function in the driverlib sources. These being implemented in separate C files, gcc is likely to dump everything into separate .o files (perhaps unless the entire application is compiled into a monothic exexcutable, which is how USB-API examples do it). With no inlining possible after that, that makes for a lot of function calls, and I'd be surprised if the cost of that is lower or even comparable to the cost to copying 64 bytes.

0 Clemens Ladisch over 10 years ago in reply to Shashank Chintalagiri

Guru 318740 points

The driverlib functions can be inlined by gcc if you use -flto.

As for the USB CDC API, it is designed to handle small amounts of data at a low rate. If you want to send CDC data at a high rate, you need to share the ring buffer between the CDC API and your application, and replace all the code in UsbCdc.c that handles the send buffer. That code wouldn't be too complex if you can restrict it to always sending full 64-byte packets.

A further optimization would be to avoid copying by putting the ring buffer into USB buffer memory, and modifying the X/Y buffer base address registers to point to the next packet to be sent.

0 Shashank Chintalagiri over 10 years ago in reply to Clemens Ladisch

Prodigy 195 points

I think I'll hold off on doing that for the moment, though moving the application buffer into USB memory seems like the way to go. I've been avoiding doing that and looking too closely at what is inside the USB RAM. The endpoint buffers account for 896 bytes of USB RAM, leaving 1100 bytes of other USB related memory. Depending on how much of that 1100 is free / freeable, moving whole buffers into the USB memory will mean reducing endpoints.

This is fine, of course, but I'm presently implementing the protocols that I think will be useful on top of the standard endpoints. Once that is done, I should see if the overall design holds up from usability and functionality perspectives atleast before gutting the innards of the USB-API code.

0 zrno soli over 10 years ago in reply to Shashank Chintalagiri

Guru 34753 points

USB endpoint buffer is 1C00-2300, 1792 bytes. It is possible to change endpoint buffer pointer on the fly, after endpoint is configured. Also different endpoint that are not used in the same time can use same buffers. As I mention before, depend what you need and what you want to do. Here is detailed description of changing receive/process to streaming (receiving and processing in parallel) algorithm that I done with my USB stack for my flasher.

http://forum.43oh.com/topic/2972-sbw-msp430f550x-based-programmer/?p=41143

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

USB-API Buffers & DMA