I successfully used the Starterware bulk USB example to create my own class compliant device on USB0. I'm running with Bulk-in and Bulk-out endpoints using 512 byte packets with a high speed host (Windows 7, 64-bit). Performance when writing to the Bulk-in endpoint (device to host) FIFO is surprisingly fast even though the low level driver code uses PIO. That code is able to transfer 512 bytes to the FIFO in about 15 usec. But even though the driver uses similar code when reading 512 bytes from the Bulk-out endpoint (host to device) FIFO, it takes 9 times longer! This comes out to over 270 nsec/byte where writing only takes 30 nsec/byte. All endpoints use single buffering. What's going on?
At first I thought this was DDR memory latency (my target board is BBB) so I wrote some test code that reads the FIFO into a register (doesn't write to buffer in RAM) and it's also slow so it appears the root problem is the memory access that reads the FIFO register.