This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM8148 eDMA over PCIe reads all Zeros

We are using DM8148 eDMA to read/write data to EP memory over PCIe bus. CPU and eDMA read of 4 bytes works fine as expected. However, if eDMA transfer length is set to anything greater than 4, data of all zeros is read. Whereas for write case, contents of EP memory remain unchanged.

eDMA transfer completion interrupt is generated successfully and acnt shows zero, showing all the requested data length has been transferred on bus.

Please help us identify whether issues lies on eDMA side or PCIe driver side. As far as, PCIe driver is confirmed,  we are successfully running USB 3.0 over PCIe. Do you still think any bug in the inbound/outbound translation can cause this behavior with eDMA burst transfers? or is there any PCIe specific setting required.

Exactly the same behavior has already been reported by other people. Your urgent help in identifying cause of the issue will be highly appreciated.


Thanks,

Zeeshan Aslam

  • Hi Zeeshan,

    Can you go through the below link, 

    http://processors.wiki.ti.com/index.php/DM81xx_AM38xx_PCI_Express_Endpoint_Driver_User_Guide#EDMA_Kernel_Module

    Please check the sample configurations. 

    Regards,

    Mahesh

  • Hi Mahesh,

    We have already gone through this link and have not found anything that can help us. This link just describes how to setup and build RC and RP sample applications.

    The question I want to ask is this:

    Assuming that our RC driver enumerates and maps end point devices correctly. Our eDMA implementation is also tested on memory to memory transfers. So in this case, if eDMA reads/writes over PCI Express memory don't work correctly, then where could be the issue? eDMA side or PCI Express root complex side driver?

    So, far we have not seen any functional issue with RC driver, except the DMA transfers. So what could be possibly wrong here? and what could be the right direction to explore this issue?

    We are using FPGA based ML507 memory controller as an EP.

    Thanks,

    Zeeshan Aslam

  • Are the start and end of your transfers 16-byte aligned (per errata advisory 3.0.6) ?

  • Also, be sure to check the EDMA TC used for transfer errors:  if a problem occurs with the read, the write is not aborted but will write zeroes instead.

  • I have again verified 16 byte alignment at start and end of transfers. This doesn't resolve our issue.

    As a first step we are only focusing on EDMA writes to PCI-E memory while verifying the data back using CPU reads from PCIE memory. EDMA write transfer doesn't update memory contents of EP. I have checked error status in EDMA TC, all the following error registers read zero after the transfer completion interrupt of write transfer is received. So, there seems to be no error in EDMA write transfer.

    Error Register (offset 0x120) : reads all 0

    Error Details Register (offset 0x12C) : reads all 0

    Error Interrupt Command Register (offset 0x130) : reads all 0

  • 19.2.11.1 section in In DM814x TRM states:

    There are bandwidth implications of using an external DMA. If the PCIe core has been programmed to
    establish a link in PCIe 2.5 Gbps rate, then the DMA controller that drives PCIESS slave port must be able
    to write/read data at about 85% of 2 Gbps bandwidth per PCIe link. For a PCIe link speed of 5.0 Gbps, the
    DMA controller must be able to provide bandwidth of about 85% of 4 Gbps per PCIe link.


    What are the EMDA transfer implications of not meeting this 85% speed requirements? Can it have anything to do with the issue we are facing? or its just fine if these requirements are not met? Otherwise how can we ensure we meet these requirements?

    Thanks,

    Zeeshan Aslam

  • It is a little suspicious that the TI81XX PSP Feature Performance Guide shows no DMA is being used with PCIe except in RC mode "Individual EPs may use Inbound DMA".  No explanation on why outbound DMA wouldn't be used.  Maybe someone from TI can comment on this?

     

    As far as I know EDMA can easily saturate the interconnect bandwidth, which itself is far in excess of the PCIe bandwidth, so that should not be a problem.  In any case, it should be easier to meet bandwidth requirements with EDMA than with the CPU.

     

    To see if the issue is with burst writes in general, it may be interesting to try a 16-byte burst write from the CPU instead of EDMA.  The easiest way to do this is by using the NEON intrinsics (#include <arm_neon.h>), for example:

    *(volatile uint32x4_t) dst = (uint32x4_t){ w0, w1, w2, w3 };  // write four words

    or

    *(volatile uint8x16_t) dst = vld1q_u8( src );  // copy 16 bytes from src

    where dst is a 16-byte-aligned pointer into PCIe memory.

     

    As a workaround, since you mentioned single word transactions work correctly, you should be able to force EDMA to use single word transactions for the whole job by configuring a single 4-byte transfer per event and enabling self-chaining on intermediate transfer completion.  Note however that this makes very inefficient use of the L3 interconnect.

  • Hi Matthijs,

    Thanks for your help.


    We have tried 16-byte burst writes using NEON into the PCIe EP memory to see if the issue is with burst writes, as you suggested. Neon burst write doesn't update the EP memory contents. However, 16-byte burst reads from the EP memory seem to work fine. However, with eDMA both writes & reads are not working. Since, even the Neon burst writes into the EP memory are not working, we are considering trying a another different EP to rule out issue with EP memory (FPGA based Xilinx ML507) .
    Alternatively, do you think there could be anything on RC side (PCIESS configuration) that could cause this issue with burst writes. Like, incorrect Inbound/Outbound address translation setup, when simple CPU read/writes work fine. We are also able to successfully run mass storage demo over PCIESS using xHCI USB EP. With this much verification of RC functionality, do you think there still can be something missing in PCIESS configuration to cause this issue with burst transfers? We are basically trying to isolate whether the issue is on DMA side, RC side or EP memory.


    Thanks for your support,

    Best Regards,

    Zeeshan

  • Assuming the PCIe memory is mapped on the cortex-a8 as strongly-ordered or device memory, a NEON load is executed as four single word transactions, not as a burst, which explains why it succeeded.  (To perform a burst read from the cortex-a8 you would need to map it as normal (non-cacheable) memory, but this is not safe: it permits the cortex-a8 to combine single write accesses into bursts which are unlikely to satisfy the constraints of PCIESS.)

    Therefore I'm inclined to conclude that bursts aren't working at all, regardless of whether DMA is used or not, so the problem isn't with DMA but either with PCIESS, the EP, or some interaction issue between them.

    Don't forget to check PCIe status registers for errors, e.g. completion timeouts if the EP is somewhat slow when processing bursts.  If such a cause is found, there are timeout registers which can be tweaked.

    Also, if possible / applicable, having the EP initiate transfers is something you could try; inbound transactions seem to be much less restricted than outbound transactions.

  • Hi Matthijs,

    The issue with burst transfers raised questions on the reliability of our EP (ML507), as it was a reference memory design based on Xilinx  MIG (Memory Interface Generator), which can generate memory designs only for burst lengths of 2, 4 or 8.

    So, we just managed to arrange Broadcom's and Intel's PCIE based Ethernet adapter cards and a Teledyne PCIE analyzer which we can use to inspect the packets at physical layer.

    Before developing and using Ethernet driver to carry large sized data transfers over PCIE bus, we tried DMA transfers on internal SRAM/Flash memories of these cards which are accessible through BARs, but without any success. I get Complete Abort (CA) and unsupported Request (UR) statuses from EP when I try to read the SRAM and Flash memories using DMA with TLP data length > 4 or 8 bytes.

    So the main question is:

    If an EP responds with CA or UR statuses on a memory read requests greater than 4 or 8 bytes,does it really means  these memories are not designed to fulfill burst transfer read/write requests greater than a certain size (in our case 4 and 8 bytes for SRAM and Flash, respectively) and we cannot do anything about this EP hardware limitation? Or is there anything we are missing/getting wrong?


    We really appreciate your useful technical support in this issue.

    Thanks,

    Zeeshan

  • I'm really not all that familiar with PCIe, but unless there's something wrong with the requests produced by the PCIe subsystem for bursts (this should however be visible in the analyzer probably?) it seems hard to avoid your conclusion that the endpoint doesn't support bursts.  If so, it seems you'll either need to avoid DMA or implement the workaround I mentioned (or some variation thereof) of forcing DMA to break up the transfer into single requests.

    Although the minimum payload size that all devices need to support is 128 bytes, the specs do permit devices to have a "restricted programming model" and issue a CA status if they don't like the request:

    When a device's programming model restricts (vs. what is otherwise permitted in PCI Express) the characteristics of a Request, that device is permitted to 'Completer Abort' any Requests which violate the programming model. Examples include unaligned or wrong-size access to a register block and unsupported size of request to a Memory Space.

    An UR status is stranger since it means "invalid address" if I'm reading the specs right ("For Memory and I/O Requests, this determination is based on the address ranges the Function has been programmed to respond to.")

    Update: replaced specs quote by a more appropriate one

  • Hi Matthijs,

    In analyzer, I don't see anything wrong in the TLP requests produced by the PCIE subsystem, with length being the only difference between successful and failed TLP requests. Our target is to measure the maximum data transfer rate supported by the PCIE bus, so by breaking the transfers into small TLP requests will not help us reach the full potential of bus. We need a method to be able to perform large sized TLP transfers from RC to EP direction, and all the EP we have tried so far are unable to support such a use-case, as it turns out so far.

    Yes, the minimum payload size in specs is 128 bytes, probably only for normal EP operation, like Ethernet EP acting as bus master DMA, but it seems they can restrict supported payload size to 8 or 4 for other access patterns like host accessing Ethernet internal SRAM memory for configuration purpose, as you also pointed out.

    UR status is not only due to invalid address. According to specification (see the link below), it is least clearly defined of all the errors because much depends on the user implementation.

    Examples of unsupported request:
    - Unsupported request type or invalid command --- by design or configuration
    - An unsupported non-posted request causes the receiver to generate a completion with UR status
    - When in D1, D2, D3hot power states, any received memory TLP is treated as UR.
    - Unsupported message code
    - Many others, see table 6-2, section 6.2.7, for a list of specification references (14 of them!)

    In our case, UR error status seems to be due to Unsupported non-posted request or invalid command.

    Reference:
    https://www.pcisig.com/developers/main/training_materials/get_document?doc_id=f96ce648eb6af1a8423e6f073f03420b1611cc4c

    Now we plan to develop Ethernet driver, but in our exact use-case, we would like to have RC initiate large-sized data transfers (up to max payload size supported) to some EP that can support such a use-case. Do you have any suggestion regarding such an EP?

    Thanks for your useful insight.

    Best Regards,

    Zeeshan Aslam