This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM6442: AM64x PCIe PTM issue with flow-control

Part Number: AM6442
Other Parts Discussed in Thread: BEAGLEY-AI

Tool/software:

Dear TI team,

we've encountered a rather serious bug in the AM64x' PCIe implementation regarding flow-control for PTM messages.

We've got a setup consisting of a custom AM64x board and a Lattice FPGA device connected via PCIe (AM64x is RC running Linux, FPGA is EP).

This works as long as the FPGA announces unlimited credits for "posted request data payload", but causes the AM64x to freeze after some time if flow control is enabled due to a limited number of posted request data payload credits.

We discussed this issue in the past (~2022) over email with some TI colleagues, but there was no solution or even confirmation at that time. We had a workaround, but recent changes to the FPGA design menans we can no longer use this workaround (FPGA now requires flow control).

The same issue is visible when using an Intel i225 as an EP on an AM64x EVM. The Intel i225 NIC operates properly with PTM enabled when used on e.g. an Intel Elkhart Lake x86.

The AM64x completely freezes after a number of PTM cycles. The core that's been accessing the PCI bus appears to be completely stuck and remains stuck indefinitely (at least for several minutes, pipeline is stalled). We can break into the core using the JTAG debugger, but we only see a few registers, and all of memory is inaccessible. We can still access the SoC memory map from another processor core (e.g. R5f) or the debug MEM-AP.

We've noticed that before the A53 core freezes, an error bit gets set in the AM64x PCIe RC registers. indicating a flow control protocol error (FCPE) after ~2000 PTM cycles. If we then continue triggering PCIe PTM cycles, the A53 freezes after a few hundred cycles more.

Using the AM64x registers that track the received credits (e.g. PCIE0_I_TRANSM_CRED_LIM_0_REG) we can see that EP returns two posted data credits for the write that actually triggers the PTM cycle in the EP: one for the memory write itself, and one for the ResponseD message. Keeping track of all the writes we've performed, all the PTM cycles (and thus PCIe messages) these caused, and all the credits we've received back from the EP, we're pretty sure that the FCPE (flow control protocol error) gets set because the AM64x believes that the EP announced more than 2047 outstanding data credits. This is one of only three reasons given in the PCIe specification for the FCPE error, and the only that makes sense in this scenario (the EP doesn't announce unlimited credits of either type, so the other two reasons for FCPE can be ruled out).

The only explanation that we have is that the AM64x' RC doesn't count the ResponseD messages against the PD credit limit, and thus receives more credits back from the EP than what it assumes it has spent.

The error is set exactly when the 2048th ResponseD message had been transmitted by the RC. From that point on, the AM64x doesn't seem to update the posted data credits it receives back from the EP anymore, until it runs out of credits. At this point the AM64x freezes - I guess this alone might be considered a serious bug, because apparently no timeout exists that would allow this transaction to be aborted.

The PCIe specification is pretty clear that "Message Requests with data" count against the posted data credit limit and e.g. the Intel i225 documentation (one of the few devices that support PTM and are known to work) specifically says that messages consume posted data credits. The i225 documentation doesn't differentiate between "messages" and "messages with data", but from the AM64x' registers that track received credits we're reasonably sure that the i225 returns only a header credit for the PTM Response, and a header credit + a posted data credit for the PTM ResponseD message.

I've been able to verify the same issues exists on an AM67x (BeagleY-AI), too.

Is there a chance that you could figure out how the Cadence PCIe IP core used in the AM64x handles PCIe PTM messages with regard to PCIe flow control?

I'm open to other explanations as to what causes this error, but so far it seems pretty conclusive. Let me know if you have any questions.

Best Regards,

Dominic