TMS320C6678: EDMA3TC ERRSTAT Bus Error

Brad Petrus

Part Number: TMS320C6678

Hello,

I am developing software for a custom board comprising and FPGA and 2 C6678 DSPs and connects to a host computer via PCIe Gen3.

The EDMA3 engine is used to DMA data between each DSP and the FPGA and that normally works fine. However, when running a utility to transfer a large amount of data from host to DDR memory attached to FPGA the data being DMAd between DSP and FPGA becomes corrupted.

With Blackhawk debugger I found that 2 of the EDMA3TCs are indicating BUS ERROR (i.e. ERRSTAT = 0x00000001) and the ERRDET register indicates the errors occur during read operations.

So, I have the following questions:
1. Are the bus errors occurring due to (PCIe read) timeouts?
2. What can be done to eliminate these errors?
3. What can be done to recover from these errors when they occur (i.e. if I enable EDMA3TC error interrupts, what can be done in ISR to recover from the error)?

Thanks,
Brad

over 6 years ago

0 Brad Petrus over 6 years ago

Intellectual 370 points

Oops, forgot to mention that the DMAs between DSPs and FPGA are over PCIe bus, too

0 Victor Kazmirenko over 6 years ago

Guru 13202 points

Hello!

PCIE subsystem has error reporting capability too, so you may wish to take a look.

Also, both PCIE and EDMA3 have capability to trigger interrupts on error conditions. So one may design recovery procedures, or at least signal user something happening there. Perhaps peripherals reset/re-init may be used, though I had no luck with PCIE restart.

0 Brad Petrus over 6 years ago in reply to Victor Kazmirenko

Intellectual 370 points

Thank you, rrlagic!

I am aware of the EDMA3 error interrupt support but am more interested in what can be done to recover from this type of error. For example, can the associated EDMA3TC(s) be reset?

Thanks,
Brad

0 Brad Petrus over 6 years ago

Intellectual 370 points

I see that the PCIe subsystem provides ability to disable or increase the completion timeout via the DEV_STAT_CTRL2 register. Would disabling or increasing the timeout eliminate the issue? (I guess disabling CTO is not really an option)

Thanks,
Brad

0 Victor Kazmirenko over 6 years ago in reply to Brad Petrus

Guru 13202 points

Hello Brad,

Though I understand your wish to develop recovery mechanism still I suggest to hunt for the root cause. Perhaps, if you eliminate that you'll need no recovery. Let me give an example.

We have a system with C6670 connected to Spartan 6 FPGA over PCIe. Original solution was to have DMA engine on FPGA side. That engine was far from perfect, particularly, if there was PIO request during DMA transfer, machinerry dead locked. We found that looking at PCIE errors, noticed completion timeout. Quick fix was in careful scheduling, no PIO request during DMA transfers, whicn indeed was a crutch rather than solution. So later we developed simpler responder type engine on FPGA side and let DSPs EDMA3 do the job. Since that we don't need recovery mechanism on PCIe link.

What you already found is EDMA read error. Do you also see error in PCIe sybsystem? I don't think there might be a trouble of EDMA coomunication to PCIESS, I think it is PCIe is more likely to fail.

As per 3.7.12 Device Status and Control Register 2 (DEV_STAT_CTRL2) in SPRUGS6D KeyStone Architecture Peripheral Component Interconnect Express (PCIe) User Guide, default timeout value is 50 s to 50 ms. Can you gues what happens on host side, so it can't respond within 50 seconds?

Now imagine, you have developed a recovery mechanism which restores operational state after timeout. But it takes 50 seconds to catch that situation. Would it really help you?

0 Brad Petrus over 6 years ago in reply to Victor Kazmirenko

Intellectual 370 points

Thanks again, rrlagic!

I think I'm now on same page as you (and agree totally with your 50 second PCIe timeout)

By rearranging the chaining of DMAs on the DSP I believe I can now synchronize the the DMAd inputs to the DSP processing algorithm with the processed DMAd outputs - I previously had independent DMA chains associated with the input and output buffers processed by DSP. Since they were independent, read delays (or even completion timeouts) delayed the acquisition of the input buffer with respect to the output buffer and this caused corruption of DMAd output buffer.

By adapting the DMA arrangement so that output buffer is DMAd only after DMA of input buffer completes (or times out), I believe the corruption will no longer be present. Although this should solve the DMA corruption issue the only side effect is that, if read DMA transaction takes too long, samples may be lost but for our use case that is undesirable but not fatal.

Together with re-arrangement of the DMAs, I plan to enable error interrupts associated with EDMA3TCs and maintain statistics on occurrence of errors.

(I discussed with hardware team and they did not think any modifications could be made to FPGA to prioritize, for example, the PCIe accesses by the DSPs - there are multiple contenders to the PCIe bus)

Thanks!
Brad

Processors

Processors forum

TMS320C6678: EDMA3TC ERRSTAT Bus Error