This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5728: PCIe access hangs DSP and ARM

Part Number: AM5728

Running a custom AM5728 board with the ARM running Linux and DSP running Sys/Bios.

The DSP is a PCIe RC talking to an FPGA EP, Gen 2, 1 lane.  DSP Core 1 connects via PCIESS2.

The driver code for the DSP borrows heavily from the PCIe example from the Processor SDK.

The overall process works like this:

ARM Boots Linux

ARM Loads DSP application

ARM puts FPGA Image in shared memory

ARM tells DSP (via MessageQ) to load the FPGA via GPMC interface.

Once the FPGA is loaded and ready, the DSP starts the PCIE Initialization.

The PCIE initialization goes like this:

1)  HW reset the FPGA PCIE Subsystem via a GPIO

2) Call Pcie_init()

3) Call pcieSerdesCfg -> PlatformPCIESS2ClockEnable(), PlatformPCIESS2PllConfig(), PlatformPCIESS2CtrlConfig(), PlatformPCIESS2Reset(), PlatformPCIESS2PhyConfig()

4) Call Pcie_open()

5) Call pcieCfgRC()

6) Call Pcie_cfgBar()

7) Call pcieIbTransCfg()

8) Call pcieObTransCfg()

9) Call Pcie_getMemSpaceRange()

10) Call pcieLtssmCtrl() <- Enable link training

11) Call waitLinkup() <- returns when the link is up

Our ARM application reloads the DSP and FPGA with different images several times a minute.  After some random amount of time, the DSP will not respond.  When we try to connect with the JTAG debugger we get the "Device core is hung" error.   

To help isolate the issue, in the DSP I looped the PCIE initialization steps above, plus a close and shutdown at the end, over and over again.  I can anywhere from 50 to 1000 iterations before the DSP core hangs.

After much debugging via printf logging, I found it always hung in the pcieCfgRC() function.  After more printf debugging, it always hung at the code:

Mem_set(&getRegs, 0, sizeof(getRegs));

                            getRegs.statusCmd = &statusCmd;

TRACE(DspTrace::ERROR_LOG,"pcieCfgRC Enable memory read"); NEO_sleep(10);
if ((retVal = Pcie_readRegs(handle, pcie_LOCATION_LOCAL, &getRegs)) != pcie_RET_OK) { TRACE(DspTrace::ERROR_LOG,"Read Status Comand register failed!"); }

TRACE is a logging function I provide that writes a string to a circular buffer. The sleep command allows our low level task to run that sends the string just written up to the ARM for display on a console.
I can also access that buffer memory from the ARM. In every failure case, the last thing logged is "pcieCfgRC Enable memory read".

I dug into the pcie V1 source code and found the code above is reading the Status/Command register at address 0x51800004. So, I replaced the call to Pcie_readRegs with the following code:

UInt32 statusCmdVal = *((volatile uint32_t *)((uint32_t *)0x51800004));

and it still hung exactly in that spot.

So, after hundreds of iterations, running the same code over and over, the core will suddenly hang on a read to 0x51800004.

It was then suggested to read that same address from the ARM. I used the command omapconf read 0x51800004 and Linux then hung. I had to press POR on the board to reboot.

When Linux is booted, but before the DSP is loaded or PCIE configured, the omapconf read 0x51800004 returns this:

-bash-4.3# omapconf read 0x51800004


!!! OUPS... MEMORY ERROR @ 0x00000000 !!!
Are you sure that:
    MEMORY ADDRESS IS VALID?
    TARGETED MODULE IS CLOCKED?

I guess that is expected since PCIe is not up and running yet.  (BTW, all references to PCIE have been removed from the Linux device tree.)

After the DSP/FPGA is loaded and looping, the read does work:

-bash-4.3# omapconf read 0x51800004
00100146

When the DSP finally hangs, I try the same command and the Linux prompt never returns, Ctrl-C does not work, I can not SSH into the board, I have to press the Reset button.

Not sure where to go with this next.  Obviously something in the PCIE subsystem is locking up and causing the memory access to it to hang the memory bus.  Just not sure how to prove that or fix it.

  • The software team have been notified. They will respond here.
  • Hi,

    In the past we had a setup with AM5728 IDK to AM5728 IDK EVM, one is RC the other is EP. The connection is PCIE SS1 x1 @ GEN2 speed. The SW on both ends are Processor SDK RTOS. We also repeatedly doing your test sequence to do PCIE SS1 reset to check the stability of linking for hundreds times. We didn't see any issue.

    We can't do the same on PCIE SS2 due to HW connection limitation. Are you able to use PCIE SS1 for the test with FPGA? Will reduce speed to GEN1 helps?

    Regards, Eric
  • Late last week, I dug into this a bit deeper myself.  I decided to see at what conditions the DSP/ARM access of the PCIe register 0x518000004, would hang.  When I boot the DSP and I have not initialized PCIe yet, I got the MEMORY ERROR Response.  I then stepped through the PCIe startup code.  I don't recall exactly, but when I executed either PlatformPCIESS1ClockEnable() or PlatformPCIESS1PllConfig(), I would get past the MEMORY ERROR, but then the read from the ARM would hang.  I was then able to step the DSP some more, and when the DSP executed the PlatformPCIESS1Reset() function, the hung command on the ARM would finish.  This was a major breakthrough.   I could get it unstuck.  YAY.

    I then saw that the RM_PCIESS_RTCTRL register was at 0x4AE07310 and not in the 0x51000000 address space.  So, I ran the test again until as before the DSP hung. From the ARM, using the register at 0x4AE07310, I was able to reset the PCIE Subsystem and sure enough the DSP would resume processing.  So, without a doubt the PCIE Subsystem was responsible for hanging the DSP core and even the ARM core if you tried to access memory addressed in its subsystem.

    Some history ....

    In another thread, we have been discussing problems in general reloading DSP/FPGA over and over again.  Part of an attempted solution to that was to "reset" the PCIe subsystem as part of the DSP shutdown.  We wanted to be sure the PCIE subsystem was not doing any possible memory access or interrupt when the DSP code went IDLE causing a memory hang issue.

    So, when the DSP was sent a shutdown message from the ARM (via MessageQ) I called the PCIe class destructor.  This was coded to call the Pcie_close() driver function.  However, this only had the effect of setting the driver handle to null, which did not seem like a very complete close implementation.  So, I looked at the Reset function provided in the example code, the same function called on initialization in pcieSerdesCfg() and I created a new function called PlatformPCIESS1Stop().  This was the same as the Reset version, but at the end it did not take the Subsystem out of reset, it left the reset bit in RM_PCIESS_RSTCTRL set. 

    Part of my stress testing was to call init and then shutdown over and over, so I was calling this new Stop function at the end of my loop.  But, keep in mind, at the start of the initialization, the PCIE constructor, I call pcieSerdesCfg(), which in turn calls PlatformPCIESS1Reset().  I would have assumed that leaving the reset bit high would have no effect on the future more complete use of the Reset function.  I was wrong.

    In another test, instead of calling my Stop function in the destructor, I called the Reset function.  When I did this, I was able to run over 5000 and 7000 iterations without failure and a I am being told the system level testing runs better also.

    However, I am very concerned about this ability for the PCIe Subsystem to hang the DSP and ARM when something goes wrong.  A DSP Core Hung error seems like a real black hole.  I am lucky I happened to stumble across something that has made it better. 

    So, my question is, are there steps to take to help prevent or recover from this type of problem?