Running a custom AM5728 board with the ARM running Linux and DSP running Sys/Bios.
The DSP is a PCIe RC talking to an FPGA EP, Gen 2, 1 lane. DSP Core 1 connects via PCIESS2.
The driver code for the DSP borrows heavily from the PCIe example from the Processor SDK.
The overall process works like this:
ARM Boots Linux
ARM Loads DSP application
ARM puts FPGA Image in shared memory
ARM tells DSP (via MessageQ) to load the FPGA via GPMC interface.
Once the FPGA is loaded and ready, the DSP starts the PCIE Initialization.
The PCIE initialization goes like this:
1) HW reset the FPGA PCIE Subsystem via a GPIO
2) Call Pcie_init()
3) Call pcieSerdesCfg -> PlatformPCIESS2ClockEnable(), PlatformPCIESS2PllConfig(), PlatformPCIESS2CtrlConfig(), PlatformPCIESS2Reset(), PlatformPCIESS2PhyConfig()
4) Call Pcie_open()
5) Call pcieCfgRC()
6) Call Pcie_cfgBar()
7) Call pcieIbTransCfg()
8) Call pcieObTransCfg()
9) Call Pcie_getMemSpaceRange()
10) Call pcieLtssmCtrl() <- Enable link training
11) Call waitLinkup() <- returns when the link is up
Our ARM application reloads the DSP and FPGA with different images several times a minute. After some random amount of time, the DSP will not respond. When we try to connect with the JTAG debugger we get the "Device core is hung" error.
To help isolate the issue, in the DSP I looped the PCIE initialization steps above, plus a close and shutdown at the end, over and over again. I can anywhere from 50 to 1000 iterations before the DSP core hangs.
After much debugging via printf logging, I found it always hung in the pcieCfgRC() function. After more printf debugging, it always hung at the code:
Mem_set(&getRegs, 0, sizeof(getRegs)); getRegs.statusCmd = &statusCmd;
TRACE(DspTrace::ERROR_LOG,"pcieCfgRC Enable memory read"); NEO_sleep(10);
if ((retVal = Pcie_readRegs(handle, pcie_LOCATION_LOCAL, &getRegs)) != pcie_RET_OK) { TRACE(DspTrace::ERROR_LOG,"Read Status Comand register failed!"); }
TRACE is a logging function I provide that writes a string to a circular buffer. The sleep command allows our low level task to run that sends the string just written up to the ARM for display on a console.
I can also access that buffer memory from the ARM. In every failure case, the last thing logged is "pcieCfgRC Enable memory read".
I dug into the pcie V1 source code and found the code above is reading the Status/Command register at address 0x51800004. So, I replaced the call to Pcie_readRegs with the following code:
UInt32 statusCmdVal = *((volatile uint32_t *)((uint32_t *)0x51800004));
and it still hung exactly in that spot.
So, after hundreds of iterations, running the same code over and over, the core will suddenly hang on a read to 0x51800004.
It was then suggested to read that same address from the ARM. I used the command omapconf read 0x51800004 and Linux then hung. I had to press POR on the board to reboot.
When Linux is booted, but before the DSP is loaded or PCIE configured, the omapconf read 0x51800004 returns this:
-bash-4.3# omapconf read 0x51800004
!!! OUPS... MEMORY ERROR @ 0x00000000 !!!
Are you sure that:
MEMORY ADDRESS IS VALID?
TARGETED MODULE IS CLOCKED?
I guess that is expected since PCIe is not up and running yet. (BTW, all references to PCIE have been removed from the Linux device tree.)
After the DSP/FPGA is loaded and looping, the read does work:
-bash-4.3# omapconf read 0x51800004
00100146
When the DSP finally hangs, I try the same command and the Linux prompt never returns, Ctrl-C does not work, I can not SSH into the board, I have to press the Reset button.
Not sure where to go with this next. Obviously something in the PCIE subsystem is locking up and causing the memory access to it to hang the memory bus. Just not sure how to prove that or fix it.