This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/AM5716: OpenCL related TIOCL FATAL: Communication to a DSP has been lost (likely due to an MMU fault)

Part Number: AM5716

Tool/software: Linux

Hi,

a customer is running an algorithm within the OpenCL framework and it works on a good number of boards apparently, but some boards are experiencing the following error

Trace of Running Program from Command Line:
-------------------------------------------

Enqueued task: scan -- @5781ms
recvfrom failed: Link has been severed (67)
rpmsgThreadFxn: transportGet failed on fd 11, returned -20
TIOCL FATAL: Communication to a DSP has been lost (likely due to an MMU fault). Please wait while the DSPs are reset and the runtime attempts to terminate. A reboot may be required before running another OpenCL application if this fails. See the kernel log for fault information.

The customer is deploying the exact same software to multiple boards, and on most of the boards, say 7, everything seems to work okay, and on 3 of the boards they get an error: “TIOCL FATAL: Communication to a DSP has been lost (likely due to an MMU fault).” After this error occurs, the DSP becomes unavailable even though it appears that the firmware is reloaded and that the DSP is reported as “up” and “available” again in the kernel log. A reboot is required even to successfully run the platforms example program from ti/examples/opencl.

 

Attached are program trace, the dmesg output corresponding to the failure, and the LAD log output.

Could someone review the attached traces if there is something systemically wrong or suspicious?

What could be a reason for this type of failure? The boards have undergone significant DDR testing under temperature without failures.

6116.opencl_lad_issue.zip

Thanks,

--Gunter