Hi,
we've got two TI K2H EVMs connected through hyperlink using two breakout cards from Mistral. We followed all the instructions in http://processors.wiki.ti.com/index.php/MCSDK_HPC_3.x_Getting_Started_Guide#EVM_Setup word by word. We even found uboot FDT command in MCSDK UG to make sure Hyperlink is enabled. Additionaly, we downscaled the hlink clock using mpm-config.json as advised. We checked the testmpi example application by running: /opt/ti-openmpi/bin/mpirun --mca btl_base_verbose 100 --mca btl self,tcp -np 2 -host k2hnode1,k2hnode2 ./testmpi works fine. On the other hand: /opt/ti-openmpi/bin/mpirun --mca btl_base_verbose 100 --mca btl self,hlink -np 2 -host k2hnode1,k2hnode2 ./testmpi fails. Anyone ever experienced this error? Any help would be much appreciated!
The output of the preceding command can be found below. Version info: (not the first we tried...): BMC_ver: 1.0.2.5 EVM type: 0.0.0.1 EVM Superset: K2KH-EVM one EVM is rev 3.0 and the other is rev 4.0 boot mode ARM-SPI imglib_c66x_3_1_1_0 mcsdk-hpc_03_00_01_08 mcsdk_linux_3_01_01_04 ndk_2_22_02_16 openem_1_10_0_0 openmp_dsp_2_01_16_02 pdk_keystone2_3_01_01_04 ti-cgt-c6000_8.0.0 ti-llvm-3.3-3.3 ti-opencl_1.0.0 ti-openmpacc_1.2.0 ti-openmpi-1.0.0.18 transport_net_lib_1_1_0_2 uia_1_03_02_10 xdctools_3_25_06_96 xdctools_3_25_06_96_core6x [k2hnode2:01877] mca: base: components_open: Looking for btl components [k2hnode1:01954] mca: base: components_open: Looking for btl components [k2hnode2:01877] mca: base: components_open: opening btl components [k2hnode2:01877] mca: base: components_open: found loaded component hlink [k2hnode2:01877] BTL_HLINK TIMPIDBG: hlink_component_register!!! [k2hnode2:01877] This is EVM, using hl0 only! [k2hnode2:01877] mca: base: components_open: component hlink register function successful [k2hnode2:01877] BTL_HLINK TIMPIDBG: hlink_component_open!!! [k2hnode2:01877] BTL_HLINK BTL HLINK start of HYPLNKINITCFG: 0xb6a63dfc [k2hnode2:01877] BTL_HLINK [0x21400000] [k2hnode2:01877] BTL_HLINK [0x40000000] [k2hnode2:01877] BTL_HLINK [0x21400100] [k2hnode2:01877] BTL_HLINK [0x28000000] [k2hnode2:01877] BTL_HLINK [(nil)] [k2hnode2:01877] BTL_HLINK [(nil)] [k2hnode2:01877] BTL_HLINK [(nil)] [k2hnode2:01877] BTL_HLINK [(nil)] [k2hnode2:01877] BTL_HLINK BTL HLINK end of HYPLNKINITCFG [k2hnode2:01877] BTL_HLINK: CMEM_init OK! [k2hnode2:01877] mca: base: components_open: component hlink open function successful [k2hnode2:01877] mca: base: components_open: found loaded component self [k2hnode2:01877] mca: base: components_open: component self has no register function [k2hnode2:01877] mca: base: components_open: component self open function successful [k2hnode1:01954] mca: base: components_open: opening btl components [k2hnode1:01954] mca: base: components_open: found loaded component hlink [k2hnode1:01954] BTL_HLINK TIMPIDBG: hlink_component_register!!! [k2hnode1:01954] This is EVM, using hl0 only! [k2hnode1:01954] mca: base: components_open: component hlink register function successful [k2hnode1:01954] BTL_HLINK TIMPIDBG: hlink_component_open!!! [k2hnode1:01954] BTL_HLINK BTL HLINK start of HYPLNKINITCFG: 0xb6afcdfc [k2hnode1:01954] BTL_HLINK [0x21400000] [k2hnode1:01954] BTL_HLINK [0x40000000] [k2hnode1:01954] BTL_HLINK [0x21400100] [k2hnode1:01954] BTL_HLINK [0x28000000] [k2hnode1:01954] BTL_HLINK [(nil)] [k2hnode1:01954] BTL_HLINK [(nil)] [k2hnode1:01954] BTL_HLINK [(nil)] [k2hnode1:01954] BTL_HLINK [(nil)] [k2hnode1:01954] BTL_HLINK BTL HLINK end of HYPLNKINITCFG [k2hnode1:01954] BTL_HLINK: CMEM_init OK! [k2hnode1:01954] mca: base: components_open: component hlink open function successful [k2hnode1:01954] mca: base: components_open: found loaded component self [k2hnode1:01954] mca: base: components_open: component self has no register function [k2hnode1:01954] mca: base: components_open: component self open function successful [k2hnode2:01877] select: initializing btl component hlink [k2hnode2:01877] BTL_HLINK TIMPIDBG: hlink_component_init!!! [k2hnode1:01954] select: initializing btl component hlink [k2hnode1:01954] BTL_HLINK TIMPIDBG: hlink_component_init!!! [k2hnode2:01877] BTL_HLINK shmem open successfull!! [k2hnode2:01877] BTL_HLINK: CMEM physAddr: 22000000 (to a2000000) userAddr:0xb59a8000 [k2hnode2:01877] BTL_HLINK shmem MSMC0 mmap successfull!! [k2hnode2:01877] BTL_HLINK shmem MSMC0 mmap successfull!! [k2hnode2:01877] BTL_HLINK attempt HyperLink0 then HyperLink1 [k2hnode2:01877] BTL_HLINK hyplnk0 attempt opening [k2hnode1:01954] BTL_HLINK shmem open successfull!! [k2hnode1:01954] BTL_HLINK: CMEM physAddr: 22000000 (to a2000000) userAddr:0xb5a41000 [k2hnode1:01954] BTL_HLINK shmem MSMC0 mmap successfull!! [k2hnode1:01954] BTL_HLINK shmem MSMC0 mmap successfull!! [k2hnode1:01954] BTL_HLINK attempt HyperLink0 then HyperLink1 [k2hnode1:01954] BTL_HLINK hyplnk0 attempt opening [k2hnode2:01877] BTL_HLINK hyplnk0 open failed [k2hnode2:01877] BTL_HLINK hyplnk1 attempt opening [k2hnode1:01954] BTL_HLINK hyplnk0 open failed [k2hnode1:01954] BTL_HLINK hyplnk1 attempt opening [k2hnode1:01954] BTL_HLINK hyplnk1 open failed [k2hnode1:01954] BTL_HLINK hyplnk0=(nil) hyplnk1=(nil) [k2hnode1:01954] HLINK turned off !!! [k2hnode1:01954] select: init of component hlink returned failure [k2hnode1:01954] select: module hlink unloaded [k2hnode1:01954] select: initializing btl component self -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[62988,1],1]) is on host: k2hnode2 Process 2 ([[62988,1],0]) is on host: k2hnode1 BTLs attempted: self Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- MPI_INIT has failed because at least one MPI process is unreachable from another. This *usually* means that an underlying communication plugin -- such as a BTL or an MTL -- has either not loaded or not allowed itself to be used. Your MPI job will now abort. You may wish to try to narrow down the problem; * Check the output of ompi_info to see which BTL/MTL plugins are available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, if using MTL-based communications) to see exactly which communication plugins were considered and/or discarded. -------------------------------------------------------------------------- [k2hnode1:1954] *** An error occurred in MPI_Init [k2hnode1:1954] *** reported by process [4127981569,0] [k2hnode1:1954] *** on a NULL communicator [k2hnode1:1954] *** Unknown error [k2hnode1:1954] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [k2hnode1:1954] *** and potentially your MPI job) -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: k2hnode1 PID: 1954 -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 1954 on node k2hnode1 exiting improperly. There are three reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter orte_create_session_dirs is set to false. In this case, the run-time cannot detect that the abort call was an abnormal termination. Hence, the only error message you will receive is this one. This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). You can avoid this message by specifying -quiet on the mpirun command line. -------------------------------------------------------------------------- [k2hnode2:01875] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc [k2hnode2:01875] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [k2hnode2:01875] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail [k2hnode2:01875] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle [k2hnode2:01875] 1 more process has sent h[k2hnode1:01954] select: init of component self returned success [k2hnode2:01877] BTL_HLINK hyplnk1 open failed [k2hnode2:01877] BTL_HLINK hyplnk0=(nil) hyplnk1=(nil) [k2hnode2:01877] HLINK turned off !!! [k2hnode2:01877] select: init of component hlink returned failure [k2hnode2:01877] select: module hlink unloaded [k2hnode2:01877] select: initializing btl component self [k2hnode2:01877] select: init of component self returned success