This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OpenMPI example on two K2H boards via hyperlink transport fails

Hi,

{this a moved thread from the C66x forum.
 The history of our discussion can be found here:

   e2e.ti.com/.../411087}


we've got two TI K2H EVMs connected through hyperlink using two breakout cards from Mistral. We followed all the instructions in http://processors.wiki.ti.com/index.php/MCSDK_HPC_3.x_Getting_Started_Guide#EVM_Setup word by word. We even found uboot FDT command in MCSDK UG to make sure Hyperlink is enabled. Additionaly, we downscaled the hlink clock using mpm-config.json as advised. We checked the testmpi example application by running: /opt/ti-openmpi/bin/mpirun --mca btl_base_verbose 100 --mca btl self,tcp -np 2 -host k2hnode1,k2hnode2 ./testmpi works fine. On the other hand: /opt/ti-openmpi/bin/mpirun --mca btl_base_verbose 100 --mca btl self,hlink -np 2 -host k2hnode1,k2hnode2 ./testmpi fails. Anyone ever experienced this error? Any help would be much appreciated!


The output of the preceding command can be found below. Version info: (not the first we tried...): BMC_ver: 1.0.2.5 EVM type: 0.0.0.1 EVM Superset: K2KH-EVM one EVM is rev 3.0 and the other is rev 4.0 boot mode ARM-SPI imglib_c66x_3_1_1_0 mcsdk-hpc_03_00_01_08 mcsdk_linux_3_01_01_04 ndk_2_22_02_16 openem_1_10_0_0 openmp_dsp_2_01_16_02 pdk_keystone2_3_01_01_04 ti-cgt-c6000_8.0.0 ti-llvm-3.3-3.3 ti-opencl_1.0.0 ti-openmpacc_1.2.0 ti-openmpi-1.0.0.18 transport_net_lib_1_1_0_2 uia_1_03_02_10 xdctools_3_25_06_96 xdctools_3_25_06_96_core6x [k2hnode2:01877] mca: base: components_open: Looking for btl components [k2hnode1:01954] mca: base: components_open: Looking for btl components [k2hnode2:01877] mca: base: components_open: opening btl components [k2hnode2:01877] mca: base: components_open: found loaded component hlink [k2hnode2:01877] BTL_HLINK TIMPIDBG: hlink_component_register!!! [k2hnode2:01877] This is EVM, using hl0 only! [k2hnode2:01877] mca: base: components_open: component hlink register function successful [k2hnode2:01877] BTL_HLINK TIMPIDBG: hlink_component_open!!! [k2hnode2:01877] BTL_HLINK BTL HLINK start of HYPLNKINITCFG: 0xb6a63dfc [k2hnode2:01877] BTL_HLINK [0x21400000] [k2hnode2:01877] BTL_HLINK [0x40000000] [k2hnode2:01877] BTL_HLINK [0x21400100] [k2hnode2:01877] BTL_HLINK [0x28000000] [k2hnode2:01877] BTL_HLINK [(nil)] [k2hnode2:01877] BTL_HLINK [(nil)] [k2hnode2:01877] BTL_HLINK [(nil)] [k2hnode2:01877] BTL_HLINK [(nil)] [k2hnode2:01877] BTL_HLINK BTL HLINK end of HYPLNKINITCFG [k2hnode2:01877] BTL_HLINK: CMEM_init OK! [k2hnode2:01877] mca: base: components_open: component hlink open function successful [k2hnode2:01877] mca: base: components_open: found loaded component self [k2hnode2:01877] mca: base: components_open: component self has no register function [k2hnode2:01877] mca: base: components_open: component self open function successful [k2hnode1:01954] mca: base: components_open: opening btl components [k2hnode1:01954] mca: base: components_open: found loaded component hlink [k2hnode1:01954] BTL_HLINK TIMPIDBG: hlink_component_register!!! [k2hnode1:01954] This is EVM, using hl0 only! [k2hnode1:01954] mca: base: components_open: component hlink register function successful [k2hnode1:01954] BTL_HLINK TIMPIDBG: hlink_component_open!!! [k2hnode1:01954] BTL_HLINK BTL HLINK start of HYPLNKINITCFG: 0xb6afcdfc [k2hnode1:01954] BTL_HLINK [0x21400000] [k2hnode1:01954] BTL_HLINK [0x40000000] [k2hnode1:01954] BTL_HLINK [0x21400100] [k2hnode1:01954] BTL_HLINK [0x28000000] [k2hnode1:01954] BTL_HLINK [(nil)] [k2hnode1:01954] BTL_HLINK [(nil)] [k2hnode1:01954] BTL_HLINK [(nil)] [k2hnode1:01954] BTL_HLINK [(nil)] [k2hnode1:01954] BTL_HLINK BTL HLINK end of HYPLNKINITCFG [k2hnode1:01954] BTL_HLINK: CMEM_init OK! [k2hnode1:01954] mca: base: components_open: component hlink open function successful [k2hnode1:01954] mca: base: components_open: found loaded component self [k2hnode1:01954] mca: base: components_open: component self has no register function [k2hnode1:01954] mca: base: components_open: component self open function successful [k2hnode2:01877] select: initializing btl component hlink [k2hnode2:01877] BTL_HLINK TIMPIDBG: hlink_component_init!!! [k2hnode1:01954] select: initializing btl component hlink [k2hnode1:01954] BTL_HLINK TIMPIDBG: hlink_component_init!!! [k2hnode2:01877] BTL_HLINK shmem open successfull!! [k2hnode2:01877] BTL_HLINK: CMEM physAddr: 22000000 (to a2000000) userAddr:0xb59a8000 [k2hnode2:01877] BTL_HLINK shmem MSMC0 mmap successfull!! [k2hnode2:01877] BTL_HLINK shmem MSMC0 mmap successfull!! [k2hnode2:01877] BTL_HLINK attempt HyperLink0 then HyperLink1 [k2hnode2:01877] BTL_HLINK hyplnk0 attempt opening [k2hnode1:01954] BTL_HLINK shmem open successfull!! [k2hnode1:01954] BTL_HLINK: CMEM physAddr: 22000000 (to a2000000) userAddr:0xb5a41000 [k2hnode1:01954] BTL_HLINK shmem MSMC0 mmap successfull!! [k2hnode1:01954] BTL_HLINK shmem MSMC0 mmap successfull!! [k2hnode1:01954] BTL_HLINK attempt HyperLink0 then HyperLink1 [k2hnode1:01954] BTL_HLINK hyplnk0 attempt opening [k2hnode2:01877] BTL_HLINK hyplnk0 open failed [k2hnode2:01877] BTL_HLINK hyplnk1 attempt opening [k2hnode1:01954] BTL_HLINK hyplnk0 open failed [k2hnode1:01954] BTL_HLINK hyplnk1 attempt opening [k2hnode1:01954] BTL_HLINK hyplnk1 open failed [k2hnode1:01954] BTL_HLINK hyplnk0=(nil) hyplnk1=(nil) [k2hnode1:01954] HLINK turned off !!! [k2hnode1:01954] select: init of component hlink returned failure [k2hnode1:01954] select: module hlink unloaded [k2hnode1:01954] select: initializing btl component self -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[62988,1],1]) is on host: k2hnode2 Process 2 ([[62988,1],0]) is on host: k2hnode1 BTLs attempted: self Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- MPI_INIT has failed because at least one MPI process is unreachable from another. This *usually* means that an underlying communication plugin -- such as a BTL or an MTL -- has either not loaded or not allowed itself to be used. Your MPI job will now abort. You may wish to try to narrow down the problem; * Check the output of ompi_info to see which BTL/MTL plugins are available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, if using MTL-based communications) to see exactly which communication plugins were considered and/or discarded. -------------------------------------------------------------------------- [k2hnode1:1954] *** An error occurred in MPI_Init [k2hnode1:1954] *** reported by process [4127981569,0] [k2hnode1:1954] *** on a NULL communicator [k2hnode1:1954] *** Unknown error [k2hnode1:1954] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [k2hnode1:1954] *** and potentially your MPI job) -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: k2hnode1 PID: 1954 -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 1954 on node k2hnode1 exiting improperly. There are three reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter orte_create_session_dirs is set to false. In this case, the run-time cannot detect that the abort call was an abnormal termination. Hence, the only error message you will receive is this one. This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). You can avoid this message by specifying -quiet on the mpirun command line. -------------------------------------------------------------------------- [k2hnode2:01875] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc [k2hnode2:01875] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [k2hnode2:01875] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail [k2hnode2:01875] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle [k2hnode2:01875] 1 more process has sent h[k2hnode1:01954] select: init of component self returned success [k2hnode2:01877] BTL_HLINK hyplnk1 open failed [k2hnode2:01877] BTL_HLINK hyplnk0=(nil) hyplnk1=(nil) [k2hnode2:01877] HLINK turned off !!! [k2hnode2:01877] select: init of component hlink returned failure [k2hnode2:01877] select: module hlink unloaded [k2hnode2:01877] select: initializing btl component self [k2hnode2:01877] select: init of component self returned success


On the recommendation of Raja we've also tried the project with the latest MCSDK version, but the the latest wouldn't even boot for us.

Regards,

Janos


  • The faulty(?) latest MCSDK version we've tried:
    MCSDK 3_01_03_06
  • Just to clarify this issue I've started the MCSDK problem an own thread in the C66 forum:
    e2e.ti.com/.../413110
  • It looks that Hyplnk open failed:

    [k2hnode1:01954] BTL_HLINK attempt HyperLink0 then HyperLink1
    [k2hnode1:01954] BTL_HLINK hyplnk0 attempt opening
    [k2hnode2:01877] BTL_HLINK hyplnk0 open failed
    [k2hnode2:01877] BTL_HLINK hyplnk1 attempt opening
    [k2hnode1:01954] BTL_HLINK hyplnk0 open failed
    [k2hnode1:01954] BTL_HLINK hyplnk1 attempt opening
    [k2hnode1:01954] BTL_HLINK hyplnk1 open failed

    When you run the openMPI test over Hyperlink, the library underneath tried to read some set-up from JSON file then configure the Hyperlink, then polling the some Hyperlink link status registers to check link up or not. The link was not up so you saw above error. Please make sure you have a stable Hyperlink connection bewteen two EVMs.

    Regards, Eric

  • Hi Eric,

    thanks for the reply! Quoting from the original thread:

    Meanwhile we've managed to make some progress:

    We noticed that during the mpmsrv startup it traces in '/var/log/mpmsrv.log'
    that it couldn't find slave devices. Started tracing the details of this
    process it seems that it's looking for a particular device in /sys/uio.. by searching
    a specific name in the 'name' files. It turns out the hyperlink devices (uio8-9)
    _do not_ follow the naming convention the mpmsrv is relying on.
    By editing the mpmconfig.json file we could guide it to find the hyperlink device for hyperlink0 (that's the one
    connected).
    However, we're stuck again with the following errors (unable to map the remote MSMC):

    [k2hnode1:02907] BTL_HLINK TIMPIDBG: hlink_component_init!!!
    [k2hnode1:02907] BTL_HLINK shmem open successfull!!
    [k2hnode1:02907] BTL_HLINK: CMEM physAddr: 22000000 (to a2000000)
    userAddr:0xb5a05000
    [k2hnode1:02907] BTL_HLINK shmem MSMC0 mmap successfull!!
    [k2hnode1:02907] BTL_HLINK shmem MSMC0 mmap successfull!!
    [k2hnode1:02907] BTL_HLINK attempt HyperLink0 then HyperLink1
    [k2hnode1:02907] BTL_HLINK hyplnk0 attempt opening
    [k2hnode1:02907] BTL_HLINK hyplnk0 open successfull!!
    [k2hnode1:02907] BTL_HLINK hyplnk1 attempt opening
    [k2hnode2:01828] select: initializing btl component hlink
    [k2hnode2:01828] BTL_HLINK TIMPIDBG: hlink_component_init!!!
    [k2hnode2:01828] BTL_HLINK shmem open successfull!!
    [k2hnode2:01828] BTL_HLINK: CMEM physAddr: 22000000 (to a2000000)
    userAddr:0xb59a1000
    [k2hnode2:01828] BTL_HLINK shmem MSMC0 mmap successfull!!
    [k2hnode2:01828] BTL_HLINK shmem MSMC0 mmap successfull!!
    [k2hnode2:01828] BTL_HLINK attempt HyperLink0 then HyperLink1
    [k2hnode2:01828] BTL_HLINK hyplnk0 attempt opening
    [k2hnode2:01828] BTL_HLINK hyplnk0 open successfull!!
    [k2hnode2:01828] BTL_HLINK hyplnk1 attempt opening
    [k2hnode1:02907] BTL_HLINK hyplnk1 open failed
    [k2hnode1:02907] BTL_HLINK hyplnk0=0x5c4e8 hyplnk1=(nil)
    [k2hnode1:02907] mmap_failed_hl_win_msmc_rmt (MSMC over hyplnk0)!
    [k2hnode1:02907] select: init of component hlink returned failure
    [k2hnode1:02907] select: module hlink unloaded
    [k2hnode1:02907] select: initializing btl component self
    [k2hnode1:02907] select: init of component self returned success
    [k2hnode2:01828] BTL_HLINK hyplnk1 open failed
    [k2hnode2:01828] BTL_HLINK hyplnk0=0x5c500 hyplnk1=(nil)
    [k2hnode2:01828] mmap_failed_hl_win_msmc_rmt (MSMC over hyplnk0)!
    [k2hnode2:01828] select: init of component hlink returned failure
    [k2hnode2:01828] select: module hlink unloaded

    Taking that the /sys filesystem interface names don't match the user space mpm implementation's requirements suggests that the mpmsrv version does not match the kernel version, although it's supposed to. Unfortunately we couldn't find any .json configuration guide for openmpi over hyperlink. We are simply trying to make the stock openmpi example work over hyperlink0.

    The hardware setup seems OK since the DSP side mcsdk hyperlink test works. Link up and running.

    Thanks you!

    Regards,

    Janos
  • Eric,

    Has the MCSDK-HPC 3.00.01.08 been validated with MCSDK 3_01_03_06?
    The download page only mentions MCSDK 3_00_01_08  3.00.04.18 :
    http://software-dl.ti.com/sdoemb/sdoemb_public_sw/mcsdk_hpc/latest/index_FDS.html

    If not are there plan to validate it with the latest MCSDK? What are the plans/roadmap in term of MCSDK and MCSDK-HPC ?

    Thanks!
    Anthony

  • Eric,

    could you please as well confirm that on your side you have the openmpi test application (with following 2 parameter set) successfully running over hyperlink transport on the K2H EVM using MCSDK 3.00.04.18 and MCSDK-HPC 3.00.01.08:

    /opt/ti-openmpi/bin/mpirun --mca btl_base_verbose 100 --mca btl self,tcp -np 2 -host k2hnode1,k2hnode2 ./testmpi

    /opt/ti-openmpi/bin/mpirun --mca btl_base_verbose 100 --mca btl self,hlink -np 2 -host k2hnode1,k2hnode2 ./testmpi

    Thanks in advance,

    Anthony

  • Anthony,

    On the page where you download MCSDK HPC, it lists the MCSDK version compatable with MCSDK HPC, just that MCSDK version only, not newer ones.

    In this case, MCSDK HPC 3.00.01.08 works with MCSDK 3.00.04.18. If you use these version and Hyperlink test should be working as is. if not, please send us log files for both EVMs from power on to the point where failure happened. Thanks.

    best reards,

    David

  • Janos,

    It seems that you still face the same errors using MCSDK-HPC 3.00.01.08 (Build date: 02112015) on top of MCSDK 3.00.04.18. Can you confirm?

    - Have you flashed the board with ALL the MCSDK 3.00.04.18 components (meaning SPL, U-boot, linux kernel and FS)?
    For debug, since the SW validation is done for the SW provided in a given MCSDK version, you should use the SW components of this given MCSDK and not mix to other SW components.

    - Can you confirm that the HW setup is similar to the picture posted in the below post:
    https://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/411087

    - Can you confirm that with this setup is the hyperlink tests (DSP2DSP) works? Can you please post the new logs?

    - Can you confirm that the openmpi tests on the ARM side does not work? Can you please post the new logs?



    Thanks in advance,

    Anthony

  • Hi Anthony,

    AnBer said:

    It seems that you still face the same errors using MCSDK-HPC 3.00.01.08 (Build date: 02112015) on top of MCSDK 3.00.04.18. Can you confirm?



    Yes, I can confirm

    AnBer said:

    - Have you flashed the board with ALL the MCSDK 3.00.04.18 components (meaning SPL, U-boot, linux kernel and FS)?

    Yes, we have

    AnBer said:

    - Can you confirm that the HW setup is similar to the picture posted in the below post:
    https://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/411087

    A: Yes, but only hyperlink0 is connected.

    AnBer said:

    - Can you confirm that with this setup is the hyperlink tests (DSP2DSP) works? Can you please post the new logs?

    The results of the successful DSP2DSP test (copied from thread https://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/411087):

    [C66xx_0] Version #: 0x02010001; string HYPLNK LLD Revision:
    02.01.00.01:Mar 30 2015:11:04:06
    About to do system setup (PLL, PSC, and DDR)
    Constructed SERDES configs: PLL=0x00000228; RX=0x0046c485; TX=0x000cc305
    system setup worked
    About to set up HyperLink Peripheral
    ============================Hyperlink Testing Port 0
    ========================================== begin registers before
    initialization ===========
    Revision register contents:
      Raw    = 0x4e902101
    Status register contents:
      Raw        = 0x00003004
    Link status register contents:
      Raw       = 0x00000000
    Control register contents:
      Raw             = 0x00000000
    Control register contents:
      Raw        = 0x00000000
    ============== end registers before initialization ===========
    Waiting for other side to come up (       0)
    Version #: 0x02010001; string HYPLNK LLD Revision: 02.01.00.01:Mar 30
    2015:11:04:06
    About to do system setup (PLL, PSC, and DDR)
    Constructed SERDES configs: PLL=0x00000228; RX=0x0046c485; TX=0x000cc305
    system setup worked
    About to set up HyperLink Peripheral
    ============================Hyperlink Testing Port 0
    ========================================== begin registers before
    initialization ===========
    Revision register contents:
      Raw    = 0x4e902101
    Status register contents:
      Raw        = 0x00003004
    Link status register contents:
      Raw       = 0x00000000
    Control register contents:
      Raw             = 0x00000000
    Control register contents:
      Raw        = 0x00000000
    ============== end registers before initialization ===========
    ============== begin registers after initialization ===========
    Status register contents:
      Raw        = 0x04402005
    Link status register contents:
      Raw       = 0xccf00cf0
    Control register contents:
      Raw             = 0x00006204
    ============== end registers after initialization ===========
    Waiting 5 seconds to check link stability
    ============== begin registers after initialization ===========
    Status register contents:
      Raw        = 0x04402005
    Link status register contents:
      Raw       = 0xccf00cff
    Control register contents:
      Raw             = 0x00006204
    ============== end registers after initialization ===========
    Waiting 5 seconds to check link stability
    Precursors 0
    Postcursors: 19
    Link seems stable
    About to try to read remote registers
    ============== begin REMOTE registers after initialization ===========
    Status register contents:
      Raw        = 0x0440200b
    Link status register contents:
      Raw       = 0xfdf0bdf0
    Control register contents:
      Raw             = 0x00006204
    ============== end REMOTE registers after initialization ===========
    Peripheral setup worked
    About to read/write once
    Precursors 0
    Postcursors: 19
    Link seems stable
    About to try to read remote registers
    ============== begin REMOTE registers after initialization ===========
    Status register contents:
      Raw        = 0x0440000b
    Link status register contents:
      Raw       = 0xfdf0bdf0
    Control register contents:
      Raw             = 0x00006200
    ============== end REMOTE registers after initialization ===========
    Peripheral setup worked
    About to read/write once
    Single write test passed
    About to pass 65536 tokens; iteration = 0
    Single write test passed
    About to pass 65536 tokens; iteration = 0
    === this is not an optimized example ===
    === this is not an optimized example ===
    Link Speed is 4 * 6.25 Gbps
    Link Speed is 4 * 6.25 Gbps
    Passed 65536 tokens round trip (read+write through hyplnk) in 16117 Mcycles
    Passed 65536 tokens round trip (read+write through hyplnk) in 16117 Mcycles
    Approximately 245938 cycles per round-trip
    Approximately 245938 cycles per round-trip
    === this is not an optimized example ===
    === this is not an optimized example ===
    Checking statistics
    Checking statistics
    About to pass 65536 tokens; iteration = 1
    About to pass 65536 tokens; iteration = 1
    === this is not an optimized example ===
    === this is not an optimized example ===
    Link Speed is 4 * 6.25 Gbps
    Link Speed is 4 * 6.25 Gbps
    Passed 65536 tokens round trip (read+write through hyplnk) in 16117 Mcycles
    Passed 65536 tokens round trip (read+write through hyplnk) in 16117 Mcycles
    Approximately 245938 cycles per round-trip
    Approximately 245938 cycles per round-trip
    === this is not an optimized example ===
    === this is not an optimized example ===
    Checking statistics
    Checking statistics

    AnBer said:

    - Can you confirm that the openmpi tests on the ARM side does not work? Can you please post the new logs?

    Yes, please find the logs below.

    One more thing I'd like to add. We've placed traces in a recompiled mpm server as
    we found that mpmsrv is searching in /sys/class/uio to find the appropriate device. We noticed
    that while all other devices are represented in  /sys/class/uio/uioX/name as a numeric_id.name format,
    hyperlink links are not. This occurs as mpmsrv trims down the first half of the line in .../name file and matches the name
    after the '.'. It surely wouldn't be able to run out of the box without these modifications. Till this day we couldn't find any hints on mpm_config.json regarding this, so we edited mpmsrv to find the right hyperlink.  Without the edits we get an 'Open failed' error.

    We checked many mcsdk versions before and /sys/class/uio/uio8/name seem to change in each one but the basic format
    remains the same. We have not checked all versions of mpmsrv though.

    /sys/class/uio/uio6/name:
        2620058.dsp6

    /sys/class/uio/uio7/name:
        262005c.dsp7

    /sys/class/uio/uio8/name:
        hyperlink0.4

    /sys/class/uio/uio9/name:
        hyperlink1.5



    in /etc/mpm/mpm_config.json
    "slaves": [
    ...
    {                                                               
      "name": "4",                                            
      "transport": "hyplnk0-remote",                          
      "dma": "dma-profile-1",                                 
      "memorymap": ["local-msmc", "local-ddr"]                
    }
    ...

    where we changed "arm-remote-hyplnk-1" => "4" just to make mpm server find something in /sys
    this way we get the following (otherwise both hyperlink open failes as in our original problem)

    [k2hnode2:01790] mca: base: components_open: Looking for btl components
    [k2hnode1:01814] mca: base: components_open: Looking for btl components
    [k2hnode1:01814] mca: base: components_open: opening btl components
    [k2hnode1:01814] mca: base: components_open: found loaded component hlink
    [k2hnode1:01814] BTL_HLINK TIMPIDBG: hlink_component_register!!!
    [k2hnode1:01814] This is EVM, using hl0 only!
    [k2hnode2:01790] mca: base: components_open: opening btl components
    [k2hnode2:01790] mca: base: components_open: found loaded component hlink
    [k2hnode2:01790] BTL_HLINK TIMPIDBG: hlink_component_register!!!
    [k2hnode2:01790] This is EVM, using hl0 only!
    [k2hnode1:01814] mca: base: components_open: component hlink register function successful
    [k2hnode2:01790] mca: base: components_open: component hlink register function successful
    [k2hnode2:01790] BTL_HLINK TIMPIDBG: hlink_component_open!!!
    [k2hnode1:01814] BTL_HLINK TIMPIDBG: hlink_component_open!!!
    [k2hnode2:01790] BTL_HLINK  BTL HLINK start of HYPLNKINITCFG: 0xb6b26e58
    [k2hnode2:01790] BTL_HLINK [0x21400000]
    [k2hnode2:01790] BTL_HLINK [0x40000000]
    [k2hnode2:01790] BTL_HLINK [0x21400100]
    [k2hnode2:01790] BTL_HLINK [0x28000000]
    [k2hnode2:01790] BTL_HLINK [(nil)]
    [k2hnode2:01790] BTL_HLINK [(nil)]
    [k2hnode2:01790] BTL_HLINK [(nil)]
    [k2hnode2:01790] BTL_HLINK [(nil)]
    [k2hnode2:01790] BTL_HLINK BTL HLINK end of HYPLNKINITCFG
    [k2hnode2:01790] BTL_HLINK: CMEM_init OK!
    [k2hnode2:01790] mca: base: components_open: component hlink open function successful
    [k2hnode2:01790] mca: base: components_open: found loaded component self
    [k2hnode2:01790] mca: base: components_open: component self has no register function
    [k2hnode1:01814] BTL_HLINK  BTL HLINK start of HYPLNKINITCFG: 0xb6a4ee58
    [k2hnode1:01814] BTL_HLINK [0x21400000]
    [k2hnode2:01790] mca: base: components_open: component self open function successful
    [k2hnode1:01814] BTL_HLINK [0x40000000]
    [k2hnode1:01814] BTL_HLINK [0x21400100]
    [k2hnode1:01814] BTL_HLINK [0x28000000]
    [k2hnode1:01814] BTL_HLINK [(nil)]
    [k2hnode1:01814] BTL_HLINK [(nil)]
    [k2hnode1:01814] BTL_HLINK [(nil)]
    [k2hnode1:01814] BTL_HLINK [(nil)]
    [k2hnode1:01814] BTL_HLINK BTL HLINK end of HYPLNKINITCFG
    [k2hnode1:01814] BTL_HLINK: CMEM_init OK!
    [k2hnode1:01814] mca: base: components_open: component hlink open function successful
    [k2hnode1:01814] mca: base: components_open: found loaded component self
    [k2hnode1:01814] mca: base: components_open: component self has no register function
    [k2hnode1:01814] mca: base: components_open: component self open function successful
    [k2hnode1:01814] select: initializing btl component hlink
    [k2hnode2:01790] select: initializing btl component hlink
    [k2hnode2:01790] BTL_HLINK TIMPIDBG: hlink_component_init!!!
    [k2hnode1:01814] BTL_HLINK TIMPIDBG: hlink_component_init!!!
    [k2hnode2:01790] BTL_HLINK shmem open successfull!!
    [k2hnode2:01790] BTL_HLINK: CMEM physAddr:        22000000 (to a2000000) userAddr:0xb5a65000
    [k2hnode1:01814] BTL_HLINK shmem open successfull!!
    [k2hnode2:01790] BTL_HLINK shmem MSMC0 mmap successfull!!
    [k2hnode1:01814] BTL_HLINK: CMEM physAddr:        22000000 (to a2000000) userAddr:0xb598d000
    [k2hnode2:01790] BTL_HLINK shmem MSMC0 mmap successfull!!
    [k2hnode2:01790] BTL_HLINK attempt HyperLink0 then HyperLink1
    [k2hnode2:01790] BTL_HLINK hyplnk0 attempt opening
    [k2hnode1:01814] BTL_HLINK shmem MSMC0 mmap successfull!!
    [k2hnode1:01814] BTL_HLINK shmem MSMC0 mmap successfull!!
    [k2hnode1:01814] BTL_HLINK attempt HyperLink0 then HyperLink1
    [k2hnode1:01814] BTL_HLINK hyplnk0 attempt opening
    [k2hnode2:01790] BTL_HLINK hyplnk0 open successfull!!
    [k2hnode2:01790] BTL_HLINK hyplnk1 attempt opening
    [k2hnode1:01814] BTL_HLINK hyplnk0 open successfull!!
    [k2hnode1:01814] BTL_HLINK hyplnk1 attempt opening
    [k2hnode2:01790] BTL_HLINK hyplnk1 open failed
    [k2hnode2:01790] BTL_HLINK hyplnk0=0x5c4e0 hyplnk1=(nil)
    [k2hnode2:01790] mmap_failed_hl_win_msmc_rmt (MSMC over hyplnk0)!
    [k2hnode2:01790] select: init of component hlink returned failure
    [k2hnode2:01790] select: module hlink unloaded
    [k2hnode2:01790] select: initializing btl component self
    [k2hnode2:01790] select: init of component self returned success
    --------------------------------------------------------------------------
    At least one pair of MPI processes are unable to reach each other for
    MPI communications.  This means that no Open MPI device has indicated
    that it can be used to communicate between these processes.  This is
    an error; Open MPI requires that all MPI processes be able to reach
    each other.  This error can sometimes be the result of forgetting to
    specify the "self" BTL.

      Process 1 ([[16559,1],0]) is on host: k2hnode1
      Process 2 ([[16559,1],1]) is on host: k2hnode2
      BTLs attempted: self

    Your MPI job is now going to abort; sorry.
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    MPI_INIT has failed because at least one MPI process is unreachable
    from another.  This *usually* means that an underlying communication
    plugin -- such as a BTL or an MTL -- has either not loaded or not
    allowed itself to be used.  Your MPI job will now abort.

    You may wish to try to narrow down the problem;

     * Check the output of ompi_info to see which BTL/MTL plugins are
       available.
     * Run your application with MPI_THREAD_SINGLE.
     * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
       if using MTL-based communications) to see exactly which
       communication plugins were considered and/or discarded.
    --------------------------------------------------------------------------
    [k2hnode2:1790] *** An error occurred in MPI_Init
    [k2hnode2:1790] *** reported by process [1085210625,1]
    [k2hnode2:1790] *** on a NULL communicator
    [k2hnode2:1790] *** Unknown error
    [k2hnode2:1790] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
    [k2hnode2:1790] ***    and potentially your MPI job)
    --------------------------------------------------------------------------
    An MPI process is aborting at a time when it cannot guarantee that all
    of its peer processes in the job will be killed properly.  You should
    double check that everything has shut down cleanly.

      Reason:     Before MPI_INIT completed
      Local host: k2hnode2
      PID:        1790
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    mpirun has exited due to process rank 1 with PID 1790 on
    node k2hnode2 exiting improperly. There are three reasons this could occur:

    1. this process did not call "init" before exiting, but others in
    the job did. This can cause a job to hang indefinitely while it waits
    for all processes to call "init". By rule, if one process calls "init",
    then ALL processes must call "init" prior to termination.

    2. this process called "init", but exited without calling "finalize".
    By rule, all processes that call "init" MUST call "finalize" prior to
    exiting or it will be considered an "abnormal termination"

    3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
    orte_create_session_dirs is set to false. In this case, the run-time cannot
    detect that the abort call was an abnormal termination. Hence, the only
    error message you will receive is this one.

    This may have caused other processes in the application to be
    terminated by signals sent by mpirun (as reported here).

    You can avoid this message by specifying -quiet on the mpirun command line.

    --------------------------------------------------------------------------
    [k2hnode1:01812] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
    [k2hnode1:01812] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
    [k2hnode1:01812] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
    [k2hnode1:01812] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
    [k2hnode1:01812] 1 more process has sent help message help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
    [k2hnode1:01814] BTL_HLINK hyplnk1 open failed
    [k2hnode1:01814] BTL_HLINK hyplnk0=0x5c4e0 hyplnk1=(nil)
    [k2hnode1:01814] mmap_failed_hl_win_msmc_rmt (MSMC over hyplnk0)!
    [k2hnode1:01814] select: init of component hlink returned failure
    [k2hnode1:01814] select: module hlink unloaded
    [k2hnode1:01814] select: initializing btl component self
    [k2hnode1:01814] select: init of component self returned success



    Regards,

    Janos

  • Hi Janos,

    There should be a new release of the HPC coming out soon.
    When available can you try it ( MCSDK HPC 3.00.01.12) with MCSDK 3.00.04.18?

    Thanks.
    A.