My simple openmpi prgoram used to work in 2 node systen, after freshly reconfiguring Libfabric, openmpi when i run it it hangs. 1.I am getting output from only one rank 2.ps output on another rank shows that actually program is running on another system.
#include <mpi.h>
int main(int argc, char *argv[]) {
int rank;
int size;
char name[MPI_MAX_PROCESSOR_NAME];
int length;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(name, &length);
printf(" rank:%d name:%s \n", rank,name);
MPI_Finalize();
return 0;
}
I should get print
rank:0 name:beta-nvidia
rank:1 name:alpha-nvidia
But only getting print
rank:1 name:alpha-nvidia
MPIRUN command used. /opt/openmpi/bin/mpirun -v -hostfile hosts.txt --mca btl self,sm,tcp --mca btl_base_verbose 30 --mca oob_tcp_if_include enp175s0f0np0 --prtemca prte_if_include enp175s0f0np0 --mca btl_tcp_if_include enp175s0f0np0 -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LD_LIBRARY_PATH:/opt/openmpi/lib/:/usr/local/lib:/opt/rdma-core/build/lib:/opt/gdrcopy/lib/ ./a.out
OMPI Configure
Configure command line: '--prefix=/opt/openmpi' '--enable-mpi-ext=cuda'
'--with-cuda=/usr/local/cuda'
'--with-cuda-libdir=/usr/local/cuda/lib64/stubs'
'--with-libfabric' '--enable-builtin-atomics'
'--without-cma' '--with-libevent=external'
'--with-hwloc=external' '--disable-silent-rules'
'--enable-ipv6' '--with-devel-headers'
'--with-slurm' '--with-sge' '--without-tm'
'--with-zlib' '--enable-heterogeneous'
'--with-pmix=internal' '--with-prrte=internal'
'--enable-mca-no-build=btl-uct'
'--with-libfabric=/opt/libfabric'
'--disable-sphinx' '--without-ucx'
marvell@neutron-nvidia:/opt/libfabric$ ./bin/fi_info -l
usnic:
version: 1.0
verbs:
version: 201.0
ofi_rxm:
version: 201.0
ofi_rxd:
version: 201.0
shm:
version: 201.0
udp:
version: 201.0
tcp:
version: 201.0
sockets:
version: 201.0
ofi_hook_perf:
version: 201.0
ofi_hook_trace:
version: 201.0
ofi_hook_profile:
version: 201.0
ofi_hook_debug:
version: 201.0
ofi_hook_noop:
version: 201.0
ofi_hook_hmem:
version: 201.0
ofi_hook_dmabuf_peer_mem:
version: 201.0
off_coll:
version: 201.0
sm2:
version: 201.0
ofi_mrail:
version: 201.0
lnx:
version: 201.0
It stuck at below
[proton-nvidia:111996] mca: base: components_register: registering framework btl components
[proton-nvidia:111996] mca: base: components_register: found loaded component self
[proton-nvidia:111996] mca: base: components_register: component self register function successful
[proton-nvidia:111996] mca: base: components_register: found loaded component sm
[proton-nvidia:111996] mca: base: components_register: component sm register function successful
[proton-nvidia:111996] mca: base: components_register: found loaded component tcp
[proton-nvidia:111996] mca: base: components_register: component tcp register function successful
[proton-nvidia:111996] mca: base: components_open: opening btl components
[proton-nvidia:111996] mca: base: components_open: found loaded component self
[proton-nvidia:111996] mca: base: components_open: component self open function successful
[proton-nvidia:111996] mca: base: components_open: found loaded component sm
[proton-nvidia:111996] mca: base: components_open: component sm open function successful
[proton-nvidia:111996] mca: base: components_open: found loaded component tcp
[proton-nvidia:111996] mca: base: components_open: component tcp open function successful
[neutron-nvidia:120919] mca: base: components_register: registering framework btl components
[neutron-nvidia:120919] mca: base: components_register: found loaded component self
[neutron-nvidia:120919] mca: base: components_register: component self register function successful
[neutron-nvidia:120919] mca: base: components_register: found loaded component sm
[neutron-nvidia:120919] mca: base: components_register: component sm register function successful
[neutron-nvidia:120919] mca: base: components_register: found loaded component tcp
[neutron-nvidia:120919] mca: base: components_register: component tcp register function successful
[neutron-nvidia:120919] mca: base: components_open: opening btl components
[neutron-nvidia:120919] mca: base: components_open: found loaded component self
[neutron-nvidia:120919] mca: base: components_open: component self open function successful
[neutron-nvidia:120919] mca: base: components_open: found loaded component sm
[neutron-nvidia:120919] mca: base: components_open: component sm open function successful
[neutron-nvidia:120919] mca: base: components_open: found loaded component tcp
[neutron-nvidia:120919] mca: base: components_open: component tcp open function successful
[proton-nvidia:111996] select: initializing btl component self
[proton-nvidia:111996] select: init of component self returned success
[proton-nvidia:111996] select: initializing btl component sm
[proton-nvidia:111996] select: init of component sm returned failure
[proton-nvidia:111996] mca: base: close: component sm closed
[proton-nvidia:111996] mca: base: close: unloading component sm
[proton-nvidia:111996] select: initializing btl component tcp
[proton-nvidia:111996] btl: tcp: Using interface: enp175s0f0np0
[proton-nvidia:111996] btl:tcp: 0x55e8c7096d90: if enp175s0f0np0 kidx 4 cnt 0 addr 50.50.50.1 IPv4 bw 100000 lt 100
[proton-nvidia:111996] btl:tcp: Attempting to bind to AF_INET port 1024
[proton-nvidia:111996] btl:tcp: Successfully bound to AF_INET port 1024
[proton-nvidia:111996] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[proton-nvidia:111996] btl:tcp: Attempting to bind to AF_INET6 port 1024
[proton-nvidia:111996] btl:tcp: Successfully bound to AF_INET6 port 1024
[proton-nvidia:111996] btl:tcp: my listening v6 socket port is 1024
[proton-nvidia:111996] btl: tcp: exchange: 0 4 IPv4 50.50.50.1
[proton-nvidia:111996] select: init of component tcp returned success
[neutron-nvidia:120919] select: initializing btl component self
[neutron-nvidia:120919] select: init of component self returned success
[neutron-nvidia:120919] select: initializing btl component sm
[neutron-nvidia:120919] select: init of component sm returned failure
[neutron-nvidia:120919] mca: base: close: component sm closed
[neutron-nvidia:120919] mca: base: close: unloading component sm
[neutron-nvidia:120919] select: initializing btl component tcp
[neutron-nvidia:120919] btl: tcp: Using interface: enp175s0f0np0
[neutron-nvidia:120919] btl:tcp: 0x55f31adeac60: if enp175s0f0np0 kidx 11 cnt 0 addr 50.50.50.2 IPv4 bw 100000 lt 100
[neutron-nvidia:120919] btl:tcp: Attempting to bind to AF_INET port 1024
[neutron-nvidia:120919] btl:tcp: Successfully bound to AF_INET port 1024
[neutron-nvidia:120919] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[neutron-nvidia:120919] btl:tcp: Attempting to bind to AF_INET6 port 1024
[neutron-nvidia:120919] btl:tcp: Successfully bound to AF_INET6 port 1024
[neutron-nvidia:120919] btl:tcp: my listening v6 socket port is 1024
[neutron-nvidia:120919] btl: tcp: exchange: 0 11 IPv4 50.50.50.2
[neutron-nvidia:120919] select: init of component tcp returned success
rank:1 name:proton-nvidia