最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

OpenMPI program hangs over RoCE using libfabric - Stack Overflow

programmeradmin5浏览0评论

My simple openmpi prgoram used to work in 2 node systen, after freshly reconfiguring Libfabric, openmpi when i run it it hangs. 1.I am getting output from only one rank 2.ps output on another rank shows that actually program is running on another system.

#include <mpi.h>

int main(int argc, char *argv[]) {
  int rank;
  int size;
  char name[MPI_MAX_PROCESSOR_NAME];
  int length;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Get_processor_name(name, &length);
  printf(" rank:%d name:%s  \n", rank,name);
  MPI_Finalize();
  return 0;
}

I should get print

 rank:0 name:beta-nvidia
 rank:1 name:alpha-nvidia

But only getting print

 rank:1 name:alpha-nvidia
MPIRUN command used. /opt/openmpi/bin/mpirun -v -hostfile hosts.txt --mca btl self,sm,tcp --mca btl_base_verbose 30 --mca oob_tcp_if_include enp175s0f0np0 --prtemca prte_if_include enp175s0f0np0 --mca btl_tcp_if_include  enp175s0f0np0  -x   LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LD_LIBRARY_PATH:/opt/openmpi/lib/:/usr/local/lib:/opt/rdma-core/build/lib:/opt/gdrcopy/lib/ ./a.out

OMPI Configure

  Configure command line: '--prefix=/opt/openmpi' '--enable-mpi-ext=cuda'
                          '--with-cuda=/usr/local/cuda'
                          '--with-cuda-libdir=/usr/local/cuda/lib64/stubs'
                          '--with-libfabric' '--enable-builtin-atomics'
                          '--without-cma' '--with-libevent=external'
                          '--with-hwloc=external' '--disable-silent-rules'
                          '--enable-ipv6' '--with-devel-headers'
                          '--with-slurm' '--with-sge' '--without-tm'
                          '--with-zlib' '--enable-heterogeneous'
                          '--with-pmix=internal' '--with-prrte=internal'
                          '--enable-mca-no-build=btl-uct'
                          '--with-libfabric=/opt/libfabric'
                          '--disable-sphinx' '--without-ucx'
marvell@neutron-nvidia:/opt/libfabric$ ./bin/fi_info -l
usnic:
    version: 1.0
verbs:
    version: 201.0
ofi_rxm:
    version: 201.0
ofi_rxd:
    version: 201.0
shm:
    version: 201.0
udp:
    version: 201.0
tcp:
    version: 201.0
sockets:
    version: 201.0
ofi_hook_perf:
    version: 201.0
ofi_hook_trace:
    version: 201.0
ofi_hook_profile:
    version: 201.0
ofi_hook_debug:
    version: 201.0
ofi_hook_noop:
    version: 201.0
ofi_hook_hmem:
    version: 201.0
ofi_hook_dmabuf_peer_mem:
    version: 201.0
off_coll:
    version: 201.0
sm2:
    version: 201.0
ofi_mrail:
    version: 201.0
lnx:
    version: 201.0

It stuck at below

[proton-nvidia:111996] mca: base: components_register: registering framework btl components
[proton-nvidia:111996] mca: base: components_register: found loaded component self
[proton-nvidia:111996] mca: base: components_register: component self register function successful
[proton-nvidia:111996] mca: base: components_register: found loaded component sm
[proton-nvidia:111996] mca: base: components_register: component sm register function successful
[proton-nvidia:111996] mca: base: components_register: found loaded component tcp
[proton-nvidia:111996] mca: base: components_register: component tcp register function successful
[proton-nvidia:111996] mca: base: components_open: opening btl components
[proton-nvidia:111996] mca: base: components_open: found loaded component self
[proton-nvidia:111996] mca: base: components_open: component self open function successful
[proton-nvidia:111996] mca: base: components_open: found loaded component sm
[proton-nvidia:111996] mca: base: components_open: component sm open function successful
[proton-nvidia:111996] mca: base: components_open: found loaded component tcp
[proton-nvidia:111996] mca: base: components_open: component tcp open function successful
[neutron-nvidia:120919] mca: base: components_register: registering framework btl components
[neutron-nvidia:120919] mca: base: components_register: found loaded component self
[neutron-nvidia:120919] mca: base: components_register: component self register function successful
[neutron-nvidia:120919] mca: base: components_register: found loaded component sm
[neutron-nvidia:120919] mca: base: components_register: component sm register function successful
[neutron-nvidia:120919] mca: base: components_register: found loaded component tcp
[neutron-nvidia:120919] mca: base: components_register: component tcp register function successful
[neutron-nvidia:120919] mca: base: components_open: opening btl components
[neutron-nvidia:120919] mca: base: components_open: found loaded component self
[neutron-nvidia:120919] mca: base: components_open: component self open function successful
[neutron-nvidia:120919] mca: base: components_open: found loaded component sm
[neutron-nvidia:120919] mca: base: components_open: component sm open function successful
[neutron-nvidia:120919] mca: base: components_open: found loaded component tcp
[neutron-nvidia:120919] mca: base: components_open: component tcp open function successful
[proton-nvidia:111996] select: initializing btl component self
[proton-nvidia:111996] select: init of component self returned success
[proton-nvidia:111996] select: initializing btl component sm
[proton-nvidia:111996] select: init of component sm returned failure
[proton-nvidia:111996] mca: base: close: component sm closed
[proton-nvidia:111996] mca: base: close: unloading component sm
[proton-nvidia:111996] select: initializing btl component tcp
[proton-nvidia:111996] btl: tcp: Using interface: enp175s0f0np0
[proton-nvidia:111996] btl:tcp: 0x55e8c7096d90: if enp175s0f0np0 kidx 4 cnt 0 addr 50.50.50.1 IPv4 bw 100000 lt 100
[proton-nvidia:111996] btl:tcp: Attempting to bind to AF_INET port 1024
[proton-nvidia:111996] btl:tcp: Successfully bound to AF_INET port 1024
[proton-nvidia:111996] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[proton-nvidia:111996] btl:tcp: Attempting to bind to AF_INET6 port 1024
[proton-nvidia:111996] btl:tcp: Successfully bound to AF_INET6 port 1024
[proton-nvidia:111996] btl:tcp: my listening v6 socket port is 1024
[proton-nvidia:111996] btl: tcp: exchange: 0 4 IPv4 50.50.50.1
[proton-nvidia:111996] select: init of component tcp returned success
[neutron-nvidia:120919] select: initializing btl component self
[neutron-nvidia:120919] select: init of component self returned success
[neutron-nvidia:120919] select: initializing btl component sm
[neutron-nvidia:120919] select: init of component sm returned failure
[neutron-nvidia:120919] mca: base: close: component sm closed
[neutron-nvidia:120919] mca: base: close: unloading component sm
[neutron-nvidia:120919] select: initializing btl component tcp
[neutron-nvidia:120919] btl: tcp: Using interface: enp175s0f0np0
[neutron-nvidia:120919] btl:tcp: 0x55f31adeac60: if enp175s0f0np0 kidx 11 cnt 0 addr 50.50.50.2 IPv4 bw 100000 lt 100
[neutron-nvidia:120919] btl:tcp: Attempting to bind to AF_INET port 1024
[neutron-nvidia:120919] btl:tcp: Successfully bound to AF_INET port 1024
[neutron-nvidia:120919] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[neutron-nvidia:120919] btl:tcp: Attempting to bind to AF_INET6 port 1024
[neutron-nvidia:120919] btl:tcp: Successfully bound to AF_INET6 port 1024
[neutron-nvidia:120919] btl:tcp: my listening v6 socket port is 1024
[neutron-nvidia:120919] btl: tcp: exchange: 0 11 IPv4 50.50.50.2
[neutron-nvidia:120919] select: init of component tcp returned success
 rank:1 name:proton-nvidia
发布评论

评论列表(0)

  1. 暂无评论