Data corruption with OpenMPI 4.0.4 and UCX 1.9.0 - C ompi

There seems to be a data corruption issue with OpenMPI 4.0.4 + UCX 1.9.0 (and 4.1.0 + UCX 1.9.0) when communicating data between processes using MPI. The attached test code is an MPI program in C++ that shows the issue when run on more than one node in a local cluster. Restricting the UCX transport protocol to sm, tcp, ud is a workaround for the issue (Not setting UCX_TLS or setting it to rc, dc results in data corruption). The issue does not show up with other MPI libraries (Intel MPI) installed on the system. This issue may be related to open-mpi/ompi#8321

More information on the issue is provided below,

  • The issue shows up only when running the program on more than 1 node. Running the attached test program with 2 processes that span 2 nodes can reproduce the issue on our local cluster
  • The test program gathers (using MPI point to point communication rather than an MPI_Gather) data for multiple variables (an hvector of indexed variables) to rank 0. The number of variables gathered by the test program can be controlled by the "--nvars=x" command line option. The data corruption issue is related to the amount of data gathered since it shows up only when gathering many variables (In our cluster, the issue does not show up when gathering <= 10 variables. It also does not show up for all cases > 10 variables.).
  • The workaround is to set the environment variable "UCX_TLS" to "sm,ud" to restrict the UCX transport protocols used by OpenMPI+UCX
  • I built the latest version of OpenMPI, 4.1.0, on the system and the issue still exists.

    The OpenMPI and UCX install information and the test program is available at https://gist.github.com/jayeshkrishna/5d053d3d5bba11359ea2dc82c435c3ea. On a successful run the test program prints out "SUCCESS: Validation successful". It prints out "ERROR: Validation failed" when data validation fails and prints out the first index in the buffer (gather buffer) where the validation failed.

Building the test program

Note that argparser.[cpp/h] are helper code and you would just need to look at mpigather.cpp (the main() and createsndrcvdt() routines).

mpicxx argparser.cpp mpi_gather.cpp -o mpi_gather_openmpi

Running the test program

The issue shows up with certain values (--nvars=13) of "--nvars" on our local cluster with processes across multiple nodes. So I use a loop to test it out,

for iproc in 2 4 16 64 96 128
do
  for i in 2 4 8 9 10 11 12 13 14 15 16 18 19 20 32
  do
    echo "--------- MPI Gather nprocs = $iproc (nvars = $i) ----------------"
    mpiexec -n $iproc ./mpi_gather_openmpi --nvars=$i
  done
done

Please let me know if you need further information on this issue.

Asked Oct 06 '21 03:10
avatar jayeshkrishna
jayeshkrishna

6 Answer:

@jayeshkrishna can you pls try setting UCXMEMEVENTS=n to check if the issue is related to memory hooks?

1
Answered Feb 03 '21 at 15:00
avatar  of yosefe
yosefe

I just tried setting "UCXMEMEVENTS=n" (env variable) but that did not help (the reproducer validation failed)

1
Answered Feb 03 '21 at 16:51
avatar  of jayeshkrishna
jayeshkrishna

hi @jayeshkrishna

thank you for bug report.

I tried to reproduce your issue and it seems our compiler is too old to process regexp - I see error:

--------- MPI Gather nprocs = 2 (nvars = 2) ----------------
terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error
[jazz17:111574] *** Process received signal ***
[jazz17:111574] Signal: Aborted (6)
[jazz17:111574] Signal code:  (-6)

we are using compiler: GCC 4.8.5 20150623 (Red Hat 4.8.5-36)

what compiler do you use? I'm going to re-implement your reproducer to use getopt function to parse arguments. will let you know about progress

1
Answered Feb 05 '21 at 06:23
avatar  of hoopoepg
hoopoepg

@hoopoepg Jayesh has used Intel-20.0.4 (from the gist above). GNU 4.8.5 is too old, a recent gcc shouldn't have issues as far as I know.

1
Answered Feb 05 '21 at 06:26
avatar  of sarats
sarats

@hoopoepg Apparently, gcc 4.9 added "Improved support for C++11, including: support for \https://gcc.gnu.org/gcc-4.9/changes.html

1
Answered Feb 05 '21 at 06:33
avatar  of sarats
sarats

Yes, a newer version of gcc should work fine. We use gcc 8.2.0 for our gnu tests (We require gcc 4.9+ for building Scorpio + E3SM due to limited C++ regex support in gcc < 4.9).

1
Answered Feb 05 '21 at 15:17
avatar  of jayeshkrishna
jayeshkrishna