Data corruption with OpenMPI 4.0.4 and UCX 1.9.0 - C ompi
There seems to be a data corruption issue with OpenMPI 4.0.4 + UCX 1.9.0 (and 4.1.0 + UCX 1.9.0) when communicating data between processes using MPI. The attached test code is an MPI program in C++ that shows the issue when run on more than one node in a local cluster. Restricting the UCX transport protocol to sm, tcp, ud is a workaround for the issue (Not setting UCX_TLS or setting it to rc, dc results in data corruption). The issue does not show up with other MPI libraries (Intel MPI) installed on the system. This issue may be related to open-mpi/ompi#8321
More information on the issue is provided below,
- The issue shows up only when running the program on more than 1 node. Running the attached test program with 2 processes that span 2 nodes can reproduce the issue on our local cluster
- The test program gathers (using MPI point to point communication rather than an MPI_Gather) data for multiple variables (an hvector of indexed variables) to rank 0. The number of variables gathered by the test program can be controlled by the "--nvars=x" command line option. The data corruption issue is related to the amount of data gathered since it shows up only when gathering many variables (In our cluster, the issue does not show up when gathering <= 10 variables. It also does not show up for all cases > 10 variables.).
- The workaround is to set the environment variable "UCX_TLS" to "sm,ud" to restrict the UCX transport protocols used by OpenMPI+UCX
I built the latest version of OpenMPI, 4.1.0, on the system and the issue still exists.
The OpenMPI and UCX install information and the test program is available at https://gist.github.com/jayeshkrishna/5d053d3d5bba11359ea2dc82c435c3ea. On a successful run the test program prints out "SUCCESS: Validation successful". It prints out "ERROR: Validation failed" when data validation fails and prints out the first index in the buffer (gather buffer) where the validation failed.
Building the test program
Note that argparser.[cpp/h] are helper code and you would just need to look at mpigather.cpp (the main() and createsndrcvdt() routines).
mpicxx argparser.cpp mpi_gather.cpp -o mpi_gather_openmpi
Running the test program
The issue shows up with certain values (--nvars=13) of "--nvars" on our local cluster with processes across multiple nodes. So I use a loop to test it out,
for iproc in 2 4 16 64 96 128
do
for i in 2 4 8 9 10 11 12 13 14 15 16 18 19 20 32
do
echo "--------- MPI Gather nprocs = $iproc (nvars = $i) ----------------"
mpiexec -n $iproc ./mpi_gather_openmpi --nvars=$i
done
done
Please let me know if you need further information on this issue.
6 Answer:
@jayeshkrishna can you pls try setting UCXMEMEVENTS=n to check if the issue is related to memory hooks?
I just tried setting "UCXMEMEVENTS=n" (env variable) but that did not help (the reproducer validation failed)
hi @jayeshkrishna
thank you for bug report.
I tried to reproduce your issue and it seems our compiler is too old to process regexp - I see error:
--------- MPI Gather nprocs = 2 (nvars = 2) ----------------
terminate called after throwing an instance of 'std::regex_error'
what(): regex_error
[jazz17:111574] *** Process received signal ***
[jazz17:111574] Signal: Aborted (6)
[jazz17:111574] Signal code: (-6)
we are using compiler: GCC 4.8.5 20150623 (Red Hat 4.8.5-36)
what compiler do you use? I'm going to re-implement your reproducer to use getopt function to parse arguments. will let you know about progress
@hoopoepg Jayesh has used Intel-20.0.4 (from the gist above). GNU 4.8.5 is too old, a recent gcc shouldn't have issues as far as I know.
@hoopoepg Apparently, gcc 4.9 added "Improved support for C++11, including: support for \
Yes, a newer version of gcc should work fine. We use gcc 8.2.0 for our gnu tests (We require gcc 4.9+ for building Scorpio + E3SM due to limited C++ regex support in gcc < 4.9).
Read next
- factory_girl_rails/factory_bot_rails RuntimeError: can't modify frozen Array - Gherkin factory_bot_rails
- OnlyFans Trying to run just doesn't work Python
- Error loading - UnityExplorer
- Microsoft/VSCode - Cannot create a Python Interactive Window - vscode-jupyter
- Missing settings in the GUI - Solaar
- appium The capabilities UnlockType & UnlockKeys doesn't work with the mobile device ( brand Oppo ) - JavaScript
- Drag capability for experimental-figspec - TypeScript storybook-addon-designs
- react-refresh-webpack-plugin - Fast refresh not accept any change in any place, require full page refresh. JavaScript