-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel runs with routing crash #370
Comments
Attached here is a small, reproducible example when ngen is built with gcc 8.3.1. cmake3 -DQUIET:=On -DBMI_C_LIB_ACTIVE:=On -DNGEN_ACTIVATE_PYTHON:BOOL=ON -DNGEN_ACTIVATE_ROUTING:BOOL=ON -DMPI_ACTIVE:=On .. Running with: Causes ngen to crash after the catchment formulations are run, but before the routing is launched. |
Could NOT reproduce using gcc 6.3.1 |
Just to note, as I continue to investigate this issue, I have noticed that even on gcc 8.3.1 environment, this crash is not deterministic. It acts like a The seg-fault does not happen on MPI rank 0, where the routing adapter is run, but instead comes from one of the other ranks. On these ranks, a |
Completely unrelated execution (this was a non-parallel build and had no routing in the realization config), but I happened to get this error when trying to run valgrind for a completely different reason:
Possibly related??? |
Notably, I encountered some stability issues with routing and ngen/Python... these were on rank 0, but had to do with the HDF5 library--pytables brings along its own binary, and we were building ngen with another. I have solved the issues with this by building pytables from source with the same libhdf5 as ngen... this could lead to an issue like this on a non-rank-0 process, in theory, if either pytables or HDF5 was used...maybe? In any case, this or other binary libraries loaded into Python modules with pybind that may not match libraries loaded in ngen should be looked at in relation to this. |
This may not be entirely parallel related. A similar issue seems to have come up during calibration runs. Reported as random seg faults during the calibration runs, but the symptoms are eerily similar. Notes on a reported crash: compiler: Finished 59161 timesteps.
Warning! ***HDF5 library version mismatched error***
The HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.
Data corruption or segmentation faults may occur if the application continues.
This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.
You should recompile the application or check your shared library related
settings such as 'LD_LIBRARY_PATH'.
'HDF5_DISABLE_VERSION_CHECK' environment variable is set to 1, application will
continue at your own risk.
Headers are 1.12.2, library is 1.10.4
SUMMARY OF THE HDF5 CONFIGURATION
=================================
General Information:
-------------------
HDF5 Version: 1.10.4
Configured on: Mon, 13 Apr 2020 12:15:08 +0000
Configured by: Debian
Host system: x86_64-pc-linux-gnu
Uname information: Debian
Byte sex: little-endian
Installation point: /usr
Flavor name: serial
Compiling Options:
------------------
Build Mode: production
Debugging Symbols: no
Asserts: no
Profiling: no
Optimization Level: high
Linking Options:
----------------
Libraries: static, shared
Statically Linked Executables:
LDFLAGS: -Wl,-Bsymbolic-functions -Wl,-z,relro
H5_LDFLAGS: -Wl,--version-script,$(top_srcdir)/debian/map_serial.ver
AM_LDFLAGS:
Extra libraries: -lpthread -lsz -lz -ldl -lm
Archiver: ar
AR_FLAGS: cr
Ranlib: x86_64-linux-gnu-ranlib
Languages:
----------
C: yes
C Compiler: /usr/bin/gcc
CPPFLAGS: -Wdate-time -D_FORTIFY_SOURCE=2
H5_CPPFLAGS: -D_GNU_SOURCE -D_POSIX_C_SOURCE=200112L -DNDEBUG -UH5_DEBUG_API
AM_CPPFLAGS:
C Flags: -g -O2 -fdebug-prefix-map=$(top_srcdir)=. -fstack-protector-strong -Wformat -Werror=format-security
H5 C Flags: -std=c99 -pedantic -Wall -Wextra -Wbad-function-cast -Wc++-compat -Wcast-align -Wcast-qual -Wconversion -Wdeclaration-after-statement -Wdisabled-optimization -Wfloat-equal -Wformat=2 -Winit-self -Winvalid-pch -Wmissing-declarations -Wmissing-include-dirs -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wpacked -Wpointer-arith -Wredundant-decls -Wshadow -Wstrict-prototypes -Wswitch-default -Wswitch-enum -Wundef -Wunused-macros -Wunsafe-loop-optimizations -Wwrite-strings -finline-functions -s -Wno-inline -Wno-aggregate-return -Wno-missing-format-attribute -Wno-missing-noreturn -O
AM C Flags:
Shared C Library: yes
Static C Library: yes
Fortran: yes
Fortran Compiler: /usr/bin/gfortran
Fortran Flags: -g -O2 -fdebug-prefix-map=$(top_srcdir)=. -fstack-protector-strong
H5 Fortran Flags: -pedantic -Wall -Wextra -Wunderflow -Wimplicit-interface -Wsurprising -Wno-c-binding-type -s -O2
AM Fortran Flags:
Shared Fortran Library: yes
Static Fortran Library: yes
C++: yes
C++ Compiler: /usr/bin/g++
C++ Flags: -g -O2 -fdebug-prefix-map=$(top_srcdir)=. -fstack-protector-strong -Wformat -Werror=format-security
H5 C++ Flags: -pedantic -Wall -W -Wundef -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Wredundant-decls -Winline -Wsign-promo -Woverloaded-virtual -Wold-style-cast -Weffc++ -Wreorder -Wnon-virtual-dtor -Wctor-dtor-privacy -Wabi -finline-functions -s -O
AM C++ Flags:
Shared C++ Library: yes
Static C++ Library: yes
Java: yes
Java Compiler: /usr/bin/java (openjdk 11.0.7-ea 2020-04-14)
Features:
---------
Parallel HDF5: no
Parallel Filtered Dataset Writes: no
Large Parallel I/O: no
High-level library: yes
Threadsafety: yes
Default API mapping: v18
With deprecated public symbols: yes
I/O filters (external): deflate(zlib),szip(encoder)
MPE: no
Direct VFD: no
dmalloc: no
Packages w/ extra debug output: none
API tracing: no
Using memory checker: no
Memory allocation sanity checks: no
Metadata trace file: no
Function stack tracing: no
Strict file format checks: no
Optimization instrumentation: no
Finished routing
/home/west/git_repositories/ngen_10242022/ngen/venv/lib/python3.8/site-packages/h5py/__init__.py:36: UserWarning: h5py is running against HDF5 1.10.4 when it was built against 1.12.2, this may cause problems
_warn(("h5py is running against HDF5 {0} when it was built against {1}, "
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/python_routing_v02/troute/routing/compute.py:597: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
pd.Series(index=lastobs_df_sub.index, name="Null"),
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/python_routing_v02/troute/routing/compute.py:601: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
pd.Series(index=lastobs_df_sub.index, name="Null"),
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/python_routing_v02/troute/routing/compute.py:597: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
pd.Series(index=lastobs_df_sub.index, name="Null"),
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/python_routing_v02/troute/routing/compute.py:601: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
pd.Series(index=lastobs_df_sub.index, name="Null"),
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/nwm_routing/src/nwm_routing/__main__.py:566: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->axis0] [items->None]
flowveldepth.loc[csv_output_segments].to_hdf(output_path.joinpath(filename_fvd), key="qvd")
/home/west/git_repositories/ngen_10242022/ngen/extern/t-route/src/nwm_routing/src/nwm_routing/__main__.py:566: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_items] [items->None]
flowveldepth.loc[csv_output_segments].to_hdf(output_path.joinpath(filename_fvd), key="qvd")
creating supernetwork connections set
supernetwork connections set complete
... in 0.007495403289794922 seconds.
setting channel initial states ...
channel initial states complete
... in 1.430511474609375e-06 seconds.
creating qlateral array ...
qlateral array complete
... in 61.56017208099365 seconds.
WARNING: Lateral flow time series is larger than provided nts. Adjusting nts.
If this was unintended, double check the configuration number of time steps and the lateral flow input time series
executing routing computation ...
JIT Preprocessing time 5.0067901611328125e-05 seconds.
starting Parallel JIT calculation
PARALLEL TIME 0.4558742046356201 seconds.
ordered reach computation complete
... in 0.45807576179504395 seconds.
Handling output ...
- writing flow, velocity, and depth results to .csv
output complete
... in 4.025044918060303 seconds.
process complete
66.17824506759644 seconds.
Segmentation fault (core dumped) In the output, you can see that routing has finished This is very hard to debug with certainty, as it may takes hundreds of executions to reproduce this error (the calibration runs were up to 300+ iterations when this randomly occurred.) The difference in this serial run and the parallel is that in parallel, the "non-routing" ranks cause the error so routing never finishes. In the case this is triggered in serial, the routing is able to finish before the seg fault occurs in the destruction chain. as @mattw-nws noted, this MAY be related to modules built and linking against mis-matched binary versions at runtime in the embedded interpreter. I'm not real sure what that would do in the destructor chain here. |
I was also finally able to reproduce this on serial runs using Apple clang version 14.0.0 (clang-1400.0.29.102)
Target: arm64-apple-darwin21.6.0 It is indeed an issue in the order of destruction of resources where the python interpreter is shutdown before the destruction of the locally held module/objects that the utility is holding are destroyed. I'm pretty sure this is due to the static singleton use and the destruction order of static variables across compilation units is not well defined. I have a fix that I should get pushed to a PR soon that at least resolves the persistent seg fault I was able to produce locally. Will need to test that fix on the reproducible example above using the known failing compiler configuration. |
Tested #470 on gcc 8.3.1 in parallel, and no longer get crashes as non-routing ranks after they finish running catchments and shutdown. |
When running parallel framework runs with routing enabled, ngen crashes trying to load the routing module . This crash occurs after the catchment formulations complete.
Current behavior
A segmentation fault occurs trying to use pybind for the routing integration in parallel.
Expected behavior
The parallel formulation execution should finish, and rank 0 should initialize and execute the t-route routing module.
The text was updated successfully, but these errors were encountered: