-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDF5-DIAG warnings calling open_mfdataset
with more than file_cache_maxsize
datasets (hdf5 1.12.2)
#7549
Comments
Not clear to me if this is a bug or just some verbose warnings. @kmuehlbauer do you have any thoughts? |
@dcherian Thanks for the ping. I can reproduce in a fresh conda-forge env with pip installed netcdf4, xarray and dask. @mx-moth A search brought up this likely related issue over at netcdf-c, Unidata/netcdf-c#2458. The according PR with a fix Unidata/netcdf-c#2461 is milestoned for netcdf-c 4.9.1. |
I just tested this with netcdf-c 4.9.1 but still these errors show up, also using conda-forge only install. To make this even weirder I've checked creation/reading with only hdf5/h5py/h5netcdf in the environment. Seems everything is working well. import argparse
import pathlib
import tempfile
from typing import List
import h5netcdf.legacyapi as nc
import xarray
HERE = pathlib.Path(__file__).parent
def add_arguments(parser: argparse.ArgumentParser):
parser.add_argument('count', type=int, default=200, nargs='?')
parser.add_argument('--file-cache-maxsize', type=int, required=False)
def main():
parser = argparse.ArgumentParser()
add_arguments(parser)
opts = parser.parse_args()
if opts.file_cache_maxsize is not None:
xarray.set_options(file_cache_maxsize=opts.file_cache_maxsize)
temp_dir = tempfile.mkdtemp(dir=HERE, prefix='work-dir-')
work_dir = pathlib.Path(temp_dir)
print("Working in", work_dir.name)
print("Making", opts.count, "datasets")
dataset_paths = make_many_datasets(work_dir, count=opts.count)
print("Combining", len(dataset_paths), "datasets")
dataset = xarray.open_mfdataset(dataset_paths, lock=False, engine="h5netcdf")
dataset.to_netcdf(work_dir / 'combined.nc', engine="h5netcdf")
def make_many_datasets(
work_dir: pathlib.Path,
count: int = 200
) -> List[pathlib.Path]:
dataset_paths = []
for i in range(count):
variable = f'var_{i}'
path = work_dir / f'{variable}.nc'
dataset_paths.append(path)
make_dataset(path, variable)
return dataset_paths
def make_dataset(
path: pathlib.Path,
variable: str,
) -> None:
ds = nc.Dataset(path, "w")
ds.createDimension("x", 1)
var = ds.createVariable(variable, "i8", ("x",))
var[:] = 1
ds.close()
if __name__ == '__main__':
main()
Working in work-dir-40mwn69y
Making 11 datasets
Combining 11 datasets
Traceback (most recent call last):
File "/home/kai/python/gists/xarray/7549.py", line 65, in <module>
main()
File "/home/kai/python/gists/xarray/7549.py", line 36, in main
dataset.to_netcdf(work_dir / 'combined.nc', engine="h5netcdf")
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/xarray/core/dataset.py", line 1911, in to_netcdf
return to_netcdf( # type: ignore # mypy cannot resolve the overloads:(
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/xarray/backends/api.py", line 1226, in to_netcdf
writes = writer.sync(compute=compute)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/xarray/backends/common.py", line 172, in sync
delayed_store = da.store(
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/dask/array/core.py", line 1236, in store
compute_as_if_collection(Array, store_dsk, map_keys, **kwargs)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/dask/base.py", line 341, in compute_as_if_collection
return schedule(dsk2, keys, **kwargs)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/dask/threaded.py", line 89, in get
results = get_async(
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/dask/local.py", line 511, in get_async
raise_exception(exc, tb)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/dask/local.py", line 319, in reraise
raise exc
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/dask/local.py", line 224, in execute_task
result = _execute_task(task, data)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/dask/core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/dask/array/core.py", line 126, in getter
c = np.asarray(c)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/xarray/core/indexing.py", line 459, in __array__
return np.asarray(self.array, dtype=dtype)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/xarray/core/indexing.py", line 623, in __array__
return np.asarray(self.array, dtype=dtype)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/xarray/core/indexing.py", line 524, in __array__
return np.asarray(array[self.key], dtype=None)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py", line 43, in __getitem__
return indexing.explicit_indexing_adapter(
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/xarray/core/indexing.py", line 815, in explicit_indexing_adapter
result = raw_indexing_method(raw_key.tuple)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py", line 50, in _getitem
return array[key]
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/h5netcdf/core.py", line 337, in __getitem__
padding = self._get_padding(key)
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/h5netcdf/core.py", line 291, in _get_padding
shape = self.shape
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/h5netcdf/core.py", line 268, in shape
return tuple([self._parent._all_dimensions[d].size for d in self.dimensions])
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/h5netcdf/core.py", line 268, in <listcomp>
return tuple([self._parent._all_dimensions[d].size for d in self.dimensions])
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/h5netcdf/dimensions.py", line 113, in size
if self.isunlimited():
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/h5netcdf/dimensions.py", line 133, in isunlimited
return self._h5ds.maxshape == (None,)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/home/kai/miniconda/envs/test-netcdf4/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 588, in maxshape
space = self.id.get_space()
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 299, in h5py.h5d.DatasetID.get_space
ValueError: Invalid dataset identifier (invalid dataset identifier) INSTALLED VERSIONS ------------------ commit: None python: 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.14.21-150400.24.46-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: ('de_DE', 'UTF-8') libhdf5: 1.12.2 libnetcdf: None Update: added |
This as far I can get for the moment. @mx-moth I'd suggest to go upstream (netCDF4/netcdf-c) with details about this issue. At least we can rule out an issue only related to Maybe @DennisHeimbigner can shed more light here?
|
@pp-mo 👀 |
See our issues here : SciTools/iris#5187 |
Before v1.6.1, I believe |
Oops, fixed. |
Thanks, @trexfeathers. I had the same problem of HDF5 DIAG warnings after upgrading to xarray v2023.3.0 yesterday. Your diagnosis in SciTools/iris#5187 helped isolate the issue with libnetcdf v4.9.1. Downgrading libnetcdf to v4.8.1 resulted in no HDF5 warnings. |
Thanks everybody. Similar to @gewitterblitz and based on SciTools/iris#5187 , pinning libnetcdf to v4.8.1 did the trick |
This merged pull request fixes the issue in netCDF: Unidata/netcdf-c#2675 |
What happened?
Using
ds = open_mfdataset(...)
to open more thanfile_cache_maxsize
files, then saving the combined dataset withds.to_netcdf()
prints many HDF5-DIAG warnings. This happens when usinghdf5==1.12.2
as bundled withnetcdf4~=1.6.0
. Downgrading tonetcdf4~=1.5.8
which bundleshdf5==1.12.0
stops the warnings being printed. The output file appears to be correct in either case.What did you expect to happen?
No warnings from HDF5-DIAG. Either raise an error because of the number of files being opened at once, or behaving as
open_mfdataset()
normally does.Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
The example is a script to run on the command line. Assuming the file is named
test.py
, invoke as:The log output is from
python3 ./test.py 128
. The length of the log output depends on the number of files opened - the more files the longer the log. The log output frompython3 ./test.py
using the default of 200 datasets was too long to include in this issue!All output files are restricted to directories created in the current working directory named
work-dir-XXXXXXX
, and are retained after the program exits. 200 datasets take up a total of 1.7MB.Environment
Failing environment:
INSTALLED VERSIONS
commit: None
python: 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: ('en_AU', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.0
xarray: 2023.1.0
pandas: 1.5.3
numpy: 1.24.2
scipy: None
netCDF4: 1.6.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2023.2.0
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 44.0.0
pip: 23.0.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None
Working environment:
INSTALLED VERSIONS
commit: None
python: 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: ('en_AU', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4
xarray: 2023.1.0
pandas: 1.5.3
numpy: 1.24.2
scipy: None
netCDF4: 1.5.8
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2023.2.0
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 44.0.0
pip: 23.0.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None
The text was updated successfully, but these errors were encountered: