Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors using to_zarr for an s3 store #3831

Closed
JarrodBWong opened this issue Mar 5, 2020 · 15 comments
Closed

Errors using to_zarr for an s3 store #3831

JarrodBWong opened this issue Mar 5, 2020 · 15 comments
Labels
needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports plan to close May be closeable, needs more eyeballs

Comments

@JarrodBWong
Copy link

JarrodBWong commented Mar 5, 2020

Hello,
I have been trying to write zarr files from xarray directly into an s3 store but keep getting errors for missing arrays. It looks like the structure of the zarr archive is created in my s3 bucket, I can see .zarray and .zattrs files but it's missing the 0.0.0, 0.0.1, etc files. I have been able to write the same arrays directly to my disk so don't think it's an issue with the dataset itself.

MCVE Code Sample

s3 = s3fs.S3FileSystem(anon=False)
store= s3fs.S3Map(root=f's3://my-bucket/data.zarr', s3=s3, check=False)

ds.to_zarr(store=store,
           consolidated=True,
           mode='w')

Output

The variable name of the array changes by the run, it's not always the same one that it says is missing.

logs -------------------------------------------------------------------------- NoSuchKey Traceback (most recent call last) /opt/conda/lib/python3.7/site-packages/s3fs/core.py in _fetch_range(client, bucket, key, version_id, start, end, max_attempts, req_kw) 1196 Range='bytes=%i-%i' % (start, end - 1), -> 1197 **kwargs) 1198 return resp['Body'].read()

~/.local/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
315 # The "self" in this scope is referring to the BaseClient.
--> 316 return self._make_api_call(operation_name, kwargs)
317

~/.local/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
625 error_class = self.exceptions.from_code(error_code)
--> 626 raise error_class(parsed_response, operation_name)
627 else:

NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

During handling of the above exception, another exception occurred:

FileNotFoundError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/fsspec/mapping.py in getitem(self, key, default)
75 try:
---> 76 result = self.fs.cat(key)
77 except: # noqa: E722

/opt/conda/lib/python3.7/site-packages/fsspec/spec.py in cat(self, path)
545 """ Get the content of a file """
--> 546 return self.open(path, "rb").read()
547

/opt/conda/lib/python3.7/site-packages/fsspec/spec.py in read(self, length)
1129 return b""
-> 1130 out = self.cache._fetch(self.loc, self.loc + length)
1131 self.loc += len(out)

/opt/conda/lib/python3.7/site-packages/fsspec/caching.py in _fetch(self, start, end)
338 # First read, or extending both before and after
--> 339 self.cache = self.fetcher(start, bend)
340 self.start = start

/opt/conda/lib/python3.7/site-packages/s3fs/core.py in _fetch_range(self, start, end)
1059 def _fetch_range(self, start, end):
-> 1060 return _fetch_range(self.fs.s3, self.bucket, self.key, self.version_id, start, end, req_kw=self.req_kw)
1061

/opt/conda/lib/python3.7/site-packages/s3fs/core.py in _fetch_range(client, bucket, key, version_id, start, end, max_attempts, req_kw)
1212 return b''
-> 1213 raise translate_boto_error(e)
1214 except Exception as e:

FileNotFoundError: The specified key does not exist.

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/zarr/core.py in _load_metadata_nosync(self)
149 mkey = self._key_prefix + array_meta_key
--> 150 meta_bytes = self._store[mkey]
151 except KeyError:

/opt/conda/lib/python3.7/site-packages/fsspec/mapping.py in getitem(self, key, default)
79 return default
---> 80 raise KeyError(key)
81 return result

KeyError: 'my-bucket/data.zarr/lv_HTGL7_l1/.zarray'

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in
7 ds.to_zarr(store=s3_store_dest,
8 consolidated=True,
----> 9 mode='w')

/opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding, compute, consolidated, append_dim)
1623 compute=compute,
1624 consolidated=consolidated,
-> 1625 append_dim=append_dim,
1626 )
1627

/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding, compute, consolidated, append_dim)
1341 writer = ArrayWriter()
1342 # TODO: figure out how to properly handle unlimited_dims
-> 1343 dump_to_store(dataset, zstore, writer, encoding=encoding)
1344 writes = writer.sync(compute=compute)
1345

/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
1133 variables, attrs = encoder(variables, attrs)
1134
-> 1135 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
1136
1137

/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
385 self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims)
386 self.set_variables(
--> 387 variables_encoded, check_encoding_set, writer, unlimited_dims=unlimited_dims
388 )
389

/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
444 dtype = str
445 zarr_array = self.ds.create(
--> 446 name, shape=shape, dtype=dtype, fill_value=fill_value, **encoding
447 )
448 zarr_array.attrs.put(encoded_attrs)

/opt/conda/lib/python3.7/site-packages/zarr/hierarchy.py in create(self, name, **kwargs)
877 """Create an array. Keyword arguments as per
878 :func:zarr.creation.create."""
--> 879 return self._write_op(self._create_nosync, name, **kwargs)
880
881 def _create_nosync(self, name, **kwargs):

/opt/conda/lib/python3.7/site-packages/zarr/hierarchy.py in _write_op(self, f, *args, **kwargs)
656
657 with lock:
--> 658 return f(*args, **kwargs)
659
660 def create_group(self, name, overwrite=False):

/opt/conda/lib/python3.7/site-packages/zarr/hierarchy.py in _create_nosync(self, name, **kwargs)
884 kwargs.setdefault('cache_attrs', self.attrs.cache)
885 return create(store=self._store, path=path, chunk_store=self._chunk_store,
--> 886 **kwargs)
887
888 def empty(self, name, **kwargs):

/opt/conda/lib/python3.7/site-packages/zarr/creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, **kwargs)
123 # instantiate array
124 z = Array(store, path=path, chunk_store=chunk_store, synchronizer=synchronizer,
--> 125 cache_metadata=cache_metadata, cache_attrs=cache_attrs, read_only=read_only)
126
127 return z

/opt/conda/lib/python3.7/site-packages/zarr/core.py in init(self, store, path, read_only, chunk_store, synchronizer, cache_metadata, cache_attrs)
122
123 # initialize metadata
--> 124 self._load_metadata()
125
126 # initialize attributes

/opt/conda/lib/python3.7/site-packages/zarr/core.py in _load_metadata(self)
139 """(Re)load metadata from store."""
140 if self._synchronizer is None:
--> 141 self._load_metadata_nosync()
142 else:
143 mkey = self._key_prefix + array_meta_key

/opt/conda/lib/python3.7/site-packages/zarr/core.py in _load_metadata_nosync(self)
150 meta_bytes = self._store[mkey]
151 except KeyError:
--> 152 err_array_not_found(self._path)
153 else:
154

/opt/conda/lib/python3.7/site-packages/zarr/errors.py in err_array_not_found(path)
19
20 def err_array_not_found(path):
---> 21 raise ValueError('array not found at path %r' % path)
22
23
ValueError: array not found at path 'lv_HTGL7_l1'

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.14.165-133.209.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.15.0
pandas: 1.0.1
numpy: 1.18.1
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: 1.5.5
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.3
cfgrib: None
iris: None
bottleneck: None
dask: 2.11.0
distributed: 2.11.0
matplotlib: 3.1.3
cartopy: None
seaborn: 0.10.0
numbagg: None
setuptools: 45.2.0.post20200209
pip: 20.0.2
conda: 4.7.12
pytest: None
IPython: 7.12.0
sphinx: None

@max-sixty
Copy link
Collaborator

Thanks for the issue @LewisJarrod

@jhamman (or @rabernat ?) -- what's the best way of identifying whether this is an xarray or zarr issue? There's a few similar issues in the backlog, and they often go unanswered. To the extent we can help people split out where the issue is and push it to the relevant library, that would keep things moving. Thank you

@rabernat
Copy link
Contributor

rabernat commented Mar 5, 2020

These are tricky issues because they involve the integration of at least three libraries (xarray, zarr, s3fs, and possibly dask as well).

Are you using dask?

There could be some issues with s3fs caching (see fsspec/s3fs#285). If you start fresh on a new path with nothing in it (so you don't need mode='w'), does it work?

@JarrodBWong
Copy link
Author

Thanks for the responses. I have tried this both with and without dask so far.

Taking out mode='w' did just go through successfully which I'm surprised by. I had tried to delete all of the uploaded directory structure before in between attempts to give it that same effect of a "fresh path".

@rabernat
Copy link
Contributor

rabernat commented Mar 5, 2020

In that case, I'm fairly certain it is fsspec/s3fs#285.

There is a bug in s3fs where it caches the directory listing and then doesn't update it again, even if you delete files. This would potential cause problems when trying to overwrite, since s3fs would think the objects are already there, even if they are deleted.

The same bug means consolidated metadata usually doesn't work. Perhaps @martindurant can weigh in.

@martindurant
Copy link
Contributor

fsspec/filesystem_spec#243 is where my attempt to fix this kind of thing will live.

However, writing or deleting keys should invalidate the appropriate part of the cache as it currently stands, so I don't know why the problem has arisen. If it is a cache problem, then s3.invalidate_cache() can always be called.

@rabernat
Copy link
Contributor

rabernat commented Mar 5, 2020

I had tried to delete all of the uploaded directory structure before in between attempts to give it that same effect of a "fresh path".

Key question: did you restart your kernel or call s3.invalidate_cache() in between attempts as well? If not, it again points to a caching problem.

The goal here is to drill down into the stack and find the point where s3fs is failing to update its cache correctly.

@JarrodBWong
Copy link
Author

I never called s3.invalidate_cache() in between but I was restarting my kernel fairly regularly although not intentionally as a way to try to clear the cache.

@max-sixty
Copy link
Collaborator

Not to hijack this specific issue for the general case, but any thoughts on the best way for users & us to identify the appropriate library for users to direct their issues? Is it just the last item in the call stack? Does xarray need to build diagnostics / assertions for highlighting where the problem is?

A quick survey of the first two pages of xarray issues yield a bunch of issues which receive no response from us, and those that do are often a decent amount of back & forth:
#3815 (zarr?)
#3781 (scipy? dask?)
#3776 (probably xarray, maybe netcdf)
#3767 (scipy? netcdf? this did get a response, from @dcherian )
#3754 (pydap. @dcherian worked through this one with some back and forth)

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2020

but any thoughts on the best way for users & us to identify the appropriate library for users to direct their issues?

This is basically an integration problem. What we are lacking is a comprehensive set of integration tests for this ecosystem (xarray + dask + zarr + fsspec and all its implementations). Pangeo has served as a de facto point for this discussion, since we are using the whole stack. Some similar issues there are:

All of these libraries understandably want to push the issues somewhere else, since they tend to be complex and hard to reduce to a MCVE. But there are fundamental issues related to integration that have to be addressed somewhere.

Is it just the last item in the call stack?

Yes and no. The details of how xarray is talking to these stores may matter. Continuing our planned refactor of the backened classes to use entry points, and formalizing the interface for backends, should help surface problems. The way we do consolidated metadata, for example, is pretty ad hoc:

def close(self):
if self._consolidate_on_close:
import zarr
zarr.consolidate_metadata(self.ds.store)

Does xarray need to build diagnostics / assertions for highlighting where the problem is? pangeo-data/pangeo#691

Better diagnostics and more informative errors is always good. But we don't want to do this randomly. I think it should be part of the backend refactor. When is that happening btw? 🙃

We can't hope to test every possible permutation of xarray / zarr store / fsspec implementation within xarray. One idea I have thought about is an "integration bot", who watches all of these libraries and runs its own integration tests. This bot could even be configured to watch the repos and comment on PRs. That would be a cool project!

Would appreciate @jhamman's thoughts here.

@dcherian
Copy link
Contributor

dcherian commented Mar 6, 2020

One idea I have thought about is an "integration bot"

I think this should be under pangeo/stack-integration-tests (or similat) and run CI nightly with git-master versions and stable release versions of xarray / dask / zarr / gcsfs etc.

@alimanfoo
Copy link
Contributor

Just to say having some kind of stack integration tests is a marvellous idea. Another example of an issue that's very hard to pin down is zarr-developers/zarr-python#528.

Btw we have also run into issues with fsspec caching directory listings and not invalidating the cache when store changes are made, although I haven't checked with latest master. We have a lot of workarounds in our code where we reopen everything after we've made changes to a store. Probably an area where some more digging and careful testing may be needed.

@martindurant
Copy link
Contributor

Note that s3fs and gcsfs now expose the kwargs skip_instance_cache use_listings_cache, listings_expiry_time, and max_paths and pass them to fsspec. See
https://filesystem-spec.readthedocs.io/en/latest/features.html#instance-caching and
https://filesystem-spec.readthedocs.io/en/latest/features.html#listings-caching

(although the new releases for both already include the change that accessing a file, contents or metadata, does not require a directory listing, which is the right thing for zarr, where the full paths are known)

@max-sixty max-sixty added needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports plan to close May be closeable, needs more eyeballs labels Oct 14, 2023
@max-sixty
Copy link
Collaborator

Do we know whether the reported issue is still present?

The broader topic of integration tests would be good to solve but is probably a broader topic...

@JarrodBWong
Copy link
Author

This is still an issue if a file path is provided instead of a mapping. The work around I've found is

fs = s3fs.S3FileSystem(anon=False, skip_instance_cache=True)

ds.to_zarr(fs.get_mapper(outpath), mode="w")

@max-sixty
Copy link
Collaborator

fs = s3fs.S3FileSystem(anon=False, skip_instance_cache=True)

Should fs work as a store though?

store is typed as store: MutableMapping | str | PathLike[str] | None = None,...

I'm closing issues to keep below 1K, but can reopen if we can show this should work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports plan to close May be closeable, needs more eyeballs
Projects
None yet
Development

No branches or pull requests

6 participants