-
-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems faced while storing onto Zarr store using ABSStore #528
Comments
This is a bit of a guess, but are you sure all of the input netcdf files
are there? Errors suggest that during attempt to read netcdf input
something is requested which does not exist.
…On Thu, 12 Dec 2019, 12:47 Nima Dokoohaki, ***@***.***> wrote:
# Your code here
import zarrfrom azure.storage.blob import BlockBlobService
store = zarr.ABSStore(container='zarrstoreall', prefix='zarrstoreall',account_name='xxxx',account_key='xxxx', blob_service_kwargs={'is_emulated': False})
compressor = zarr.Blosc(cname='zstd', clevel=3)
encoding = {vname: {'compressor': compressor} for vname in ds.data_vars}
ds.to_zarr(store=store, encoding=encoding, consolidated=True)
Problem description
I'm trying to use ABSStore to store a large XArray onto a zarr store using
blob store. (see the code in previous section). I am facing two issues
currently:
1.
I am getting first some sort of network error when loading "certain"
variables into the store:
[image: image]
<https://user-images.githubusercontent.com/164987/70712978-5c4f2280-1ce5-11ea-8fbe-cadffe2d20aa.png>
2.
After some time passing I get this error:
[image: image]
<https://user-images.githubusercontent.com/164987/70712591-6290cf00-1ce4-11ea-974c-df2615ea0a0a.png>
Needless to say with relatively smaller sizes of XArray datasets I did not
face these issues.
I appreciate your kind attention.
Version and installation information
Please provide the following:
- Value of zarr.__version__ = '2.3.2'
- Value of numcodecs.__version__ = '0.6.4'
- Version of Python interpreter = Python 3.7.3
- Operating system (Linux/Windows/Mac) = Databricks Runtime Version
6.1 (includes Apache Spark 2.4.4, Scala 2.11)
- How Zarr was installed (e.g., "using pip into virtual environment",
or "using conda")
!pip install zarr
Also, if you think it might be relevant, please provide the output from pip
freeze or
conda env export depending on which was used to install Zarr.
pip freeze output:
adal==1.2.2
asciitree==0.3.3
asn1crypto==0.24.0
azure==4.0.0
azure-applicationinsights==0.1.0
azure-batch==4.1.3
azure-common==1.1.23
azure-cosmosdb-nspkg==2.0.2
azure-cosmosdb-table==1.0.6
azure-datalake-store==0.0.48
azure-eventgrid==1.3.0
azure-graphrbac==0.40.0
azure-keyvault==1.1.0
azure-loganalytics==0.1.0
azure-mgmt==4.0.0
azure-mgmt-advisor==1.0.1
azure-mgmt-applicationinsights==0.1.1
azure-mgmt-authorization==0.50.0
azure-mgmt-batch==5.0.1
azure-mgmt-batchai==2.0.0
azure-mgmt-billing==0.2.0
azure-mgmt-cdn==3.1.0
azure-mgmt-cognitiveservices==3.0.0
azure-mgmt-commerce==1.0.1
azure-mgmt-compute==4.6.2
azure-mgmt-consumption==2.0.0
azure-mgmt-containerinstance==1.5.0
azure-mgmt-containerregistry==2.8.0
azure-mgmt-containerservice==4.4.0
azure-mgmt-cosmosdb==0.4.1
azure-mgmt-datafactory==0.6.0
azure-mgmt-datalake-analytics==0.6.0
azure-mgmt-datalake-nspkg==3.0.1
azure-mgmt-datalake-store==0.5.0
azure-mgmt-datamigration==1.0.0
azure-mgmt-devspaces==0.1.0
azure-mgmt-devtestlabs==2.2.0
azure-mgmt-dns==2.1.0
azure-mgmt-eventgrid==1.0.0
azure-mgmt-eventhub==2.6.0
azure-mgmt-hanaonazure==0.1.1
azure-mgmt-iotcentral==0.1.0
azure-mgmt-iothub==0.5.0
azure-mgmt-iothubprovisioningservices==0.2.0
azure-mgmt-keyvault==1.1.0
azure-mgmt-loganalytics==0.2.0
azure-mgmt-logic==3.0.0
azure-mgmt-machinelearningcompute==0.4.1
azure-mgmt-managementgroups==0.1.0
azure-mgmt-managementpartner==0.1.1
azure-mgmt-maps==0.1.0
azure-mgmt-marketplaceordering==0.1.0
azure-mgmt-media==1.0.0
azure-mgmt-monitor==0.5.2
azure-mgmt-msi==0.2.0
azure-mgmt-network==2.7.0
azure-mgmt-notificationhubs==2.1.0
azure-mgmt-nspkg==3.0.2
azure-mgmt-policyinsights==0.1.0
azure-mgmt-powerbiembedded==2.0.0
azure-mgmt-rdbms==1.9.0
azure-mgmt-recoveryservices==0.3.0
azure-mgmt-recoveryservicesbackup==0.3.0
azure-mgmt-redis==5.0.0
azure-mgmt-relay==0.1.0
azure-mgmt-reservations==0.2.1
azure-mgmt-resource==2.2.0
azure-mgmt-scheduler==2.0.0
azure-mgmt-search==2.1.0
azure-mgmt-servicebus==0.5.3
azure-mgmt-servicefabric==0.2.0
azure-mgmt-signalr==0.1.1
azure-mgmt-sql==0.9.1
azure-mgmt-storage==2.0.0
azure-mgmt-subscription==0.2.0
azure-mgmt-trafficmanager==0.50.0
azure-mgmt-web==0.35.0
azure-nspkg==3.0.2
azure-servicebus==0.21.1
azure-servicefabric==6.3.0.0
azure-servicemanagement-legacy==0.20.6
azure-storage-blob==1.5.0
azure-storage-common==1.4.2
azure-storage-file==1.4.0
azure-storage-queue==1.4.0
backcall==0.1.0
boto==2.49.0
boto3==1.9.162
botocore==1.12.163
certifi==2019.3.9
cffi==1.12.2
cftime==1.0.4.2
chardet==3.0.4
cryptography==2.6.1
cycler==0.10.0
Cython==0.29.6
dask==2.9.0
decorator==4.4.0
docutils==0.14
fasteners==0.15
fsspec==0.6.1
idna==2.8
ipython==7.4.0
ipython-genutils==0.2.0
isodate==0.6.0
jedi==0.13.3
jmespath==0.9.4
kiwisolver==1.1.0
koalas==0.23.0
locket==0.2.0
matplotlib==3.0.3
monotonic==1.5
msrest==0.6.10
msrestazure==0.6.2
netCDF4==1.5.3
numcodecs==0.6.4
numpy==1.16.2
oauthlib==3.1.0
pandas==0.24.2
parso==0.3.4
partd==1.1.0
patsy==0.5.1
pexpect==4.6.0
pickleshare==0.7.5
prompt-toolkit==2.0.9
psycopg2==2.7.6.1
ptyprocess==0.6.0
pyarrow==0.13.0
pycparser==2.19
pycurl==7.43.0
Pygments==2.3.1
pygobject==3.20.0
PyJWT==1.7.1
pyOpenSSL==19.0.0
pyparsing==2.4.2
PySocks==1.6.8
python-apt==1.1.0b1+ubuntu0.16.4.5
python-dateutil==2.8.0
pytz==2018.9
requests==2.21.0
requests-oauthlib==1.3.0
s3transfer==0.2.1
scikit-learn==0.20.3
scipy==1.2.1
seaborn==0.9.0
six==1.12.0
ssh-import-id==5.5
statsmodels==0.9.0
toolz==0.10.0
traitlets==4.3.2
unattended-upgrades==0.1
urllib3==1.24.1
virtualenv==16.4.1
wcwidth==0.1.7
xarray==0.14.1
zarr==2.3.2
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#528?email_source=notifications&email_token=AAFLYQQKRK6GKXP5NDWXPQLQYIXHRA5CNFSM4JZ577ZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IABKMLQ>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQSOQYKXQDVLIZVRHC3QYIXHRANCNFSM4JZ577ZA>
.
|
I believe the first error is actually a warning, and occurs when Zarr looks for metadata files that do not exist. This has been solved in newer versions of the Azure SDK. I would try upgrading azure-storage-blob to v2.1. It's worth noting that while investigating this I learned that there is a major new release of the Azure SDK that looks like it will break ABSStore entirely. We are going to need to figure out how to deal with this probably soon. It's not obvious how we are going to deal with two versions of the SDK that are essentially incompatible. I will probably start a new issue to work on this eventually. |
Thanks @alimanfoo I use mfdataset method and I get some warnings during the import which should suggest that: /local_disk0/tmp/1576146109393-0/PythonShell.py:4: FutureWarning: In xarray version 0.15 the default behaviour of import errno Would this suggest that some of the files were not loaded I guess into Xarray. I will try experimenting with the combine options to check this. |
Maybe not related, but did you see PR ( #526 )? |
Agree with @tjcrone here, the first warning goes away after updating to the newer version. As for the above error, I have faced various errors, mostly out of memory error(so it's worth monitoring the memory of your device/vm while doing the above) but also the one above while transferring large amounts of netCDF data to zarr. My solution was to transfer the data to zarr "in parts". It is easily possible now with xarray's new "append" feature for zarr. You can use |
Hi, |
I would not have thought so, at least on the side of writing the zarr data, zarr should be ignorant to what the actual data values are, it will just write them. But it's still unclear to me at least whether the errors are being generated during the read from netcdf or the write to zarr. The error messages suggest it's the read from netcdf that's triggering the error, but I may have misunderstood. Are you reading the netcdf data from ABS, or is that being read from a local file system? Apologies if I'm barking up the wrong tree. |
Thanks for your kind follow up. We are reading NetCDF from local file system through Xray and then writing it onto Zarr. |
Re: getting NaN values I think I might have found out why this happens, as I ran into this myself. There is a Xarray uses this same attribute as the @zarr-developers/core-devs Is this a correct interpretation? If so where should this be fixed? In xarray or in zarr? @dokooh I fixed this temporarily by giving |
Hi @shikharsg, thanks a lot for following up. Yes zarr has a I don't know the details of how xarray zarr backend uses the I'm still not sure what the underlying problem is here. @shikharsg do you have a handle on where the problem is? Could you elaborate? |
If you are curious, would you be willing to trial |
So I had a large number of Python 3.7.6 (default, Jan 8 2020, 19:59:22)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import zarr
>>> import xarray as xr
>>> import numpy as np
>>> zarr.__version__, xr.__version__, np.__version__
('2.4.0', '0.15.0', '1.18.1')
>>>
>>> # in memory zarr array
>>> store = zarr.MemoryStore()
>>> grp = zarr.open_group(store)
>>> arr = zarr.open_array(store, path='foo', shape=(2, 10), fill_value=0.0, chunks=(1, 10))
>>> arr[0] = np.zeros((10,))
>>> arr[0] = np.ones((10,))
>>>
>>> arr[:]
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
>>>
>>> # manually build up dimensions
>>> dim1 = zarr.open_array(store, path='dim1', shape=(2,))
>>> dim1[:] = np.array(list(range(1, 3)))
>>> dim1.attrs['_ARRAY_DIMENSIONS'] = ['dim1']
>>> dim2 = zarr.open_array(store, path='dim2', shape=(10,))
>>> dim2[:] = np.array(list(range(1, 11)))
>>> dim2.attrs['_ARRAY_DIMENSIONS'] = ['dim2']
>>> arr.attrs['_ARRAY_DIMENSIONS'] = ['dim1', 'dim2']
>>>
>>> xr.open_zarr(store)['foo'].values
array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])
>>>
>>> arr[:]
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]) As you can see, zarr and xarray return different results. This is because xarray uses the |
Perhaps this is a more appropriate example, where you can see zarr and xarray are using >>> zarr.__version__, xr.__version__, np.__version__
('2.4.0', '0.15.0', '1.18.1')
>>>
>>> # in memory zarr array
>>> store = zarr.MemoryStore()
>>> grp = zarr.open_group(store)
>>> arr = zarr.open_array(store, path='foo', shape=(2, 10), fill_value=0.0, chunks=(1, 10))
>>> arr[0] = np.ones((10,))
>>>
>>> arr[:]
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
>>>
>>> # manually build up dimensions
>>> dim1 = zarr.open_array(store, path='dim1', shape=(2,))
>>> dim1[:] = np.array(list(range(1, 3)))
>>> dim1.attrs['_ARRAY_DIMENSIONS'] = ['dim1']
>>> dim2 = zarr.open_array(store, path='dim2', shape=(10,))
>>> dim2[:] = np.array(list(range(1, 11)))
>>> dim2.attrs['_ARRAY_DIMENSIONS'] = ['dim2']
>>> arr.attrs['_ARRAY_DIMENSIONS'] = ['dim1', 'dim2']
>>>
>>> print(xr.open_zarr(store)['foo'].values)
[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[nan nan nan nan nan nan nan nan nan nan]]
>>> |
@martindurant Is the adlfs work appropriate for files stored in standard Azure blob storage? From the description it looks like it targets the datalake storage? |
It implements both datalake and blob. The latter is more recent, but I believe it is complete. |
@martindurant is this in context to the current issue or just in general? |
In hindsight, it probably makes no difference to how the nan-value is inferred by zarr versus xarray ; so in general. |
Would love to try it. Will try to check it out over the next couple of days. |
@martindurant Does it support SAS tokens? I see the example mentions only
|
I have no idea what SAS tokens are :| |
@rabernat @jhamman would be great to have your comments on this |
@martindurant FYI SAS tokens are a way of allowing access to your blob stores with more fine grained and potentially time limited access For example you could give someone read-only access for period of one month via a token. |
OK, so some sort of delegation thing... |
@tjcrone Did you ever create a new issue for this? I can't seem to find one. Unfortunately, the version 12 of the azure-storage-blob SDK does break |
cc @TomAugspurger (who may have thoughts here 🙂) |
The ABSStore has been deprecated in favor of using adlfs. |
Problem description
I'm trying to use ABSStore to store a large XArray dataset onto a zarr store using blob store. (see the code in previous section). I am facing two issues currently:
I am getting first some sort of network error when loading "certain" variables into the store:
![image](https://user-images.githubusercontent.com/164987/70712978-5c4f2280-1ce5-11ea-8fbe-cadffe2d20aa.png)
After some time passing I get this error:
![image](https://user-images.githubusercontent.com/164987/70712591-6290cf00-1ce4-11ea-974c-df2615ea0a0a.png)
Needless to say with relatively smaller sizes of XArray datasets I did not face these issues.
I appreciate your kind attention.
Version and installation information
Please provide the following:
zarr.__version__
= '2.3.2'numcodecs.__version__
= '0.6.4'6.1 (includes Apache Spark 2.4.4, Scala 2.11)
!pip install zarr
Also, if you think it might be relevant, please provide the output from
pip freeze
orconda env export
depending on which was used to install Zarr.pip freeze output:
adal==1.2.2
asciitree==0.3.3
asn1crypto==0.24.0
azure==4.0.0
azure-applicationinsights==0.1.0
azure-batch==4.1.3
azure-common==1.1.23
azure-cosmosdb-nspkg==2.0.2
azure-cosmosdb-table==1.0.6
azure-datalake-store==0.0.48
azure-eventgrid==1.3.0
azure-graphrbac==0.40.0
azure-keyvault==1.1.0
azure-loganalytics==0.1.0
azure-mgmt==4.0.0
azure-mgmt-advisor==1.0.1
azure-mgmt-applicationinsights==0.1.1
azure-mgmt-authorization==0.50.0
azure-mgmt-batch==5.0.1
azure-mgmt-batchai==2.0.0
azure-mgmt-billing==0.2.0
azure-mgmt-cdn==3.1.0
azure-mgmt-cognitiveservices==3.0.0
azure-mgmt-commerce==1.0.1
azure-mgmt-compute==4.6.2
azure-mgmt-consumption==2.0.0
azure-mgmt-containerinstance==1.5.0
azure-mgmt-containerregistry==2.8.0
azure-mgmt-containerservice==4.4.0
azure-mgmt-cosmosdb==0.4.1
azure-mgmt-datafactory==0.6.0
azure-mgmt-datalake-analytics==0.6.0
azure-mgmt-datalake-nspkg==3.0.1
azure-mgmt-datalake-store==0.5.0
azure-mgmt-datamigration==1.0.0
azure-mgmt-devspaces==0.1.0
azure-mgmt-devtestlabs==2.2.0
azure-mgmt-dns==2.1.0
azure-mgmt-eventgrid==1.0.0
azure-mgmt-eventhub==2.6.0
azure-mgmt-hanaonazure==0.1.1
azure-mgmt-iotcentral==0.1.0
azure-mgmt-iothub==0.5.0
azure-mgmt-iothubprovisioningservices==0.2.0
azure-mgmt-keyvault==1.1.0
azure-mgmt-loganalytics==0.2.0
azure-mgmt-logic==3.0.0
azure-mgmt-machinelearningcompute==0.4.1
azure-mgmt-managementgroups==0.1.0
azure-mgmt-managementpartner==0.1.1
azure-mgmt-maps==0.1.0
azure-mgmt-marketplaceordering==0.1.0
azure-mgmt-media==1.0.0
azure-mgmt-monitor==0.5.2
azure-mgmt-msi==0.2.0
azure-mgmt-network==2.7.0
azure-mgmt-notificationhubs==2.1.0
azure-mgmt-nspkg==3.0.2
azure-mgmt-policyinsights==0.1.0
azure-mgmt-powerbiembedded==2.0.0
azure-mgmt-rdbms==1.9.0
azure-mgmt-recoveryservices==0.3.0
azure-mgmt-recoveryservicesbackup==0.3.0
azure-mgmt-redis==5.0.0
azure-mgmt-relay==0.1.0
azure-mgmt-reservations==0.2.1
azure-mgmt-resource==2.2.0
azure-mgmt-scheduler==2.0.0
azure-mgmt-search==2.1.0
azure-mgmt-servicebus==0.5.3
azure-mgmt-servicefabric==0.2.0
azure-mgmt-signalr==0.1.1
azure-mgmt-sql==0.9.1
azure-mgmt-storage==2.0.0
azure-mgmt-subscription==0.2.0
azure-mgmt-trafficmanager==0.50.0
azure-mgmt-web==0.35.0
azure-nspkg==3.0.2
azure-servicebus==0.21.1
azure-servicefabric==6.3.0.0
azure-servicemanagement-legacy==0.20.6
azure-storage-blob==1.5.0
azure-storage-common==1.4.2
azure-storage-file==1.4.0
azure-storage-queue==1.4.0
backcall==0.1.0
boto==2.49.0
boto3==1.9.162
botocore==1.12.163
certifi==2019.3.9
cffi==1.12.2
cftime==1.0.4.2
chardet==3.0.4
cryptography==2.6.1
cycler==0.10.0
Cython==0.29.6
dask==2.9.0
decorator==4.4.0
docutils==0.14
fasteners==0.15
fsspec==0.6.1
idna==2.8
ipython==7.4.0
ipython-genutils==0.2.0
isodate==0.6.0
jedi==0.13.3
jmespath==0.9.4
kiwisolver==1.1.0
koalas==0.23.0
locket==0.2.0
matplotlib==3.0.3
monotonic==1.5
msrest==0.6.10
msrestazure==0.6.2
netCDF4==1.5.3
numcodecs==0.6.4
numpy==1.16.2
oauthlib==3.1.0
pandas==0.24.2
parso==0.3.4
partd==1.1.0
patsy==0.5.1
pexpect==4.6.0
pickleshare==0.7.5
prompt-toolkit==2.0.9
psycopg2==2.7.6.1
ptyprocess==0.6.0
pyarrow==0.13.0
pycparser==2.19
pycurl==7.43.0
Pygments==2.3.1
pygobject==3.20.0
PyJWT==1.7.1
pyOpenSSL==19.0.0
pyparsing==2.4.2
PySocks==1.6.8
python-apt==1.1.0b1+ubuntu0.16.4.5
python-dateutil==2.8.0
pytz==2018.9
requests==2.21.0
requests-oauthlib==1.3.0
s3transfer==0.2.1
scikit-learn==0.20.3
scipy==1.2.1
seaborn==0.9.0
six==1.12.0
ssh-import-id==5.5
statsmodels==0.9.0
toolz==0.10.0
traitlets==4.3.2
unattended-upgrades==0.1
urllib3==1.24.1
virtualenv==16.4.1
wcwidth==0.1.7
xarray==0.14.1
zarr==2.3.2
The text was updated successfully, but these errors were encountered: