Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caravan data hosted on OpenDAP server #30

Open
BSchilperoort opened this issue Apr 9, 2024 · 3 comments
Open

Caravan data hosted on OpenDAP server #30

BSchilperoort opened this issue Apr 9, 2024 · 3 comments

Comments

@BSchilperoort
Copy link

Thanks for working on this and putting the data online! For our ewatercycle project we wanted easier access to the separate basins contained in the Caravans dataset. A data hosting service we have access to has an OpenDAP server, so we wanted to put it there.

I reorganized the data: added the attributes (units, basin properties) to the netCDF files, merged them per collections (i.e. one file per Camels), and compressed the netCDF files.

The data is available on:
https://doi.org/10.4121/ca13056c-c347-4a27-b320-930c2a4dd207

And can be accessed like this in xarray:

# open camels US:
ds = xr.open_dataset("https://opendap.4tu.nl/thredds/dodsC/data2/djht/ca13056c-c347-4a27-b320-930c2a4dd207/1/camels.nc")

# select the basin of interest:
ds.sel(basin_id=b"camels_01022500")

# plot the air temperature:
ds.sel(basin_id=b"camels_01022500")["temperature_2m_mean"].plot()
@kratzert
Copy link
Owner

Hi Bart,

thanks for the post. I do agree that the current file structure and the way how we share data is not ideal. I will even become worse in the next days, since I am about to add more than 10k additional basins to Caravan. The first thing I will do for the update is to have two separate downloads, so that not everybody needs to download the csv and nc version together but could chose to only download one of the two.

One netCDF file per subdataset is also neat. I could also imagine having a single zarr/netcdf file at some point with all data combined, but that would required to recreate this file every time there is an extension. Also a zarr file hosted online could be an interesting idea. That could allow users to only query for basins and bands (and time periods) they are interested in.

@BSchilperoort
Copy link
Author

Hi Frederik,

Having separate downloads for netCDF and csv would already be much better. However, separate files for each basin still isn't ideal as it adds a lot of overhead when copying them or opening them as a multi-file dataset.

I could also imagine having a single zarr/netcdf file at some point with all data combined, but that would required to recreate this file every time there is an extension.

Not necessarily; netCDF datasets can be split up over multiple files. I believe zarr also has some support for appending along a dimension. Of course this also depends on how/where the data is hosted.

I will also be attending EGU next week so we can discuss this there.

@BSchilperoort
Copy link
Author

I saw the ARCO-ERA5 dataset from Google some time back (https://github.com/google-research/arco-era5), it could be nice if Caravan would be accessible in the same way (as a Zarr store on Google Cloud Public Datasets).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants