Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In Memory netcdf subdatasets do not persist order when buffer is closed #1388

Open
lagamura opened this issue Nov 17, 2024 · 8 comments
Open

Comments

@lagamura
Copy link

To report a non-security related issue, please provide:

  • the version of the software with which you are encountering an issue
    netcdf4 1.7.1 nompi_py311hae66bec_102 conda-forge

  • environmental information (i.e. Operating System, compiler info, java version, python version, etc.)
    OS: Almalinux-9.3, python: 3.11

  • a description of the issue with the steps needed to reproduce it:
    When writing subdatasets to a netcdf in-memory, the subdatasets change index order when the buffer is written as a netcdf file at the end. Following a minimal example:

import numpy as np
from netCDF4 import Dataset
from osgeo import gdal

list_of_subds = ["first_subdataset", "c_subdataset", "b_subdataset"]

ds = Dataset(
    "dump_ds.nc", mode="w", memory=1028, format="NETCDF4"
) 

ds.createDimension("lon", 100)
ds.createDimension("lat", 100)
ds.createDimension("time", None)

for subds in list_of_subds:

    data = ds.createVariable(
        subds,
        "f8",
        ("time", "lat", "lon"),
        zlib=True,
        fill_value=-1,
    )
    data[0, :, :] = np.arange(100)

print(ds)
nc_buf = ds.close()
with open("dump_ds.nc", "wb") as f:
    f.write(nc_buf)

print(gdal.Info("dump_ds.nc"))

In print(ds) we still have ordered subdatasets:

root group (NETCDF4 data model, file format HDF5):
dimensions(sizes): lon(100), lat(100), time(1)
variables(dimensions): float64 first_subdataset(time, lat, lon), float64 c_subdataset(time, lat, lon), float64 b_subdataset(time, lat, lon)
groups:

printing gdal.Info after dumping the nc file:

Subdatasets:
SUBDATASET_1_NAME=NETCDF:"dump_ds.nc":b_subdataset
SUBDATASET_1_DESC=[1x100x100] b_subdataset (64-bit floating-point)
SUBDATASET_2_NAME=NETCDF:"dump_ds.nc":c_subdataset
SUBDATASET_2_DESC=[1x100x100] c_subdataset (64-bit floating-point)
SUBDATASET_3_NAME=NETCDF:"dump_ds.nc":first_subdataset
SUBDATASET_3_DESC=[1x100x100] first_subdataset (64-bit floating-point)

@jswhit
Copy link
Collaborator

jswhit commented Nov 17, 2024

I don't know how gdal chooses how to order to variables - maybe alphabetical? Don't believe this is a bug in netcdf4-python.

@jswhit
Copy link
Collaborator

jswhit commented Nov 17, 2024

Looks like the order of the variables does change when the memory buffer is written out and re-read (ncdump shows the same thing as gdal). I don't know if the order should be preserved - perhaps @DennisHeimbigner would know.

@lagamura
Copy link
Author

Looks like the order of the variables does change when the memory buffer is written out and re-read (ncdump shows the same thing as gdal). I don't know if the order should be preserved - perhaps @DennisHeimbigner would know.

Thanks for the quick look,
In my opinion the order should be preserved, for consistency, as it happens if you use typically Dataset class to store a netcdf in the disk. Currently, when in-memory is used, the subdatasets will be written alphabetically ordered as you pointed out.

@jswhit
Copy link
Collaborator

jswhit commented Nov 18, 2024

just curious - why does the order matter for your use case?

@lagamura
Copy link
Author

To be in compliance with previous version of the product we are working on. A more specific usage would be if someone opens two netcdfs of the same product and try to compare the subdataset by indices.

@jswhit
Copy link
Collaborator

jswhit commented Nov 18, 2024

netcdf-c keeps track of creation order, and preserves that order when a dataset is written to disk. Since you are bypassing the c library when writing the memory buffer to disk directly, my guess is that the logic that preserves creation order is also bypassed. Unfortunately, I don't see any way to tell the C library to write the memory buffer to disk preserving the creation order.

@lagamura
Copy link
Author

Just to clarify the use-case, we want to use in-memory feature combining with writing the IO.buffer result directly to S3.
It is possible by using the netcdf driver with gdal and write directly to s3 storage, so I will make a minimal example and check the sub-datasets order.

@lagamura
Copy link
Author

Apparently, it seems netcdf gdal driver does not support writing a file directly on s3 (/vsis3).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants