Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF unable to read some HDF5 enums #267

Closed
shoyer opened this issue May 23, 2016 · 28 comments · Fixed by #2655
Closed

NetCDF unable to read some HDF5 enums #267

shoyer opened this issue May 23, 2016 · 28 comments · Fixed by #2655

Comments

@shoyer
Copy link
Contributor

shoyer commented May 23, 2016

h5py stores boolean data using an HDF5 enum: http://docs.h5py.org/en/latest/faq.html

The attached file is a such an example: test.nc.zip

However, the netCDF-C library doesn't report these variables at all:

$ ncdump test.nc
netcdf test {
}

This is somewhat unfortunate.

Here's the output of h5dump on the file, along with a file created via netCDF4:

$ h5dump test.nc  $ made with h5py
HDF5 "test.nc" {
GROUP "/" {
   DATASET "foo" {
      DATATYPE  H5T_ENUM {
         H5T_STD_I8LE;
         "FALSE"            0;
         "TRUE"             1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): TRUE
      }
   }
}
}
$ h5dump test2.nc  # made with netCDF4-python
HDF5 "test2.nc" {
GROUP "/" {
   DATATYPE "boolean" H5T_ENUM {
      H5T_STD_I8LE;
      "FALSE"            0;
      "TRUE"             1;
   };
   DATASET "foo" {
      DATATYPE  H5T_ENUM {
         H5T_STD_I8LE;
         "FALSE"            0;
         "TRUE"             1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): TRUE
      }
   }
}
}

Scripts to produce these files:

import h5py
import numpy as np

with h5py.File('test.nc') as f:
    f.create_dataset('foo', data=np.array(True))
import netCDF4
import numpy as np
with netCDF4.Dataset('test2.nc', 'w') as ds:
    dt = ds.createEnumType('i1', 'boolean', {'FALSE': 0, 'TRUE': 1})
    v = ds.createVariable('foo', dt, ())
    v[:] = True

ncdump reports "netcdf library version 4.4.0"

@shoyer
Copy link
Contributor Author

shoyer commented May 23, 2016

This is also true for reading data written with a complex dtype, which is encoded as an HDF5 compound dtype.

@WardF
Copy link
Member

WardF commented May 23, 2016

I'll take a look at this; netCDF4 does not support all features of libhdf5, but if this is something that's straightforward to implement, we should be able to add support in.

@edhartnett
Copy link
Contributor

I suspect that the problem is that I wrote the enum code for integer types only. Who knew that you could use a binary type in an enum!?!

For example, in nc4type.c we have this function, which assumes that only integer types are involved:

/* Get enum name from enum value. Name size will be <= NC_MAX_NAME. */
int
NC4_inq_enum_ident(int ncid, nc_type xtype, long long value, char *identifier)
{
   NC_GRP_INFO_T *grp;
   NC_TYPE_INFO_T *type;
   NC_ENUM_MEMBER_INFO_T *enum_member;
   long long ll_val;
   int i;
   int retval;

   LOG((3, "nc_inq_enum_ident: xtype %d value %d\n", xtype, value));
   
   /* Find group metadata. */
   if ((retval = nc4_find_nc4_grp(ncid, &grp)))
      return retval;
   
   /* Find this type. */
   if (!(type = nc4_rec_find_nc_type(grp->nc4_info->root_grp, xtype)))
      return NC_EBADTYPE;
   
   /* Complain if they are confused about the type. */
   if (type->nc_type_class != NC_ENUM)
      return NC_EBADTYPE;
   
   /* Move to the desired enum member in the list. */
   enum_member = type->u.e.enum_member;
   for (i = 0; i < type->u.e.num_members; i++)
   {
      switch (type->u.e.base_nc_typeid)
      {
	 case NC_BYTE:
	    ll_val = *(char *)enum_member->value;
	    break;
	 case NC_UBYTE:
	    ll_val = *(unsigned char *)enum_member->value;
	    break;
	 case NC_SHORT:
	    ll_val = *(short *)enum_member->value;
	    break;
	 case NC_USHORT:
	    ll_val = *(unsigned short *)enum_member->value;
	    break;
	 case NC_INT:
	    ll_val = *(int *)enum_member->value;
	    break;
	 case NC_UINT:
	    ll_val = *(unsigned int *)enum_member->value;
	    break;
	 case NC_INT64:
	 case NC_UINT64:
	    ll_val = *(long long *)enum_member->value;
	    break;
	 default:
	    return NC_EINVAL;
      }
      LOG((4, "ll_val=%d", ll_val));
      if (ll_val == value)
      {
	 if (identifier)
	    strcpy(identifier, enum_member->name);
	 break;
      }
      else
	 enum_member = enum_member->l.next;
   }

   /* If we didn't find it, life sucks for us. :-( */
   if (i == type->u.e.num_members)
      return NC_EINVAL;

   return NC_NOERR;
}

I can take a look at this - I will put it on the list. ;-)

But @shoyer I am not familiar with a dtype. What's that all about?

@shoyer
Copy link
Contributor Author

shoyer commented Nov 13, 2017

@edhartnett thanks for you interest!

My guess was that the issue is netCDF4 always expects data types to be defined at the group level, but h5py only defines data types on dataset objects.

I suspect that the problem is that I wrote the enum code for integer types only. Who knew that you could use a binary type in an enum!?!

I'm not quite what you mean by a binary type in an enum. "TRUE"/"FALSE" are just names for enum level, which themselves have H5T_STD_I8LE (aka H5T_NATIVE_CHAR) values.

I am not familiar with a dtype. What's that all about?

"dtype" is Python shorthand for "data type", i.e., the type for each element in an array.

@DennisHeimbigner
Copy link
Collaborator

The assumption that enum basetypes are integers is not only in the netcdf-c library
but also in the netcdf-Java library. I am very reluctant to change that. Sometimes we have
to make sure that users understand that legal netcdf-4 files are a subset of the set of legal
HDF5 files.

@shoyer
Copy link
Contributor Author

shoyer commented Nov 13, 2017

@edhartnett @DennisHeimbigner I'm pretty sure h5py is still using integer enum codes -- where did you get the impression that they are something else? Both my examples in my first post, created with netCDF4-Python and h5py, report the underlying H5T_STD_I8LE data type (i.e., 8-bit integer).

Looking at the HDF5 docs, it seems that they also insist on enums as a mapping between characters strings and integer values.

An HDF enumeration datatype is a 1:1 mapping between a set of symbols and a set of integer values, and an order is imposed on the symbols by their integer values. The symbols are passed between the application and library as character strings and all the values for a particular enumeration type are of the same integer type, which is not necessarily a native type.

@edhartnett
Copy link
Contributor

Ok sorry got that wrong. I will take another look...

@edhartnett
Copy link
Contributor

OK, the problem is that the enum is somehow defined inside the dataset in HDF5. How did you do that?

Here's the h5dump of your file:

HDF5 "ref_hdf5_enum.nc" {
GROUP "/" {
   DATASET "foo" {
      DATATYPE  H5T_ENUM {
         H5T_STD_I8LE;
         "FALSE"            0;
         "TRUE"             1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): TRUE
      }
   }
}
}

Here's the output of h5dump on an enum file created by nc_test4/tst_enum.c:

HDF5 "tst_enums.nc" {
GROUP "/" {
   ATTRIBUTE "_NCProperties" {
      DATATYPE  H5T_STRING {
         STRSIZE 67;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "version=1|netcdflibversion=4.6.2-development|hdf5libversion=1.10.1"
      }
   }
   DATATYPE "Mysterous_Word" H5T_ENUM {
      H5T_STD_I32LE;
      "Mene1"            0;
      "Mene2"            99;
      "Tekel"            81232;
      "Upharsin"         12;
   };
}
}

Note that in the netCDF-4 file, the data type is defined at root level in the group. But in your mystery file, the data type is defined inside the dataset, instead of at the group level.

How did you do that?

@shoyer
Copy link
Contributor Author

shoyer commented Aug 20, 2018

I made this file using h5py using the following Python code:

import h5py

with h5py.File('test.nc') as f:
    f.create_dataset('foo', data=True)

h5py creates HDF5 dtypes using its low-level py_create function:
https://github.com/h5py/h5py/blob/cae0c01d4d3589cde20fe49fc7df750dfbe94a51/h5py/h5t.pyx#L1614

Booleans in particular use the following function:

cdef TypeEnumID _c_bool(dtype dt):
    # Booleans
    global cfg

    cdef TypeEnumID out
    out = TypeEnumID(H5Tenum_create(H5T_NATIVE_INT8))

    out.enum_insert(cfg._f_name, 0)
    out.enum_insert(cfg._t_name, 1)

    return out

https://github.com/h5py/h5py/blob/cae0c01d4d3589cde20fe49fc7df750dfbe94a51/h5py/h5t.pyx#L1452-L1462

cdef class TypeEnumID(TypeCompositeID):

    """
        Represents an enumerated type
    """

    cdef int enum_convert(self, long long *buf, int reverse) except -1:
        # Convert the long long value in "buf" to the native representation
        # of this (enumerated) type.  Conversion performed in-place.
        # Reverse: false => llong->type; true => type->llong

        cdef hid_t basetype
        cdef H5T_class_t class_code

        class_code = H5Tget_class(self.id)
        if class_code != H5T_ENUM:
            raise ValueError("This type (class %d) is not of class ENUM" % class_code)

        basetype = H5Tget_super(self.id)
        assert basetype > 0

        try:
            if not reverse:
                H5Tconvert(H5T_NATIVE_LLONG, basetype, 1, buf, NULL, H5P_DEFAULT)
            else:
                H5Tconvert(basetype, H5T_NATIVE_LLONG, 1, buf, NULL, H5P_DEFAULT)
        finally:
            H5Tclose(basetype)


    @with_phil
    def enum_insert(self, char* name, long long value):
        """(STRING name, INT/LONG value)
        Define a new member of an enumerated type.  The value will be
        automatically converted to the base type defined for this enum.  If
        the conversion results in overflow, the value will be silently
        clipped.
        """
        cdef long long buf

        buf = value
        self.enum_convert(&buf, 0)
        H5Tenum_insert(self.id, name, &buf)

https://github.com/h5py/h5py/blob/cae0c01d4d3589cde20fe49fc7df750dfbe94a51/h5py/h5t.pyx#L1215

@edhartnett
Copy link
Contributor

OK, much as I love python, I cannot follow that to figure out how you created the file (with the HDF5 C API).

Do you understand what the above code is doing?

Do you understand where it is creating the type with an H5Tcreate() command? At that time, is it providing the dataset ID as the locid?

I will try and see if C code like that will reproduce the situation...

@shoyer
Copy link
Contributor Author

shoyer commented Aug 20, 2018

It looks like it's using H5Tenum_create (rather than H5Tcreate) and H5Tenum_insert to make the enum type:
https://support.hdfgroup.org/HDF5/doc/H5.user/DatatypesEnum.html

How does netCDF associate types with HDF5 groups? There doesn't seem to be any reference to groups in either H5Tenum_create or H5Tenum_insert.

@DennisHeimbigner
Copy link
Collaborator

I think Ed has it right. Creating the enum type inside the dataset is not
recognized by netcdf. It is possible to create HDF5 files that are not legal
netcdf files. This appears to be one case.

@shoyer
Copy link
Contributor Author

shoyer commented Aug 20, 2018

OK, the docs on HDF5 datatypes seem to be helpful here: https://support.hdfgroup.org/HDF5/doc/H5.user/Datatypes.html
https://support.hdfgroup.org/HDF5/doc1.6/UG/11_Datatypes.html (see especially "8. Life Cycle of the Datatype Object")

In particular, netCDF4 only seems to support "named datatypes", but h5py here is writing a file with a "transient datatype":

  • From an API perspective, the difference is simply where or not H5Tcommit was called on the data type to store it in a group.
  • From a storage perspective, the difference is whether the datatype is stored on the HDF5 dataset (transient datatypes) or on an HDF5 group with only a pointer in the dataset (named datatypes).

So I guess the request here is to "support transient datatypes" in netCDF. I don't see any particular reason for why not to do this, but the implementation might be involved since we would need to lookup data types in a new place.

@edhartnett
Copy link
Contributor

I am right in the middle of a major refactor of the file opening code, which is where this happens, so this is timely.

If I can support transient datatypes without too many contortions, I am happy to do so. I would like netCDF-4 to have wide capabilities to read existing HDF5 files.

@kmuehlbauer
Copy link

@edhartnett I'm assuming you've already finished the code refactor, considering your latest comment from almost three years ago. I just wanted to check the status on this issue regarding "transient datatypes" in netCDF-4. Thanks!

@edwardhartnett
Copy link
Contributor

Did not happen, sorry!

@kmuehlbauer
Copy link

No worries! Do you think efforts to tackle this will be resumed in near future? I'd be happy to help with testing.

@edwardhartnett
Copy link
Contributor

No, I don't. Unless someone else attempts it.

Unfortunately (or fortunately) I am working on other tasks, much more urgent to NOAA needs. ;-)

@kmuehlbauer
Copy link

Thanks for letting me know and good to hear that my fellow researchers at NOAA receive your attention ;-)

If I would try this myself (which would mean to dive into C again after 25 years) which code do you recommend reading first. Any pointers are welcome.

@edwardhartnett
Copy link
Contributor

I don't recommend that this be anyone's first dive into C in 25 years. ;-)

What is the big picture here? Is there some application or user workflow that is broken because of this?

Or are you trying to fill in missing functionality but there is no user waiting for it?

@kmuehlbauer
Copy link

kmuehlbauer commented Mar 3, 2021

I thought that this might be adventurous.

This functionality would be upstreamed downstreamed to h5netcdf/h5netcdf#70 and finally to pydata/xarray#4068.

@edwardhartnett
Copy link
Contributor

OK, let me know if you have questions. It would be well to understand the discussion in #541 before looking at the code. The monsters of #597 may also rear their ugly snouts.

image

@kmuehlbauer
Copy link

Thanks, if I'm not back in a year or so, consider me lost in space.

@DennisHeimbigner
Copy link
Collaborator

I too am curious about the use-case. And also the semantics of this.
In terms of netcdf-4, types are named, so what type name do we assign
to the transient type?

@edwardhartnett
Copy link
Contributor

I think the idea is to have enums accept other base types (i.e. not just integers). Apparently HDF5 does this, and I just missed it.

So no transient type would be needed.

@DennisHeimbigner
Copy link
Collaborator

What other base types?

@DennisHeimbigner
Copy link
Collaborator

FWIW: If I were doing a boolean type in netcdf-4, I would probably do it this way

netcdf bool {
types:
    ubyte enum boolean {F = 0, T = 1};
dimensions:
    d=5;
variables:
    boolean bitmap(d);
	boolean bitmap:bits = T,F,F,F,F,T; 
data:
   bitmap = F,T,F,F,T;
}

@edwardhartnett
Copy link
Contributor

Wow, that's perseverance! Thanks for sticking with this @shoyer and improving netCDF for everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants