Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF needs better and faster compression, especially for HPC applications #1545

Closed
edwardhartnett opened this issue Nov 21, 2019 · 37 comments

Comments

@edwardhartnett
Copy link
Contributor

This issue is for general discussion about improving the compression capability of netCDF. I will address specific ideas in their own issues.

Compression is a key feature of netCDF-4. Whenever I ask why netCDF-4 was selected, the answer is always because it's netCDF, with built in compression. That is, the API is already understood and all the read program are working, and it's easy to turn on compression and suddenly save half your disk space. (The NOAA GFS is compressing a 33 GB file down to 6 GB. Without compression, storing and moving the file is much more expensive.)

However, currently we only offer zlib compression, which is not the best or most commonly desired kind. The work of @DennisHeimbigner with filters offers the chance to expose other HDF5 compression filters.

And what about the classic formats? Is there not a way to turn on compression? As a simple case, the entire file can be piped through gzip and automatically piped through unzip. It's pretty easy to do that, but there are much better ways.

When and if we add more compression options, we must of course continue to honor the existing API and code that has been written to it. Currently we do:

nc_def_var(...)
nc_def_var_deflate(...)

So we can imagine other forms of compression getting their own nc_def_var_XXXX function, which would be called after the var is defined and before enddef.

@gsjaardema
Copy link
Contributor

Re: other forms of compression "getting their own nc_def_var_XXXX function". Would it not be OK from an API status to instead add an nc_def_var_compress(compression_type, compression_value) or something similar such that a new compression type could be added without adding a new API function... Maybe even a nc_def_var_property(). There has to be some way that this can be supported without adding a one or more new API functions per compression type. There would still be the need to add some defines, but please try to limit the number of API functions.

@gsjaardema
Copy link
Contributor

Re: compression for classic formats. I remember some work from about a decade ago where there were some patches or something to the NetCDF library which provided compression. I think it wasn't just compressing the resulting file, but instead messed with NetCDF internals to compress on a dataset by dataset basis. Will try to see if I can find anything referencing that work...

@WardF
Copy link
Member

WardF commented Nov 22, 2019

@gsjaardema The API @DennisHeimbigner wrote to support filters (nc_def_var_filter, etc) is generalizable to different filters. I'm on my phone so can't check the reference at the moment. I don't think we need to define a function for each compression individually.

@edwardhartnett
Copy link
Contributor Author

I don't want to get into the idea of just passing a void pointer and it means different things. That's a bit much. So I believe we will have separate functions for separate compression methods. I'm willing to reuse as much as possible, but different methods have different numbers and types of parameters.

@WardF
Copy link
Member

WardF commented Nov 22, 2019

This may need to be a broader discussion before we go too far with the implementation, it’s important we maintain consistent in our approach. Are you suggesting a handful of specialized API calls for a closed set of new compression algorithms? Or are you suggesting an open ended “every new compression method requires new functions?” The latter seems like it could expand quickly, and is not in line with the current approach.

What do you mean by passing a void pointer means different things?

@sluehrs
Copy link

sluehrs commented Nov 22, 2019

From my side I think the available nc_def_var_filter is already pretty powerful for the netcdf4 format. E.g. I did some tests using ZFP (https://computing.llnl.gov/projects/floating-point-compression) within a parallel program (in a brute force way, by removing the parallel checks from the netcdf4 code, which normally blocks parallel filter calls). This worked pretty fine, e.g.:

H5Pset_zfp_reversible_cdata(cd_nelmts, cd_values);
nc_def_var_filter(nc_file_id,nc_variable,H5Z_FILTER_ZFP,cd_nelmts,cd_values);

Of course you have to include some H5 plugin code here, to configrue the individual plugin.
Adding additional compression mechanics directly into netcdf also means a lot of additional dependencies. This HDF5 plugin mechanism already solved this problem. So I think at least for the netcdf4 format the proper parallel access to the filter function can already offer a lot of new netcdf HPC approaches, without a lot of additional overhead.

@edwardhartnett
Copy link
Contributor Author

I believe having a different nc_def_var_XXXX() for each compression method is perfectly acceptable and also in accordance with what we already have.

The post of @sluehrs actually proves my point. It does not sound easy! If I am to describe how to turn on compression to NOAA programmers, I need to just tell them a function, like nc_def_var_deflate().

I am perfectly willing to have a function cover multiple compression methods, but since each method has it's own type and number of parameters, how will that work?

I welcome concrete proposals...

@gsjaardema
Copy link
Contributor

I do not have a concrete proposal at this time. But, I know of several compression algorithms, both lossy and non-lossy, and the thought of adding support for all of these to NetCDF through a new API function seems non-tenable. Some of these may be niche solutions -- e.g. they work only for an unstructured mesh representation -- and may not be worthwhile to have them fully supported in every installation of NetCDF.

I haven't looked closely at the filter implementation, but maybe something similar to that is needed. I know that HDF5 has a somewhat extensible filter API which supports compression; maybe there is something there that can be adopted.

While you are looking at the compression algorithms, please look at BLOSC (https://github.com/Blosc/hdf5-blosc and https://blosc.org/pages/blosc-in-depth/) as it looks like it gives good performance.

I guess I don't have a good API suggestion, but I would like it to be easy to add support for (or at least test) a new experimental lossy or non-lossy compression algorithm without having to modify the API. lossy, problem-specific, compression algorithms are an active area of research so many are being proposed and it would be nice to be able to evaluate them on my application area as easily as possible.

I'm also not sure how the API function per compression algorithm approach works for optional support. NetCDF is currently able to support multiple optiona formats (CDF1, CDF2, ..., CDF5) in parallel and serial without much affect on the API; hopefully the same can be done for compression and filter support.

I also see many older algorithms going away or not being used very often anymore and don't want to see the NetCDF API in 5 or 10 years cluttered with a bunch of functions for compression algorithms that are no longer supported.

@edwardhartnett
Copy link
Contributor Author

@gsjaardema you make some excellent points, and all of these must be kept in mind as we move forward.

To address one issue, which is how do compression methods interact with formats: right now, none of the classic formats can use zlib. When one (or more) can, then yes, I would expect we would use the same nc_def_var_deflate() function; it would then work for both netcdf-4 and whatever classic formats can handle it. In other words, the APIs are independent of format, though not supported in all formats.

I am thinking of less than 5 new compression methods, so I don't believe a general purpose solution is needed. Also recall that whatever we do must be usable from Fortran, which restricts how clever we can be with parameters. What works best, and matches the rest of the api, is a specific type-safe function wherever we can provide one, even if they are redundant (for example we could use nc_put_vara() for all array puts - but we also have a function for each type, so the user gets type-checking. And this way, everything looks very obvious in fortran codes.

Let us discuss the algorithms on a case by case basis. I agree we only want ones that will stand the test of time. I have put a bunch out for discussion, but that doesn't mean all of them will be added. I just put out all the ones I can think of.

Thanks for pointing out BLOSC, I will add it to the list.

We should also learn in detail what is being done by the netcdf4 python library. I believe it handles unpacking, for example. @dopplershift do you know about that?

@dopplershift
Copy link
Member

I know some. What do you mean by "handles unpacking"?

@dopplershift
Copy link
Member

Also please bear in mind that any core netCDF functionality that gets added to the C library has to be added to netCDF-Java, at least as far as read support is concerned.

@edhartnett
Copy link
Contributor

@dopplershift by "handles unpacking" I mean it will automatically apply the scale/offset parameters and change a packed short into an unpacked float. Am I correct that python netCDF-4 library does this?

@dopplershift
Copy link
Member

Yes, it can (optionally) handle scale/offset unpacking.

@edwardhartnett
Copy link
Contributor Author

Seems like automatic handling of unpacking is something the netcdf-c library could do as well.

Perhaps a new mode flag for nc_open/nc_create, NC_UNPACK, which would indicate that this should be done.

@dopplershift
Copy link
Member

netCDF should tread carefully in considering this kind of feature. Right now, as far as I'm aware, the netCDF library does not do any semantic interpretation of metadata. The only functionality is about managing mapping the data model to the storage layer.

Interpreting user metadata is a whole new ballgame, and arguably increases the scope of the library. I'm not expressing support or disdain for it--just that I think there are implications beyond the simple implementation of the feature.

@epourmal
Copy link

epourmal commented Dec 2, 2019

I hope this information will help you to decide how compression can be extended in netCDF-4.

My personal view is that it is nice to have APIs for most commonly used compression methods and provide a generic API similar to H5Pset_filter that uses filter identifier to set compression for user-defined methods.

On another hand, please notice that all filters registered with The HDF Group are required to provide filter name in the H5Z_class2_t structure. The name (or part of it) can be used as a string parameter to identify a filter. Of cause, passing filter's data would require some care, especially from Fortran.

The HDF Group tests the following filters with the maintenance releases. We have been posting binaries for decoding only with our HDF5 binaries, but starting next year we will add binaries for both encoding and decoding. Currently we provide decoders for BZ2, BLOSC, LZ4, LZF, ZFP, and MAFISC.

Please let me know if you have any questions.
Thank you!
Elena

@DennisHeimbigner
Copy link
Collaborator

I guess I need someone to spell out to me why the current netcdf-c
filter API is inadequate for this.

@edwardhartnett
Copy link
Contributor Author

Perhaps what is needed is for someone to explain how the filter API can achieve this? ;-)

I will dig into this further after AGU, but the first question I have is, what would it take to turn on one of those other compression filters using the filter API? Because I think the NOAA GFS team would want to use that immediately.

@edwardhartnett
Copy link
Contributor Author

Seems like this work is going to happen, at least initially, in the community codec repo: https://github.com/ccr/ccr.

The point of the CCR is that users can install the ccr tarball and get all the filters and support code installed to read and write with the additional compression filters.

I have added bzip2 and lz4 to the CCR (BLOSC will be next, once the HDF5 team answers some support questions). The CCR library uses the filter API to turn on the compression, and provides user-friendly nc_def_var/nc_inq_var wrapper functions for each compression method. For example:

    int nc_def_var_bzip2(int ncid, int varid, int level);
    int nc_inq_var_bzip2(int ncid, int varid, int *bzip2p, int *levelp);
    int nc_def_var_lz4(int ncid, int varid, int level);
    int nc_inq_var_lz4(int ncid, int varid, int *lz4p, int *levelp);

The CCR will allow us to evaluate how additional compression functions will work, and will allow big producers of data (like NOAA) to meet immediate and looming operational requirements, currently not being met due to zlib performance issues.

The issues of netcdf-java with filters remain, and must be resolved by the netcdf-java team. To summarize the outcomes suggested in this tread:
1 - Abandon the java-only reader of HDF5 files, and use the HDF5 C library to both write and read HDF5 files.
2 - Some energetic Java programmer(s) writes or finds the code to handle the new decompression methods. and netcdf-java is modified.
3 - netcdf-java accepts that there are going to be some netCDF files it cannot read. :-(

Getting and installing the filters remains a separate step for the users, which is a pain (though the ccr project will at least allow them to get a standard set of filters, all at once). It seems like HDF5 could and should ship some of these filter codes with the HDF5 tarball, which would vastly reduce the problem. I will pursue this with the HDF5 team.

I will close this issue, but welcome any further comments or discussion.

@dopplershift
Copy link
Member

@edwardhartnett So just to correct something here:

The issues of netcdf-java with filters remain, and must be resolved by the netcdf-java team. To summarize the outcomes suggested in this tread:
1 - Abandon the java-only reader of HDF5 files, and use the HDF5 C library to both write and read HDF5 files.
2 - Some energetic Java programmer(s) writes or finds the code to handle the new decompression methods. and netcdf-java is modified.
3 - netcdf-java accepts that there are going to be some netCDF files it cannot read. :-(

These issues remain for the entire netCDF team and community of developers, across all languages, to figure out how to address so as not to create confusion for our community of netCDF data users. There are clients in many languages on a diverse set of platforms that need to be considered. IMO, anyone proposing additions should be considering the impacts of additions on the broader ecosystem. It is incredibly poor form to add incompatible data format variations and then leave it for other clients to pick up the pieces. I understand and support the goals of improved compression to support user needs. I'm just not sure why you continue to consider it someone else's job to solve the challenges introduced by the innovations you want to make.

@epourmal
Copy link

epourmal commented Jan 2, 2020

Happy New Year to everyone!

Ed,
I am not sure how your users will benefit from nc_inq_var_ functions. How will they find that they need to query a specific compression? May be you should add a generalized function (similar to NC_inq_var_all) to deal with user-defined compression instead of just gzip (or maybe I am missing something...)

All,

Please remember that HDF5 (and netCDF-4) based application in any language doesn't need to do anything special to read HDF5 compressed data. HDF5 library should be built with gzip and szip and the third-party filters should be available as a shared library installed in a specified location. This is really a matter of building and packaging HDF5 and netCDF-4.

If a filter is not found HDF5 prints an error message about missing filter. Please let me know if there is any new HDF5 functionality needed to facilitate error reporting in netCDF-4 for the missing compression method.

Now about pure Java HDF5 reader.

First of all, if current Java implementation can deal with GZIP compression, it is already has a code to discover chunks and applied filters, and to use decoding. How it would be different from any other compression? It shouldn't be hard to update the code to use bzip2, etc.

Second, it is my understanding that Java reader can deal only with HDF5 file format that was available in HDF5 1.8.0 release. It was not updated to deal with extensions introduced in 1.10.0 and later, for example, new chunk indexing schemas. Thus, new compression is not the only reason why there will be HDF5 (or netCDF-4) files that cannot be read by netcdf-java. The HDF Group will be more than happy to work with the netcdf-java maintainers as we worked with John Caron when he was developing the reader and help to update it.

I am not sure why the strategy of adding just two new APIs (one API to set an arbitrary compression and another one to find out about compression) and keeping netCDF in sync with the latest HDF5 releases is not the right one. HDF5 library will do all heavy lifting! And any HDF5 file can be read/modified with netCDF-based tools. Great win for everyone!

Thank you!
Elena

@Dave-Allured
Copy link
Contributor

@epourmal,

Second, it is my understanding that Java reader can deal only with HDF5 file format that was available in HDF5 1.8.0 release. It was not updated to deal with extensions introduced in 1.10.0 and later ...

This is not an issue. There should be few or no netcdf-4 files with 1.10 extensions in the wild. All recent versions of the netcdf-C library have rigorously enforced 1.8 format compatibility, since shortly after HDF5 1.10 was first released.

Otherwise, I am very supportive of expanding use of HDF5 filter and compression abilities in netcdf-4 format.

@epourmal
Copy link

epourmal commented Jan 2, 2020

his is not an issue. There should be few or no netcdf-4 files with 1.10 extensions in the wild. All recent versions of the netcdf-C library have rigorously enforced 1.8 format compatibility, since shortly after HDF5 1.10 was first released.

Yes, I understand and agree that it is not an issue right now if people use netCDF library to create netCDF files.

Unfortunately, sticking with 1.8 file format features will deprive netCDF users from HDF5 I/O internal improvements, for example, better performing appends (up to 30%) to unlimited dimension datasets that were introduced in 1.10.0 release or using single writer/multiple readers virtual file driver (VFD SWMR) based on the paged allocation (introduced in 1.10.0 release) that will allow to read netCDF file under construction.

@Dave-Allured
Copy link
Contributor

Unfortunately, sticking with 1.8 file format features ...

Yes. Please see #951.

@Dave-Allured
Copy link
Contributor

@dopplershift wrote:

There are clients in many languages on a diverse set of platforms that need to be considered. IMO, anyone proposing additions should be considering the impacts of additions on the broader ecosystem.

Please help me better understand the scope of this. I think the main community issue is format read compatibility. Writing and enquiry can be considered specialties in this context.

It seems to me that the majority of science apps are reading though the netcdf and HDF5 libraries, thus will automatically get read capability as @epourmal said, just by updating libraries. Would you agree?

Are there any other significant readers besides netcdf, HDF5, and netcdf-java libraries?

@dopplershift
Copy link
Member

@Dave-Allured I agree that the focus is mostly on the read functionality; it's certainly my chief concern. The netcdf-java library has a completely independent, pure-Java implementation for reading netCDF data. As a result, it's not enough to just update the netcdf-c library, the netcdf-java library needs to be updated as well.

@edhartnett
Copy link
Contributor

Wow, even more great commentary and information. Clearly this is a topic which generates a lot of interest!

@epourmal good suggestions, I will make them issues in the CCR project where they can be discussed fully.

@Dave-Allured I agree it would be a great thing if we can achieve #951. Let me know if you want some help putting a PR together. Given the performance improvements @epourmal mentions, this may be a priority for NOAA. 30% increase in performance would be very useful to the GFS, allowing more science to be done.

@dopplershift it's normal that I innovate and other people have to help with the challenges. That's my job as a programmer, and netcdf-4/HDF5 is a good example. ;-) But it feels like you are assigning me credit (or blame) for the work Dennis did with filters. That is where your netcdf-java issues originate. Those issues seem readily resolvable, but they are not of my making.

Allow me to explain NOAA's operational situation. Right now, the GFS team have a ~350 second budget to write one model-hour of output in netCDF/HDF5. But it's taking 450 seconds. So they need that 100 seconds back.

The problem is all zlib. If they turn that off, they get great performance, but the file is too big. They didn't buy enough disk to hold all the data uncompressed! So they need to compress the file, but they need it to be a lot faster.

Of secondary concern, there is a lot of downstream programs that read the data file later. These programs are not as operationally important, but they are still important. And they are taking too long to read the data, because of zlib.

What they want is compression that is not as good as zlib (they don't care that much about final size, but it must be compressed somewhat), and be fast enough to write, and also not slow down all the reading programs downstream. The LZ4 filter looks like a much better choice for them. According to Charlie's paper, it performs >5X faster on read and write, and gives good enough compression.

NOAA is not going to use a new compression filter because of me - it's not my decision. They are going to use it (if it performs as expected) to meet their operational requirements. And if it does that, it helps NOAA and also the science community. Because in the past, these files were all in an impenetrable binary format that only NCEP could read. Now these files will be readable by everyone, leading to better scientific analysis of the GFS.

Furthermore, there is a trade off between science and operational needs. They don't write all the data that the scientists want, because of time constraints. If we can improve write performance, they can save more science data and still meet operational requirements.

Finally, the model resolution is determined by what they can compute and write in time. If we can reduce the I/O we make it easier for the computational people to increase the resolution, which is better for the science. (And something they very much desire.)

Anyway, this is something that will happen if the GFS in Silver Spring team decides it will happen. I don't have the authority to stop it or make it happen. They are aware of the filter API and the availability of other compression filters.

My goal with this effort is to facilitate this in the most useful way possible for the netCDF community. I think the CCR project (originally proposed by Charlie Zender), is a great way to develop these solutions further, and make available to the community what is needed to read these NOAA files. Also it will help us converge on a small set of useful compression filters, which everyone will have access to.

@epourmal
Copy link

epourmal commented Jan 2, 2020

@edwardhartnett ,

The HDF Group needs to look into performance of filter pipeline too. We do know that it slows down tremendously vs. compressing individual chunks outside the library and writing them to the file using direct chunk I/O functions.

Have you profiled applications and checked how much time is spent in HDF5 filter pipeline? It will help us a lot if you look into profiling and provide us with I/O benchmark.

Thank you!

@DennisHeimbigner
Copy link
Collaborator

Your explanation of the NOAA situation is quite clear. Thanks.
I think that if NOAA takes the lead here and says that it plans to use
some specific compressor (e,g, LZ4) to use as the default, and can provide
Unidata with a C implementation (and if possible Java) that they
consider reliable, then we can add that to the netcdf-c library as an
always-available compressor (like zlib is now).
We would make it available in the same way as SZIP or ZLIB is now.

@Dave-Allured
Copy link
Contributor

@edwardhartnett write:

@Dave-Allured I agree it would be a great thing if we can achieve #951. Let me know if you want some help putting a PR together. Given the performance improvements @epourmal mentions, this may be a priority for NOAA. 30% increase in performance would be very useful to the GFS, allowing more science to be done.

Yes I would appreciate your help on #951. That issue was either forgotten or waiting for response, I am not sure which. I do not have the proper training or resources to make PR's, so I can only make feature suggestions.

However, #951 would only raise the HDF5 feature level from 1.6 to 1.8 forward compatibility. The improvements that @epourmal mentioned just now are for 1.10. That level remains out of reach for the time being, due to broad file compatibility concerns in the same vein as what we are discussing for expanded filters.

@Dave-Allured
Copy link
Contributor

@dopplershift wrote:

... As a result, it's not enough to just update the netcdf-c library, the netcdf-java library needs to be updated as well.

Yes I understand. I did not state my actual question clearly, sorry. Do you know of any other significant netcdf-4 readers that would be affected by new filters, other than netcdf, HDF5, and netcdf-java libraries?

@lesserwhirls
Copy link
Contributor

netCDF-C is not the only library that attempts to read and/or write netCDF-4 files. The NCAR RAL developed Nujan library is a pure java writer for netCDF-4 and HDF files. h5netcdf is a python library that both reads and writes netCDF-4 files using h5py (which wraps HDF5 C) so it also does not use the netCDF-C library. The impact on h5netcdf depends on where new filters are implemented.

The bigger question here, from my point of view, is “what is netcdf?” Up until the addition of HDF5 as a persistence format (and the only persistence format supporting the extended data model), I think the situation was pretty clear, as the formats were well defined, such that anyone could implement read/write without much ambiguity. I believe this helped gather a large, language agnostic community around netCDF, and the associated persistence formats. If that’s a discussion to be had, let’s do that over on Unidata/netcdf#50 for now.

The issues of netcdf-java with filters remain, and must be resolved by the netcdf-java team. To summarize the outcomes suggested in this tread:
1 - Abandon the java-only reader of HDF5 files, and use the HDF5 C library to both write and read HDF5 files.
2 - Some energetic Java programmer(s) writes or finds the code to handle the new decompression methods. and netcdf-java is modified.
3 - netcdf-java accepts that there are going to be some netCDF files it cannot read. :-(

Option 2 is how this will play out, and the modification is not terrible as we already support a few filters so the hooks are there. However, it would be nice to be considered when making these changes (thank you @WardF and @dopplershift for tagging me), as it impacts planned development. Again, in my opinion, 1 and 3 are unacceptable (most especially 3), unless it is truly the case the netCDF Enhanced Data Model and its (currently single) persistence format are only fully defined by the C library implementation in combination with HDF5 (including all configuration options and version information), and all other documentation regarding the Enhanced Data Model and its persistence format are only meant to be conceptual guidelines. If so, I would say that’s less than desirable, but ok. It'd be nice to get that viewpoint clearly stated in some documentation.

@edwardhartnett
Copy link
Contributor Author

I have just released version 1.0 of the Community Codec Repo project. It provides two additional forms of compression, bzip2 and lz4. It can be obtained here: https://github.com/ccr/ccr/releases/download/v1_0/ccr-1.0.tar.gz

@edwardhartnett
Copy link
Contributor Author

Well I just reread this thread, and it's nice to see how much progress has been made since this issue was opened about two years ago.

Since then we have:

  • Modified the netcdf-c code so that it works with HDF5 1.10 and 1.12. @Dave-Allured does this resolve all the issues in Update HDF5 format compatibility #951
  • We have release 1.3.0 of CCR, and Zstandard has emerged as the best compression filter so far.
  • We have added the bitgroom quantization to the C and Fortran libraries (not yet released), allowing lossy compression.

We have managed to do all this in a way that has been fully compatible to netCDF-Java, which is nice.

I think our next step should be to support the zstandard compression library, as we now support szip. I will add a separate issue for that discussion...

@edwardhartnett
Copy link
Contributor Author

I'm reopening this issue because it is a good issue for the general discussion of how to improve compression in netCDF.

In the immediate future the 4.9.0 release will include quantization.
In the next release, ZStandard compression will be supported.

Other forms of compression are under discussion including BLOSC and LZ4.

@edwardhartnett
Copy link
Contributor Author

Here's a slide from a presentation by @czender which adds an additional motivation for supporting and encouraging better compression:

image

@edwardhartnett
Copy link
Contributor Author

Great progress has been made here and zstandard compress is supported in released code, which is fantastic. Thanks to all the Unidites who made this happen and also to @czender for getting the ball rolling with CCR.

NOAA is currently examining zstandard compression on some projects. I have no doubt that many or all will decide to convert to zstandard for better performance.

I will close this issue for now. Further compression work will continue in the CCR project and we welcome contributors and comment: https://github.com/ccr/ccr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants