How should netCDF support BLOSC compression? #2227
Replies: 5 comments 3 replies
-
Blosc is somehere in the PRs, but I do not recall where at this point. |
Beta Was this translation helpful? Give feedback.
-
Can I bump this up to underline the importance of Blosc/other lossless codecs in netcdf? Especially zstd gives a really wide range of performance/compression trade-offs. Given the increasing need for compression directly within netcdf, probably without fancy physical transformations (zfp, tensor trains...), the bitround+lossless framework seems to be a very good candidate for advanced compression without the need to implement lots of new stuff (see Klöwer 2021, nco/nco#250, zarr-developers/numcodecs#298). This compression framework would greatly benefit from more modern lossless codecs! |
Beta Was this translation helpful? Give feedback.
-
I did not notice before the existence of the HDF5 wrapper at https://github.com/Blosc/hdf5-blosc. |
Beta Was this translation helpful? Give feedback.
-
@milankl zstd has been agreed upon as an addition to netCDF codecs. It's a clear winner on performance, is standardized, and very widely available on netCDF platforms. Also, crucially, it has a pure-Java implementation for netCDF-Java to use. Further work on BLOSC and other codecs will take place in the Community Codec Repository (CCR), where we invite your contributions and help. CCR is a set of plugins, and some C and Fortran glue code, which add new codecs for netCDF C/Fortran users. For more information see: https://github.com/ccr/ccr. It's safe to assume that BLOSC and other codecs will not be considered for addition to the netcdf-c/netcdf-fortran libraries until they have proved their value and functionality with CCR. |
Beta Was this translation helpful? Give feedback.
-
Not quite. Blosc is essential for zarr support in netcdf. |
Beta Was this translation helpful? Give feedback.
-
Introduction
How should netCDF support BLOSC?
BLOSC is not the same kind of animal as zstandard or zlib - instead it is a "meta-compressor" which may use zlib, zstandard, or LZ4.
The point of BLOSC is to better manage the flow of data into the CPU cache so that it is most efficient. Then BLOSC then uses one of its compression codecs (LZ4, zstd, or zlib) to compress the data.
So BLOSC promises a lot of great performance improvement, especially on multi-processor systems.
More info: https://www.blosc.org/pages/blosc-in-depth/
HDF5 Support
HDF5 offers a tested BLOSC filter. (See https://github.com/Blosc/hdf5-blosc). This is working well.
netcdf-c Support
BLOSC will be pretty easy to support in netcdf-c. There is a well-tested HDF5 filter, and HDF5 will automatically apply the filter when reading BLOSC-compressed data.
BLOSC can be turned on with filter commands in netcdf-c.
Probably, if BLOSC is further supported by netCDF, it would get its own def/inq functions, like the other compression methods.
@DennisHeimbigner you mentioned that you had an implementation of BLOSC working with netCDF - is that on a branch?
netcdf-fortran Support
BLOSC will be easy to support in the Fortran APIs, if we wrap the filter calls with a well-behaved def/inq set of functions.
Fortran is how most large data producers interact with netCDF, so BLOSC is not going to be used much until it is available in the Fortran APIs.
Community Codec Repository
We intend to support BLOSC in the CCR project (https://github.com/ccr/ccr). But this has not happened yet. If I can get a look at Dennis' implementation I can perhaps steal from that and get BLOSC out in CCR.
This would make BLOSC available to users to play with.
netcdf-java Support
Here we have the sticking point: there does not seem to be a native Java implementation of BLOSC. There is this Java project: https://github.com/Blosc/JBlosc
Notes
Beware of ruling out BLOSC due to Java issues. The technology is already available in the C/Fortran netCDF libraries, due to HDF5 and filter support. The speed results look very compelling, especially for multiprocessor systems (i.e. the HPC systems used to produce so much netCDF data).
As I have mentioned before, large data producers are unlikely to be much concerned with netcdf-java issues when examining this technology.
Recently we did a paper on different compression methods for the AGU. In the next such paper, I will try and include BLOSC so it can be directly compared to what is already available.
Beta Was this translation helpful? Give feedback.
All reactions