Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug relating to the setting of Variable chunksizes #1323

Open
davidhassell opened this issue Jun 3, 2024 · 8 comments
Open

Possible bug relating to the setting of Variable chunksizes #1323

davidhassell opened this issue Jun 3, 2024 · 8 comments

Comments

@davidhassell
Copy link

davidhassell commented Jun 3, 2024

Hello,

I have found it impossible (at v1.6.5) to get netCDF4 to write out a file with the default chunking strategy - it either writes contiguous, or with explicitly set chunksizes, but never with the default chunks.

To test this I used the following function:

import netCDF4
import numpy as np

def write(**kwargs):
    nc = netCDF4.Dataset('chunk.nc', 'w')
    x = nc.createDimension('x', 80000)
    y = nc.createDimension('y', 4000)
    tas = nc.createVariable('tas', 'f8', ('y', 'x'), **kwargs)
    tas[...] = np.random.random(320000000).reshape(4000, 80000)
    print(tas.chunking())
    nc.close()

and ran it as follows:

In [2]: write()  # Not as expected - expected default chunking
contiguous
In [3]: !ncdump -sh chunk.nc | grep tas:
		tas:_Storage = "contiguous" ;
		tas:_Endianness = "little" ;

In [4]: write(contiguous=False)  # Not as expected - expected default chunking
contiguous
In [5]: !ncdump -sh chunk.nc | grep tas:
		tas:_Storage = "contiguous" ;
		tas:_Endianness = "little" ;

In [6]: write(contiguous=True)  # As expected 
contiguous
In [7]: !ncdump -sh chunk.nc | grep tas:
		tas:_Storage = "contiguous" ;
		tas:_Endianness = "little" ;

In [8]: write(chunksizes=(400, 8000))  # As expected 
[400, 8000]
In [9]: !ncdump -sh chunk.nc | grep tas:
		tas:_Storage = "chunked" ;
		tas:_ChunkSizes = 400, 8000 ;
		tas:_Endianness = "little" ;

Surely it's the case that if contiguous=False, chunksizes=None then the netCDF default chunking strategy should be used?

I found that if I changed line https://github.com/Unidata/netcdf4-python/blob/v1.6.5rel/src/netCDF4/_netCDF4.pyx#L4307 to read:

                    if chunksizes is not None or not contiguous:  # was: if chunksizes is not None or contiguous

then I could get the default chunking to work as expected:

In [2]: write()  # With modified code
[308, 6154]
In [3]: !ncdump -sh chunk.nc | grep tas:
		tas:_Storage = "chunked" ;
		tas:_ChunkSizes = 308, 6154 ;
		tas:_Endianness = "little" ;

In [4]: write(contiguous=False)  # With modified code
[308, 6154]
In [5]: !ncdump -sh chunk.nc | grep tas:		
                tas:_Storage = "chunked" ;
		tas:_ChunkSizes = 308, 6154 ;

In [6]: write(contiguous=True) # With modified code 
contiguous
In [7]: !ncdump -sh chunk.nc | grep tas:
		tas:_Storage = "contiguous" ;
		tas:_Endianness = "little" ;

In [8]: write(chunksizes=(400, 8000))  # With modified code
[400, 8000]
In [9]: !ncdump -sh chunk.nc | grep tas:
		tas:_Storage = "chunked" ;
		tas:_ChunkSizes = 400, 8000 ;
		tas:_Endianness = "little" ;

However, this might not be the best way to do things - what do you think?

Many thanks,
David

>>> netCDF4.__version__
1.6.5
@jswhit
Copy link
Collaborator

jswhit commented Jun 4, 2024

The current code will not call nc_def_var_chunking at all if chunksizes=None and contiguous=False, which I would think would result in the library default chunking strategy.

@jswhit
Copy link
Collaborator

jswhit commented Jun 4, 2024

I think chunking is only used be default if there is an unlimited dimension. Try this:

import netCDF4
import numpy as np

def write(**kwargs):
    nc = netCDF4.Dataset('chunk.nc', 'w')
    x = nc.createDimension('x', 8000)
    y = nc.createDimension('y', 400)
    z = nc.createDimension('z', None)
    tas = nc.createVariable('tas', 'f8', ('z','y', 'x'), **kwargs)
    tas[0:10,:,:] = np.random.random(32000000).reshape(10,400, 8000)
    print(tas.chunking())
    nc.close()

write()
[1, 200, 4000]

so even if you specify contiguous=False you won't get chunking by default unless there is an unlimited dimension. If there is no unlimited dimension you have to specify the chunksize to get chunking.

I can see how this can be confusing since the default for the contingous kwarg is False, yet the library default is True unless there is an unlimited dimension. It does say this in the netcdf4-python docs though "Fixed size variables (with no unlimited dimension) with no compression filters are contiguous by default."

@DennisHeimbigner
Copy link
Collaborator

As near as I can tell, when a variable is created, it has default chunksizes computed automatically.
Then, if later, nc_def_var_chunking is called, those default sizes should get overwritten.

@davidhassell
Copy link
Author

Thanks for the background, @jswhit and @DennisHeimbigner - it's very useful.

So, not a bug then, but maybe a feature request! Could it be possible get netCDF4-python to write with the default chunking strategy a variable that has no unlimited dimensions? I guess that you don't want to change the existing API, so perhaps that could be controlled by a new keyword to createVariable?

Thanks,
David

@jswhit
Copy link
Collaborator

jswhit commented Jun 5, 2024

@davidhassell it is already being reported - variables with no unlimited dimension are not chunked by default (they are contiguous).

@davidhassell
Copy link
Author

Hi @jswhit, I see that what I wrote was ambiguous - sorry! I'll try again:

I would like to create chunked variables, chunked with the netCDF default chunk sizes, that have no unlimited dimensions. As far as I can tell this is not currently possible, but would you be open to creating this option?

@jswhit
Copy link
Collaborator

jswhit commented Jun 6, 2024

@davidhassell thanks for clarifying, I understand now. Since the python interface doesn't have access to the default chunking algorithm in the C library, I don't know how this would be done. I'm open to suggestions though.

@jswhit
Copy link
Collaborator

jswhit commented Jun 6, 2024

a potential workaround that doesn't require having an unlimited dimension is to turn on compression (zlib=True,complevel=1) or the fletcher checksum algorithm (fletcher32=True).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants