-
-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/write empty chunks #2429
base: main
Are you sure you want to change the base?
Feat/write empty chunks #2429
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good approach. Should we add some backwards compatibility thing for the write_empty_chunks
kwarg in zarr.open?
I think that In [41]: x = np.random.randn(10, 10, 10)
In [42]: x2, y = np.broadcast_arrays(x, 0)
In [43]: x is x2 # No copy of the array is created
Out[43]: True
In [44]: y.base # Only a single value is allocated for the fill value array data. It'd be nice to avoid the equality check when writing, at least under some circumstances, but I haven't thought of an easy way to do that. |
Are you thinking of something like a warning to guide people to use the configuration approach, if they pass in |
Yes, a warning and maybe even setting the config for them? |
I think a warning is a good idea but I'm hesitant to have any runtime code that sets config variables beyond the initial setup. IMO we are better off treating it as immutable, and leaving it to users to set. I think we can afford to just do a warning here because user code won't break if |
That sounds reasonable |
I've forgotten the code path now, but if zarr creates the empty chunk using |
@@ -331,6 +331,7 @@ async def write_batch( | |||
value: NDBuffer, | |||
drop_axes: tuple[int, ...] = (), | |||
) -> None: | |||
write_empty_chunks = config.get("array.write_empty_chunks") == True # noqa: E712 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about unpacking this config value so deep in the stack. I'd rather make this a property of the Array
so that we can guarantee consistent write behavior after an Array
has been initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather make this a property of the Array so that we can guarantee consistent write behavior after an Array has been initialized.
If we do that, then we also require that users create a brand new array if they want to write just some parts of the data with different empty chunks handling (or we introduce write_empty_chunks
as a mutable attribute, which I would rather avoid)
This PR adds a boolean
array.write_empty_chunks
value to the global config, and uses this value to control whether chunks that are "empty", i.e. filled with values equivalent to the array's fill value, are written to storage.In
zarr-python
2.x,write_empty_chunks
was a property of anArray
that users specified when creating theArray
object. This had pros and cons which I'm happy to discuss if people are interested, but the tl;dr is that the cons of that approach are driving my decision in this PR to makewrite_empty_chunks
a global runtime property accessible via the config API.Usage looks something like this (
donfig
experts please correct me if there's a better way):If people hate this, then we can definitely change this API. I'm very open to discussion here.
Also worth noting:
Our check for whether a chunk is equal to the fill value is pretty inefficient -- it's allocating a new array for every check invocation. This can definitely be made more efficient, in a stupid way by caching an all-fill-value chunk on the array instance and using that for the comparison, or a smarter way by doing the
(chunk, fill_value)
comparison without allocating a new array. But I think this is an effort for a separate PR.closes #2409
TODO: