-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What are correct metadata attributes when adding a zarr array #333
Comments
As you found, specifying the _ARRAY_DIMENSIONS in the .zattr file works. That provides compatibility with xarray. However, neuroglancer currently does not support a way to specify the units or a non-zero offset for a zarr array, as I was not aware of any standard representation for that information, and was hesitant to invent one. |
I am also not aware of any standard representation of that information, but I'd like to propose one. I don't think adding specific fields/attributes would break any reader of other tools. Why not use We could double check with the zarr community. I can try to submit a pull request if we reach a consensus. |
I think it would indeed be helpful to check with the zarr community to try to reach some consensus on these attributes. Note that there is the additional issue of non-scalar and multi-field dtype, such as [["a", "<u2", [4, 5]], "b", "<i4", [3]] Neuroglancer currently doesn't support that but in general it might be desirable to specify names, units, and offsets for those inner dimensions as well. Or maybe inner dimensions are sufficiently rare that it doesn't matter. Within Neuroglancer offsets would be fairly simple to support, but for a library like tensorstore, should the offset affect indexing operations as well, or solely be used for visualization? If it affects indexing, then it would probably better belong in .zarray than in .zattr. |
Okey, let us post an issue over at the zarr community if we come up with a reasonable proposal. I am sure they have thought about it as well, so I don't think it would be relevant for consideration into their core specs, and it might give be just an ok for the root field names which do not interfere with anything else. The relevant docs in zarr for structured data types are here. Do you have a particular dataset in mind where you have come across this? How would you imagine this to be supported in Neuroglancer? If subfields are multidimensional, such as They have a good way to specify time units for datateime64 datatypes, Hmm, regarding the question of supporting offsets for indexing operations in libraries like tensorstore - I think it might be quite a dogmatic topic. I would tend to say to keep the offset only for visualization use cases for now, and keep the assumption of always having zero-offset indexing operations in tensorstore. So it would go to Can you make a proposal of how you would specify the information we just discussed in |
As far as structured data types, no I haven't seen them used much myself, and in general the interleaved storage that it results in is probably in most cases less desirable than columnar storage. But I would suppose that if it is to be officially supported by zarr then the issue should be addressed as I imagine there are some users of structured dtypes. Neuroglancer currently supports "m", "s", "Hz" and "rad/s" (along with all of the SI prefixes) as units. Mostly I think just "m" (for meters) and "s" (for seconds) are used. Possibly in the future it could support more, e.g. to know that "meters" and "m" are the same thing, for example. Units are tricky, though; on the one hand it is pretty important for neuroglancer to support SI prefixes at least, so that it can display "nm" and "um" at appropriate relative scales and show scale bars using SI prefixes, but on the other hand it would be nice to allow arbitrary unit identifiers to support varied use cases. I have taken a look at the units syntax used by the udunits2 library (https://www.unidata.ucar.edu/software/udunits/udunits-2.1.24/udunits2lib.html#Syntax), and it seems like a reasonable standard. However, udunits2 does depend on a units database; while users can define their own custom units, that is not really recommended by the udunits2 author. I would suggest that zarr use a representation like: "units": [[450, "nm"], [450, "nm"], [1, "um"], [1, ""], null] The units for each dimension would be specified by either As far as offsets: note that tensorstore itself supports non-zero origins, and for the neuroglancer precomputed format there is a non-zero origin via the "voxel_offset" attribute. I do think it might be somewhat confusing to have an integer offset vector for visualization but then not respect it when accessing the array programmatically; however, I can see how that would be problematic for zarr-python to support, and also there would be the backward compatibility issue that older versions of zarr would ignore the offset attribute even if newer versions respected it. If the offset is purely for visualization purposes, it might instead make more sense to allow non-integer offsets or a full affine transform matrix. Currently zarr seems designed to pretty closely follow the numpy array data model, and I can see that there may be hesitancy to officially support non-zero origins and units as they are not part of the numpy array data model. |
Note: An alternative could be to use a string to encode both the multiplier and the unit, e.g. "nm" or "4nm" or "4" (for dimensionless unit). That would be more consistent with the udunits2 syntax, but would require slightly more complicated parsing. |
I suspect spatial metadata will be ruled out of scope for zarr-python. A related question came up here: zarr-developers/zarr-specs#50, and the decision was made to punt the issue over to a higher abstraction level (the nascent OME-Zarr specification). For my own purposes, I have been happy with expressing spatial metadata in Zarr / N5 containers as follows: {
"transform": {
"axes": [
"z",
"y",
"x"
],
"scale": [
5.24,
4.0,
4.0
],
"translate": [
0.0,
0.0,
0.0
],
"units": [
"nm",
"nm",
"nm"
]
}
} ...but I am not pushing this as a standard, because I'm hoping something more encompassing comes out of community developments to standardize this kind of stuff: see ome/ngff#28 |
Adding support for OME/ngff metadata to neuroglancer would also be reasonable, I think. I think it is a bit unfortunate that ome/ngff is rather narrowly focused on bio imaging, e.g. in that xyztc is specifically baked in, though it seems that it is moving in the direction of being more general. |
Yes, we are making a constant effort to un-bake the magic dimension names from the spec :) |
Thanks @d-v-b for pointing out the discussion around ome/ngff. There seems to be a lot going on, and it seems very nice that a lot of the tools already or have the intention of supporting ngff. I think it would make sense to perhaps contribute some of the discussion/proposals here into possibly augmenting a proposal for the axes labeling in the ngff specification. And then go about adding support for OME/ngff to Neuroglancer, instead of another special-purpose zarr metadata spec that is not supported by a wider community. A few points for discussion may be:
|
Thanks for your thoughts @unidesigner. Re 1: Re 5: Re 6: I think we can distinguish a couple different cases: |
Thanks for the response, @jbms - I will cross-link this issue with the relevant ome-zarr issue 35 with the hope of cross-pollination. Re 1: Re 2: Re 4: Re 5: In the NG Python bindings, I figured that I had to specify Re 6: |
Re Re 1: To be clear, by "columnar" data storage (and I believe this is consistent with standard database terminology, https://en.wikipedia.org/wiki/Column-oriented_DBMS), in the context of dense array storage, I mean that, if for example we have a regular x-y grid of points, and we want to store two different variables, like "temperature" and "pressure", or "red" and "green" from a microscope, then we would store chunks containing just temperature, chunks containing just pressure, chunks containing just red, chunks containing just green. That means when compressing we just have a sequence of "temperature" values, which will likely compress better than interleaved temperature and pressure values, and if we only need to read "temperature", then we don't have to read "pressure" as well, so we only need to process half as much data. We can already achieve this by storing. each variable as a separate zarr array that all have the same dimensions. Potentially though there could be some specification by which we could indicate more explicitly that (some of) the dimensions are common between these different arrays. In contrast, "row" storage would be achieved by using a single zarr array with a structured dtype like It is true that for point cloud storage, "units" as we have been discussing is not a useful attribute. Instead, you may wish to associate units with the data type itself, e.g. you would say that the data type of "x" is "float32" and specifies a value in units of "4nm", while the data type of "y" is float32 and specifies values in units of "6nm". That would have to be specified with a different attribute than the "units" attribute we were discussing. xarray has the concept of a "coordinate array", where you say that the coordinates along one dimension correspond to values of another 1-d array. However, typically there is just one coordinate array. For the point cloud data we would have both the "x" coordinate array and the "y" coordinate array associated with the "color" array. The neuroglancer precomputed annotation format is actually a "row" format, not a "columnar" format, since we store multiple variables together. As far as annotation storage, for small amounts of data the format does not much matter, since we can afford to load it all in memory, but for large amounts of data, to be able to display it interactively, it needs to be indexed suitably to allow the viewer to retrieve just the relevant portion needed for the current view. For that reason the precomputed annotation format indexes the data using a multi-scale spatial index as well as "relationship" indexes. There is the distinction between the high-level overall organization of the data, which determine which access patterns may be supported efficiently, and the lower-level encoding of individual chunks of the data, where each chunk will always be read in its entirety. For the lower-level encoding,it is perhaps a bit unfortunate that the neuroglancer precomputed annotation format has its own custom simple binary format --- using an established format like avro, apache arrow, etc. would probably have been a better choice, and perhaps could be done in a future version. For the higher-level organization, we are getting into the realm of databases rather than pure data streams. Of course there are a multitude of existing general purpose and specialized database formats. There are a few useful desirable properties:
Re Re 5: Currently there is no way to specify the "display dimensions" for Neuroglancer in the data format itself, except that Neuroglancer defaults to the first 3 dimensions. Instead you have to specify that separately in the Neuroglancer state. I think that should logically be separate from the data itself, as you will in general have many different views of the same data. Re Re 6: |
Thanks for your very detailed and informative answer, @jbms . I think this issue and answer contains a lot of useful information for any ome-zarr standardization efforts. I'd like to reply with a few more points/questions to your answer, but it will take me a bit of time. |
Is the _ARRAY_DIMENSIONS attribute still supported by neuroglancer? I can't even get the axes names to work like OP did. I tried to follow the suggested xarray example, but I'm still getting the axis names 'd0','d1','d2' rather than the desired names 'x','y','z' in the viewer: e.g. serve with cors_webserver.py, then go to the demo-appspot |
I have an image layer with prechunked data loaded with x,y,z dimension at 4,4,40nm voxel resolution. I want to add a zarr volume which is stored in z,y,x orientation with 40,8,8nm voxel resolution. I get the following source info after adding:
where the zarr array has the following
.zattrs
.zarray
infoWhat is the correct way to do this? Perhaps I can add the resolution and bounds information directly to the metadata files.
The text was updated successfully, but these errors were encountered: