-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose a Zarr interface to Tiled #562
Comments
Inspired by discussion with @joshmoore at SciPy 2023, but I think not written down until now. |
My assumption is that these would be "layouts" of zarr arrays. @ivirshup can say more on
I don't think I know understand well enough what you mean by wide containers. If this would be 1M zarrays in a single zgroup, that's doable, but you would never want to list the zgroup. Is this where pagination comes in? If it's a wide array, then I don't think there's a particular issue, because the pagination would map to chunking, right? |
I think I actually talked with @danielballan a bit about this in Seattle in May. In anndata, we've got on disk formats for tables and sparse built on top of hdf5/ zarr (description). What Josh describes is basically how we do tables, plus a little bit for more complicated column dtypes like categoricals and nullable bool/ int. Tools like IIRC, Our @trevormanz has also been interested in the idea of a server-backed zarr store. |
OK, so a "layout" is a standard Zarr on-disk format with a special interpretation layered on top---i.e. "You should interpret this group of arrays as a table and assume that they have equal length?" Does Zarr have an official way to encode a layout? Yes, "wide contianers" would be like 1M zarrays in a single zgroup---too large to list in a single request. Tiled provides filtering (search) and paginated access to make that tractable. Yes, tiled's sparse support is has chunked-array semantics (like dask.array) where each chunk is transmitted as a table of COO data. We have built in a path for other sparse layouts (CSR, CSC) but not yet implemented. Isaac meant to tag @manzt, I think. |
Yes! Thanks for tagging me. I made |
Nice! I don't think I'd seen that one yet. I'm working on a branch to update on our docs on "How Tiled Fits Into the Ecosystem". I'll include Yes, we do go the other way: exposing a Zarr store with Tiled's existing endpoints, layering on the things that Tiled gets you for the extra weight:
|
It doesn't, oops! I just copied the CLI from uvicorn since it just forwards args to creating the server. Sorry for the confusion.
Definitely! Let's find a time.
Wow, awesome. It's been a while since I check in on things here. The transcoding is something I've always wanted access to from a web app. One thing I've been thinking about for a while is letting a zarr client "request" preferred encodings through something like request headers or query params. # cat a basic zarr store
curl -sL https://my-zarr-service.com/data.zarr/.zarray
# { "dtype": "<u8", "shape": [10000, 10000], "chunks": [1024, 1024], "compression": ... }
# provide "preferred" overrides
curl -sL https://my-zarr-service.com/data.zarr/.zarray?dtype=%3Cu2&chunk_x=256&chunk_y=256&compression=gzip
# { "dtype": "<u2", "shape": [10000, 10000], "chunks": [256, 256], "compression": { "codec_id": "gzip" ...} } |
😅, yes. Thanks for figuring that out.
|
@manzt We have exactly the same vision. Choosing the format and compression encoding works now, via HTTP content negotiation headers, e.g.
We also support a custom query parameter
We have not addressed re-chunking or requesting a coarser dtype, but these ideas have been raised and are certainly in scope. I've sent you an email. Thanks for the reference, @ivirshup. That ZEP was not yet on our radar. I like the idea of specified higher-level interpretations for Zarr data. It would cover some of our use cases, though not all of them. For example, we sometimes handle very wide tables---a snapshot of the state of a large amount of scientific hardware before and after an experiment---which can be 2 rows long and hundreds of columns wide. I believe this is not a good fit for a Zarr group, performance-wise, but it's a great fit for Arrow or Parquet. |
This is like With my understanding of parquet and arrow, this would also be a bad case since there's an overhead per column. However, using a This may be a bit off topic though, so happy to refocus. |
Actually I think this may be relevant. If we expose Tiled as "a Zarr", i.e. add a
But I don't want to dwell too much on what may be an edge case. A best-effort that we can kick the tires on may be the place to begin. Thanks for GraphBLAS/binsparse-specification#16, by the way, a great overview of options. I believe Tiled is doing "Logical chunking of sparse arrays" at the API level. Internally, for the one mode of storage we currently support, we are also doing "Logical chunking of storage arrays", but this not a strongly-held design choice, and alternatives could be added without any major changes. |
Honestly, 10x performance drop is way better than I was expecting 😆 But also, this case is another order of magnitude faster to write to json, so I strongly think its an edge case. In [10]: %time df.to_json("test.json")
CPU times: user 1.69 ms, sys: 1.63 ms, total: 3.32 ms
Wall time: 3.53 ms
I think it would be fine to expose them. The way anndata does tables will only be reasonable for columnar access (or pretty large chunks of contiguous rows) once you have larger number of rows.
Thanks!
Am I remembering correctly that you were doing global indices as opposed to chunk local indices + an offset? And was that at storage or API level? |
Yeah, that's fair enough, for sure. At an API level, we support:
or
where Storage uses block-local indices within each Parquet file. There is a convenience constructor for building local blocks from a global reference frame. Did that answer your question? It's been awhile since sparse has been on the "front burner" of my brain. Reading more through the anndata links, I see how it shows clear patterns for presenting all of the structure families currently supported by Tiled as Zarr. This seems like a great place to start. Thanks for doing all the work. :-D |
As we dip our toes into zarr we would love to have an endpoint that could serve zarr directories and files in a similar fashion as a plain old web server could, enforces tiled authN/authZ, while still also exposing those data sets in the tiled |
Great. I think that is exactly what we intend with this issue. |
We could add a dedicated route, like
/zarr/v3/
that exposes Tiled's contents as Zarr, such that fsspec would "just work" with it.Open questions:
table
,sparse
, andawkward
that do not (yet, at least) fit cleanly into Zarr's data model?container
structures map perfects onto Zarr groups, Tiled also supports extremely wide containers (~1M entries) and exposes a paginated API. Is there an analog in Zarr?To start, one option is to simple filter out nodes that do not map cleanly into Zarr, exposing only those nodes that do.
The text was updated successfully, but these errors were encountered: