-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Arrow stream interface for public API #69
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks. Just a question on one point.
Do you think the existing tests are sufficient to cover this? In particular, do we have any tests with enough data / small enough chunk size to hit multiple batches?
|
||
if isinstance(schema, InferredSchema): | ||
schema = schema.inner | ||
|
||
for batch in read_json_chunked(path, chunk_size=chunk_size): | ||
yield stac_items_to_arrow(batch, schema=schema) | ||
batches_iter = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine, but I wanted to clarify a couple things:
- Do we have 2 schemas here: in input schema (named
schema)
and an output schema (namedresolved_schema
)? - Could we somehow derive
resolved_schema
from just the input schema, and not from actual data? Something likestac_items_to_arrow([], schema=schema)
? Or do we need actual items to figure out the resolved schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Yes. It's the difference between the
StacJsonBatch
stac-geoparquet/stac_geoparquet/arrow/_batch.py
Lines 36 to 52 in ab2701f
class StacJsonBatch: """ An Arrow RecordBatch of STAC Items that has been **minimally converted** to Arrow. That is, it aligns as much as possible to the raw STAC JSON representation. The **only** transformations that have already been applied here are those that are necessary to represent the core STAC items in Arrow. - `geometry` has been converted to WKB binary - `properties.proj:geometry`, if it exists, has been converted to WKB binary ISO encoding - The `proj:geometry` in any asset properties, if it exists, has been converted to WKB binary. No other transformations have yet been applied. I.e. all properties are still in a top-level `properties` struct column. """ StacArrowBatch
stac-geoparquet/stac_geoparquet/arrow/_batch.py
Lines 176 to 180 in ab2701f
class StacArrowBatch: """ An Arrow RecordBatch of STAC Items that has been processed to match the STAC-GeoParquet specification. """ One is the schema of the input data, which is as close to the original STAC JSON as possible (only with geometry pre-coerced to WKB), and the other is the schema of the output data, after any STAC GeoParquet transformations.
-
No, not as it stands, and it's very annoying. I think the main blocker to this is that we transform the bounding box from an arrow List to an arrow Struct (which we do to take advantage of GeoParquet 1.1, which defines a bounding box struct column that can be used for predicate pushdown). However we don't know in advance whether the bounding box of each Item is 2D or 3D, and so we don't know in advance how many struct fields to create.
This also means that STAC conversion will fail on mixed 2D/3D input. Are there any real-world STAC collections that have mixed 2D/3D bounding boxes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks. We might consider revisiting this later but that all makes sense for now.
Are there any real-world STAC collections that have mixed 2D/3D bounding boxes?
I think we only have one collection with 3D bounding boxes, and that should have 3D for each item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However we don't know in advance whether the bounding box of each Item is 2D or 3D, and so we don't know in advance how many struct fields to create.
I suppose this is something we could keep track of in our InferredSchema
class while we're doing a scan of the input data: keep track of whether bounding boxes are only 2D or only 3D or a mix of the two.
Though if a user passed in their own schema (which describes the input, not the output data, and so describes bbox
as a List
), they'd also need to pass in whether the bbox is 2D or 3D
We should add tests with a small chunk size that specifically checks for multi-chunk behavior |
Change list
pyarrow.RecordBatchReader
for public APIs. This is better for memory usage because it never materializes the entire stream in memory at a time.pyarrow.RecordBatchReader
are also compliant with the Arrow PyCapsule Interface. This means that there's zero-copy interop with stac-arrow in Rust, pyogrio, geoarrow-rs, etc.Closes #66