Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Arrow stream interface for public API #69

Merged
merged 6 commits into from
Jun 25, 2024

Conversation

kylebarron
Copy link
Collaborator

Change list

  • Use pyarrow.RecordBatchReader for public APIs. This is better for memory usage because it never materializes the entire stream in memory at a time.
  • Return types and function arguments using pyarrow.RecordBatchReader are also compliant with the Arrow PyCapsule Interface. This means that there's zero-copy interop with stac-arrow in Rust, pyogrio, geoarrow-rs, etc.

Closes #66

Copy link
Collaborator

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks. Just a question on one point.

Do you think the existing tests are sufficient to cover this? In particular, do we have any tests with enough data / small enough chunk size to hit multiple batches?


if isinstance(schema, InferredSchema):
schema = schema.inner

for batch in read_json_chunked(path, chunk_size=chunk_size):
yield stac_items_to_arrow(batch, schema=schema)
batches_iter = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine, but I wanted to clarify a couple things:

  1. Do we have 2 schemas here: in input schema (named schema) and an output schema (named resolved_schema)?
  2. Could we somehow derive resolved_schema from just the input schema, and not from actual data? Something like stac_items_to_arrow([], schema=schema)? Or do we need actual items to figure out the resolved schema?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes. It's the difference between the StacJsonBatch

    class StacJsonBatch:
    """
    An Arrow RecordBatch of STAC Items that has been **minimally converted** to Arrow.
    That is, it aligns as much as possible to the raw STAC JSON representation.
    The **only** transformations that have already been applied here are those that are
    necessary to represent the core STAC items in Arrow.
    - `geometry` has been converted to WKB binary
    - `properties.proj:geometry`, if it exists, has been converted to WKB binary
    ISO encoding
    - The `proj:geometry` in any asset properties, if it exists, has been converted to
    WKB binary.
    No other transformations have yet been applied. I.e. all properties are still in a
    top-level `properties` struct column.
    """
    and StacArrowBatch
    class StacArrowBatch:
    """
    An Arrow RecordBatch of STAC Items that has been processed to match the
    STAC-GeoParquet specification.
    """

    One is the schema of the input data, which is as close to the original STAC JSON as possible (only with geometry pre-coerced to WKB), and the other is the schema of the output data, after any STAC GeoParquet transformations.

  2. No, not as it stands, and it's very annoying. I think the main blocker to this is that we transform the bounding box from an arrow List to an arrow Struct (which we do to take advantage of GeoParquet 1.1, which defines a bounding box struct column that can be used for predicate pushdown). However we don't know in advance whether the bounding box of each Item is 2D or 3D, and so we don't know in advance how many struct fields to create.

    This also means that STAC conversion will fail on mixed 2D/3D input. Are there any real-world STAC collections that have mixed 2D/3D bounding boxes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks. We might consider revisiting this later but that all makes sense for now.

Are there any real-world STAC collections that have mixed 2D/3D bounding boxes?

I think we only have one collection with 3D bounding boxes, and that should have 3D for each item.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However we don't know in advance whether the bounding box of each Item is 2D or 3D, and so we don't know in advance how many struct fields to create.

I suppose this is something we could keep track of in our InferredSchema class while we're doing a scan of the input data: keep track of whether bounding boxes are only 2D or only 3D or a mix of the two.

Though if a user passed in their own schema (which describes the input, not the output data, and so describes bbox as a List), they'd also need to pass in whether the bbox is 2D or 3D

@TomAugspurger TomAugspurger merged commit 7df15b3 into main Jun 25, 2024
1 check passed
@TomAugspurger TomAugspurger deleted the kyle/record-batch-reader branch June 25, 2024 13:29
@kylebarron
Copy link
Collaborator Author

Do you think the existing tests are sufficient to cover this? In particular, do we have any tests with enough data / small enough chunk size to hit multiple batches?

We should add tests with a small chunk size that specifically checks for multi-chunk behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standardize on Arrow RecordBatchReader in function parameters and return types
3 participants