Use Arrow stream interface for public API #69

kylebarron · 2024-06-24T16:05:52Z

Change list

Use pyarrow.RecordBatchReader for public APIs. This is better for memory usage because it never materializes the entire stream in memory at a time.
Return types and function arguments using pyarrow.RecordBatchReader are also compliant with the Arrow PyCapsule Interface. This means that there's zero-copy interop with stac-arrow in Rust, pyogrio, geoarrow-rs, etc.

Closes #66

TomAugspurger

Looks good, thanks. Just a question on one point.

Do you think the existing tests are sufficient to cover this? In particular, do we have any tests with enough data / small enough chunk size to hit multiple batches?

TomAugspurger · 2024-06-25T02:05:38Z

stac_geoparquet/arrow/_api.py


    if isinstance(schema, InferredSchema):
        schema = schema.inner

-    for batch in read_json_chunked(path, chunk_size=chunk_size):
-        yield stac_items_to_arrow(batch, schema=schema)
+    batches_iter = (


This is probably fine, but I wanted to clarify a couple things:

Do we have 2 schemas here: in input schema (named schema) and an output schema (named resolved_schema)?

Could we somehow derive resolved_schema from just the input schema, and not from actual data? Something like stac_items_to_arrow([], schema=schema)? Or do we need actual items to figure out the resolved schema?

Yes. It's the difference between the StacJsonBatch

stac-geoparquet/stac_geoparquet/arrow/_batch.py

Lines 36 to 52 in ab2701f

class StacJsonBatch:

"""

An Arrow RecordBatch of STAC Items that has been **minimally converted** to Arrow.

That is, it aligns as much as possible to the raw STAC JSON representation.

The **only** transformations that have already been applied here are those that are

necessary to represent the core STAC items in Arrow.

- `geometry` has been converted to WKB binary

- `properties.proj:geometry`, if it exists, has been converted to WKB binary

ISO encoding

- The `proj:geometry` in any asset properties, if it exists, has been converted to

WKB binary.

No other transformations have yet been applied. I.e. all properties are still in a

top-level `properties` struct column.

"""

and StacArrowBatch

stac-geoparquet/stac_geoparquet/arrow/_batch.py

Lines 176 to 180 in ab2701f

class StacArrowBatch:

"""

An Arrow RecordBatch of STAC Items that has been processed to match the

STAC-GeoParquet specification.

"""

One is the schema of the input data, which is as close to the original STAC JSON as possible (only with geometry pre-coerced to WKB), and the other is the schema of the output data, after any STAC GeoParquet transformations.

No, not as it stands, and it's very annoying. I think the main blocker to this is that we transform the bounding box from an arrow List to an arrow Struct (which we do to take advantage of GeoParquet 1.1, which defines a bounding box struct column that can be used for predicate pushdown). However we don't know in advance whether the bounding box of each Item is 2D or 3D, and so we don't know in advance how many struct fields to create.

This also means that STAC conversion will fail on mixed 2D/3D input. Are there any real-world STAC collections that have mixed 2D/3D bounding boxes?

OK thanks. We might consider revisiting this later but that all makes sense for now.

Are there any real-world STAC collections that have mixed 2D/3D bounding boxes?

I think we only have one collection with 3D bounding boxes, and that should have 3D for each item.

However we don't know in advance whether the bounding box of each Item is 2D or 3D, and so we don't know in advance how many struct fields to create.

I suppose this is something we could keep track of in our InferredSchema class while we're doing a scan of the input data: keep track of whether bounding boxes are only 2D or only 3D or a mix of the two.

Though if a user passed in their own schema (which describes the input, not the output data, and so describes bbox as a List), they'd also need to pass in whether the bbox is 2D or 3D

kylebarron · 2024-06-25T13:30:48Z

Do you think the existing tests are sufficient to cover this? In particular, do we have any tests with enough data / small enough chunk size to hit multiple batches?

We should add tests with a small chunk size that specifically checks for multi-chunk behavior

kylebarron added 2 commits June 24, 2024 11:54

Use Arrow stream interface for public API

fd70247

remove comment

387bc98

kylebarron requested review from gadomski and TomAugspurger June 24, 2024 16:06

kylebarron added 4 commits June 24, 2024 12:15

remove merge diff error

6295d50

Fix schema

12ab03d

flake8 lint

93e187b

fix ci

f778fe6

TomAugspurger approved these changes Jun 25, 2024

View reviewed changes

gadomski approved these changes Jun 25, 2024

View reviewed changes

TomAugspurger merged commit 7df15b3 into main Jun 25, 2024
1 check passed

TomAugspurger deleted the kyle/record-batch-reader branch June 25, 2024 13:29

TomAugspurger mentioned this pull request Jun 25, 2024

NumPy 2.0 compatible release #63

Closed

kylebarron mentioned this pull request Jun 25, 2024

Add batched tests #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Arrow stream interface for public API #69

Use Arrow stream interface for public API #69

kylebarron commented Jun 24, 2024

TomAugspurger left a comment

TomAugspurger Jun 25, 2024

kylebarron Jun 25, 2024

TomAugspurger Jun 25, 2024

kylebarron Jun 25, 2024

kylebarron commented Jun 25, 2024

	class StacJsonBatch:
	"""
	An Arrow RecordBatch of STAC Items that has been minimally converted to Arrow.
	That is, it aligns as much as possible to the raw STAC JSON representation.

	The only transformations that have already been applied here are those that are
	necessary to represent the core STAC items in Arrow.

	- `geometry` has been converted to WKB binary
	- `properties.proj:geometry`, if it exists, has been converted to WKB binary
	ISO encoding
	- The `proj:geometry` in any asset properties, if it exists, has been converted to
	WKB binary.

	No other transformations have yet been applied. I.e. all properties are still in a
	top-level `properties` struct column.
	"""

	class StacArrowBatch:
	"""
	An Arrow RecordBatch of STAC Items that has been processed to match the
	STAC-GeoParquet specification.
	"""

Use Arrow stream interface for public API #69

Use Arrow stream interface for public API #69

Conversation

kylebarron commented Jun 24, 2024

Change list

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Jun 25, 2024

Choose a reason for hiding this comment

kylebarron Jun 25, 2024

Choose a reason for hiding this comment

TomAugspurger Jun 25, 2024

Choose a reason for hiding this comment

kylebarron Jun 25, 2024

Choose a reason for hiding this comment

kylebarron commented Jun 25, 2024