Handle merging of evolved schemas in ParquetExec #1622

thinkharderdev · 2022-01-20T17:47:08Z

Which issue does this PR close?

Closes #132

Rationale for this change

Currently, it is assumed that all parquet files in a listing scan will have the same schema. This will allow support for schema evolution in the underlying storage layer by merging parquet schemas on read.

What changes are included in this PR?

There are three parts:

Modify ParquetFormat to merge schemas for all listed parquet files using the underlying Schema::try_merge method in arrow-rs
When reading individual parquet files, map projected column indexes from the merged schema to the file schema.
"Backfill" any missing columns from the merged schema with null-valued columns.

Are there any user-facing changes?

Scans that contain heterogenous schemas will attempt to merge the schemas. If that error was expected then the observed behavior will be different.

No

alamb · 2022-01-20T21:25:16Z

Thanks @thinkharderdev ! I am hoping to review this tomorrow

datafusion/src/physical_plan/file_format/parquet.rs

alamb

Thank you so much @thinkharderdev -- this is a wonderful first contribution.

I think other than some additional testing, this is basically ready to go. I left comments about what cases I think should be covered and a suggestion of how to do so (and avoid having to mess with parquet-data)

alamb · 2022-01-21T17:16:59Z

datafusion/src/physical_plan/file_format/parquet.rs

@@ -473,6 +536,69 @@ mod tests {
        schema::types::SchemaDescPtr,
    };

+    #[tokio::test]


Thanks! This is a great start on testing

I was thinking there is no reason to use checked in parquet files for this test -- we can create files as part of the test. I went ahead and coded this up as coralogix#1 as the scaffolding was a bit annoying.

With that code I suggest we test:

columns in different orders in different files (e.g. one file has (a, b, c) columns and one has (b, a)

projection (which I do see is covered here)

Column with the same name and different types

Columns with different subsets of the data (a, b) and (b, c) for example

On point 1, I actually noticed this morning that my implementation would fail in the case where the columns are in different orders but all projected columns are present in the file. The easiest way to fix that would be to remove the condition on re-mapping the columns in the output batch (so do that mapping in all cases).

Are we concerned about the runtime cost of that operation and try to avoid it if unnecessary? I'm relatively new to Rust so not sure how expensive cloning Arc is.

I'm relatively new to Rust so not sure how expensive cloning Arc is.

Cloneing an Arc is very fast (it increments an atomic counter) 🚤

datafusion/src/physical_plan/file_format/parquet.rs

thinkharderdev · 2022-01-21T18:00:35Z

Thank you so much @thinkharderdev -- this is a wonderful first contribution.

I think other than some additional testing, this is basically ready to go. I left comments about what cases I think should be covered and a suggestion of how to do so (and avoid having to mess with parquet-data)

Thanks! I'll add the additional test coverage.

Add round trip parquet testing

datafusion/src/physical_plan/file_format/parquet.rs

tustvold · 2022-01-21T18:26:15Z

datafusion/src/physical_plan/file_format/parquet.rs

@@ -385,9 +388,33 @@ fn build_row_group_predicate(
    }
 }

+// Map projections from the schema which merges all file schemas to projections on a particular
+// file
+fn map_projections(


I wonder if this logic and the logic in read_partition might be extracted into some of SchemaAdapter, akin to PartitionColumnProjector. This would allow the logic to be reused with other file formats, e.g. JSON or CSV, whilst also allowing testing it in isolation.

that is a good idea. If not in this PR I'll file a ticket to do it in a follow on

1. Add additional test cases 2. Map projected column indexes in all cases 3. Raise an explicit error in the case where there a conflict between a file schema and the merged schema.

thinkharderdev · 2022-01-21T20:21:53Z

datafusion/src/datasource/file_format/parquet.rs

    use crate::execution::runtime_env::{RuntimeConfig, RuntimeEnv};
    use arrow::array::{
        BinaryArray, BooleanArray, Float32Array, Float64Array, Int32Array,
        TimestampNanosecondArray,
    };
    use futures::StreamExt;

+    #[tokio::test]
+    async fn test_merge_schema() -> Result<()> {


@alamb Not sure we need a test case here since it gets tested implicitly in the ParquetExec tests? If you'd rather have an explicit test case here, then I can use the same utility from those test cases over here.

I think coverage at the ParquetExec level is more than adequate.

alamb

This is looking really close @thinkharderdev -- thank you so much. I think the test for incompatible schema needs to be fixed but then this is ready to merge

datafusion/src/physical_plan/file_format/parquet.rs

alamb · 2022-01-22T12:22:13Z

datafusion/src/physical_plan/file_format/parquet.rs

+
+        // read/write them files:
+        let read =
+            round_trip_to_parquet(vec![batch1, batch2], Some(vec![0, 3]), None).await;


using the projection index 3 is a good 👍 (as it requires projecting on the merged schema)

alamb · 2022-01-22T12:24:25Z

datafusion/src/physical_plan/file_format/parquet.rs

+    }
+
+    #[tokio::test]
+    async fn evolved_schema_incompatible_types() {


This test name implies it is for incompatible types, but then merges two files with compatible types.

Perhaps you could switch one of the column types and then assert a panic like

Suggested change

async fn evolved_schema_incompatible_types() {

#[should_panic(expected = "incorrect types")]

async fn evolved_schema_incompatible_types() {

That is what I tried to do originally but the issue is that the panic is on the reader thread. In effect it basically just causes the read on that partition to get dropped. We could test for a panic by calling read_partition directly from the test if you'd rather do it that way.

👍 I think this is good enough for now -- I'll file a ticket to improve the behavior (error) on schema mismatch

Filed #1651

Fixed in #1837

…ser error rather than something to investigate) Using a DataFusionError rather than one from Parquet (the rationale being that this error comes from DataFusion, and is not related to being able to read the parquet files) Co-authored-by: Andrew Lamb <[email protected]>

alamb · 2022-01-23T13:34:45Z

Thanks @thinkharderdev !

thinkharderdev · 2022-01-23T14:03:56Z

Thanks @thinkharderdev !

Thanks you for your help!

alamb · 2022-01-24T15:27:57Z

FYI @thinkharderdev this also fixed @capkurmagati 's issue in #1527 🎉

thinkharderdev added 2 commits January 20, 2022 11:56

Handle merging of evolved schemas in ParquetExec

5628320

Handle merging of evolved schemas in ParquetExec

98a5ad4

github-actions bot added the datafusion Changes in the datafusion crate label Jan 20, 2022

Linting fix

33ef068

houqp reviewed Jan 21, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved

thinkharderdev and others added 2 commits January 21, 2022 06:00

Avoid unnecessary search by field name

e81f1df

Add round trip parquet testing

ad335ed

alamb reviewed Jan 21, 2022

View reviewed changes

Merge pull request #1 from alamb/alamb/merging_test

41945b9

Add round trip parquet testing

tustvold reviewed Jan 21, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved

tustvold reviewed Jan 21, 2022

View reviewed changes

PR comments:

ccee675

1. Add additional test cases 2. Map projected column indexes in all cases 3. Raise an explicit error in the case where there a conflict between a file schema and the merged schema.

thinkharderdev commented Jan 21, 2022

View reviewed changes

thinkharderdev added 3 commits January 21, 2022 15:25

Remove redundant test case and revert submodule changes

012bdbc

Linting

8acae80

Linting

c0593bb

alamb reviewed Jan 22, 2022

View reviewed changes

thinkharderdev and others added 4 commits January 22, 2022 09:56

PR comments: Clarify incompatible schema test

3e67937

Linting

ae5ae7e

Clippy

f950080

alamb merged commit deaa8ac into apache:master Jan 23, 2022

alamb mentioned this pull request Jan 23, 2022

Panic/dropped data when reading parquet files with incompatible shemas #1651

Closed

capkurmagati mentioned this pull request Jan 24, 2022

Error reading Parquet files after schema evolution #1527

Closed

alamb mentioned this pull request Jan 24, 2022

Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669

Closed

alamb mentioned this pull request Jan 28, 2022

Public Expr simplification API #1694

Closed

alamb mentioned this pull request Feb 15, 2022

Return Error when parquet reader fails rather than no data with println! #1837

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle merging of evolved schemas in ParquetExec #1622

Handle merging of evolved schemas in ParquetExec #1622

thinkharderdev commented Jan 20, 2022

alamb commented Jan 20, 2022

alamb left a comment

alamb Jan 21, 2022

thinkharderdev Jan 21, 2022

alamb Jan 21, 2022

thinkharderdev commented Jan 21, 2022

tustvold Jan 21, 2022 •

edited

Loading

alamb Jan 21, 2022

alamb Jan 24, 2022

thinkharderdev Jan 21, 2022

alamb Jan 21, 2022

alamb left a comment

alamb Jan 22, 2022

alamb Jan 22, 2022

thinkharderdev Jan 22, 2022

alamb Jan 23, 2022

alamb Jan 23, 2022

alamb Feb 15, 2022

alamb commented Jan 23, 2022

thinkharderdev commented Jan 23, 2022

alamb commented Jan 24, 2022

	async fn evolved_schema_incompatible_types() {
	#[should_panic(expected = "incorrect types")]
	async fn evolved_schema_incompatible_types() {

Handle merging of evolved schemas in ParquetExec #1622

Handle merging of evolved schemas in ParquetExec #1622

Conversation

thinkharderdev commented Jan 20, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Jan 20, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thinkharderdev commented Jan 21, 2022

tustvold Jan 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jan 23, 2022

thinkharderdev commented Jan 23, 2022

alamb commented Jan 24, 2022

tustvold Jan 21, 2022 •

edited

Loading