-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-11460] Support reading Parquet files with unknown schema #13554
Conversation
@danielxjd @lgajowy @jbonofre Can you review the feature for reading Parquet files with unknown schema |
…arseFiles<T>` implementation for supporting files with unknown schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, looks pretty good thanks for this contribution @anantdamle
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
1. Fix Javadoc example by using consistent words 1. Other indentation and space fixes
@anantdamle Thank you for contribution! Please, do your changes in the feature branch, not in your master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice now. I think I am going to merge this eagerly once the last fix is done @aromanenko-dev. Don't hesitate to bring any extra comments afterwards that we consider we can still improve.
@@ -58,6 +58,7 @@ | |||
* ReadFromMongoDB/WriteToMongoDB will mask password in display_data (Python) ([BEAM-11444](https://issues.apache.org/jira/browse/BEAM-11444).) | |||
* Support for X source added (Java/Python) ([BEAM-X](https://issues.apache.org/jira/browse/BEAM-X)). | |||
* There is a new transform `ReadAllFromBigQuery` that can receive multiple requests to read data from BigQuery at pipeline runtime. See [PR 13170](https://github.com/apache/beam/pull/13170), and [BEAM-9650](https://issues.apache.org/jira/browse/BEAM-9650). | |||
* ParquetIO can now read files with an unknown schema. See [PR-13554](https://github.com/apache/beam/pull/13554) and ([BEAM-11460](https://issues.apache.org/jira/browse/BEAM-11460)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this up to the 2.28.0 section. The 2.27.0 was already merged (sorry I missed this in previous check.
As the branch needs to change. Created new PR/13616 |
Data engineers encounter times when schema of Parquet file is unknown at the time of writing the pipeline or multiple schema may be present in different files. Reading Parquet files using ParquetIO requires providing an Avro (equivalent) schema, Many a times its not possible to know the schema of the Parquet files.
On the other hand AvroIO supports reading unknow schema files by providing a parse function :
#parseGenericRecords(SerializableFunction<GenericRecord,T>)
Supporting this functionality in ParquetIO is simple and requires minimal changes to the ParquetIO surface.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
ParquetIOTest
R: @lgajowy and @jbonofre
).CHANGES.md
with noteworthy changes.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.