-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table Scan: Add Row Selection Filtering #565
Table Scan: Add Row Selection Filtering #565
Conversation
18dd9f9
to
7d61858
Compare
9e12056
to
c2e118b
Compare
@liurenjie1024 / @Xuanwo - PTAL when you get chance, thanks 👍🏼 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdd for this great pr! Genarally LGTM, and I've left some minor points to improve, but overall it looks good!
)); | ||
}; | ||
|
||
let row_counts = self.calc_row_counts(offset_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be computed only once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot. I think the optimal approach here would not be to calc row counts for all columns up-front because only the ones for columns that appear in the predicate would be needed. Caching the result would make more sense. If it is ok with you I'll add a TODO for this rather than address in this PR, and we can address in a follow-up, as it is not broken and caching would not save us a huge amount of time here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me.
|
||
if datum.is_nan() { | ||
// NaN indicates unreliable bounds. | ||
return Ok(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, but java's implementation follows jdk's default behavior: https://docs.oracle.com/javase/8/docs/api/java/lang/Double.html#compareTo-java.lang.Double-
I think in future we could unify the logic of datum comparison, but it doesn't have to be a blocker here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's defer then considering we want this in a 0.4.0 release.
Back to you @liurenjie1024, thanks for the thorough review - some good spots! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdd for this great pr!
)); | ||
}; | ||
|
||
let row_counts = self.calc_row_counts(offset_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me.
commit cda4a0c Author: congyi wang <[email protected]> Date: Mon Sep 30 11:07:36 2024 +0800 chore: fix typo in FileIO Schemes (apache#653) * fix typo * fix typo commit af9609d Author: Scott Donnelly <[email protected]> Date: Mon Sep 30 04:06:14 2024 +0100 fix: page index evaluator min/max args inverted (apache#648) * fix: page index evaluator min/max args inverted * style: fix clippy lint in test commit a6a3fd7 Author: Alon Agmon <[email protected]> Date: Sat Sep 28 10:10:08 2024 +0300 test (datafusion): add test for table provider creation (apache#651) * add test for table provider creation * fix formatting * fixing yet another formatting issue * testing schema using data fusion --------- Co-authored-by: Alon Agmon <[email protected]> commit 87483b4 Author: Alon Agmon <[email protected]> Date: Fri Sep 27 04:40:08 2024 +0300 making table provider pub (apache#650) Co-authored-by: Alon Agmon <[email protected]> commit 984c91e Author: ZENOTME <[email protected]> Date: Thu Sep 26 17:56:02 2024 +0800 avoid to create memory schema operator every time (apache#635) Co-authored-by: ZENOTME <[email protected]> commit 4171275 Author: Matheus Alcantara <[email protected]> Date: Wed Sep 25 08:28:42 2024 -0300 scan: change ErrorKind when table dont have spanshots (apache#608) commit ab51355 Author: xxchan <[email protected]> Date: Tue Sep 24 21:25:45 2024 +0800 fix: compile error due to merge stale PR (apache#646) Signed-off-by: xxchan <[email protected]> commit 420b4e2 Author: Scott Donnelly <[email protected]> Date: Tue Sep 24 08:20:23 2024 +0100 Table Scan: Add Row Selection Filtering (apache#565) * feat(scan): add row selection capability via PageIndexEvaluator * test(row-selection): add first few row selection tests * feat(scan): add more tests, fix bug where min/max args swapped * fix: ad test and fix for logic bug in PageIndexEvaluator in-clause handler * feat: changes suggested from PR review commit b3709ba Author: Christian <[email protected]> Date: Tue Sep 24 04:47:04 2024 +0200 feat: Add NamespaceIdent.parent() (apache#641) * Add NamespaceIdent.parent() * Use split_last commit 1533c43 Author: Alon Agmon <[email protected]> Date: Mon Sep 23 13:39:46 2024 +0300 feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate (apache#588) * adding main function and tests * adding tests, removing integration test for now * fixing typos and lints * fixing typing issue * - added support in schmema to convert Date32 to correct arrow type - refactored scan to use new predicate converter as visitor and seperated it to a new mod - added support for simple predicates with column cast expressions - added testing, mostly around date functions * fixing format and lic * reducing number of tests (17 -> 7) * fix formats * fix naming * refactoring to use TreeNodeVisitor * fixing fmt * small refactor * adding swapped op and fixing CR comments --------- Co-authored-by: Alon Agmon <[email protected]> commit e967deb Author: xxchan <[email protected]> Date: Mon Sep 23 18:34:59 2024 +0800 feat: expose remove_all in FileIO (apache#643) Signed-off-by: xxchan <[email protected]> commit d03c4f8 Author: Scott Donnelly <[email protected]> Date: Mon Sep 23 08:28:52 2024 +0100 Migrate to arrow-* v53 (apache#626) * chore: migrate to arrow-* v53 * chore: update datafusion to 42 * test: fix incorrect test assertion * chore: update python bindings to arrow 53 commit 88e5e4a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Sep 23 15:26:18 2024 +0800 chore(deps): Bump crate-ci/typos from 1.24.5 to 1.24.6 (apache#640) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.24.5...v1.24.6) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit c354983 Author: xxchan <[email protected]> Date: Mon Sep 23 14:50:18 2024 +0800 doc: improve FileIO doc (apache#642) Signed-off-by: xxchan <[email protected]> commit 12e12e2 Author: xxchan <[email protected]> Date: Fri Sep 20 19:59:55 2024 +0800 feat: expose arrow type <-> iceberg type (apache#637) * feat: expose arrow type <-> iceberg type Previously we only exposed the schema conversion. Signed-off-by: xxchan <[email protected]> * add tests Signed-off-by: xxchan <[email protected]> --------- Signed-off-by: xxchan <[email protected]> commit 3b27c9e Author: xxchan <[email protected]> Date: Fri Sep 20 18:32:31 2024 +0800 feat: add Sync to TransformFunction (apache#638) Signed-off-by: xxchan <[email protected]> commit 34cb81c Author: Xuanwo <[email protected]> Date: Wed Sep 18 20:18:40 2024 +0800 chore: Bump opendal to 0.50 (apache#634) commit cde35ab Author: FANNG <[email protected]> Date: Fri Sep 13 10:01:16 2024 +0800 feat: support projection pushdown for datafusion iceberg (apache#594) * support projection pushdown for datafusion iceberg * support projection pushdown for datafusion iceberg * fix ci * fix field id * remove depencences * remove depencences commit eae9464 Author: Xuanwo <[email protected]> Date: Thu Sep 12 02:06:31 2024 +0800 refactor(python): Expose transform as a submodule for pyiceberg_core (apache#628) commit 8a3de4e Author: Christian <[email protected]> Date: Mon Sep 9 14:45:16 2024 +0200 Feat: Normalize TableMetadata (apache#611) * Normalize Table Metadata * Improve readability & comments commit e08c0e5 Author: Renjie Liu <[email protected]> Date: Mon Sep 9 11:57:22 2024 +0800 fix: Correctly calculate highest_field_id in schema (apache#590) commit f78c59b Author: Jack <[email protected]> Date: Mon Sep 9 03:35:16 2024 +0100 feat: add `client.region` (apache#623) commit a5aba9a Author: Christian <[email protected]> Date: Sun Sep 8 18:36:05 2024 +0200 feat: SortOrder methods should take schema ref if possible (apache#613) * SortOrder methods should take schema ref if possible * Fix test type * with_order_id should not take reference commit 5812399 Author: Christian <[email protected]> Date: Sun Sep 8 18:18:41 2024 +0200 feat: partition compatibility (apache#612) * Partition compatability * Partition compatability * Rename compatible_with -> is_compatible_with commit ede4720 Author: Christian <[email protected]> Date: Sun Sep 8 16:49:39 2024 +0200 fix: Less Panics for Snapshot timestamps (apache#614) commit ced661f Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Sep 8 22:43:38 2024 +0800 chore(deps): Bump crate-ci/typos from 1.24.3 to 1.24.5 (apache#616) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.24.3...v1.24.5) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit cbbd086 Author: Xuanwo <[email protected]> Date: Sun Sep 8 10:29:31 2024 +0800 feat: Add more fields in FileScanTask (apache#609) Signed-off-by: Xuanwo <[email protected]> commit 620d58e Author: Callum Ryan <[email protected]> Date: Thu Sep 5 03:44:55 2024 +0100 feat: SQL Catalog - namespaces (apache#534) * feat: SQL Catalog - namespaces Signed-off-by: callum-ryan <[email protected]> * feat: use transaction for updates and creates Signed-off-by: callum-ryan <[email protected]> * fix: pull out query param builder to fn Signed-off-by: callum-ryan <[email protected]> * feat: add drop and tests Signed-off-by: callum-ryan <[email protected]> * fix: String to str, remove pub and optimise query builder Signed-off-by: callum-ryan <[email protected]> * fix: nested match, remove ok() Signed-off-by: callum-ryan <[email protected]> * fix: remove pub, add set, add comments Signed-off-by: callum-ryan <[email protected]> * fix: refactor list_namespaces slightly Signed-off-by: callum-ryan <[email protected]> * fix: add default properties to all new namespaces Signed-off-by: callum-ryan <[email protected]> * fix: remove check for nested namespace Signed-off-by: callum-ryan <[email protected]> * chore: add more comments to the CatalogConfig to explain bind styles Signed-off-by: callum-ryan <[email protected]> * fix: edit test for nested namespaces Signed-off-by: callum-ryan <[email protected]> --------- Signed-off-by: callum-ryan <[email protected]> commit ae75f96 Author: Søren Dalby Larsen <[email protected]> Date: Tue Sep 3 13:46:48 2024 +0200 chore: bump crate-ci/typos to 1.24.3 (apache#598) commit 7aa8bdd Author: Scott Donnelly <[email protected]> Date: Thu Aug 29 04:37:48 2024 +0100 Table Scan: Add Row Group Skipping (apache#558) * feat(scan): add row group and page index row selection filtering * fix(row selection): off-by-one error * feat: remove row selection to defer to a second PR * feat: better min/max val conversion in RowGroupMetricsEvaluator * test(row_group_filtering): first three tests * test(row_group_filtering): next few tests * test: add more tests for RowGroupMetricsEvaluator * chore: refactor test assertions to silence clippy lints * refactor: consolidate parquet stat min/max parsing in one place commit da08e8d Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Aug 28 14:35:55 2024 +0800 chore(deps): Bump crate-ci/typos from 1.23.6 to 1.24.1 (apache#583) commit ecbb4c3 Author: Sung Yun <[email protected]> Date: Mon Aug 26 23:57:01 2024 -0400 Expose Transforms to Python Binding (apache#556) * bucket transform rust binding * format * poetry x maturin * ignore poetry.lock in license check * update bindings_python_ci to use makefile * newline * python-poetry/poetry#9135 * use hatch instead of poetry * refactor * revert licenserc change * adopt review feedback * comments * unused dependency * adopt review comment * newline * I like this approach a lot better * more tests commit 905ebd2 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Aug 26 20:49:07 2024 +0800 chore(deps): Update typed-builder requirement from 0.19 to 0.20 (apache#582) --- updated-dependencies: - dependency-name: typed-builder dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit f9c92b7 Author: FANNG <[email protected]> Date: Sun Aug 25 22:31:36 2024 +0800 fix: Update sqlx from 0.8.0 to 0.8.1 (apache#584) commit ba66665 Author: FANNG <[email protected]> Date: Sat Aug 24 12:35:36 2024 +0800 fix: correct partition-id to field-id in UnboundPartitionField (apache#576) * correct partition-id to field id in PartitionSpec * correct partition-id to field id in PartitionSpec * correct partition-id to field id in PartitionSpec * xx
commit 2c9d7e4 Author: Christian Thiel <[email protected]> Date: Wed Oct 2 07:51:50 2024 +0200 New builder commit fea1817 Author: Christian Thiel <[email protected]> Date: Tue Oct 1 12:56:10 2024 +0200 Merge branch 'feat/safe-partition-spec' commit 1f49c95 Author: Christian Thiel <[email protected]> Date: Tue Oct 1 12:55:02 2024 +0200 Merge branch 'feat/schema-reassign-field-ids' commit cda4a0c Author: congyi wang <[email protected]> Date: Mon Sep 30 11:07:36 2024 +0800 chore: fix typo in FileIO Schemes (apache#653) * fix typo * fix typo commit af9609d Author: Scott Donnelly <[email protected]> Date: Mon Sep 30 04:06:14 2024 +0100 fix: page index evaluator min/max args inverted (apache#648) * fix: page index evaluator min/max args inverted * style: fix clippy lint in test commit a6a3fd7 Author: Alon Agmon <[email protected]> Date: Sat Sep 28 10:10:08 2024 +0300 test (datafusion): add test for table provider creation (apache#651) * add test for table provider creation * fix formatting * fixing yet another formatting issue * testing schema using data fusion --------- Co-authored-by: Alon Agmon <[email protected]> commit 87483b4 Author: Alon Agmon <[email protected]> Date: Fri Sep 27 04:40:08 2024 +0300 making table provider pub (apache#650) Co-authored-by: Alon Agmon <[email protected]> commit 984c91e Author: ZENOTME <[email protected]> Date: Thu Sep 26 17:56:02 2024 +0800 avoid to create memory schema operator every time (apache#635) Co-authored-by: ZENOTME <[email protected]> commit 4171275 Author: Matheus Alcantara <[email protected]> Date: Wed Sep 25 08:28:42 2024 -0300 scan: change ErrorKind when table dont have spanshots (apache#608) commit ab51355 Author: xxchan <[email protected]> Date: Tue Sep 24 21:25:45 2024 +0800 fix: compile error due to merge stale PR (apache#646) Signed-off-by: xxchan <[email protected]> commit 420b4e2 Author: Scott Donnelly <[email protected]> Date: Tue Sep 24 08:20:23 2024 +0100 Table Scan: Add Row Selection Filtering (apache#565) * feat(scan): add row selection capability via PageIndexEvaluator * test(row-selection): add first few row selection tests * feat(scan): add more tests, fix bug where min/max args swapped * fix: ad test and fix for logic bug in PageIndexEvaluator in-clause handler * feat: changes suggested from PR review commit b3709ba Author: Christian <[email protected]> Date: Tue Sep 24 04:47:04 2024 +0200 feat: Add NamespaceIdent.parent() (apache#641) * Add NamespaceIdent.parent() * Use split_last commit 1533c43 Author: Alon Agmon <[email protected]> Date: Mon Sep 23 13:39:46 2024 +0300 feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate (apache#588) * adding main function and tests * adding tests, removing integration test for now * fixing typos and lints * fixing typing issue * - added support in schmema to convert Date32 to correct arrow type - refactored scan to use new predicate converter as visitor and seperated it to a new mod - added support for simple predicates with column cast expressions - added testing, mostly around date functions * fixing format and lic * reducing number of tests (17 -> 7) * fix formats * fix naming * refactoring to use TreeNodeVisitor * fixing fmt * small refactor * adding swapped op and fixing CR comments --------- Co-authored-by: Alon Agmon <[email protected]> commit e967deb Author: xxchan <[email protected]> Date: Mon Sep 23 18:34:59 2024 +0800 feat: expose remove_all in FileIO (apache#643) Signed-off-by: xxchan <[email protected]> commit d03c4f8 Author: Scott Donnelly <[email protected]> Date: Mon Sep 23 08:28:52 2024 +0100 Migrate to arrow-* v53 (apache#626) * chore: migrate to arrow-* v53 * chore: update datafusion to 42 * test: fix incorrect test assertion * chore: update python bindings to arrow 53 commit 88e5e4a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Sep 23 15:26:18 2024 +0800 chore(deps): Bump crate-ci/typos from 1.24.5 to 1.24.6 (apache#640) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.24.5...v1.24.6) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit c354983 Author: xxchan <[email protected]> Date: Mon Sep 23 14:50:18 2024 +0800 doc: improve FileIO doc (apache#642) Signed-off-by: xxchan <[email protected]> commit 12e12e2 Author: xxchan <[email protected]> Date: Fri Sep 20 19:59:55 2024 +0800 feat: expose arrow type <-> iceberg type (apache#637) * feat: expose arrow type <-> iceberg type Previously we only exposed the schema conversion. Signed-off-by: xxchan <[email protected]> * add tests Signed-off-by: xxchan <[email protected]> --------- Signed-off-by: xxchan <[email protected]> commit 3b27c9e Author: xxchan <[email protected]> Date: Fri Sep 20 18:32:31 2024 +0800 feat: add Sync to TransformFunction (apache#638) Signed-off-by: xxchan <[email protected]> commit 34cb81c Author: Xuanwo <[email protected]> Date: Wed Sep 18 20:18:40 2024 +0800 chore: Bump opendal to 0.50 (apache#634) commit cde35ab Author: FANNG <[email protected]> Date: Fri Sep 13 10:01:16 2024 +0800 feat: support projection pushdown for datafusion iceberg (apache#594) * support projection pushdown for datafusion iceberg * support projection pushdown for datafusion iceberg * fix ci * fix field id * remove depencences * remove depencences commit eae9464 Author: Xuanwo <[email protected]> Date: Thu Sep 12 02:06:31 2024 +0800 refactor(python): Expose transform as a submodule for pyiceberg_core (apache#628) commit 8a3de4e Author: Christian <[email protected]> Date: Mon Sep 9 14:45:16 2024 +0200 Feat: Normalize TableMetadata (apache#611) * Normalize Table Metadata * Improve readability & comments commit e08c0e5 Author: Renjie Liu <[email protected]> Date: Mon Sep 9 11:57:22 2024 +0800 fix: Correctly calculate highest_field_id in schema (apache#590) commit f78c59b Author: Jack <[email protected]> Date: Mon Sep 9 03:35:16 2024 +0100 feat: add `client.region` (apache#623) commit a5aba9a Author: Christian <[email protected]> Date: Sun Sep 8 18:36:05 2024 +0200 feat: SortOrder methods should take schema ref if possible (apache#613) * SortOrder methods should take schema ref if possible * Fix test type * with_order_id should not take reference commit 5812399 Author: Christian <[email protected]> Date: Sun Sep 8 18:18:41 2024 +0200 feat: partition compatibility (apache#612) * Partition compatability * Partition compatability * Rename compatible_with -> is_compatible_with commit ede4720 Author: Christian <[email protected]> Date: Sun Sep 8 16:49:39 2024 +0200 fix: Less Panics for Snapshot timestamps (apache#614) commit ced661f Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Sep 8 22:43:38 2024 +0800 chore(deps): Bump crate-ci/typos from 1.24.3 to 1.24.5 (apache#616) Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.24.3...v1.24.5) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit cbbd086 Author: Xuanwo <[email protected]> Date: Sun Sep 8 10:29:31 2024 +0800 feat: Add more fields in FileScanTask (apache#609) Signed-off-by: Xuanwo <[email protected]> commit 620d58e Author: Callum Ryan <[email protected]> Date: Thu Sep 5 03:44:55 2024 +0100 feat: SQL Catalog - namespaces (apache#534) * feat: SQL Catalog - namespaces Signed-off-by: callum-ryan <[email protected]> * feat: use transaction for updates and creates Signed-off-by: callum-ryan <[email protected]> * fix: pull out query param builder to fn Signed-off-by: callum-ryan <[email protected]> * feat: add drop and tests Signed-off-by: callum-ryan <[email protected]> * fix: String to str, remove pub and optimise query builder Signed-off-by: callum-ryan <[email protected]> * fix: nested match, remove ok() Signed-off-by: callum-ryan <[email protected]> * fix: remove pub, add set, add comments Signed-off-by: callum-ryan <[email protected]> * fix: refactor list_namespaces slightly Signed-off-by: callum-ryan <[email protected]> * fix: add default properties to all new namespaces Signed-off-by: callum-ryan <[email protected]> * fix: remove check for nested namespace Signed-off-by: callum-ryan <[email protected]> * chore: add more comments to the CatalogConfig to explain bind styles Signed-off-by: callum-ryan <[email protected]> * fix: edit test for nested namespaces Signed-off-by: callum-ryan <[email protected]> --------- Signed-off-by: callum-ryan <[email protected]> commit ae75f96 Author: Søren Dalby Larsen <[email protected]> Date: Tue Sep 3 13:46:48 2024 +0200 chore: bump crate-ci/typos to 1.24.3 (apache#598) commit 7aa8bdd Author: Scott Donnelly <[email protected]> Date: Thu Aug 29 04:37:48 2024 +0100 Table Scan: Add Row Group Skipping (apache#558) * feat(scan): add row group and page index row selection filtering * fix(row selection): off-by-one error * feat: remove row selection to defer to a second PR * feat: better min/max val conversion in RowGroupMetricsEvaluator * test(row_group_filtering): first three tests * test(row_group_filtering): next few tests * test: add more tests for RowGroupMetricsEvaluator * chore: refactor test assertions to silence clippy lints * refactor: consolidate parquet stat min/max parsing in one place commit da08e8d Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Aug 28 14:35:55 2024 +0800 chore(deps): Bump crate-ci/typos from 1.23.6 to 1.24.1 (apache#583) commit ecbb4c3 Author: Sung Yun <[email protected]> Date: Mon Aug 26 23:57:01 2024 -0400 Expose Transforms to Python Binding (apache#556) * bucket transform rust binding * format * poetry x maturin * ignore poetry.lock in license check * update bindings_python_ci to use makefile * newline * python-poetry/poetry#9135 * use hatch instead of poetry * refactor * revert licenserc change * adopt review feedback * comments * unused dependency * adopt review comment * newline * I like this approach a lot better * more tests commit 905ebd2 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Aug 26 20:49:07 2024 +0800 chore(deps): Update typed-builder requirement from 0.19 to 0.20 (apache#582) --- updated-dependencies: - dependency-name: typed-builder dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit f9c92b7 Author: FANNG <[email protected]> Date: Sun Aug 25 22:31:36 2024 +0800 fix: Update sqlx from 0.8.0 to 0.8.1 (apache#584) commit ba66665 Author: FANNG <[email protected]> Date: Sat Aug 24 12:35:36 2024 +0800 fix: correct partition-id to field-id in UnboundPartitionField (apache#576) * correct partition-id to field id in PartitionSpec * correct partition-id to field id in PartitionSpec * correct partition-id to field id in PartitionSpec * xx
* feat(scan): add row selection capability via PageIndexEvaluator * test(row-selection): add first few row selection tests * feat(scan): add more tests, fix bug where min/max args swapped * fix: ad test and fix for logic bug in PageIndexEvaluator in-clause handler * feat: changes suggested from PR review
Row Selection Skipping
This PR adds the PageIndexEvaluator, which applies an iceberg filter predicate to a parquet Page Index, allowing the fetching into memory and decoding of entire slices of rows within a row group to be skipped if the PageIndexEvaluator determines from the Page Index that they don't match the filter. The PageIndexEvaluator builds a RowSelection that it passes to ParquetRecordBatchStreamBuilder::with_row_selection.
Unlike the row group metadata, the Page Index is not parsed by default when reading in the parquet file metadata. When present, especially for files with many columns, it can be quite large and intensive to parse. In fact, during experiments on the performance testing suite on my other PR, it turns out that often the additional time required to parse the Page Index outweighs the time saved fetching, decoding, and filtering all of the rows within a row group. For this reason, I've kept Row Selection Skipping disabled by default. It can be enabled by calling
with_row_selection_enabled(true)
when building the scan (There's a corresponding method on the ArrowReaderBuilder if you are using that directly). If you're interested in this feature, you may want to experiment with thewrite.parquet.page-size-bytes
andwrite.parquet.page-row-limit
table settings (which default to 1Mb and 20,000 rows respectively) in order to get optimal performance, as well as the sort settings for your table.More info can be found in the preceding PR to this one, #558