Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup init of ReadParquetPyarrowFS #909

Merged
merged 4 commits into from
Feb 29, 2024
Merged

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Feb 29, 2024

This includes a couple of fixes to the ReadParquetPyarrowFS expr that speed up the initial kick off significantly. On the TPCH 1000 scale datasets for Q1 creating the query and optimizing it initially took 10-11s on my machine (with me sitting in europe and the dataset in us-east-2). With these fixes, we're down to 1-2s which is on par with the fsspec implementation.

@fjetter fjetter changed the title Speedup parquet init Speedup init of ReadParquetPyarrowFS Feb 29, 2024
@@ -2867,7 +2867,7 @@ def are_co_aligned(*exprs, allow_broadcast=True):
ancestors = [set(non_blockwise_ancestors(e)) for e in exprs]
unique_ancestors = {
# Account for column projection within IO expressions
_tokenize_partial(item, ["columns", "_series"])
_tokenize_partial(item, ["columns", "_series", "_dataset_info_cache"])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #907 for the root cause discussion. What I'm doing here is a pretty brittle patch imo

@phofl phofl merged commit d720db9 into dask:main Feb 29, 2024
7 checks passed
@phofl
Copy link
Collaborator

phofl commented Feb 29, 2024

thx

@fjetter fjetter deleted the speedup_parquet_init branch February 29, 2024 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants