Speedup init of `ReadParquetPyarrowFS` #909

fjetter · 2024-02-29T12:47:17Z

This includes a couple of fixes to the ReadParquetPyarrowFS expr that speed up the initial kick off significantly. On the TPCH 1000 scale datasets for Q1 creating the query and optimizing it initially took 10-11s on my machine (with me sitting in europe and the dataset in us-east-2). With these fixes, we're down to 1-2s which is on par with the fsspec implementation.

fjetter · 2024-02-29T12:48:15Z

dask_expr/_expr.py

@@ -2867,7 +2867,7 @@ def are_co_aligned(*exprs, allow_broadcast=True):
    ancestors = [set(non_blockwise_ancestors(e)) for e in exprs]
    unique_ancestors = {
        # Account for column projection within IO expressions
-        _tokenize_partial(item, ["columns", "_series"])
+        _tokenize_partial(item, ["columns", "_series", "_dataset_info_cache"])


See #907 for the root cause discussion. What I'm doing here is a pretty brittle patch imo

phofl · 2024-02-29T15:35:53Z

thx

fjetter added 2 commits February 29, 2024 13:42

Speed up ReadParquetPyarrowFS instantiation

b34d665

add explanation about the NoDirectory

96c8fa4

fjetter changed the title ~~Speedup parquet init~~ Speedup init of ReadParquetPyarrowFS Feb 29, 2024

fjetter commented Feb 29, 2024

View reviewed changes

fjetter mentioned this pull request Feb 29, 2024

Fix detection of parquet filter pushdown #910

Merged

fjetter added 2 commits February 29, 2024 14:54

catch filenotfound as well

7cdc843

Merge branch 'main' into speedup_parquet_init

23e7de3

phofl approved these changes Feb 29, 2024

View reviewed changes

phofl merged commit d720db9 into dask:main Feb 29, 2024
7 checks passed

fjetter deleted the speedup_parquet_init branch February 29, 2024 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup init of `ReadParquetPyarrowFS` #909

Speedup init of `ReadParquetPyarrowFS` #909

fjetter commented Feb 29, 2024

fjetter Feb 29, 2024

phofl commented Feb 29, 2024

Speedup init of ReadParquetPyarrowFS #909

Speedup init of ReadParquetPyarrowFS #909

Conversation

fjetter commented Feb 29, 2024

fjetter Feb 29, 2024

Choose a reason for hiding this comment

phofl commented Feb 29, 2024

Speedup init of `ReadParquetPyarrowFS` #909

Speedup init of `ReadParquetPyarrowFS` #909