Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37630: [C++][Python][Dataset] Allow disabling fragment metadata caching #45330

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jan 22, 2025

Rationale for this change

Parquet file fragments currently cache their (Parquet) metadata for later accesses when scanning has finished.
This can produce surprisingly high memory consumption in cases where:

  1. the dataset is only scanned once, rather than repeatedly (this is very common)
  2. there is a high metadata-to-data ratio; this can happen when the schemas on disk are very wide, with few rows per file and/or a low number of columns selected for reading

What changes are included in this PR?

Add an option to disable metadata caching on Parquet file fragments.

Are these changes tested?

Yes, by new unit tests. Also, reading a wide dataset locally has been confirmed to consume much less memory when the new option is toggled.

Are there any user-facing changes?

No.

Copy link

⚠️ GitHub issue #37630 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member Author

pitrou commented Jan 29, 2025

In #45287 (comment) it was mentioned that clearing physical_schema_ further improved the memory profile, but it's not obvious that it would be correct.

@icexelloss
Copy link
Contributor

I wonder if we can have a mode to release the fragment once the data for that fragment has been read in a "scan once" usage pattern. But also I don't know how hard it is to change that.

Per physical_schema_ - I am not familiar with the details but it seems fine to me if this is "scan once" and the fragment is not used again?

@timothydijamco
Copy link

I posted some thoughts about clearing physical_schema_ in #45287 (comment)

@pitrou pitrou force-pushed the gh37630-dataset-cached-metadata branch from 01bb19e to 09755d6 Compare February 4, 2025 09:59
@pitrou
Copy link
Member Author

pitrou commented Feb 4, 2025

I added a change that clears the cached physical schema, but keeps the original schema when it was passed via the constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants