-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37630: [C++][Python][Dataset] Allow disabling fragment metadata caching #45330
base: main
Are you sure you want to change the base?
Conversation
|
561a9f2
to
6058a91
Compare
6058a91
to
01bb19e
Compare
In #45287 (comment) it was mentioned that clearing |
I wonder if we can have a mode to release the fragment once the data for that fragment has been read in a "scan once" usage pattern. But also I don't know how hard it is to change that. Per |
I posted some thoughts about clearing |
01bb19e
to
09755d6
Compare
I added a change that clears the cached physical schema, but keeps the original schema when it was passed via the constructor. |
Rationale for this change
Parquet file fragments currently cache their (Parquet) metadata for later accesses when scanning has finished.
This can produce surprisingly high memory consumption in cases where:
What changes are included in this PR?
Add an option to disable metadata caching on Parquet file fragments.
Are these changes tested?
Yes, by new unit tests. Also, reading a wide dataset locally has been confirmed to consume much less memory when the new option is toggled.
Are there any user-facing changes?
No.