Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for output tables to be written as parquet files. #762

Closed
stefancoe opened this issue Dec 12, 2023 · 0 comments
Closed

Add option for output tables to be written as parquet files. #762

stefancoe opened this issue Dec 12, 2023 · 0 comments
Labels
Feature New feature or request

Comments

@stefancoe
Copy link
Contributor

stefancoe commented Dec 12, 2023

Is your feature request related to a problem? Please describe.
Current options include csv and h5. Parquet offers a better file based option when compared to csv, for both size and speed. PSRC's trip table as a csv is 2,100,000 KB compared to 459,700 KB as a parquet file. Loading the csv file into a Pandas Dataframe on my laptop takes 20 seconds compared to 3 seconds as a parquet file. Activitysim is already using parquet to store pipeline files.

Describe the solution you'd like
Currently, there is a config setting called 'h5_store', that uses h5 when set to True and csv when set to False or not included. So csv is the default. I propose adding a setting called 'file_type' that would allow 3 options: 'csv', 'h5', or 'parquet'. Its default would also be 'csv'. The h5_store setting would remain and its current expected behavior would be unchanged. The behavior of these settings would work like so:

  • When h5_store is set to True outputs are written out to h5.
  • When h5_store is set to False (default) and file_type is not specified, outputs are written as .csv
  • When h5_store is set to False (default) and file_type is specified, outputs are written out to its setting: csv, parquet or h5.
  • file_type is validated against allowed values (csv, parquet, h5) using pydantic. Activitysim will crash with a useful error message almost immediately if this setting is included with a wrong value.

Describe alternatives you've considered
Another option would be to add a boolean setting like use_parquet, but conflicts would arise if both settings were to set to True in a config file. If this request is accepted and we go with file_type, it may make sense to deprecate the h5_store setting at some point, especially if even more file types are supported in the future.

Additional context
I have made these changes on a fork and will issue a pull request.

@stefancoe stefancoe added the Feature New feature or request label Dec 12, 2023
@stefancoe stefancoe changed the title Enable output tables to be written as parquet files. Add option for output tables to be written as parquet files. Dec 12, 2023
@dhensle dhensle closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants