Add option for output tables to be written as parquet files. #762

stefancoe · 2023-12-12T17:14:40Z

Is your feature request related to a problem? Please describe.
Current options include csv and h5. Parquet offers a better file based option when compared to csv, for both size and speed. PSRC's trip table as a csv is 2,100,000 KB compared to 459,700 KB as a parquet file. Loading the csv file into a Pandas Dataframe on my laptop takes 20 seconds compared to 3 seconds as a parquet file. Activitysim is already using parquet to store pipeline files.

Describe the solution you'd like
Currently, there is a config setting called 'h5_store', that uses h5 when set to True and csv when set to False or not included. So csv is the default. I propose adding a setting called 'file_type' that would allow 3 options: 'csv', 'h5', or 'parquet'. Its default would also be 'csv'. The h5_store setting would remain and its current expected behavior would be unchanged. The behavior of these settings would work like so:

When h5_store is set to True outputs are written out to h5.
When h5_store is set to False (default) and file_type is not specified, outputs are written as .csv
When h5_store is set to False (default) and file_type is specified, outputs are written out to its setting: csv, parquet or h5.
file_type is validated against allowed values (csv, parquet, h5) using pydantic. Activitysim will crash with a useful error message almost immediately if this setting is included with a wrong value.

Describe alternatives you've considered
Another option would be to add a boolean setting like use_parquet, but conflicts would arise if both settings were to set to True in a config file. If this request is accepted and we go with file_type, it may make sense to deprecate the h5_store setting at some point, especially if even more file types are supported in the future.

Additional context
I have made these changes on a fork and will issue a pull request.

stefancoe added the Feature New feature or request label Dec 12, 2023

stefancoe changed the title ~~Enable output tables to be written as parquet files.~~ Add option for output tables to be written as parquet files. Dec 12, 2023

stefancoe mentioned this issue Dec 12, 2023

Option to write output tables as parquet files #763

Merged

dhensle closed this as completed Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for output tables to be written as parquet files. #762

Add option for output tables to be written as parquet files. #762

stefancoe commented Dec 12, 2023 •

edited

Loading

Add option for output tables to be written as parquet files. #762

Add option for output tables to be written as parquet files. #762

Comments

stefancoe commented Dec 12, 2023 • edited Loading

stefancoe commented Dec 12, 2023 •

edited

Loading