You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Current options include csv and h5. Parquet offers a better file based option when compared to csv, for both size and speed. PSRC's trip table as a csv is 2,100,000 KB compared to 459,700 KB as a parquet file. Loading the csv file into a Pandas Dataframe on my laptop takes 20 seconds compared to 3 seconds as a parquet file. Activitysim is already using parquet to store pipeline files.
Describe the solution you'd like
Currently, there is a config setting called 'h5_store', that uses h5 when set to True and csv when set to False or not included. So csv is the default. I propose adding a setting called 'file_type' that would allow 3 options: 'csv', 'h5', or 'parquet'. Its default would also be 'csv'. The h5_store setting would remain and its current expected behavior would be unchanged. The behavior of these settings would work like so:
When h5_store is set to True outputs are written out to h5.
When h5_store is set to False (default) and file_type is not specified, outputs are written as .csv
When h5_store is set to False (default) and file_type is specified, outputs are written out to its setting: csv, parquet or h5.
file_type is validated against allowed values (csv, parquet, h5) using pydantic. Activitysim will crash with a useful error message almost immediately if this setting is included with a wrong value.
Describe alternatives you've considered
Another option would be to add a boolean setting like use_parquet, but conflicts would arise if both settings were to set to True in a config file. If this request is accepted and we go with file_type, it may make sense to deprecate the h5_store setting at some point, especially if even more file types are supported in the future.
Additional context
I have made these changes on a fork and will issue a pull request.
The text was updated successfully, but these errors were encountered:
stefancoe
changed the title
Enable output tables to be written as parquet files.
Add option for output tables to be written as parquet files.
Dec 12, 2023
Is your feature request related to a problem? Please describe.
Current options include csv and h5. Parquet offers a better file based option when compared to csv, for both size and speed. PSRC's trip table as a csv is 2,100,000 KB compared to 459,700 KB as a parquet file. Loading the csv file into a Pandas Dataframe on my laptop takes 20 seconds compared to 3 seconds as a parquet file. Activitysim is already using parquet to store pipeline files.
Describe the solution you'd like
Currently, there is a config setting called 'h5_store', that uses h5 when set to True and csv when set to False or not included. So csv is the default. I propose adding a setting called 'file_type' that would allow 3 options: 'csv', 'h5', or 'parquet'. Its default would also be 'csv'. The h5_store setting would remain and its current expected behavior would be unchanged. The behavior of these settings would work like so:
Describe alternatives you've considered
Another option would be to add a boolean setting like use_parquet, but conflicts would arise if both settings were to set to True in a config file. If this request is accepted and we go with file_type, it may make sense to deprecate the h5_store setting at some point, especially if even more file types are supported in the future.
Additional context
I have made these changes on a fork and will issue a pull request.
The text was updated successfully, but these errors were encountered: