Create `benchmark_single_table` method #151

katxiao · 2022-10-26T16:48:50Z

Problem Description

Create the benchmark_single_table method which will benchmark a single table model on the SDV demo datasets. This is very similar to our existing benchmark method, but with some API changes.

New functionality:
Instead of only running metrics, we should be able to run reports and include the overall report score in the benchmarking results. Right now, we will add an option to include the Quality Report's overall score.

Expected behavior

sdgym.benchmark_single_table

synthesizers: A list of synthesizer classes to use.
- (default) [GaussianCopulaSynthesizer, FASTMLPreset, CTGANSynthesizer]
- Options: any of the SDV synthesizers, as available through SDGym
sdv_datasets: A list of strings corresponding to the SDV demo datasets
- (default) [adult, alarm, census, child, expedia_hotel_logs, insurance, intrusion, news, covtype]
- Options: Various strings, corresponding to single-table datasets
- Use None to disable using any sdv datasets
additional_datasets_folder: A string containing the path to a folder (local or an S3 bucket). Datasets found in this folder are run in addition to the SDV datasets.
- (default) None – do not run any datasets
limit_dataset_size: Use this flag to limit the size of the datasets for faster evaluation
- (default) False: Do not limit the size. Use the full dataset
- True: Limit the size of every table to 100 rows (randomly sampled) and the first 10 columns.
evaluate_quality: A boolean representing whether or not to evaluate an overall quality score
- (default) True: Compute an overall data quality score
- False: Do not evaluate data quality. If you set this, only performance and error-related information will be available.
sdmetrics: A list of strings corresponding to the different SDMetrics to use. If you'd like to input specific parameters into the metric, provide a tuple with the metric name followed by a dictionary of the parameters.
- (default) [('NewRowSynthesis', {'synthetic_sample_size': 1_000 }]: Do not run any additional metrics. Only show the basic scores
- Options: Various (see SDMetrics guide for details)
timeout: The maximum number of seconds to wait for synthetic data creation
- (default) None: Run for as long as it takes to finish the synthetic data creation
output_filepath: A file path for where to write the output (as a csv file)
- (default) None: Do not write the output anywhere
- : Write the output to this file name as a CSV file
detailed_results_folder: The folder for where to store the intermediary results
- (default) None: Do not store the intermediary results anywhere
- : Store the intermediary results in this folder. If the program crashes or takes a long time to run, you can view the results in the folder at any time.
show_progress: Same as before

Returns a pandas.DataFrame with the results from running the benchmark

The text was updated successfully, but these errors were encountered:

katxiao added the feature request Request for a new feature label Oct 26, 2022

katxiao mentioned this issue Dec 15, 2022

Create single table benchmarking method #164

Merged

katxiao self-assigned this Dec 16, 2022

katxiao added this to the 0.6.0 milestone Dec 16, 2022

katxiao closed this as completed in #164 Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create `benchmark_single_table` method #151

Create `benchmark_single_table` method #151

katxiao commented Oct 26, 2022 •

edited

Loading

Create benchmark_single_table method #151

Create benchmark_single_table method #151

Comments

katxiao commented Oct 26, 2022 • edited Loading

Problem Description

Expected behavior

Create `benchmark_single_table` method #151

Create `benchmark_single_table` method #151

katxiao commented Oct 26, 2022 •

edited

Loading