You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create the benchmark_single_table method which will benchmark a single table model on the SDV demo datasets. This is very similar to our existing benchmark method, but with some API changes.
New functionality:
Instead of only running metrics, we should be able to run reports and include the overall report score in the benchmarking results. Right now, we will add an option to include the Quality Report's overall score.
Expected behavior
sdgym.benchmark_single_table
synthesizers: A list of synthesizer classes to use.
Options: Various strings, corresponding to single-table datasets
Use None to disable using any sdv datasets
additional_datasets_folder: A string containing the path to a folder (local or an S3 bucket). Datasets found in this folder are run in addition to the SDV datasets.
(default) None – do not run any datasets
limit_dataset_size: Use this flag to limit the size of the datasets for faster evaluation
(default) False: Do not limit the size. Use the full dataset
True: Limit the size of every table to 100 rows (randomly sampled) and the first 10 columns.
evaluate_quality: A boolean representing whether or not to evaluate an overall quality score
(default) True: Compute an overall data quality score
False: Do not evaluate data quality. If you set this, only performance and error-related information will be available.
sdmetrics: A list of strings corresponding to the different SDMetrics to use. If you'd like to input specific parameters into the metric, provide a tuple with the metric name followed by a dictionary of the parameters.
(default) [('NewRowSynthesis', {'synthetic_sample_size': 1_000 }]: Do not run any additional metrics. Only show the basic scores
Options: Various (see SDMetrics guide for details)
timeout: The maximum number of seconds to wait for synthetic data creation
(default) None: Run for as long as it takes to finish the synthetic data creation
output_filepath: A file path for where to write the output (as a csv file)
(default) None: Do not write the output anywhere
: Write the output to this file name as a CSV file
detailed_results_folder: The folder for where to store the intermediary results
(default) None: Do not store the intermediary results anywhere
: Store the intermediary results in this folder. If the program crashes or takes a long time to run, you can view the results in the folder at any time.
show_progress: Same as before
Returns a pandas.DataFrame with the results from running the benchmark
The text was updated successfully, but these errors were encountered:
Problem Description
Create the
benchmark_single_table
method which will benchmark a single table model on the SDV demo datasets. This is very similar to our existingbenchmark
method, but with some API changes.New functionality:
Instead of only running metrics, we should be able to run reports and include the overall report score in the benchmarking results. Right now, we will add an option to include the Quality Report's overall score.
Expected behavior
sdgym.benchmark_single_table
synthesizers
: A list of synthesizer classes to use.[GaussianCopulaSynthesizer, FASTMLPreset, CTGANSynthesizer]
sdv_datasets
: A list of strings corresponding to the SDV demo datasets[adult, alarm, census, child, expedia_hotel_logs, insurance, intrusion, news, covtype]
additional_datasets_folder
: A string containing the path to a folder (local or an S3 bucket). Datasets found in this folder are run in addition to the SDV datasets.None
– do not run any datasetslimit_dataset_size
: Use this flag to limit the size of the datasets for faster evaluationTrue
: Limit the size of every table to 100 rows (randomly sampled) and the first 10 columns.evaluate_quality
: A boolean representing whether or not to evaluate an overall quality scoreTrue
: Compute an overall data quality scoreFalse
: Do not evaluate data quality. If you set this, only performance and error-related information will be available.sdmetrics
: A list of strings corresponding to the different SDMetrics to use. If you'd like to input specific parameters into the metric, provide a tuple with the metric name followed by a dictionary of the parameters.[('NewRowSynthesis', {'synthetic_sample_size': 1_000 }]
: Do not run any additional metrics. Only show the basic scorestimeout
: The maximum number of seconds to wait for synthetic data creationoutput_filepath
: A file path for where to write the output (as a csv file)None
: Do not write the output anywheredetailed_results_folder
: The folder for where to store the intermediary resultsNone
: Do not store the intermediary results anywhereshow_progress
: Same as beforeReturns a pandas.DataFrame with the results from running the benchmark
The text was updated successfully, but these errors were encountered: