This repository contains code used to compare various different enzyme-substrate promiscuity strategies on family-wide enzyme screening data.
A python environment can be created directly using the environment.yml
file included:
conda env create -f environment.yml
Once an enviornment has been activated, the package can be installed with:
python setup.py install
All featurizations can be handeled directly by the build_features
file. Due to the cost of repeatedly using language models to featurize proteins, features are automatically cached in data/program_cache
for later use if the --cache-dir
argument is set in the program.
Precomputed JT-VAE embeddings were also precomputed using a forked repositoryfrom the original JT-VAE paper. These can be found in data/processed/precomputed_features/
.
The datasets used in this study and the corresponding structure reference files can be downloaded from the following github repository: https://github.com/samgoldman97/enzyme-datasets
, which contains instructions for how datasets, alignments, and structure references were created and processed.
These dataset files are also included within this package directly for convenience in data/processed/
.
A simple, example program can be executed using the following run call:
python run_scripts/run_combinations_slurm.py configs/2021_06_30_example_launch.json
This will launch an evaluation of a KNN based model that uses the levenshtein distance between enzymes sequences to make predictions about held out enzyme activity for each substrate in the esterase_binary dataset.
Experiments can be run using python train_model.py
. Experiments can also be run from config files located in configs using the launcher scripts contained in run_scripts
. Specifically, python run_scripts/run_combinations_slurm.py [config file]
will launch the expriments defined in the config file, with instructions for config files contained at the top of run_combinations_slurm.py
. The config files have an optional flag to run the program on a SLURM cluster for parallelization as done in the original study.
The various provided config files are detailed here:
-
configs/2021_05_25_psar_olea_hyperopt.json
: Perform hyperoptimization for various model types on the OleA dataset for PSAR models that try to generalize to new enzymes. -
configs/2021_05_25_qsar_olea_hyperopt.json
: Perform hyperoptimization for various model types on the OleA dataset for QSAR models that try to generalize to new substrates. -
configs/2021_05_27_psar_multi.json
: Use the resulting hyperoptimized parameters to run PSAR analyses on all other datasets. -
configs/2021_05_28_qsar_multi.json
: Use the resulting hyperoptimized parameters to run QSAR analyses. -
configs/2021_05_25_psar_olea_hyperopt.json
: Use the resulting PSAR hyperoptimized parameters to run pooling comparison experiments in the PSAR direction. -
configs/2021_06_30_example_launch.json
: Run an example program launch
After completing a set of experiments, all the results entries from the specific experiment can be collected into a single results file using the script run_scripts/combine_csvs.py
. For instance, to combine any experiments in the example launch:
python run_scripts/combine_csvs.py --results-dir results/dense/2021_06_30_example_launch -
-out-file results/dense/2021_06_30_example_launch/combined_csv.csv
These combined results files are used
Figures can be constructed usign the scripts contained in the folder make_figs/
. Assumign the proper folders. All figure scripts can be run using the command:
source make_figs/make_all_figs.sh