Improving the performance of models for one-step retrosynthesis through re-ranking Retrosynthesis is at the core of organic chemistry. Recently, the rapid growth of artificial intelligence (AI) has spurred a variety of novel machine learning approaches for data-driven synthesis planning. These methods learn complex patterns from reaction databases in order to predict, for a given product, sets of reactants that can be used to synthesise that product. However, their performance as measured by the top-N accuracy in matching published reaction precedents still leaves room for improvement. This work aims to enhance these models by learning to re-rank their reactant predictions. Specifically, we design and train an energy-based model to re-rank, for each product, the published reaction as the top suggestion and the remaining reactant predictions as lower-ranked. We show that re-ranking can improve one-step models significantly using the standard USPTO-50k benchmark dataset, such as RetroSim, a similarity-based method, from 35.7 to 51.8% top-1 accuracy and NeuralSym, a deep learning method, from 45.7 to 51.3%, and also that re-ranking the union of two models’ suggestions can lead to better performance than either alone. However, the state-of-the-art top-1 accuracy is not improved by this method.
If you have used our code or referred to our paper, we would appreciate it if you could cite our work:
@article{lin2022improving,
title={Improving the performance of models for one-step retrosynthesis through re-ranking},
author={Lin, Min Htoo and Tu, Zhengkai and Coley, Connor W},
journal={Journal of cheminformatics},
volume={14},
number={1},
pages={1--13},
year={2022},
publisher={Springer}
}
Lin, M.H., Tu, Z. & Coley, C.W. Improving the performance of models for one-step retrosynthesis through re-ranking. J Cheminform 14, 15 (2022).
# ensure conda is already initialized
bash setup.sh
conda activate rxnebm
To get the results for our paper, we train each of the 4 one-step models on 3 seeds. Specifically, we used the following:
- GLN: 19260817, 20210423, 77777777
- RetroXpert: 11111111, 20210423, 77777777
- NeuralSym: 0, 20210423, 77777777
- RetroSim: no random seed needed
Thus, we have 3 sets of CSV files (train + valid + test) per one-step model (except RetroSim, which has no seed), which belong in rxnebm/data/cleaned_data/
. We then train one EBM re-ranker with a specified random seed (ebm_seed
) on one set of CSV file, for a total of 3 repeats per one-step model. e.g. Graph-EBM (seed 0) on NeuralSym seed 0, Graph-EBM (seed 20210423) on NeuralSym seed 20210423, and Graph-EBM (seed 77777777) on NeuralSym seed 77777777. For GLN seed 19260817 and RetroXpert seed 11111111, we use ebm_seed = 0
. For RetroSim, we use ebm_seed
of 0, 20210423, 77777777. We provide all 39 proposal CSV files on both figshare and Google Drive. For a useful tool to download an entire folder to a linux server, see: prasmussen's gdrive
The training proposal CSV files are quite large (~200 MB), so please ensure you do have enough storage space (4.4 GB total). Note we have not uploaded fingerprints and graph features as these files are much larger. The graph features (train + val + test) can take up as much as 30 GB, while for fingerprints it is ~1 GB. See below in each proposer section for how to generate them yourself. If there is enough demand for us to upload these (very big) files, we may consider doing so.
Before training, ensure you have 1) the 3 CSV files 2) the 3 precomputed reaction data files (be it fingerprints, rxn_smi, graphs etc.). Refer to below for how we generate the reaction data files for a proposer. Note that <ebm_seed>
refers to the random seed to be used for training the EBM re-ranker, and <proposer_seed>
refers to the random seed that was used to train the one-step model.
Note: As RetroSim has no random seed, you do not need to provide <proposer_seed>
.
If you are reloading a trained checkpoint for whatever reason, you additionally need to provide --old_expt_name <name>
, --date_trained <DD_MM_YYYY>
and --load_checkpoint
.
For FF-EBM
bash scripts/<proposer>/FeedforwardEBM.sh <ebm_seed> <proposer_seed>
For Graph-EBM
bash scripts/<proposer>/GraphEBM.sh <ebm_seed> <proposer_seed>
For Transformer-EBM (note that this yields poor results and we only report results on RetroSim). To train this, you just need the 3 CSV files, e.g. rxnebm/data/cleaned_data/retrosim_200topk_200maxk_noGT_<phase>.csv
bash scripts/retrosim/TransformerEBM.sh <ebm_seed>
The data was obtained from the dropbox folder provided by the authors of GLN.
We renamed these 3 csv files from raw_{phase}.csv
to 'schneider50k_train.csv'
, 'schneider50k_test.csv'
and 'schneider50k_valid.csv'
, and saved them to rxnebm/data/original_data
(already included in this repo)
For the re-ranking task, we trained four different retrosynthesis models. We use a single, extra-clean USPTO_50k dataset, split roughly into 80/10/10. These are derived from the three schneider50k_{phase}.csv
files, using the script rxnebm/data/preprocess/clean_smiles.py
, i.e.
python -m rxnebm.data.preprocess.clean_smiles
This data has been included in this repository under rxnebm/data/cleaned_data/
as 50k_clean_rxnsmi_noreagent_allmapped_cano_{phase}.pickle
.
Note that these 3 .pickle files are extremely important, as we will use them as inputs to generate proposals & ground-truth for each one-step model.
Specifically, we perform these steps:
- Keep all atom mapping
- Remove reaction SMILES strings with product molecules that are too small and clearly incorrect. The criteria used was
len(prod_smi) < 3
. 4 reaction SMILES strings were caught by this criteria, with products:'CN[C@H]1CC[C@@H](c2ccc(Cl)c(Cl)c2)c2ccc([I:19])cc21>>[IH:19]'
'O=C(CO)N1CCC(C(=O)[OH:28])CC1>>[OH2:28]'
'CC(=O)[Br:4]>>[BrH:4]'
'O=C(Cn1c(-c2ccc(Cl)c(Cl)c2)nc2cccnc21)[OH:10]>>[OH2:10]'
- Remove all duplicate reaction SMILES strings
- Remove reaction SMILES in the training data that overlap with validation/test sets + validation data that overlap with the test set.
- test_appears_in_train: 50
- test_appears_in_valid: 6
- valid_appears_in_train: 44
- Finally, we obtain an (extra) clean dataset of reaction SMILES:
- Train: 39713
- Valid: 4989
- Test: 5005
- Canonicalization: After running
clean_smiles.py
, we runcanonicalize.py
in the same folder:python -m rxnebm.data.preprocess.canonicalize
-
3 CSV files of proposals from RetroSim. This is first generated by running
python -m rxnebm.proposer.retrosim_model
This step takes 13 hours on an 8 core machine. You only need to run this python script again if you wish to get more than top-200 predictions, or beyond 200 max precedents, or modify the underlying RetroSim model; otherwise, you can just grab the CSV files from figshare or google drive, as stated above under the section
Data preparation
.As a precaution, we canonicalize all these precursors again and ensure no training reaction has duplicate ground-truth, by running:
bash scripts/retrosim/clean.sh
-
3 .npz files of sparse reaction fingerprints
retrosim_rxn_fps_{phase}.npz
This is generated by running
bash scripts/retrosim/make_fp.sh
It takes about 8 minutes on a 32 core machine. Please refer to
gen_proposals/gen_fps_from_proposals.py
for detailed arguments.Since RetroSim will not generate the full 50/200 proposals for every product, we pad the reaction fingerprints with all-zero vectors for batching and mask these during training & testing.
-
3 sets of graph features (1 for each phase). Each set consists of:
cache_feat_index.npz
,cache_feat.npz
,cache_mask.pkl
,cache_smi.pkl
. Note that these 3 sets in total take up between 20 to 30 GBs, so ensure you have sufficient disk space. We again provide them in our Drive, but you can also generate them yourself using:bash scripts/retrosim/make_graphfeat.sh
It takes about 12 minutes on 32 cores.
-
As stated in our paper, just training on the top-50 proposals (
--topk 50
) is sufficient and yields the same performance as training on more predictions (e.g. top-100/200); for testing, we still keep the top-200 proposals (--maxk 200
) to maximize the chances of the published reaction appearing in those 200 proposals for re-ranking.
Once either the reaction fingerprints or the graphs have been generated, follow the instructions under Training
above to train the EBMs.
-
First we need to train GLN itself. We already include the 3 CSV files to train GLN, which contains the atom-mapped, canonicalized, extra-clean reaction SMILES from USPTO-50K, in
rxnebm/proposer/gln_openretro/data/gln_schneider50k/
.- To generate these yourself, just run:
python prep_data_for_retro_models.py --output_format gln
This takes as input the 3 .pickle files generated usingclean_smiles.py
above.
- To generate these yourself, just run:
-
We created a wrapper of the original GLN repo, to ease some issues installing GLN as a package, as well as standardize training, testing & proposing. Our official wrapper, openretro, is still under development to be released soon, and we include the GLN portion in this repo at:
cd rxnebm/proposer/gln_openretro
-
To install GLN: once you're in
gln_openretro
, run:
bash scripts/setup.sh
This creates a conda environment calledgln_openretro
, which you need to activate to train/test/propose with GLN. Note that the GLN authors compiled custom CUDA ops to speed up model training/testing, so you need to install GLN on a GPU machine with CUDA properly set up. -
To preprocess training data:
bash scripts/preprocess.sh
-
To train (takes ~2 hours on 1 RTX2080Ti). Note that you need to specify a training seed.
bash scripts/train.sh <gln_seed>
, e.g.bash scripts/train.sh 0
-
To test (takes ~4 hours on 1 RTX2080Ti, because it tests all 10 checkpoints). Testing is important because it tells you the best checkpoint (by validation top-1 accuracy) to use for proposing. For example, on seed
77777777
, this should bemodel-6.dump
.
bash scripts/test.sh <seed>
-
To propose, we need to go back up to root with:
cd ../../../
Then run (takes ~12 hours on 1 RTX2080Ti):
bash scripts/gln/propose_and_compile.sh <gln_seed> <best_ckpt>
You need to providegln_seed
andbest_ckpt
arguments. For example, if your best checkpoint wasmodel-6.dump
trained on seed77777777
, then:
bash scripts/gln/propose_and_compile.sh 77777777 6
which will output 3 cleaned CSV files inrxnebm/data/cleaned_data
of the formatGLN_200topk_200maxk_noGT_<gln_seed>_<phase>.csv
-
The last step is to generate either the fingerprints or graphs. This step is very similar across all 4 proposers.
- Fingerprints:
bash scripts/gln/make_fp.sh <gln_seed>
- Graphs:
bash scripts/gln/make_graphfeat.sh <gln_seed>
- Fingerprints:
-
Finally, we can train the EBMs to re-rank GLN! Whew! That took a while. Alternatively, if you just want to reproduce our results, you can just grab the fingerprints and/or graphs of the proposals off our Google Drive.
-
To train RetroXpert, we include the 3 USPTO-50K CSV files in
rxnebm/proposer/retroxpert/data/USPTO50K/canonicalized_csv
.- To generate these yourself, run:
python prep_data_for_retro_models.py --output_format retroxpert
This takes as input the 3 .pickle files generated usingclean_smiles.py
above.
- To generate these yourself, run:
-
We cloned the original RetroXpert repo, and added python scripts in order to generate proposals across the entire dataset. We also slightly modified the workflow to include a random seed for training. The folder is at:
cd rxnebm/proposer/retroxpert
-
To setup the environment for RetroXpert: once you're in
retroxpert
, run:
bash scripts/setup.sh
This creates a conda environment calledretroxpert
, which you need to activate to train/test/propose with RetroXpert. -
To preprocess training data. However, there is a slight RDKit bug in the template extraction step, where the same input data can generate different number of templates each time the script is run
python extract_semi_template_pattern.py --extract_pattern
.
If you are simply reproducing our results, you do not need to extract the templates again, as we already include it indata/USPTO50K/product_patterns.txt
, which has 527 templates. So, you just need to run:
bash scripts/preprocess.sh
and use 527 as<num_templates>
for later steps.
However, if you are using your own dataset, then you must extract the templates, by including the--extract_pattern
flag when runningpreprocess.sh
:
bash scripts/preprocess.sh --extract_pattern
, and afterwards note down the EXACT number of templates extracted<num_templates>
as you need to provide this to later scripts. -
To train EGAT (~6 hours on 1 RTX2080Ti). You need to specify a training seed and number of extracted templates.
bash scripts/train_EGAT.sh <retroxpert_seed> <num_templates>
, e.g.bash scripts/train_EGAT.sh 0 527
-
To train RGN (~30 hours on 2 RTX2080Ti). You need to specify a training seed.
bash scripts/train_RGN.sh <retroxpert_seed>
-
To translate using RGN and test the overall two-step performance (15 mins on 1 RTX2080Ti). We recommend running this step to double-check that the model is trained well without bugs.
bash scripts/translate_RGN.sh <retroxpert_seed>
-
The proposal stage contains two sub-steps. In the first sub-step, we directly use the existing RetroXpert code, only with minor modifications, to generate reactant-set predictions and evaluate the top-200 predictions. This sub-step takes ~8.5 hours on 1 RTX2080Ti:
bash scripts/propose.sh <seed> <num_templates>
-
In the second sub-step, because the output format from RetroXpert does not align with the input format we need for the EBM, we further process those top-200 proposals. This includes cleaning up invalid SMILES, de-duplicating proposals and ensuring there is only one ground-truth in each training reaction. This sub-step is much faster, ~10 mins on 8 cores (no GPU needed).
First, go back to root:
cd ../../../
And then run:
bash scripts/retroxpert/compile.sh <retroxpert_seed>
which will output 3 cleaned CSV files inrxnebm/data/cleaned_data
of the formatretroxpert_200topk_200maxk_noGT_<retroxpert_seed>_<phase>.csv
-
The last step is to generate either the fingerprints or graphs using those 3 cleaned CSV files. This step is very similar across all 4 proposers.
- Fingerprints:
bash scripts/retroxpert/make_fp.sh <retroxpert_seed>
- Graphs:
bash scripts/retroxpert/make_graphfeat.sh <retroxpert_seed>
- Fingerprints:
-
Finally, we can train the EBMs to re-rank RetroXpert!
-
To train NeuralSym, we simply use the 3 .pickle files
50k_clean_rxnsmi_noreagent_allmapped_canon_<phase>.pickle
generated usingclean_smiles.py
above, which contain the extra-cleaned USPTO-50K reactions. They've also been placed in NeuralSym's input data folderrxnebm/proposer/neuralsym/data/
. -
As the original authors did not open-source NeuralSym, we re-implemented it from scratch following their paper and the repo is placed at:
cd rxnebm/proposer/neuralsym
For reference, our re-implementation can also be found at: https://github.com/linminhtoo/neuralsym -
To setup the environment for NeuralSym: once you're in
neuralsym
, run:
bash setup.sh
This creates a conda environment calledneuralsym
, which you need to activate to train/test/propose with NeuralSym. -
To preprocess training data into 32681-dim fingerprints. As we've heavily optimized this step, it takes only ~10 mins on 16 cores, and probably ~15-20 mins on 8 cores.
python prepare_data.py
. -
To train (<5 mins on 1 RTX2080Ti, yep you read that correctly). Note that the accuracies used during training are template-matching accuracies, which are lower than reactant-matching accuracy (the actual metric for evaluating one-step retrosynthesis), because a particular reactant-set can be obtained from a given product through multiple templates. However, calculating template acucracy is faster (batchable) and more convenient, which is why we use it, given that our reactant-matching results are in agreement with literature values (in fact slightly better).
You need to specify a training seed.
bash train.sh <neuralsym_seed>
, e.g.bash train.sh 0
-
We combine the testing and proposing into a single step, and evaluate reactant-matching accuracy here.
bash propose_and_compile.sh <neuralsym_seed>
This will output 3 cleaned CSV files inrxnebm/data/cleaned_data
of the formatneuralsym_200topk_200maxk_noGT_<neuralsym_seed>_<phase>.csv
-
Now, go back to root:
cd ../../../
The last step is to generate either the fingerprints or graphs using those 3 cleaned CSV files. This step is very similar across all 4 proposers.- Fingerprints:
bash scripts/neuralsym/make_fp.sh <neuralsym_seed>
- Graphs:
bash scripts/neuralsym/make_graphfeat.sh <neuralsym_seed>
- Fingerprints:
-
Finally, we can train the EBMs to re-rank NeuralSym!
-
First, ensure you have generated the 3x proposal CSV files for both GLN and RetroSim by following the instructions for their respective sections above. This means you need both
GLN_200topk_200maxk_noGT_<gln_seed>_<phase>.csv
andretrosim_200topk_200maxk_noGT_<phase>.csv
inrxnebm/data/cleaned_data
. -
To compile the union of proposals into a single CSV file for each phase, run:
bash scripts/gln_sim/compile.sh <gln_seed>
which will output 3 cleaned CSV files inrxnebm/data/cleaned_data
of the formatGLN_50topk_200maxk_<gln_seed>_retrosim_50topk_200maxk_noGT_<phase>.csv
. -
The last step is to generate either the fingerprints or graphs using those 3 cleaned CSV files. This step is very similar across all 4 proposers.
- Fingerprints:
bash scripts/gln_sim/make_fp.sh <gln_seed>
- Graphs:
bash scripts/gln_sim/make_graphfeat.sh <gln_seed>
- Fingerprints:
-
Finally, we can train the Graph-EBM to re-rank the union of GLN and RetroSim!
bash scripts/gln_sim/GraphEBM.sh <ebm_seed> <gln_seed>