This repo holds the code of the competition "Central Nervous System (CNS) drug development: drug screening and optimization" and serves as a semester project of "AI for Chemistry" (CH-457).
- Python 3.11
- Conda environment (Recommended)
- CUDA 12.1 (Recommended, for PyTorch)
git clone https://github.com/GardevoirX/CNS_drug_screening.git
cd CNS_drug_screening
pip install --upgrade pip
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip install -r requirements.txt
python ./inference.py --data_file your_smiles.csv
cd CNS_drug_screening
pytest
python ./train.py
The treatment of central nervous system (CNS) diseases is very tricky due to the existence of the blood-brain barrier (BBB). The BBB is a highly selective barrier between the circulatory system and the CNS, which protects the brain from harmful substances in the blood while also keeping the drugs against CNS diseases from the focus of infection. Near 98% of small molecular drugs and almost all macromolecular drugs cannot pass that barrier.
Quantitative structure-activity relationship (QSAR) is a model that relates a series of molecular properties (X, descriptors) to the activities of the molecular (Y, labels). Hansch and Fujita first proposed a linear model between molar concentrations, Hammett constants and the partition coefficients:
where
This model can be further generalized as:
the
The dataset is organized as:
SMILES | Target |
---|---|
CC(=O)Nc1ccc(cc1)O | 1 |
CC1OC1P(=O)(O)O | 0 |
... | ... |
Here 1 stands for the CNS drugs and 0 stands for non-CNS drugs. There are a total of 701 data in the training set and 368 data in the test set. Below is the composition of the dataset: 453 non-CNS drugs and 247 CNS drugs.
Descriptors are mainly calculated with the help of the descriptor module of RDKit. Here we use a total of 14 descriptors, which can be further categorized into 6 types
Type | Descriptor | # of features |
---|---|---|
Molecular characteristics | MW, abs. net charge, abs. max./min. partial charge, # of rotatable bonds, # of heavy atoms |
6 |
Topological descriptors | USR, USR-CAT, 2D autocorrelation | 164 |
Quantum descriptor | MQM | 42 |
Electronegative descriptor | PEOE | 10 |
Partition coefficients | VSA-logP | 12 |
Topological fingerprints | topological torsion, Morgan fingerprints | 3072 (bits) |
In the real training process, some features are found to have only one value. These features are later removed leading to a total of 2912 features in the final scope.
Models can be simple models provided by scikit-learn or complex models built by PyTorch.
Our final model is a perceptron with five hidden layers. The number of neurons in each layer is 3076, 2048, 1024, 512 and 128, respectively. Layers are all equipped with LayerNorm, ReLU activation function and dropout. The dropout rate varies, and is 0.8, 0.6, 0.4, 0.4, 0.4 for each layer. Below is a schematic figure of our model.
Our model finally achieved an F2 score of .838 in the online test provided by the Bohrium platform.
Below is the performance of different models. Though the Bayesian regression performs the best in the validation set, it is far behind perceptrons in the test set, which might be explained by the stronger generalization ability led by the more complex model.
Model | F2-score (validation) | F2-score (test) |
---|---|---|
Logistic | 0.747 | |
Linear | 0.679 | |
Ridge | 0.649 | |
Lasso | 0.676 | |
ElasticNet | 0.746 | |
Bayesian | 0.843 | 0.702 |
SGD | 0.623 | |
Kernel | 0.731 | |
SVC | 0.000 | |
KNN | 0.675 | |
KMeans | 0.441 | |
GMM | 0.783 | |
3-Layer perceptron | 0.783 | 0.811 |
5-Layer perceptron | 0.796 | 0.838 |
- https://bohrium.dp.tech/competitions/9169114995?tab=datasets (You can change the language in the menu hiding behind the up-right icon)
- https://www.rdkit.org/docs/index.html