GitHub - lhm30/PIDGINv3: Protein target prediction using random forests and reliability-density neighbourhood analysis

Prediction IncluDinG INactivity (PIDGIN) Version 3

Author : Lewis Mervin, [email protected]

Supervisor : Dr. A. Bender

Protein target prediction using Random Forests (RFs) trained on bioactivity data from PubChem (extracted 07/06/18) and ChEMBL (version 24), using the RDKit and Scikit-learn, which employ a modification of the reliability-density neighbourhood Applicability Domain (AD) analysis by Aniceto [1]. This project is the sucessor to PIDGIN version 1 [2] and PIDGIN version 2 [3]. Target prediction with extended NCBI pathway and DisGeNET disease enrichment calculation is available as implemented in [4].

Molecular Descriptors : 2048bit Rdkit Extended Connectivity FingerPrints (ECFP) [5]
Algorithm: Random Forests with dynamic number of trees (see docs for details), class weight = 'balanced', sample weight = ratio Inactive:Active
Models generated at four different cut-off's: 100μM, 10μM, 1μM and 0.1μM
Models generated both with and without mapping to orthologues, as implemented in [3]
Pathway information from NCBI BioSystems
Disease information from DisGeNET
Target/pathway/disease enrichment calculated using Fisher's exact test and the Chi-squared test

Details for sizes across all activity cut-off's:

	Without orthologues	With orthologues
Distinct Models	10,446	14,678
Distinct Targets [exhaustive total]	7,075 [7,075]	16,623 [60,437]
Total Bioactivities Over all models	39,424,168	398,340,769
Actives	3,204,038	35,009,629
Inactives [Of which are Sphere Exclusion (SE)]	36,220,130 [27,435,133]	363,331,140 [248,782,698]

Full details on all models are provided in the uniprot_information.txt files in the orthologue and no_orthologue directories

INSTRUCTIONS

Development occurs on GitHub.

Install with Conda

Documentation, installation and instructions are on ReadtheDocs.

IMPORTANT

Use the ReadtheDocs! You MUST download the models before running!
The program recognises as input line-separated SMILES in either .smi/.smiles or .sdf format
If the SMILES input contains data additional to the SMILES string, the first entries after the SMILES are automatically interpreted as identifiers (see the OpenSMILES specification §4.5) - although there are options to change this behaviour
Molecules are automatically standardized when running models (can be turned off)
Do not modify the 'pkls', 'ad_data' etc. names or directories
Files in the examples directory are included for testing as on the ReadtheDocs tutorials.
For installation and usage instructions, see the documentation.

License

PIDGINv3 is available under the GNU General Public License v3.0 (GPLv3).

References

[1]	Aniceto, N, et al. A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: Reliability-density neighbourhood. J. Cheminform. 8: 69 (2016).

[2]	Mervin, L H., et al. Target prediction utilising negative bioactivity data covering large chemical space. J. Cheminform. 7: 51 (2015).

[3]	(1, 2) Mervin, L H., et al. Orthologue chemical space and its influence on target prediction. Bioinformatics. 34: 72–79 (2018).

[4]	Mervin, L H., et al. Understanding Cytotoxicity and Cytostaticity in a High-Throughput Screening Collection. ACS Chem. Biol. 11: 11 (2016)

[5]	Rogers D & Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50: 742-54 (2010).

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
docs		docs
examples		examples
.gitignore		.gitignore
DisGeNET_diseases.txt		DisGeNET_diseases.txt
LICENSE		LICENSE
README.rst		README.rst
biosystems.txt		biosystems.txt
predict.py		predict.py
predict_enriched.py		predict_enriched.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prediction IncluDinG INactivity (PIDGIN) Version 3