Smart Distributed Data Factory

This repository hosts the source code for the experiments in the paper "Smart Distributed Data Factory: Volunteer Computing Platform for Active Learning-Driven Molecular Dara Acquisition". The repository provides scripts for the training and inference of energy prediction models, as well as the active-learning framework simulation.

The pre-print with the detailed description of the methods and implementation is available on bioRxiv.
The conformational energy dataset and the benchmark for machine learning models is available on Zenodo.

Conformational energy prediction

We provide a script for running the inference with our conformational energy prediction models. For example, you can run the GENConv model on input conformations (each in a separate .sdf file) as follows:

python -m force_field_models.inference.inference --config force_field_models/model_configs/ConfigurableGNNModel_GENConv_new_normals.yaml --checkpoint GENConv.pth --data_dir dataset/molecules --output_file predictions.csv

The predictions are in Hartree units.

Model checkpoints:

GENConv: [Download link][config file: ConfigurableGNNModel_GENConv_new_normals.yaml]
PNAConv: [Download link][config file: ConfigurableGNNModel_PNAConv_new_normals.yaml]
ResGatedConv: [Download link][config file: ConfigurableGNNModel_ResGatedConv.yaml]
GeneralConv: [Download link][config file: ConfigurableGNNModel_GeneralConv_new_normals.yaml]
TransformerConv: [Download link][config file: ConfigurableGNNModel_TransformerConv_new_normals_1.yaml]

Installation steps

In order to run the code, you first need to have Python 3.11 or Python 3.12 installed on your system. Then, you should install the remaining dependencies using:

pip install -r requirements.txt

Active learning-based conformation sampling

We also provide a script (force_field_models/train/cycle.py) for running the simulation of the active learning-based dataset sampling. In order to run it, you should specify the model configs, the initial datasets, and training-related hyperparameters. A detailed explanation and an example is given in cycle_README.md.

How to cite this work

For citation please use:

The paper (pre-print):
Ghukasyan, T., Altunyan, V., Bughdaryan, A., Aghajanyan, T., Smbatyan, K., Papoian, G. A., & Petrosyan, G. (2024). SMART DATA FACTORY: VOLUNTEER COMPUTING PLATFORM FOR ACTIVE LEARNING-DRIVEN MOLECULAR DATA ACQUISITION. bioRxiv, 2024-10.
The dataset:
Altunyan, V., Ghukasyan, T., Bughdaryan, A., Aghajanyan, T., Smbatyan, K., Papoian, G., & Petrosyan, G. (2024). SDDF Energy Dataset (2024-Q3) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14008357

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Smart Distributed Data Factory

Conformational energy prediction

Installation steps

Active learning-based conformation sampling

How to cite this work

Files

README.md

Latest commit

History

README.md

File metadata and controls

Smart Distributed Data Factory

Conformational energy prediction

Installation steps

Active learning-based conformation sampling

How to cite this work