🔀 Merge pull request #63 from RasmussenLab/developer

♻️ 🎨 📝 Developer
RasmussenLab · Jan 2, 2023 · 17db92c · 17db92c
2 parents 4d312b8 + 4a42031
commit 17db92c
Show file tree

Hide file tree

Showing 30 changed files with 1,019 additions and 146 deletions.
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,16 @@
+version: 2
+
+build:
+  os: ubuntu-20.04
+  tools:
+    python: "3.9"
+
+sphinx:
+    configuration: docs/source/conf.py
+
+python:
+  install:
+    - requirements: docs/requirements.txt
+    - requirements: requirements.txt
+    - method: pip
+      path: .
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@ The code in this repository can be used to run our Multi-Omics Variational
 autoEncoder (MOVE) framework for integration of omics and clinical variabels
 spanning both categorial and continuous data. Our approach includes training
 ensemble VAE models and using *in silico* perturbation experiments to identify
-cross omics associations. The manuscript has been accepted and we will provide 
+cross omics associations. The manuscript has been accepted and we will provide
 the link when it is published.
 
 We developed the method based on a Type 2 Diabetes cohort from the IMI DIRECT
@@ -68,29 +68,8 @@ MOVE has five-six steps:
 
 ## How to run MOVE
 
-You can run the move-dl pipeline from the command line or within a Jupyter
-notebook.
-
-You can run MOVE as Python module with the following command. Details on how
-to set up the configuration for the data and task can be found our
-[tutorial](https://github.com/RasmussenLab/MOVE/tree/main/tutorial) folder.
-
-```bash
->>> move-dl data=[name of data config] task=[name of task config]
-```
-
-Feel free to
-[open an issue](https://github.com/RasmussenLab/MOVE/issues/new/choose) if you
-need any help.
-
-### How to use MOVE with your data
-
-Your data files should be tab separated, include a header and the first column
-should be the IDs of your samples. The configuration of MOVE is done using YAML
-files that describe the input data and the task specification. These should be
-placed in a `config` directory in the working directory. Please see the
-[tutorial](https://github.com/RasmussenLab/MOVE/tree/main/tutorial)
-for more information.
+Please refer to our [**documentation**](https://move-dl.readthedocs.io/) for
+examples and tutorials on how to run MOVE.
 
 
 # Data sets
@@ -110,5 +89,13 @@ available [here](https://directdiabetes.org).
 
 ## Simulated and publicaly available data sets
 
-We have therefore provided two datasets to test the workflow: a simulated 
+We have therefore provided two datasets to test the workflow: a simulated
 dataset and a publicly-available maize rhizosphere microbiome data set.
+
+# Citation
+
+To cite MOVE, use the following information:
+
+Allesøe, R.L., Lundgaard, A.T., Hernández Medina, R. et al. Discovery of
+drug–omics associations in type 2 diabetes with generative deep-learning models.
+*Nat Biotechnol* (2023). https://doi.org/10.1038/s41587-022-01520-x
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,2 +1,4 @@
 sphinx==5.3.0
-sphinx_rtd_theme=1.1.1
+sphinx-rtd-theme
+sphinx-autodoc-typehints
+sphinxemoji
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -11,18 +11,23 @@
 
 sys.path.insert(0, str(Path("../src").resolve()))
 
-project = "move-dl"
-copyright = "2022, Valentas Brasas, Ricardo Hernandez Medina"
-author = "Valentas Brasas, Ricardo Hernandez Medina"
-release = "1.0.0"
+import move
+
+project = "MOVE"
+copyright = "2022, Rasmussen Lab"
+author = "Rasmussen Lab"
+release = ".".join(map(str, move.__version__))
 
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
 
 extensions = [
     "sphinx.ext.autodoc",
+    "sphinx.ext.autosectionlabel",
     "sphinx.ext.autosummary",
     "sphinx.ext.napoleon",
+    "sphinx_autodoc_typehints",
+    "sphinxemoji.sphinxemoji",
 ]
 
 templates_path = ["_templates"]
@@ -32,6 +37,9 @@
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
 
 html_theme = "sphinx_rtd_theme"
+html_theme_options = {
+    "collapse_navigation" : False,
+}
 html_static_path = []
 
 # -- Napoleon settings --------------------------------------------------------

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -1,15 +1,22 @@
-.. move-dl documentation master file, created by
-   sphinx-quickstart on Sat Nov  5 15:48:56 2022.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-
-Welcome to move-dl's documentation!
-===================================
+Welcome to MOVE's documentation!
+================================
 
 .. toctree::
+   :hidden:
    :maxdepth: 1
    :caption: Contents:
 
-   pages/installation
-   pages/tutorial
-   pages/api/API
+   install
+   method
+   tutorial/index
+
+MOVE (**m**\ ulti-\ **o**\ mics **v**\ ariational auto\ **e**\ ncoder) is a
+framework for integration of omics and other data modalities (including both
+categorical and continuous data). Our approach consists of training an ensemble
+of VAE (variational autoencoder) models and performing *in silico* perturbation
+experiments to identify associations across the different omics datasets.
+
+We invite you to read `our publication`_ presenting this method, or read
+about the method :doc:`here</method>`.
+
+.. _`our publication`: https://www.nature.com/articles/s41587-022-01520-x
diff --git a/docs/source/install.rst b/docs/source/install.rst
@@ -0,0 +1,47 @@
+Install
+=======
+
+MOVE is distributed as ``move-dl``, a Python package.
+
+It requires Python 3.9 (or later) and third-party libraries, such as `PyTorch`_
+and `Hydra`_. These dependencies will be installed automatically when you
+install with ``pip``.
+
+Install the stable version
+--------------------------
+
+We recommend installing ``move-dl`` in a fresh virtual environment. If you wish
+to learn how to create and manage virtual environments with Conda, please
+follow `these instructions`_.
+
+The latest stable version of ``move-dl`` can be installed with ``pip``.
+
+.. code-block:: bash
+
+    >>> pip install move-dl
+
+Install the development version
+-------------------------------
+
+If you wish to install the development of ``move-dl``, create a new virtual
+environment, and do:
+
+.. code-block:: bash
+
+    >>> pip install git+https://github.com/RasmussenLab/MOVE@developer
+
+Alternatively, you can clone ``move-dl`` from `GitHub`_ and install by
+running the following command from the top-level source directory:
+
+.. code-block:: bash
+
+    >>> pip install -e .
+
+The ``-e`` flag installs the project in "editable" mode, so you can follow the
+development branch and update your installation by pulling from GitHub.
+
+.. _PyTorch: https://pytorch.org/
+.. _Hydra: https://hydra.cc/
+.. _GitHub: https://github.com/RasmussenLab/MOVE
+
+.. _these instructions: https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html
diff --git a/docs/source/method.rst b/docs/source/method.rst
@@ -0,0 +1,98 @@
+About the method
+================
+
+MOVE is based on the VAE (variational autoencoder) model, a deep learning model
+that transforms high-dimensional data into a lower-dimensional space (so-called
+latent representation). The autoencoder is made up of two neural networks: an
+encoder, which compresses the input variables; and a decoder, which tries to
+reconstruct the original input from the compressed representation. In doing so,
+the model learns the structure and associations between the input variables.
+
+In `our publication`_, we used this type of model to integrate different data
+modalities, including: genomics, transcriptomics, proteomics, metabolomics,
+microbiomes, medication data, diet questionnaires, and clinical measurements.
+Once we obtained a trained model, we exploited the decoder network to identify
+cross-omics associations.
+
+Our approach consists of performing *in silico* perturbations of the original
+data and using either univariate statistical methods or Bayesian decision
+theory to identify significant differences between the reconstruction with or
+without perturbation. Thus, we are able to detect associations between the
+input variables.
+
+.. _`our publication`: https://www.nature.com/articles/s41587-022-01520-x
+
+.. image:: method/fig1.svg
+
+VAE design
+-----------
+
+The VAE was designed to account for a variable number of fully-connected hidden
+layers in both encoder and decoder. Each hidden layer is followed by batch
+normalization, dropout, and a leaky rectified linear unit (leaky ReLU).
+
+To integrate different modalities, each dataset is reshaped and concatenated
+into an input matrix. Moreover, error calculation is done on a dataset
+basis: binary cross-entropy for binary and categorical datasets and mean squared
+error for continuous datasets. Each error :math:`E_i` is then multiplied by a
+given weight :math:`W_i` and added up to form the loss function:
+
+:math:`L = \sum_i W_i E_i + W_\textnormal{KL} D_\textnormal{KL}`
+
+Note that the :math:`D_\textnormal{KL}` (Kullback–Leibler divergence) penalizes
+deviance of the latent representation from the standard normal distribution. It
+is also subject to a weight :math:`W_\textnormal{KL}`, which warms up as the
+model is trained.
+
+Extracting associations
+-----------------------
+
+After determining the right set of hyperparameters, associations are extracted
+by perturbing the original input data and passing it through an ensemble of
+trained models. The reason behind using an ensemble is that VAE models are
+stochastic, so we need to ensure that the results we obtain are not a product
+of chance.
+
+We perturbed categorical data by changing its value from one category to
+another (e.g., drug status changed from "not received" to "received"). Then, we
+compare the change between the reconstruction generated from the original data
+and the perturbed data. To achieve this, we proposed two approaches: using
+*t*\ -test and Bayes factors. Both are described below:
+
+MOVE *t*\ -test
+^^^^^^^^^^^^^^^
+
+#. Perturb a variable in one dataset.
+#. Repeat 10 times for 4 different latent space sizes:
+
+    #. Train VAE model with original data.
+    #. Obtain reconstruction of original data (baseline reconstruction).
+    #. Obtain 10 additional reconstructions of original data and calculate
+       difference from the first (baseline difference).
+    #. Obtain reconstruction of perturbed data (perturbed reconstruction) and
+       subtract from baseline reconstruction (perturbed difference).
+    #. Compute p-value between baseline and perturbed differences with t-test.
+
+#. Correct p-values using Bonferroni method.
+#. Select features that are significant (p-value lower than 0.05).
+#. Select significant features that overlap in at least half of the refits and
+   3 out of 4 architectures. These    features are associated with the
+   perturbed variable.
+
+MOVE Bayes
+^^^^^^^^^^
+
+#. Perturb a variable in one dataset.
+#. Repeat 30 times:
+
+    #. Train VAE model with original data.
+    #. Obtain reconstruction of original data (baseline reconstruction).
+    #. Obtain reconstruction of perturbed data (perturbed reconstruction).
+    #. Record difference between baseline and perturbed reconstruction.
+
+#. Compute probability of difference being greater than 0.
+#. Compute Bayes factor from probability: :math:`K = \log p - \log (1 - p)`.
+#. Sort probabilities by Bayes factor, from highest to lowest.
+#. Compute false discovery rate (FDR) as cumulative evidence.
+#. Select features whose FDR is above desired threshold (e.g., 0.05). These
+   features are associated with the perturbed variable.
diff --git a/docs/source/method/fig1.svg b/docs/source/method/fig1.svg
diff --git a/docs/source/pages/installation.rst b/docs/source/pages/installation.rst
diff --git a/docs/source/pages/tutorial.rst b/docs/source/pages/tutorial.rst