Skip to content

Commit

Permalink
porting changes from JB's edits
Browse files Browse the repository at this point in the history
  • Loading branch information
jaybee84 committed Feb 16, 2023
1 parent 2f01c30 commit 9760283
Show file tree
Hide file tree
Showing 9 changed files with 57 additions and 35 deletions.
2 changes: 1 addition & 1 deletion content/03.combining-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ Ideally, the structure of the composite dataset reflects differences in variable
If the samples from the same cohort tend to group together regardless of phenotype, this suggests that the datasets used to generate the composite dataset need to be corrected to overcome differences in how the data were generated or collected.
In the next section, we will discuss approaches that can aid in identifying and visualizing structure in datasets to determine whether composite rare disease datasets are appropriate for use in ML.

![Combining datasets to increase data for training machine learning models. a) Appropriate methods are required to combine smaller datasets into a larger composite dataset: The left panel shows multiple small rare disease datasets that need to be combined to form a dataset of higher sample size. The color of the samples suggest classes or groups present in the datasets. The shape represents the dataset of origin. The middle panel shows methods that may be used to combine the datasets while accounting for dataset-specific technical differences. The right panel shows Principal Component Analysis of the combined datasets to verify proper integration of samples in the larger dataset. b) Composite datasets can be used to make training, evaluation, and validation datasets for machine learning: Left panel shows the division of the composite dataset into training dataset and a held-out validation dataset (top). Shapes indicate the study of origin. The held-out validation set is a separate study that has not been seen by the model. The training set is further divided into training and evaluation datasets for k-fold cross-validation (in this example k=4), where each fold contains all samples from an individual study. This approach is termed study-wise cross validation, and supports the goal of training models that generalize to unseen cohorts.The panel on the right shows the class distribution of the training, evaluation, and held-out validation datasets.](images/figures/lower-res-figures/figure-1-combining-datasets.png){#fig:1}

5 changes: 2 additions & 3 deletions content/04.heterogeneity.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
## Learning representations from rare disease data

Dimensionality reduction methods can help explore and visualize underlying structure in the data (e.g., [@doi:10.1038/s41467-019-13056-x]), to define sample subgroups (e.g., [@doi:10.1038/s41467-020-15351-4], or for feature selection and extraction during application of specific machine learning models [@doi:10.1007/978-3-030-03243-2_299-1] (Figure [@fig:2]c).
These methods ‘compress’ information from a large number of features into a smaller number of features in an unsupervised manner [@doi:10.1007/978-3-540-33037-0; @doi:10.1098/rsta.2015.0202, @https://www.jmlr.org/papers/v9/vandermaaten08a.html; @https://arxiv.org/abs/1802.03426; @doi:10.1016/j.media.2020.101660; @doi:10.1038/ncomms14825] (Figure {@fig:2}).
Dimensionality reduction methods can help explore and visualize underlying structure in the data (e.g., [@doi:10.1038/s41467-019-13056-x]), to define sample subgroups (e.g., [@doi:10.1038/s41467-020-15351-4]), or for feature selection and extraction during application of specific machine learning models [@doi:10.1007/978-3-030-03243-2_299-1] (Figure [@fig:2]c).
These methods ‘compress’ information from a large number of features into a smaller number of features in an unsupervised manner [@doi:10.1007/978-3-540-33037-0; @doi:10.1098/rsta.2015.0202, @https://www.jmlr.org/papers/v9/vandermaaten08a.html; @https://arxiv.org/abs/1802.03426; @doi:10.1016/j.media.2020.101660; @doi:10.1038/ncomms14825] (Figure [@fig:2], Box 3a).
An example of a method that is commonly used for dimensionality reduction is principal components analysis (PCA).
PCA identifies new features or dimensions, termed _principal components_ (PCs), that are combinations of original features.
The PCs are calculated in a way that maximizes the amount of information (variance) they contain and ensures that each PC is uncorrelated with the other PCs. [@doi:10.1098/rsta.2015.0202]
Expand All @@ -19,4 +19,3 @@ However, it can be a powerful tool to learn low-dimensional patterns from large
In later sections, we will discuss this method of leveraging large datasets to reduce dimensionality in smaller datasets, also known as feature-representation-transfer learning.
Once the dimensions of the training dataset have been reduced, model training can proceed using the experimental design as outlined in Box 1.

![Representation learning can extract useful features from high dimensional data. a) The data (e.g., transcriptomic data) are highly dimensional, having thousands of features (displayed as Fa-Fz). Samples come from two separate classes (purple and green row annotation). b) In the original feature space, Fa and Fb do not separate the two classes (purple and green) well. c) A representation learning approach learns new features (e.g., New Feature 1, a combination of Fa, Fb .... Fz, and New Feature 2, a different combination of Fa, Fb .... Fz). New Feature 2 distinguishes class, whereas New Feature 1 may capture some other variable such as batch (not represented). New features from the model can be used to interrogate the biology of the input samples, develop classification models, or use other analytical techniques that would have been more difficult with the original dataset dimensions.](images/figures/pdfs/figure2-representation-learning-resized.png){#fig:2}
11 changes: 5 additions & 6 deletions content/05.model-complexity.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@ All of these contribute to low signal to noise ratio in rare disease datasets.
Thus, applying ML to rare disease data without addressing the aforementioned shortcomings may lead to models that have low reproducibility or are hard to interpret.

Class imbalance in datasets can be addressed using decision tree-based ensemble learning methods (e.g., random forests). [@doi:10.1007/s11634-019-00354-x] (Figure[@fig:3]a)
Random forests use resampling (with replacement) based techniques to form a consensus about the important predictive features identified by the decision trees. [@https://doi.org/10.1023/A:1010933404324; @doi:10.1186/1472-6947-13-134]
Additional approaches like combining random forests with resampling without replacement can generate confidence intervals for the model predictions by iteratively exposing the models to incomplete datasets, mimicking real world cases where most rare disease datasets are incomplete [@doi:10.3390/genes11020226].
Random forests use resampling (with replacement) based techniques to form a consensus about the important predictive features identified by the decision trees (Box 3c). [@https://doi.org/10.1023/A:1010933404324; @doi:10.1186/1472-6947-13-134]
Additional approaches like combining random forests with resampling without replacement can generate confidence intervals for the model predictions by iteratively exposing the models to incomplete datasets, mimicking real world cases where most rare disease datasets are incomplete [@doi:10.3390/genes11020226] (Box 3d).
Resampling approaches are most helpful in constructing confidence intervals for algorithms that generate the same outcome every time they are run (i.e., deterministic models).
For decision trees that choose features at random for selecting a path to the outcome (i.e., are non-deterministic), resampling approaches can be helpful in estimating the reproducibility of the model.

In situations where decision tree-based ensemble methods fail when they are applied to rare disease datasets, cascade learning is a viable alternative. [@pmc:PMC6371307]
In cascade learning, multiple methods leveraging distinct underlying assumptions are used in tandem to capture stable patterns existing in the dataset [@doi:10.1109/CVPR.2001.990537; @doi:10.1007/978-3-540-75175-5_16; @doi:10.1109/icpr.2004.1334680].
For example, a cascade learning approach for identifying rare disease patients from electronic health record data incorporated independent steps for feature extraction (word2vec [@arxiv:1301.3781]), preliminary prediction with ensembled decision trees, and then prediction refinement using data similarity metrics. [@pmc:PMC6371307]
For example, a cascade learning approach for identifying rare disease patients from electronic health record data (Box 3a) incorporated independent steps for feature extraction (word2vec [@arxiv:1301.3781]), preliminary prediction with ensembled decision trees, and then prediction refinement using data similarity metrics. [@pmc:PMC6371307]
Combining these three methods resulted in better overall prediction when implemented on a silver standard dataset, as compared to a model that used ensemble-based prediction alone.
In addition to cascade learning, other approaches that can better represent rare classes using class re-balancing techniques like inverse sampling probability weighting [@doi:10.1186/s12911-021-01688-3], inverse class frequency weighting [@doi:10.1197/jamia.M3095], oversampling of rare classes [@https://doi.org/10.1613/jair.953], or uniformly random undersampling of majority class [@doi:10.48550/arXiv.1608.06048] may also help mitigate limitations due to class imbalance.

Expand All @@ -28,14 +28,13 @@ Regularization can help in these scenarios.
Regularization is an approach by which a penalty or constraint is added to a model to avoid making large prediction errors.
These procedures can not only protect ML models as well as learned representations from poor generalizability caused by overfitting, but also reduce model complexity by reducing the feature space available for training [@doi:10.1371/journal.pgen.1004754, @doi:10.1002/sim.6782]. (Figure[@fig:3]a)
Some examples of ML methods with regularization include ridge regression, LASSO regression, and elastic net regression [@doi:10.1111/j.1467-9868.2005.00503.x], among others.
Regularization is often used in rare variant discovery and immune cell signature discovery studies; much like rare disease, these examples need to accommodate sparsity in data.
Regularization is often used in exploring functional role of rare variants in rare disease and immune cell signature discovery studies; much like rare disease, these examples need to accommodate sparsity in data.
For example, LASSO has been used to capture combinations of rare and common variants associated with specific traits. [@doi:10.1186/1753-6561-5-s9-s113]
In this example, applying LASSO regularization reduced the number of common variants included as features in the final analysis generating a simpler model while reducing error in the association of common and rare variants with a specific trait.
In the context of rare immune cell signature discovery, variations of elastic-net regression were found to outperform other regression approaches [@doi:10.1016/j.compbiomed.2015.10.008; @doi:10.1186/s12859-019-2994-z].
Thus, regularization methods like LASSO or elastic-net are beneficial in ML with rare observations, and are worth exploring in the context of rare diseases.[@doi:10.1371/journal.pgen.1004754]
Other examples of regularization that have been successfully applied to rare disease ML include Kullback–Leibler (KL) divergence loss or dropout during neural network training.
In a study using a variational autoencoder (VAE) (see Box 2: Definitions) for dimensionality reduction in gene expression data from acute myeloid leukemia (AML) samples, the KL loss between the input data and its low dimensional representation provided the regularizing penalty for the model. [@doi:10.1101/278739; @doi:10.48550/arXiv.1312.6114]
In a study using a convolutional neural network (CNN) to identify tubers in MRI images from tuberous sclerosis patients, overfitting was minimized using the dropout regularization method which removed randomly chosen network nodes in each iteration of the CNN model generating simpler models in each iteration.[@doi:10.1371/journal.pone.0232376]
In a study using a convolutional neural network (CNN) to identify tubers in MRI images from tuberous sclerosis patients (an application that can facilitate Box3a), overfitting was minimized using the dropout regularization method which removed randomly chosen network nodes in each iteration of the CNN model generating simpler models in each iteration.[@doi:10.1371/journal.pone.0232376]
Thus, depending on the learning method that is used, regularization approaches should be incorporated into data analysis when working with rare disease datasets.

![Strategies to reduce misinterpretation of machine learning model output in rare disease. a) Bootstrapping: Left panel shows a small rare disease dataset, which can be resampled with replacement using bootstrap to form a large resampled dataset (middle panel). Running the same ML model on multiple resampled datasets generates a distribution of values for the importance scores for each feature utilized by the ML model (right panel), b) Cascade Learning: A schematic showing the different steps in a cascade learning approach for identifying rare disease patients from electronic health record data. The bar plot in the middle panel schematically represents patient classification accuracy after ensemble learning. The accuracy is high for non-rare diseases, but low for rare diseases. The bar plot on the right panel depicts classification accuracy after implementation of cascade learning. The accuracy is high for both non-rare and rare diseases. c) Regularization: A schematic showing the concept of regularization to selectively learn relevant features. The samples (green and blue circles) in the rare disease dataset on left panel can be represented as a combination of features. Each horizontal bar in the middle panel (training set) represents a feature-by-sample heatmap for one sample each. In the held-out validation dataset, for a sample of unknown class (open circle), some features recapitulate the pattern present in the training set, while others do not. The right panel depicts accuracy of predicting the class of the open circles with or without using regularization during implementation of the ML models on rare disease data. Without regularization the classification accuracy is low due to presence of only a subset of learned features (denoted by dashed rectangle in middle panel), but with regularization this subset of features is sufficient to gain high classification accuracy.](images/figures/pdfs/figure-3-resized.png){#fig:3}
Loading

0 comments on commit 9760283

Please sign in to comment.