Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaclyn taroni/2023 02 19 edits #242

Merged
merged 21 commits into from
Feb 21, 2023
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
3334802
Make Box 3 adhere to one sentence per line
jaclyn-taroni Feb 19, 2023
10c9283
Box 2 minor formatting edits
jaclyn-taroni Feb 19, 2023
416da4a
Reference panel "(c)" -> "c)"
jaclyn-taroni Feb 19, 2023
d90b494
Use "held-out validation set" everywhere
jaclyn-taroni Feb 19, 2023
5a7a900
Save a few words
jaclyn-taroni Feb 19, 2023
ca4c5a0
Remove use of "complex" from synopsis
jaclyn-taroni Feb 19, 2023
13eba04
replace "cumulatively used" with "combined and used"
jaclyn-taroni Feb 19, 2023
800959d
Revise wording around evaluation dataset in CV
jaclyn-taroni Feb 19, 2023
86f61ac
Remove reference to shapes in Fig 1B legend
jaclyn-taroni Feb 19, 2023
488e7f4
Add TODO for the LASSO reference
jaclyn-taroni Feb 19, 2023
340b623
Minor changes to Box 3 wording
jaclyn-taroni Feb 19, 2023
8c83a1f
Box 3 style consistent with Box 2
jaclyn-taroni Feb 21, 2023
f568cad
Change order of boxes – move Box 3 ("tasks") up to Box 1 slot
jaclyn-taroni Feb 21, 2023
b07a522
Add reference to common tasks in the introduction
jaclyn-taroni Feb 21, 2023
e5b2125
Make diagnostic decision support more general
jaclyn-taroni Feb 21, 2023
7e364a9
Move representation learning citations and Box 1a ref down to RL section
jaclyn-taroni Feb 21, 2023
0a1b190
Merge pull request #243 from jaybee84/jaclyn-taroni/2023-02-21-change…
jaclyn-taroni Feb 21, 2023
c03e152
Make a few tweaks to where Box 1 is referenced
jaclyn-taroni Feb 21, 2023
d174f55
Update content/05.model-complexity.md
jaclyn-taroni Feb 21, 2023
dbf36c0
Apply suggestions from code review
jaclyn-taroni Feb 21, 2023
09365eb
Merge branch 'jaybee84/edits021523' into jaclyn-taroni/2023-02-19-edits
jaclyn-taroni Feb 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions content/01.synopsis.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
## Abstract {.page_break_before}

The advent of high-throughput profiling methods such as genomics and other technologies has accelerated basic research and made deep molecular characterization of patient samples routine.
These approaches provide a rich portrait of genes, molecular pathways, and cell types involved in complex phenotypes.
These approaches provide a rich portrait of genes, molecular pathways, and cell types involved in rare disease phenotypes.
jaclyn-taroni marked this conversation as resolved.
Show resolved Hide resolved
Machine learning (ML) can be a useful tool to extract disease-relevant patterns from high dimensional datasets.
However, depending on the complexity of the biological question, machine learning often requires a large number of samples to identify recurrent and biologically meaningful patterns.
Rare diseases are inherently limited in clinical cases and thus have few samples to study.
Rare diseases are inherently limited in clinical cases and thus have few samples to study.
Precision medicine also presents a similar challenge, where patients with common diseases are partitioned into small subsets of patients based on particular characteristics.
jaclyn-taroni marked this conversation as resolved.
Show resolved Hide resolved
In this perspective, we outline the challenges and emerging solutions for using machine learning in the context of small sample sets, specifically that of rare diseases.
Advances in machine learning methods for rare disease are likely to be informative for applications beyond rare diseases in which sample sizes are small but datasets are complex (e.g., using genomics data for predictive modeling in precision medicine).
Advances in machine learning methods for rare disease are likely to be informative for applications beyond rare diseases in which sample sizes are small but datasets are high-dimensional (e.g., using genomics data for predictive modeling in precision medicine).
We propose that the methods community prioritizes the development of machine learning techniques for rare disease research.
1 change: 1 addition & 0 deletions content/02.intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Therefore, if the goal of a study is to classify patients with a rare disease in
Conversely, unsupervised learning algorithms can learn patterns or features from unlabeled training data.
In the absence of known molecular subtypes, unsupervised ML approaches can be applied to identify groups of samples that are similar and may have distinct patterns of pathway activation [@doi:10.1158/0008-5472.CAN-08-2100].
Unsupervised approaches can also extract combinations of features (e.g., genes) that may describe a certain cell type or pathway.
See Box 1 "Common uses for machine learning in rare disease" for more examples of how ML can be used in rare disease research.

While ML can be a useful tool, there are challenges in applying ML to rare disease datasets.
ML methods are generally most effective when using large datasets; thus analyzing high dimensional biomedical data (i.e. data with typically > 1000 features, e.g. 20,000 genes) from rare diseases datasets that typically contain 20 to 99 samples is challenging[@https://www.fda.gov/media/99546/download; @doi:10.1186/s13023-020-01424-6].
Expand Down
2 changes: 1 addition & 1 deletion content/03.combining-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ More features often mean increased missing observations (_sparsity_), more dissi

One of the important factors in machine learning is performance (e.g. the accuracy of a supervised model in identifying patterns relevant for the biological question of interest, or the reliability of an unsupervised model in identifying hypothetical biological patterns that are supported by post-hoc validation and research).
When small sample sizes compromise an ML model’s performance, then two approaches can be taken to manage sparsity, variance, and multicollinearity: 1) increase the number of samples, 2) improve the quality of samples.
In the first approach, appropriate training, evaluation, and held-out validation sets could be constructed by combining multiple rare disease cohorts (Figure [@fig:1]a, Box 1).
In the first approach, appropriate training, evaluation, and held-out validation sets could be constructed by combining multiple rare disease cohorts (Figure [@fig:1]a, Box 2).
When combining datasets, special attention should be directed towards data harmonization since data collection methods can differ from cohort to cohort.
Without careful selection of aggregation methods, one may introduce variability into the combined dataset that can negatively impact the ML model’s ability to learn or detect meaningful signals.
Steps such as reprocessing the data using a single pipeline, using batch correction methods [@doi:10.1093/biostatistics/kxj037; @doi:10.1093/nar/gku864], and normalizing raw values appropriately without affecting the underlying variance in the data [@doi:10.1186/gb-2010-11-3-r25, @doi:10.1371/journal.pcbi.1003531, @doi:10.1186/s13059-014-0550-8] may be necessary to mitigate unwanted variability. (Figure [@fig:1]a)
Expand Down
6 changes: 3 additions & 3 deletions content/04.heterogeneity.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
## Learning representations from rare disease data

Dimensionality reduction methods can help explore and visualize underlying structure in the data (e.g., [@doi:10.1038/s41467-019-13056-x]), to define sample subgroups (e.g., [@doi:10.1038/s41467-020-15351-4]), or for feature selection and extraction during application of specific machine learning models [@doi:10.1007/978-3-030-03243-2_299-1] (Figure [@fig:2]c).
These methods ‘compress’ information from a large number of features into a smaller number of features in an unsupervised manner [@doi:10.1007/978-3-540-33037-0; @doi:10.1098/rsta.2015.0202, @https://www.jmlr.org/papers/v9/vandermaaten08a.html; @https://arxiv.org/abs/1802.03426; @doi:10.1016/j.media.2020.101660; @doi:10.1038/ncomms14825] (Figure [@fig:2], Box 3a).
These methods ‘compress’ information from a large number of features into a smaller number of features in an unsupervised manner [@doi:10.1007/978-3-540-33037-0; @doi:10.1098/rsta.2015.0202, @https://www.jmlr.org/papers/v9/vandermaaten08a.html; @https://arxiv.org/abs/1802.03426] (Figure [@fig:2]).
An example of a method that is commonly used for dimensionality reduction is principal components analysis (PCA).
PCA identifies new features or dimensions, termed _principal components_ (PCs), that are combinations of original features.
The PCs are calculated in a way that maximizes the amount of information (variance) they contain and ensures that each PC is uncorrelated with the other PCs. [@doi:10.1098/rsta.2015.0202]
Expand All @@ -13,9 +13,9 @@ Beyond dimensionality reduction, other unsupervised learning approaches such as

Representation learning approaches (which include dimensionality reduction) learn low-dimensional representations (composite features) from the raw data.
For example, representation learning through matrix factorization methods can extract features from transcriptomics datasets that are made of combinations of gene expression values. [@doi:10.1038/s41467-020-14666-6; @doi:10.1093/bioinformatics/btq503; @doi:10.1186/s13059-020-02021-3]
Representation learning can also be utilized to predict rare pathologies from images [@doi:10.1016/j.media.2020.101660] (Box 1a) or detect cell populations associated with rare diseases in single-cell mass cytometry data [@doi:10.1038/ncomms14825].

When applied to complex biological systems, representation learning generally requires many samples and therefore may appear to aggravate the curse of dimensionality.
However, it can be a powerful tool to learn low-dimensional patterns from large datasets and then find those patterns in smaller, related datasets.
In later sections, we will discuss this method of leveraging large datasets to reduce dimensionality in smaller datasets, also known as feature-representation-transfer learning.
Once the dimensions of the training dataset have been reduced, model training can proceed using the experimental design as outlined in Box 1.

Once the dimensions of the training dataset have been reduced, model training can proceed using the experimental design as outlined in Box 2.
10 changes: 5 additions & 5 deletions content/05.model-complexity.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@ All of these contribute to low signal to noise ratio in rare disease datasets.
Thus, applying ML to rare disease data without addressing the aforementioned shortcomings may lead to models that have low reproducibility or are hard to interpret.

Class imbalance in datasets can be addressed using decision tree-based ensemble learning methods (e.g., random forests). [@doi:10.1007/s11634-019-00354-x] (Figure[@fig:3]a)
Random forests use resampling (with replacement) based techniques to form a consensus about the important predictive features identified by the decision trees (Box 3c). [@https://doi.org/10.1023/A:1010933404324; @doi:10.1186/1472-6947-13-134]
Additional approaches like combining random forests with resampling without replacement can generate confidence intervals for the model predictions by iteratively exposing the models to incomplete datasets, mimicking real world cases where most rare disease datasets are incomplete [@doi:10.3390/genes11020226] (Box 3d).
Random forests use resampling (with replacement) based techniques to form a consensus about the important predictive features identified by the decision trees (e.g., Box 1c). [@https://doi.org/10.1023/A:1010933404324; @doi:10.1186/1472-6947-13-134]
Additional approaches like combining random forests with resampling without replacement can generate confidence intervals for the model predictions by iteratively exposing the models to incomplete datasets, mimicking real world cases where most rare disease datasets are incomplete [@doi:10.3390/genes11020226].
jaclyn-taroni marked this conversation as resolved.
Show resolved Hide resolved
Resampling approaches are most helpful in constructing confidence intervals for algorithms that generate the same outcome every time they are run (i.e., deterministic models).
For decision trees that choose features at random for selecting a path to the outcome (i.e., are non-deterministic), resampling approaches can be helpful in estimating the reproducibility of the model.

In situations where decision tree-based ensemble methods fail when they are applied to rare disease datasets, cascade learning is a viable alternative. [@pmc:PMC6371307]
In cascade learning, multiple methods leveraging distinct underlying assumptions are used in tandem to capture stable patterns existing in the dataset [@doi:10.1109/CVPR.2001.990537; @doi:10.1007/978-3-540-75175-5_16; @doi:10.1109/icpr.2004.1334680].
For example, a cascade learning approach for identifying rare disease patients from electronic health record data (Box 3a) incorporated independent steps for feature extraction (word2vec [@arxiv:1301.3781]), preliminary prediction with ensembled decision trees, and then prediction refinement using data similarity metrics. [@pmc:PMC6371307]
For example, a cascade learning approach for identifying rare disease patients from electronic health record data (Box 1a) incorporated independent steps for feature extraction (word2vec [@arxiv:1301.3781]), preliminary prediction with ensembled decision trees, and then prediction refinement using data similarity metrics. [@pmc:PMC6371307]
Combining these three methods resulted in better overall prediction when implemented on a silver standard dataset, as compared to a model that used ensemble-based prediction alone.
In addition to cascade learning, other approaches that can better represent rare classes using class re-balancing techniques like inverse sampling probability weighting [@doi:10.1186/s12911-021-01688-3], inverse class frequency weighting [@doi:10.1197/jamia.M3095], oversampling of rare classes [@https://doi.org/10.1613/jair.953], or uniformly random undersampling of majority class [@doi:10.48550/arXiv.1608.06048] may also help mitigate limitations due to class imbalance.

Expand All @@ -32,9 +32,9 @@ Regularization is often used in exploring functional role of rare variants in ra
For example, LASSO has been used as a feature selection method for amyotrophic lateral sclerosis (ALS) gene expression data. [@doi:10.1186/s10020-023-00603-y]
In this example, applying LASSO regularization reduced the number of genes included as features in the a machine learning model designed to classify brain tissue regions from ALS patients.
In the context of rare immune cell signature discovery, variations of elastic-net regression were found to outperform other regression approaches [@doi:10.1016/j.compbiomed.2015.10.008; @doi:10.1186/s12859-019-2994-z].
Thus, regularization methods like LASSO or elastic-net are beneficial in ML with rare observations, and are worth exploring in the context of rare diseases.[@doi:10.1371/journal.pgen.1004754]
Thus, regularization methods like LASSO or elastic-net are beneficial in ML with rare observations, and are worth exploring in the context of rare diseases.[@doi:10.1371/journal.pgen.1004754] <!--TODO: replace this reference-->
jaclyn-taroni marked this conversation as resolved.
Show resolved Hide resolved
Other examples of regularization that have been successfully applied to rare disease ML include Kullback–Leibler (KL) divergence loss or dropout during neural network training.
In a study using a variational autoencoder (VAE) (see Box 2: Definitions) for dimensionality reduction in gene expression data from acute myeloid leukemia (AML) samples, the KL loss between the input data and its low dimensional representation provided the regularizing penalty for the model. [@doi:10.1101/278739; @doi:10.48550/arXiv.1312.6114]
In a study using a variational autoencoder (VAE) (see Box 3: Definitions) for dimensionality reduction in gene expression data from acute myeloid leukemia (AML) samples, the KL loss between the input data and its low dimensional representation provided the regularizing penalty for the model. [@doi:10.1101/278739; @doi:10.48550/arXiv.1312.6114]
In a study using a convolutional neural network (CNN) to identify tubers in MRI images from tuberous sclerosis patients (an application that can facilitate Box3a), overfitting was minimized using the dropout regularization method which removed randomly chosen network nodes in each iteration of the CNN model generating simpler models in each iteration.[@doi:10.1371/journal.pone.0232376]
Thus, depending on the learning method that is used, regularization approaches should be incorporated into data analysis when working with rare disease datasets.

6 changes: 3 additions & 3 deletions content/06.prior-knowledge.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ Knowledge graphs (KGs) which integrate related-but-different data types, provide
These graphs connect genetic, functional, chemical, clinical, and ontological data so that relationships of data with disease phenotypes can be explored through manual review [@doi:10.1093/database/baaa015] or computational methods [@doi:10.1016/j.jbi.2021.103838, @doi:10.1142/9789811215636_0041; @doi:10.1186/s12911-019-0938-1]. (Figure[@fig:3]a)
KGs may include links (also called edges) or nodes that are specific to the rare disease of interest (e.g., an FDA approved treatment would be a specific disease-compound edge in the KG) as well as edges that are more generalized (e.g., gene-gene interactions noted in the literature for a different disease). (Figure [@fig:4]a)

Rare disease researchers can repurpose general (i.e., not rare disease-specific) biological or chemical knowledge graphs to answer rare disease-based research questions [@doi:10.1142/9789811215636_0041] (e.g. Box 3b).
There are a variety of tactics to sift through the large amounts of complex data in knowledge graphs.
Rare disease researchers can repurpose general (i.e., not rare disease-specific) biological or chemical knowledge graphs to answer rare disease-based research questions [@doi:10.1142/9789811215636_0041] (e.g. Box 1b).
There are a variety of tactics to sift through the data encoded in knowledge graphs.
jaclyn-taroni marked this conversation as resolved.
Show resolved Hide resolved
One such tactic is to calculate the distances between nodes of interest (e.g., diseases and drugs to identify drugs for repurposing in rare disease [@doi:10.1142/9789811215636_0041]); this is often done by determining the "embeddings" (linear representations of the position and connections of a particular point in the graph) for nodes in the knowledge graph, and calculating the similarity between these embeddings.
Effective methods to calculate node embeddings that can generate actionable insights for rare diseases is an active area of research [@doi:10.1142/9789811215636_0041].

Another application of KGs is to augment or refine a dataset [@doi:10.1186/s12911-019-0752-9, doi:10.1186/s12911-019-0938-1].
For example, Li et. al.[@doi:10.1186/s12911-019-0938-1] used a KG to identify linked terms in a medical corpus from a large number of patients, some with rare disease diagnoses.
They were able to augment their text dataset by identifying related terms in the clinical text to map them to the same term - e.g., mapping "cancer" and "malignancy" in different patients to the same clinical concept.
With this augmented and improved dataset, they were able to train and test a variety of text classification algorithms to identify rare disease patients within their corpus. (Figure [@fig:4]b, Box 3a)
With this augmented and improved dataset, they were able to train and test a variety of text classification algorithms to identify rare disease patients within their corpus. (Figure [@fig:4]b, Box 1a)

Finally, another possible tactic for rare disease researchers is to take a knowledge graph, or an integration of several knowledge graphs, and apply neural network-based algorithms optimized for graph data, such as a graph convolutional neural network.
Rao and colleagues [@doi:10.1186/s12920-018-0372-8] describe the construction of a KG using phenotype information (Human Phenotype Ontology) and rare disease information (Orphanet) and curated gene interaction/pathway data (Lit-BM-13, WikiPathways) [@pmc:PMC7778952; @doi:10.1016/j.cell.2014.10.050; @doi:10.1093/nar/gkaa1024].
Expand Down
Loading