jaybee84 · jaybee84 · Feb 21, 2023 · Feb 19, 2023 · Feb 19, 2023 · Feb 19, 2023
diff --git a/content/01.synopsis.md b/content/01.synopsis.md
@@ -1,11 +1,11 @@
 ## Abstract {.page_break_before}
 
 The advent of high-throughput profiling methods such as genomics and other technologies has accelerated basic research and made deep molecular characterization of patient samples routine. 
-These approaches provide a rich portrait of genes, molecular pathways, and cell types involved in complex phenotypes. 
+These approaches provide a rich portrait of genes, molecular pathways, and cell types involved in rare disease phenotypes. 
 Machine learning (ML) can be a useful tool to extract disease-relevant patterns from high dimensional datasets. 
 However, depending on the complexity of the biological question, machine learning often requires a large number of samples to identify recurrent and biologically meaningful patterns. 
-Rare diseases are inherently limited in clinical cases and thus have few samples to study. 
+Rare diseases are inherently limited in clinical cases and thus have few samples to study.
 Precision medicine also presents a similar challenge, where patients with common diseases are partitioned into small subsets of patients based on particular characteristics. 
 In this perspective, we outline the challenges and emerging solutions for using machine learning in the context of small sample sets, specifically that of rare diseases. 
-Advances in machine learning methods for rare disease are likely to be informative for applications beyond rare diseases in which sample sizes are small but datasets are complex (e.g., using genomics data for predictive modeling in precision medicine). 
+Advances in machine learning methods for rare disease are likely to be informative for applications beyond rare diseases in which sample sizes are small but datasets are high-dimensional (e.g., using genomics data for predictive modeling in precision medicine). 
 We propose that the methods community prioritizes the development of machine learning techniques for rare disease research.
diff --git a/content/02.intro.md b/content/02.intro.md
@@ -12,6 +12,7 @@ Therefore, if the goal of a study is to classify patients with a rare disease in
 Conversely, unsupervised learning algorithms can learn patterns or features from unlabeled training data. 
 In the absence of known molecular subtypes, unsupervised ML approaches can be applied to identify groups of samples that are similar and may have distinct patterns of pathway activation [@doi:10.1158/0008-5472.CAN-08-2100].
 Unsupervised approaches can also extract combinations of features (e.g., genes) that may describe a certain cell type or pathway.
+See Box 1 "Common uses for machine learning in rare disease" for more examples of how ML can be used in rare disease research.
 
 While ML can be a useful tool, there are challenges in applying ML to rare disease datasets. 
 ML methods are generally most effective when using large datasets; thus analyzing high dimensional biomedical data (i.e. data with typically > 1000 features, e.g. 20,000 genes) from rare diseases datasets that typically contain 20 to 99 samples is challenging[@https://www.fda.gov/media/99546/download; @doi:10.1186/s13023-020-01424-6].

diff --git a/content/03.combining-datasets.md b/content/03.combining-datasets.md
@@ -6,7 +6,7 @@ More features often mean increased missing observations (_sparsity_), more dissi
 
 One of the important factors in machine learning is performance (e.g. the accuracy of a supervised model in identifying patterns relevant for the biological question of interest, or the reliability of an unsupervised model in identifying hypothetical biological patterns that are supported by post-hoc validation and research). 
 When small sample sizes compromise an ML model’s performance, then two approaches can be taken to manage sparsity, variance, and multicollinearity: 1) increase the number of samples, 2) improve the quality of samples. 
-In the first approach, appropriate training, evaluation, and held-out validation sets could be constructed by combining multiple rare disease cohorts (Figure [@fig:1]a, Box 1). 
+In the first approach, appropriate training, evaluation, and held-out validation sets could be constructed by combining multiple rare disease cohorts (Figure [@fig:1]a, Box 2). 
 When combining datasets, special attention should be directed towards data harmonization since data collection methods can differ from cohort to cohort. 
 Without careful selection of aggregation methods, one may introduce variability into the combined dataset that can negatively impact the ML model’s ability to learn or detect meaningful signals. 
 Steps such as reprocessing the data using a single pipeline, using batch correction methods [@doi:10.1093/biostatistics/kxj037; @doi:10.1093/nar/gku864], and normalizing raw values appropriately without affecting the underlying variance in the data [@doi:10.1186/gb-2010-11-3-r25, @doi:10.1371/journal.pcbi.1003531, @doi:10.1186/s13059-014-0550-8] may be necessary to mitigate unwanted variability. (Figure [@fig:1]a)

diff --git a/content/04.heterogeneity.md b/content/04.heterogeneity.md
@@ -1,7 +1,7 @@
 ## Learning representations from rare disease data
 
 Dimensionality reduction methods can help explore and visualize underlying structure in the data (e.g., [@doi:10.1038/s41467-019-13056-x]), to define sample subgroups (e.g., [@doi:10.1038/s41467-020-15351-4]), or for feature selection and extraction during application of specific machine learning models [@doi:10.1007/978-3-030-03243-2_299-1] (Figure [@fig:2]c).
-These methods ‘compress’ information from a large number of features into a smaller number of features in an unsupervised manner [@doi:10.1007/978-3-540-33037-0; @doi:10.1098/rsta.2015.0202, @https://www.jmlr.org/papers/v9/vandermaaten08a.html; @https://arxiv.org/abs/1802.03426; @doi:10.1016/j.media.2020.101660; @doi:10.1038/ncomms14825] (Figure [@fig:2], Box 3a).
+These methods ‘compress’ information from a large number of features into a smaller number of features in an unsupervised manner [@doi:10.1007/978-3-540-33037-0; @doi:10.1098/rsta.2015.0202, @https://www.jmlr.org/papers/v9/vandermaaten08a.html; @https://arxiv.org/abs/1802.03426] (Figure [@fig:2]).
 An example of a method that is commonly used for dimensionality reduction is principal components analysis (PCA). 
 PCA identifies new features or dimensions, termed _principal components_ (PCs), that are combinations of original features. 
 The PCs are calculated in a way that maximizes the amount of information (variance) they contain and ensures that each PC is uncorrelated with the other PCs. [@doi:10.1098/rsta.2015.0202]
@@ -13,9 +13,9 @@ Beyond dimensionality reduction, other unsupervised learning approaches such as
 
 Representation learning approaches (which include dimensionality reduction) learn low-dimensional representations (composite features) from the raw data. 
 For example, representation learning through matrix factorization methods can extract features from transcriptomics datasets that are made of combinations of gene expression values. [@doi:10.1038/s41467-020-14666-6; @doi:10.1093/bioinformatics/btq503; @doi:10.1186/s13059-020-02021-3]
+Representation learning can also be utilized to predict rare pathologies from images [@doi:10.1016/j.media.2020.101660] (Box 1a) or detect cell populations associated with rare diseases in single-cell mass cytometry data [@doi:10.1038/ncomms14825].
 
 When applied to complex biological systems, representation learning generally requires many samples and therefore may appear to aggravate the curse of dimensionality. 
 However, it can be a powerful tool to learn low-dimensional patterns from large datasets and then find those patterns in smaller, related datasets. 
 In later sections, we will discuss this method of leveraging large datasets to reduce dimensionality in smaller datasets, also known as feature-representation-transfer learning. 
-Once the dimensions of the training dataset have been reduced, model training can proceed using the experimental design as outlined in Box 1.
-
+Once the dimensions of the training dataset have been reduced, model training can proceed using the experimental design as outlined in Box 2.
diff --git a/content/05.model-complexity.md b/content/05.model-complexity.md
@@ -11,14 +11,14 @@ All of these contribute to low signal to noise ratio in rare disease datasets.
 Thus, applying ML to rare disease data without addressing the aforementioned shortcomings may lead to models that have low reproducibility or are hard to interpret.
 
 Class imbalance in datasets can be addressed using decision tree-based ensemble learning methods (e.g., random forests). [@doi:10.1007/s11634-019-00354-x] (Figure[@fig:3]a)
-Random forests use resampling (with replacement) based techniques to form a consensus about the important predictive features identified by the decision trees (Box 3c). [@https://doi.org/10.1023/A:1010933404324; @doi:10.1186/1472-6947-13-134]
-Additional approaches like combining random forests with resampling without replacement can generate confidence intervals for the model predictions by iteratively exposing the models to incomplete datasets, mimicking real world cases where most rare disease datasets are incomplete [@doi:10.3390/genes11020226] (Box 3d).
+Random forests use resampling (with replacement) based techniques to form a consensus about the important predictive features identified by the decision trees (e.g., Box 1c). [@https://doi.org/10.1023/A:1010933404324; @doi:10.1186/1472-6947-13-134]
+Additional approaches like combining random forests with resampling without replacement can generate confidence intervals for the model predictions by iteratively exposing the models to incomplete datasets, mimicking real world cases where most rare disease datasets are incomplete [@doi:10.3390/genes11020226].
 Resampling approaches are most helpful in constructing confidence intervals for algorithms that generate the same outcome every time they are run (i.e., deterministic models). 
 For decision trees that choose features at random for selecting a path to the outcome (i.e., are non-deterministic), resampling approaches can be helpful in estimating the reproducibility of the model.
 
 In situations where decision tree-based ensemble methods fail when they are applied to rare disease datasets, cascade learning is a viable alternative. [@pmc:PMC6371307]
 In cascade learning, multiple methods leveraging distinct underlying assumptions are used in tandem to capture stable patterns existing in the dataset [@doi:10.1109/CVPR.2001.990537; @doi:10.1007/978-3-540-75175-5_16; @doi:10.1109/icpr.2004.1334680]. 
-For example, a cascade learning approach for identifying rare disease patients from electronic health record data (Box 3a) incorporated independent steps for feature extraction (word2vec [@arxiv:1301.3781]), preliminary prediction with ensembled decision trees, and then prediction refinement using data similarity metrics. [@pmc:PMC6371307] 
+For example, a cascade learning approach for identifying rare disease patients from electronic health record data (Box 1a) incorporated independent steps for feature extraction (word2vec [@arxiv:1301.3781]), preliminary prediction with ensembled decision trees, and then prediction refinement using data similarity metrics. [@pmc:PMC6371307] 
 Combining these three methods resulted in better overall prediction when implemented on a silver standard dataset, as compared to a model that used ensemble-based prediction alone.
 In addition to cascade learning, other approaches that can better represent rare classes using class re-balancing techniques like inverse sampling probability weighting [@doi:10.1186/s12911-021-01688-3], inverse class frequency weighting [@doi:10.1197/jamia.M3095], oversampling of rare classes [@https://doi.org/10.1613/jair.953], or uniformly random undersampling of majority class [@doi:10.48550/arXiv.1608.06048] may also help mitigate limitations due to class imbalance. 
 
@@ -32,9 +32,9 @@ Regularization is often used in exploring functional role of rare variants in ra
 For example, LASSO has been used as a feature selection method for amyotrophic lateral sclerosis (ALS) gene expression data. [@doi:10.1186/s10020-023-00603-y]
 In this example, applying LASSO regularization reduced the number of genes included as features in the a machine learning model designed to classify brain tissue regions from ALS patients. 
 In the context of rare immune cell signature discovery, variations of elastic-net regression were found to outperform other regression approaches [@doi:10.1016/j.compbiomed.2015.10.008; @doi:10.1186/s12859-019-2994-z]. 
-Thus, regularization methods like LASSO or elastic-net are beneficial in ML with rare observations, and are worth exploring in the context of rare diseases.[@doi:10.1371/journal.pgen.1004754] 
+Thus, regularization methods like LASSO or elastic-net are beneficial in ML with rare observations, and are worth exploring in the context of rare diseases.[@doi:10.1371/journal.pgen.1004754] <!--TODO: replace this reference-->
 Other examples of regularization that have been successfully applied to rare disease ML include Kullback–Leibler (KL) divergence loss or dropout during neural network training.
-In a study using a variational autoencoder (VAE) (see Box 2: Definitions) for dimensionality reduction in gene expression data from acute myeloid leukemia (AML) samples, the KL loss between the input data and its low dimensional representation provided the regularizing penalty for the model. [@doi:10.1101/278739; @doi:10.48550/arXiv.1312.6114]
+In a study using a variational autoencoder (VAE) (see Box 3: Definitions) for dimensionality reduction in gene expression data from acute myeloid leukemia (AML) samples, the KL loss between the input data and its low dimensional representation provided the regularizing penalty for the model. [@doi:10.1101/278739; @doi:10.48550/arXiv.1312.6114]
 In a study using a convolutional neural network (CNN) to identify tubers in MRI images from tuberous sclerosis patients (an application that can facilitate Box3a), overfitting was minimized using the dropout regularization method which removed randomly chosen network nodes in each iteration of the CNN model generating simpler models in each iteration.[@doi:10.1371/journal.pone.0232376]
 Thus, depending on the learning method that is used, regularization approaches should be incorporated into data analysis when working with rare disease datasets. 
 
diff --git a/content/06.prior-knowledge.md b/content/06.prior-knowledge.md
@@ -7,15 +7,15 @@ Knowledge graphs (KGs) which integrate related-but-different data types, provide
 These graphs connect genetic, functional, chemical, clinical, and ontological data so that relationships of data with disease phenotypes can be explored through manual review [@doi:10.1093/database/baaa015] or computational methods [@doi:10.1016/j.jbi.2021.103838, @doi:10.1142/9789811215636_0041; @doi:10.1186/s12911-019-0938-1]. (Figure[@fig:3]a)
 KGs may include links (also called edges) or nodes that are specific to the rare disease of interest (e.g., an FDA approved treatment would be a specific disease-compound edge in the KG) as well as edges that are more generalized (e.g., gene-gene interactions noted in the literature for a different disease). (Figure [@fig:4]a)
 
-Rare disease researchers can repurpose general (i.e., not rare disease-specific) biological or chemical knowledge graphs to answer rare disease-based research questions [@doi:10.1142/9789811215636_0041] (e.g. Box 3b). 
-There are a variety of tactics to sift through the large amounts of complex data in knowledge graphs.
+Rare disease researchers can repurpose general (i.e., not rare disease-specific) biological or chemical knowledge graphs to answer rare disease-based research questions [@doi:10.1142/9789811215636_0041] (e.g. Box 1b). 
+There are a variety of tactics to sift through the data encoded in knowledge graphs.
 One such tactic is to calculate the distances between nodes of interest (e.g., diseases and drugs to identify drugs for repurposing in rare disease [@doi:10.1142/9789811215636_0041]); this is often done by determining the "embeddings" (linear representations of the position and connections of a particular point in the graph) for nodes in the knowledge graph, and calculating the similarity between these embeddings. 
 Effective methods to calculate node embeddings that can generate actionable insights for rare diseases is an active area of research [@doi:10.1142/9789811215636_0041].
 
 Another application of KGs is to augment or refine a dataset [@doi:10.1186/s12911-019-0752-9, doi:10.1186/s12911-019-0938-1].
 For example, Li et. al.[@doi:10.1186/s12911-019-0938-1] used a KG to identify linked terms in a medical corpus from a large number of patients, some with rare disease diagnoses.
 They were able to augment their text dataset by identifying related terms in the clinical text to map them to the same term - e.g., mapping "cancer" and "malignancy" in different patients to the same clinical concept. 
-With this augmented and improved dataset, they were able to train and test a variety of text classification algorithms to identify rare disease patients within their corpus. (Figure [@fig:4]b, Box 3a)
+With this augmented and improved dataset, they were able to train and test a variety of text classification algorithms to identify rare disease patients within their corpus. (Figure [@fig:4]b, Box 1a)
 
 Finally, another possible tactic for rare disease researchers is to take a knowledge graph, or an integration of several knowledge graphs, and apply neural network-based algorithms optimized for graph data, such as a graph convolutional neural network.
 Rao and colleagues [@doi:10.1186/s12920-018-0372-8] describe the construction of a KG using phenotype information (Human Phenotype Ontology) and rare disease information (Orphanet) and curated gene interaction/pathway data (Lit-BM-13, WikiPathways) [@pmc:PMC7778952; @doi:10.1016/j.cell.2014.10.050; @doi:10.1093/nar/gkaa1024].