README.rmd

---
output: github_document
---

<!-- edit README.rmd -->

```{r knitr options, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
devtools::load_all()
```

# ordr

<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![CRAN](http://www.r-pkg.org/badges/version/ordr)](https://cran.r-project.org/package=ordr)
<!-- badges: end -->

**ordr** integrates ordination analysis and biplot visualization into [**tidyverse**](https://github.com/tidyverse/tidyverse) workflows.

## motivation

> Wherever there is an SVD, there is a biplot.[^svd-quote]

[^svd-quote]: Greenacre MJ (2010) _Biplots in Practice_. Fundacion BBVA, ISBN: 978-84-923846. <https://www.fbbva.es/microsite/multivariate-statistics/biplots.html>

### ordination and biplots

_Ordination_ is a catch-all term for a variety of statistical techniques that introduce an artificial coordinate system for a data set in such a way that a few coordinates capture a large amount of the data structure [^ordination].
The branch of mathematical statistics called [geometric data analysis](https://link.springer.com/book/10.1007/1-4020-2236-0) (GDA) provides the theoretical basis for (most of) these techniques.
Ordination overlaps with regression and with dimension reduction, which can be [contrasted to clustering and classification](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d) in that they assign continuous rather than categorical values to data elements [^continuous-categorical].

[^ordination]: The term _ordination_ is most prevalent among ecologists; no catch-all term seems to be in common use outside ecology.
[^continuous-categorical]: This is not a hard rule: PCA is often used to compress data before clustering, and LDA uses dimension reduction to perform classification tasks.

Most ordination techniques decompose a numeric rectangular data set into the product of two matrices, often using singular value decomposition (SVD). The coordinates of the shared dimensions of these matrices (over which they are multiplied) are the artificial coordinates.
In some cases, such as principal components analysis, the decomposition is exact; in others, such as non-negative matrix factorization, it is approximate. Some techniques, such as correspondence analysis, transform the data before decomposition.
Ordination techniques may be supervised, like linear discriminant analysis, or unsupervised, like multidimensional scaling.

Analysis pipelines that use these techniques may use the artificial coordinates directly, in place of natural coordinates, to arrange and compare data elements or to predict responses. This is possible because both the rows and the columns of the original table can be located, or positioned, along these shared coordinates.
The number of artificial coordinates used in an application, such as regression or visualization, is called the _rank_ of the ordination [^lm-kmeans].
A common application is the _biplot_, which positions the rows and columns of the original table in a scatterplot in 1, 2, or 3 artificial coordinates, usually those that explain the most variation in the data.

[^lm-kmeans]: Regression and clustering models, like classical [linear regression](https://www.fbbva.es/microsite/multivariate-statistics/) and [_k_-means](http://joelcadwell.blogspot.com/2015/08/matrix-factorization-comes-in-many.html), can also be understood as matrix decomposition approximations and even visualized in biplots. Their shared coordinates, which are pre-defined rather than artificial, are the predictor coefficients and the cluster assignments, respectively. Methods for `stats::lm()` and `stats::kmeans()`, for example, are implemented for the sake of novelty and instruction, but are not widely used in practice.

### implementations in R

An extensive range of ordination techniques are implemented in R, from classical multidimensional scaling (`stats::cmdscale()`) and principal components analysis (`stats::prcomp()` and `stats::princomp()`) in the **stats** package distributed with base R, across widely-used implementations of linear discriminant analysis (`MASS::lda()`) and correspondence analysis (`ca::ca()`) in general-use statistical packages, to highly specialized packages that implement cutting-edge techniques or adapt conventional techniques to challenging settings. These implementations come with their own conventions, tailored to the research communities that produced them, and it would be impractical (and probably unhelpful) to try to consolidate them.

Instead, **ordr** provides a streamlined process by which the models output by these methods&mdash;in particular, the matrix factors into which the original data are approximately decomposed and the artificial coordinates they share&mdash;can be inspected, annotated, tabulated, summarized, and visualized. On this last point, most biplot implementations in R provide limited customizability. **ordr** adopts the grammar of graphics paradigm from [**ggplot2**](https://github.com/tidyverse/ggplot2) to modularize and standardize biplot elements [^virtue]. Overall, the package is designed to follow the broader syntactic conventions of the **tidyverse**, so that users familiar with a this workflow can more easily and quickly integrate ordination models into practice.

[^virtue]: Biplot elments must be chosen with care, and it is useful and appropriate that many model-specific biplot methods have limited flexibility. This package adopts the trade-off articulated in [Wilkinson's _The Grammar of Graphics_](https://www.google.com/books/edition/_/iI1kcgAACAAJ) (p. 15): "This system is capable of producing some hideous graphics. There is nothing in its design to prevent its misuse. ... This system cannot produce a meaningless graphic, however."

## usage

### installation

**ordr** is now on CRAN and can be installed using base R:

```{r install release, eval=FALSE}
install.packages("ordr")
```

The development version can be installed from the (default) `main` branch using [**remotes**](https://github.com/r-lib/remotes):

```{r install, eval=FALSE}
remotes::install_github("corybrunson/ordr")
```

### example

> Morphologically, _Iris versicolor_ is much closer to _Iris virginica_ than to _Iris setosa_, though in every character by which it differs from _Iris virginica_ it departs in the direction of _Iris setosa_.[^iris-quote]

[^iris-quote]: Anderson E (1936) "The Species Problem in Iris". _Annals of the Missouri Botanical Garden_ **23**(3), 457-469+471-483+485-501+503-509. <https://doi.org/10.2307/2394164>

A very common illustration of ordination in R applies principal components analysis (PCA) to Anderson's iris measurements. These data consist of lengths and widths of the petals and surrounding sepals from 50 each of three species of iris:

```{r iris}
head(iris)
summary(iris)
```

**ordr** provides a convenience function to send a subset of columns to an ordination function, wrap the resulting model in the [**tibble**](https://github.com/tidyverse/tibble)-derived 'tbl_ord' class, and append both model diagnostics and other original data columns as annotations to the appropriate matrix factors:[^ordinate]

[^ordinate]: The data must be in the form of a data frame that can be understood by the modeling function. Step-by-step methods also exist to build and annotate a 'tbl_ord' from a fitted ordination model.

```{r pca}
(iris_pca <- ordinate(iris, cols = 1:4, model = ~ prcomp(., scale. = TRUE)))
```

Additional annotations can be added using several row- and column-specific **dplyr**-style verbs:

```{r bind}
iris_meta <- data.frame(
  Species = c("setosa", "versicolor", "virginica"),
  Colony = c(1L, 1L, 2L),
  Cytotype = c("diploid", "hexaploid", "tetraploid")
)
(iris_pca <- left_join_rows(iris_pca, iris_meta, by = "Species"))
```

Following the [**broom**](https://github.com/tidymodels/broom) package, the `tidy()` method produces a tibble describing the model components, in this case the principal coordinates, which is suitable for scree plotting:

```{r model components and scree plot}
tidy(iris_pca) %T>% print() %>%
  ggplot(aes(x = name, y = prop_var)) +
  geom_col() +
  labs(x = "", y = "Proportion of inertia") +
  ggtitle("PCA of Anderson's iris measurements",
          "Distribution of inertia")
```

Following **ggplot2**, the `fortify()` method row-binds the factor tibbles with an additional `.matrix` column. This is used by `ggbiplot()` to redirect row- and column-specific plot layers to the appropriate subsets:[^text-radiate]

```{r interpolative biplot}
ggbiplot(iris_pca, sec.axes = "cols", scale.factor = 2) +
  geom_rows_point(aes(color = Species, shape = Species)) +
  stat_rows_ellipse(aes(color = Species), alpha = .5, level = .99) +
  geom_cols_vector(aes(label = name)) +
  expand_limits(y = c(-3.5, NA)) +
  ggtitle("PCA of Anderson's iris measurements",
          "99% confidence ellipses; variables use top & right axes")
```

[^text-radiate]: The radiating text geom, like several other features, is adapted from the **ggbiplot** package.

When variables are represented in standard coordinates, as typically in PCA, their rules can be rescaled to yield a predictive biplot.[^predictive]
For legibility, the axes are limited to the data range and offset from the origin:

[^predictive]: This is an experimental feature only available for linear methods, namely eigendecomposition, singular value decomposition, and principal components analysis.

```{r predictive biplot}
ggbiplot(iris_pca, axis.type = "predictive", axis.percents = FALSE) +
  theme_scaffold() +
  geom_rows_point(aes(color = Species, shape = Species)) +
  stat_rows_center(
    aes(color = Species, shape = Species),
    size = 5, alpha = .5, fun.data = mean_se
  ) +
  stat_cols_rule(aes(label = name, center = center, scale = scale)) +
  ggtitle("Predictive biplot of Anderson's iris measurements",
          "Project a marker onto an axis to approximate its measurement")
aggregate(iris[, 1:4], by = iris[, "Species", drop = FALSE], FUN = mean)
```

### more methods

The auxiliary package [**ordr.extra**](https://github.com/corybrunson/ordr.extra) provides recovery methods for several additional ordination models---and has room for several more!

## acknowledgments

### contribute

Any feedback on the package is very welcome! If you encounter confusion or errors, do create an issue, with a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) if feasible. If you have requests, suggestions, or your own implementations for new features, feel free to create an issue or submit a pull request.
Methods for additional ordination classes (see the `methods-*.r` scripts in the `R` folder) are especially welcome, as are new plot layers.
Please try to follow the
[contributing guidelines](https://github.com/corybrunson/ordr/blob/main/CONTRIBUTING.md) and respect the [Code of
Conduct](https://github.com/corybrunson/ordr/blob/main/CODE_OF_CONDUCT.md).

### inspirations

This package was originally inspired by the **ggbiplot** extension developed by [Vincent Q. Vu](https://github.com/vqv/ggbiplot), [Richard J Telford](https://github.com/richardjtelford/ggbiplot), and [Vilmantas Gegzna](https://github.com/forked-packages/ggbiplot), among others. It probably first brought biplots into the **tidyverse** framework.
The motivation to unify a variety of ordination methods came from several books and articles by [Michael Greenacre](https://www.fbbva.es/microsite/multivariate-statistics/resources.html), in particular [_Biplots in Practice_](https://www.fbbva.es/microsite/multivariate-statistics/resources.html#biplots).
Several answers at CrossValidated, in particular by [amoeba](https://stats.stackexchange.com/users/28666/amoeba) and [ttnphns](https://stats.stackexchange.com/users/3277/ttnphns), provided theoretical insights and informed design choices.
Thomas Lin Pedersen's [**tidygraph**](https://github.com/thomasp85/tidygraph) prequel to **ggraph** finally induced the shift from the downstream generation of scatterplots to the upstream handling and manipulating of models.
Additional design elements and features have been informed by the monograph [_Biplots_](https://www.google.com/books/edition/Biplots/lTxiedIxRpgC) and the textbook [_Understanding Biplots_](https://www.wiley.com/en-us/Understanding+Biplots-p-9781119972907) by John C. Gower, David J. Hand, Sugnet Gardner--Lubbe, and Niël J. Le Roux, and by the volume [_Principal Components Analysis_](https://link.springer.com/book/10.1007/b98835) by I. T. Jolliffe.

### exposition

This work was presented ([slideshow PDF](https://raw.githubusercontent.com/corybrunson/tidy-factor/main/tidy-factor-x/tidy-factor-x.pdf)) at an invited panel on [New Developments in Graphing Multivariate Data](https://ww2.amstat.org/meetings/jsm/2022/onlineprogram/ActivityDetails.cfm?SessionID=222053) at the [Joint Statistical Meetings](https://ww2.amstat.org/meetings/jsm/2022/), on 2022 August 8 in Washington DC.
I'm grateful to Joyce Robbins for the invitation and for organizing such a fun first experience, to Naomi Robbins for chairing the event, and to my co-panelists Ursula Laa and Hengrui Luo for sharing and sparking such exciting ideas and conversations.
An update was presented to the [ggplot2 extenders](https://teunbrand.github.io/ggplot-extension-club/), which elicited additional valuable feedback.

### resources

Development of this package benefitted from the use of equipment and the
support of colleagues at [UConn Health](https://health.uconn.edu/) and
at [UF Health](https://ufhealth.org/).