diff --git a/README.md b/README.md index 543d16d..4ec9b8a 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ To install EMBEDR, we recommend cloning this repository before installing using `pip` in the main project directory. Specifically: ```bash - pip install . +pip install . ``` The package requires numpy, scikit-learn, scipy, conda, and numba for @@ -38,7 +38,7 @@ the t-SNE algorithm. You can install fftw using ## Getting Started Once you've installed EMBEDR, you can easily generate an embedding colored by -EMBEDR *p*-value by calling the `fit` method in the EMBEDR class as below: +EMBEDR *p*-value by calling the `fit` method in the EMBEDR class as below. ```python from EMBEDR import EMBEDR, EMBEDR_sweep @@ -46,13 +46,42 @@ import numpy as np X = np.loadtxt("./data/mnist2500_X.txt").astype(float) -embObj = EMBEDR() +embObj = EMBEDR(project_dir='./') embObj.fit(X) embObj.plot() ``` ![Example EMBEDR Plot](EasyUseExample.png) +In the example above, we embed 2500 MNIST digits once using t-SNE and we embed +a marginally-resampled null data set once as well. The quality of the data +embedding, based on the correspondence between the neighborhoods of each sample +in the original space and the shown projection, are compared to those expected +to be generated by signalless data (as generated by the null data set). This +comparison results in a "*p*-value," which we use to color the samples in the +embedding. For complete details, see our +[preprint](https://www.biorxiv.org/content/10.1101/2020.11.18.389031v2). + +The EMBEDR package primarily works through the `EMBEDR` class object, as in the +example above. Importantly, because EMBEDR generates several embeddings of a +data set (and a generated null data set), the method stores intermediate +results in a project directory. In the example above, the `project_dir` +variable is set to the current working directory, but we recommend that you set +a specified "projects" directory. The default value for `project_dir` is +`./projects/`. To facilitate this organization, a `project_name` parameter can +also be specified. If you don't want to do file caching, set `do_cache=False` +when initializing the EMBEDR object. + +Other useful parameters are: +- `DRA`: the dimensionality reduction algorithm; currently only `tSNE` and + `UMAP` are supported. +- `perplexity`/`nearest_neighbors`: Set the algorithm hyperparameters for + t-SNE or UMAP. Defaults are to set these at 10% of the number of samples. +- `n_data_embed` and `n_null_embed`: The number of data and null embeddings to + generate before calculating EMBEDR *p*-values. Defaults are set at 1, but in + practice using 3-10 embeddings is recommended. +For a complete list of options, check the `EMBEDR` class documentation. + ## New in Version 2.0 The updated version of the EMBEDR package better facilitates the EMBEDR @@ -81,14 +110,4 @@ which they were created has been amended and will be backwards compatible with previous versions. Objects can now be loaded from any relative path specification for the project directory. -## To-Do - -- Plotting Utility - - Sweep results: - - EES vs hyperparameter (null and data) - - p-Values vs hyperparameter - - EMBEDR results: - - color plot by other metadata / supplied array. -- k-Effective Calculator -