Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: UMAP connectivity and diagnostic plotting #65

Open
maddyduran opened this issue Jul 30, 2020 · 7 comments
Open

feature request: UMAP connectivity and diagnostic plotting #65

maddyduran opened this issue Jul 30, 2020 · 7 comments
Labels
enhancement New feature or request

Comments

@maddyduran
Copy link

It would be great and super useful to have the connectivity or diagnostic plotting features seen in the python UMAP implementation.

Thanks for the great work!

@jlmelville jlmelville added the enhancement New feature or request label Jul 30, 2020
@vertesy
Copy link

vertesy commented Feb 12, 2023

This would be really great!

@jlmelville
Copy link
Owner

I agree some kind of diagnostic plotting is necessary for any dimensionality method which embeds a neighbor graph. I have written substantial amounts of R (and Python) plotting code for visualizing UMAP output but I don't really want to add it to uwot because I think it would result in a drastic increase in the maintenance burden.

Also I admit to being a bit of a skeptic that connectivity plots are that useful for static output. For interactive plotting it's a different matter, I think they are very informative there. But I am not sure what would constitute a useful contribution. plotly is adequate for my needs. Seems like I could end up having to support multiple output styles (e.g. base graphics, ggplot2, plotly) and still not offer something that fits into most people's workflows or graphics needs.

That said it's a bit hypocritical of me to say that diagnostic plotting is necessary and then resolutely refuse to provide any help.

@vertesy
Copy link

vertesy commented Feb 15, 2023

I think the reason why a static connectivity plot is helpful is because it shows you which distances are actually meaningful on a standard 2D umap.

E.g. 2 clusters may sit equally close to a third cluster but only one of them is close due to contentedness, thus meaningful, the other may only end up at the same distance because of the dimensionality compression/reduction.

I understand and agree that implementing different plotting frameworks can cause a large burden, but it may not be necessary.

@jlmelville
Copy link
Owner

E.g. 2 clusters may sit equally close to a third cluster but only one of them is close due to contentedness, thus meaningful, the other may only end up at the same distance because of the dimensionality compression/reduction.

Agreed about the intention. I suppose I should try and implement it and then be prepared to eat my words.

@jlmelville
Copy link
Owner

My initial experiments with connectivity plotting have confirmed my suspicions that without access to something that works like datashader (which the Python connectivity plotter makes use of), the naïve approach of plotting lines between the n_neighbors nearest neighbors from the original space quickly scales beyond feasibility.

As an alternative, I considered plotting just the connections between the furthest nearest neighbor of each point. Closer neighbors are more likely to be embedded closer to the point so you would probably see a higher proportion of uninteresting within-cluster lines.

Here's what this looks like for iris:

image

That looks ok, although I should stress that I have zero evidence that displaying the further nearest neighbor distance gives useful information about clusters or connectivity.

But iris only contains 150 points. Here is a bog-standard UMAP of the MNIST digits (N = 70,000), a more realistic case:

image

And here are the 15-neighbor connectivities (the equivalent of the iris plot above):

image

I still don't consider that static output to be all that useful, and don't actually have a way to produce an equivalent interactive plot for this yet. The very simplified method of producing those connections may also be misleading or unhelpful. A more sophisticated method processing all the neighbor connectivities to leave only the "useful" ones seems like a substantial research project on its own.

Not sure when or if I will pursue this further, but if you are able to get to the data in a form that lets you use uwot directly on a matrix or dataframe (not sure how easy that is to extract from e.g. seurat workflows) you can play about with this yourself:

conn_plot <-
  function(model,
           X,
           alpha_scale = 0.5,
           color = "black",
           lwd = 1,
           nn = NULL) {
    X <- uwot:::x2m(X)
    if (is.null(nn)) {
      if (!is.null(model$nn)) {
        nn <- model$nn[[1]]
      }
      else {
        nn <-
          uwot:::annoy_search(X, k = model$n_neighbors, ann = model$nn_index)
      }
    }

    nnf <- nn$idx[, model$n_neighbors, drop = FALSE]
    pairs <- as.matrix(reshape2::melt(nnf)[, c(1, 3)])

    coords <- model$embedding

    x0 <- coords[pairs[, 1], 1]
    y0 <- coords[pairs[, 1], 2]

    x1 <- coords[pairs[, 2], 1]
    y1 <- coords[pairs[, 2], 2]

    segments(
      x0 = x0,
      y0 = y0,
      x1 = x1,
      y1 = y1,
      col = grDevices::adjustcolor(color, alpha.f = alpha_scale),
      lwd = lwd
    )
  }

Example of using it with iris:

# ret_nn = TRUE is optional but strongly recommended
model <- umap(iris, ret_model = TRUE, ret_nn = TRUE)
plot(model$embedding, col=iris$Species)
# or vizier::embed_plot(model$embedding, iris)
conn_plot(model, iris, alpha_scale = 0.1)

Note:

  • You need to have reshape2 installed.
  • You need to have plotted the initial dataset yourself separately, via something like plot. Something as simple as plot(model$embedding) but you'll need to workout point sizes, colors and so on.
  • On an MNIST-sized dataset, it the function takes a while to run because it has to find the nearest neighbors and then just plotting all those lines takes ages even after the function returns. Obviously caching the nearest neighbors would help here, which you can do by generating the original UMAP model with ret_nn = TRUE. Even then, be prepared to wait several minutes with seemingly nothing happening.

@vertesy
Copy link

vertesy commented Feb 26, 2023

Thank you!

@jlmelville
Copy link
Owner

https://schochastics.github.io/edgebundle/ seems worth exploring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants