Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modified _knn_from_dists to enable very large distance matrices #33

Merged
merged 1 commit into from
Oct 13, 2020

Conversation

kragol
Copy link
Contributor

@kragol kragol commented Sep 16, 2020

This is a simple code modification that has enabled me to use very large distance matrices as input to UMAP (e.g. 100 000 x 100 000 distance matrices). The matrices may be so large that they do not fit into memory unless they are implemented in a sparse way (I have been using DefaultArrays; not really the most efficient implementation but it does the job).

The modified code is equivalent to the old one but doesn't seem to require nearly as much memory.

With the older code, julia would run out of memory in the _knn_from_dists function while trying to allocate very large matrices. I am not sure why that happened but I guess it is related to how array comprehensions are materialized.

On a side note, modifications I used that are not included in this PR are

  • using Threads.@threads in the for loop in _knn_from_dists, which provides a great speedup
  • adding a ProgressMeter (for when UMAP takes 10 minutes...) in a few functions.
    Those are basic modifications but really improve the quality of life when dealing with large datasets. Let me know if you'd be interested in some of these.

@dillondaudert
Copy link
Owner

Thanks for the PR ! This seems like a good change, I should have time in the next few days to review it.

@dillondaudert
Copy link
Owner

Looks good to me. In doing some benchmarking comparing the two implementations, I didn't notice a significant difference in memory usage so I might need more context on your specific use-case to recreate that.

However, your implementation is faster anyway even single threaded, so good enough for me. If you want to add in the @threads macro, I'll accept that as well, otherwise I can merge this.

@kragol
Copy link
Contributor Author

kragol commented Oct 12, 2020

Maybe the memory issue is related to the fact I am using DefaultArrays for the implementation of my very large distance matrices. I don't really have much time to check other implementations to see where the issue comes from exactly, but I'll let you know if I figure it out. Right now, it does not make sense to me why this would happen, which is why I thought it might be something related to the way array comprehensions are materialized.

For the threaded implementation, it probably should be optional with a sensible default depending on matrix size (because it may very well be slower with small matrices) so I guess that could be a separate PR.

@dillondaudert dillondaudert merged commit f7a44fd into dillondaudert:master Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants