Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMap support #110

Open
MohammadFakhreddin opened this issue Oct 12, 2024 · 6 comments
Open

UMap support #110

MohammadFakhreddin opened this issue Oct 12, 2024 · 6 comments

Comments

@MohammadFakhreddin
Copy link

Hello,

First of all, I want to thank you for this library. I've been looking for a library that I can use to integrate dimensionality reduction techniques into our tool for our paper, and this is perfect for that. (I make sure to cite :))

I would like to ask about the current situation with the UMAP. Is it ready to use?

Also, as a side question, Are you guys aware of any good library for the K-NN classifier?

@iglesias
Copy link
Collaborator

Hello @MohammadFakhreddin,

Thank you for the interest and nice to read you are finding the library useful.

I might be able to help with the k-NN question. Even though tapkee already includes tree data structures for it as several dimensionality reduction techniques are based on nearest neighbors, you could take a look at Shogun. This notebook should help to get a quick idea of how you can do k-NN in Shogun using the Python interface. Even though the notebook is about LMNN (you can think of it as an extension to k-NN), see e.g. code cell [14] for an example applying k-NN in a metagenomics dataset.

So :)
if you are already using tapkee and are attracted to the diving in its code a bit, you will find the NN code and eventually be able to modify to get a k-NN classifier from it and you won't need any other library;
if you want a more readily available solution where you can call a k-NN classifie, Shogun could help better, but it may require some effort getting it work, which can widely vary depending on what system you are using (OS, package manager, compiler, ...) and what version of Shougn you would like to use.

@MohammadFakhreddin
Copy link
Author

@iglesias Thanks a lot! I look into it and try to implement something based on that.

At the moment, I'm trying to keep the build as simple as possible, so I do my best to avoid a complex library. One of our goals is the project's accessibility. We have some prototypes in Python using Scikit Learn, but currently, my aim is the project's longevity and ease of build.

As a side note, I think the cmake minimum version is too high :)

Let me know if anyone knows about the current state of the UMAP library.

@lisitsyn
Copy link
Owner

Hi @MohammadFakhreddin

thanks for reaching out! The UMAP should be a good addition to the library but none of us two have got enough time recently to implement it. As of now there is no implementation even in a branch.

@iglesias
Copy link
Collaborator

Indeed. On Open Source, I am with the CodeQl stuff and making contributions to GitHub’s coding-standards repo. Would I look into something in tapkee atm, I’d be more interested in some topic related to that (even widely, such as safety with Circle or just even trying the new clang real-time sanitizer on it).

I recalled on umap there was already this #95

The umap python repo on github looks quite popular, and there’s also a c++ repo. What would be the goal of adding a new method DR now to tapkee? I wondered and I couldn’t think of any besides completeness in tapkee.

@MohammadFakhreddin
Copy link
Author

So, I integrated Tapkee into my project and fed it the dataset I used for testing PCA using OpenCV. Strangely, it took 3-4 seconds for OpenCV PCA, while for Tapkee, it took 4-5 minutes. I noticed that OpenCL was present in the cmake. Are you guys using OpenCL for optimization? Can it be that by not including OpenCL in my project, I made Tapkee much slower than OpenCV?
(I used the Passiflora dataset, which is a very large dataset. It worked well with smaller datasets :))

cmake_minimum_required (VERSION 3.16)

project (Tapkee  LANGUAGES CXX)

set (CMAKE_CXX_STANDARD 23)
set (TAPKEE_INCLUDE_DIR "${CMAKE_CURRENT_SOURCE_DIR}/include")

include_directories("${TAPKEE_INCLUDE_DIR}")

add_library(tapkee_library INTERFACE)
target_include_directories(tapkee_library INTERFACE "${TAPKEE_INCLUDE_DIR}")

@iglesias
Copy link
Collaborator

Hello @MohammadFakhreddin,

assuming I understood your message and questions correctly after reading them a few times, a comparison between OpenCV with GPU acceleration and Tapkee without, providing that PCA is amenable to data-parallelism, would obviously result in a large difference in a large dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants