Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code for Deduplication #8

Open
qtli opened this issue Aug 20, 2024 · 2 comments
Open

Code for Deduplication #8

qtli opened this issue Aug 20, 2024 · 2 comments

Comments

@qtli
Copy link

qtli commented Aug 20, 2024

Hi, thanks so much for your promising work!
I was hoping to inquire if it's possible for you to provide me with the code for the "Deduplication" section. Thank you in advance for your help!

@zachschillaci27
Copy link

zachschillaci27 commented Oct 15, 2024

Hi, thanks so much for your promising work!

I was hoping to inquire if it's possible for you to provide me with the code for the "Deduplication" section. Thank you in advance for your help!

Plus one! I'm thinking of implementing something similar myself. The similarity based deduplication with embeddings seems straightforward enough, but I'd like to see their minhash implementation.

@zhangzx-uiuc
Copy link

Also, there is another concern. The embedding-based similarity approach seems to be (n^2) complexity. Does it take a lot of time in practice when the number of personas scales to 1 billion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants