-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentence and paragraph (etc) distances? #36
Comments
I agree that is a common use. I mean my PhD thesis was on the fact that such simple linear combinations of word embeddings often out peform more sophisticated methods. But I am not sure it is worth including in the package. const get_word_index = Dict(word=>ii for (ii,word) in enumerate(embtable.vocab))
get_embedding(word) = embtable.embeddings[:, get_word_index[word]] Which allows them to do something fancier if they have for example loaded there words into a Similarly, thingsl like sums of embeddings are also 1 liners. sowe(words) = sum(get_embedding, words)
mowe(words) = mean(get_embedding, words) and if they want to do something fancier to handle out of vocabulary etc then they are free to do so |
Yes, I saw your thesis (but haven't read it all). Sure, it's simple enough to keep it out. I figured not everyone who needs sentence/paragraph distances would know about sowe/mowe so having it in a package might make it easier but maybe many do. Anyway, no problem. BTW, would you recommend straight mowe/sowe on all the words (well potentially excluding stop words etc) of a paragraph or rather do pairwise on sentences and then aggregate in some way based on sentence similarities? I haven't explored it much for larger batches of text and my intuition tells me that just taking the mean would loose "resolution" at some point. Do you know of some papers investigating this empirically? |
Straight mowe/sowe is so simple to implement it should be the first thing you try (possibly after plain BoW). |
Thanks for this package; very useful.
Would it make sense to include simple multi-word distance metrics like MOWE (mean/median of word embeddings) etc in this package or is that already available in other packages of JuliaText? I didn't find it but seems a quite common use case for people that download
Embeddings.jl
. An alternative might be to make these part instead ofStringDistances.jl
.The text was updated successfully, but these errors were encountered: