-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is_oov
in comparison to in nlp.vocab
#136
Comments
is_oov
in comparison to in nlp.vocab
is_oov
in comparison to in nlp.vocab
@eliorc - if you take a look at the issue on SpaCy I linked to above, you can see that this is a general feature of spacy models and not specific to scispacy. Sorry about that! It seems like the spacy vocab is also used as a cache, so it's possible that new words are added to it which were not originally included in the vocab. You might have better luck checking that the |
What is the meaning of |
Ah sorry - spacy uses log probabilties (for computational precision) for it's |
Seems legit, do we know for sure that any OOV vocabulary will be -20 or less? Is it save to use this condition for OOV assertion? |
Umm, I think so? Not 100% sure about that, but I think it's the default minimum value, so maybe? Sorry I can't be more helpful! |
Thanks, I'll keep following the issue you referenced |
I'm looking to use scispacy's
en_core_sci_md
model for various purposes, one being using its word vectors as an input to a neural network.As I was checking the coverage of the existing embedding, I noticed a weird phenomenon where a given token's
token.is_oov == True
, thoughttoken.text in nlp.vocab == True
. When this happens thetoken.vector.sum() == 0
.I can't figure out how does this make sense, if it is in the vocabulary, how come it is oov and has an all zero vector? Also some basic words are missing, for example
How come it is OOV yet returns
True
when checkingin nlp.vocab
?Is it expected that basic words like
smelling
won't have a vector?The text was updated successfully, but these errors were encountered: