Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_oov in comparison to in nlp.vocab #136

Closed
eliorc opened this issue Jul 20, 2019 · 6 comments
Closed

is_oov in comparison to in nlp.vocab #136

eliorc opened this issue Jul 20, 2019 · 6 comments

Comments

@eliorc
Copy link

eliorc commented Jul 20, 2019

I'm looking to use scispacy's en_core_sci_md model for various purposes, one being using its word vectors as an input to a neural network.
As I was checking the coverage of the existing embedding, I noticed a weird phenomenon where a given token's token.is_oov == True, thought token.text in nlp.vocab == True. When this happens the token.vector.sum() == 0.
I can't figure out how does this make sense, if it is in the vocabulary, how come it is oov and has an all zero vector? Also some basic words are missing, for example

tokens = gather_all_tokens_from_corpus()

some_token = random.choice([t for t in tokens if t.is_oov])
print(some_token)
>>> smelling

some_token.text in nlp.vocab
>>> True

some_token.vector
>>> array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)

How come it is OOV yet returns True when checking in nlp.vocab?
Is it expected that basic words like smelling won't have a vector?

@eliorc eliorc changed the title is_oov in comparison to in nlp.vocab is_oov in comparison to in nlp.vocab Jul 20, 2019
@DeNeutoy
Copy link
Contributor

@eliorc - if you take a look at the issue on SpaCy I linked to above, you can see that this is a general feature of spacy models and not specific to scispacy. Sorry about that!

It seems like the spacy vocab is also used as a cache, so it's possible that new words are added to it which were not originally included in the vocab. You might have better luck checking that the .prob attribute is > -20.

@eliorc
Copy link
Author

eliorc commented Jul 23, 2019

What is the meaning of .prob > -20? Can you elaborate on that?

@DeNeutoy
Copy link
Contributor

Ah sorry - spacy uses log probabilties (for computational precision) for it's .prob attribute, which is calculated from unigram data. If a word is added to the vocabulary not when the model is first created (for instance, if it is used as a cache), it's probability will be set to the "default" value which is -20. So I think, any token with a prob greater than this value would have been seen at training time. Sorry if that's not super helpful....

@eliorc
Copy link
Author

eliorc commented Jul 24, 2019

Seems legit, do we know for sure that any OOV vocabulary will be -20 or less? Is it save to use this condition for OOV assertion?

@DeNeutoy
Copy link
Contributor

Umm, I think so? Not 100% sure about that, but I think it's the default minimum value, so maybe? Sorry I can't be more helpful!

@eliorc
Copy link
Author

eliorc commented Jul 27, 2019

Thanks, I'll keep following the issue you referenced

@eliorc eliorc closed this as completed Jul 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants