`is_oov` in comparison to `in nlp.vocab` #136

eliorc · 2019-07-20T09:43:29Z

I'm looking to use scispacy's en_core_sci_md model for various purposes, one being using its word vectors as an input to a neural network.
As I was checking the coverage of the existing embedding, I noticed a weird phenomenon where a given token's token.is_oov == True, thought token.text in nlp.vocab == True. When this happens the token.vector.sum() == 0.
I can't figure out how does this make sense, if it is in the vocabulary, how come it is oov and has an all zero vector? Also some basic words are missing, for example

tokens = gather_all_tokens_from_corpus()

some_token = random.choice([t for t in tokens if t.is_oov])
print(some_token)
>>> smelling

some_token.text in nlp.vocab
>>> True

some_token.vector
>>> array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)

How come it is OOV yet returns True when checking in nlp.vocab?
Is it expected that basic words like smelling won't have a vector?

The text was updated successfully, but these errors were encountered:

DeNeutoy · 2019-07-23T13:29:44Z

@eliorc - if you take a look at the issue on SpaCy I linked to above, you can see that this is a general feature of spacy models and not specific to scispacy. Sorry about that!

It seems like the spacy vocab is also used as a cache, so it's possible that new words are added to it which were not originally included in the vocab. You might have better luck checking that the .prob attribute is > -20.

eliorc · 2019-07-23T17:15:16Z

What is the meaning of .prob > -20? Can you elaborate on that?

DeNeutoy · 2019-07-23T22:27:08Z

Ah sorry - spacy uses log probabilties (for computational precision) for it's .prob attribute, which is calculated from unigram data. If a word is added to the vocabulary not when the model is first created (for instance, if it is used as a cache), it's probability will be set to the "default" value which is -20. So I think, any token with a prob greater than this value would have been seen at training time. Sorry if that's not super helpful....

eliorc · 2019-07-24T05:38:37Z

Seems legit, do we know for sure that any OOV vocabulary will be -20 or less? Is it save to use this condition for OOV assertion?

DeNeutoy · 2019-07-24T21:48:51Z

Umm, I think so? Not 100% sure about that, but I think it's the default minimum value, so maybe? Sorry I can't be more helpful!

eliorc · 2019-07-27T14:43:21Z

Thanks, I'll keep following the issue you referenced

eliorc changed the title ~~is_oov in comparison to in nlp.vocab~~ is_oov in comparison to in nlp.vocab Jul 20, 2019

DeNeutoy mentioned this issue Jul 20, 2019

How should is_oov be used? explosion/spaCy#3994

Closed

eliorc closed this as completed Jul 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`is_oov` in comparison to `in nlp.vocab` #136

`is_oov` in comparison to `in nlp.vocab` #136

eliorc commented Jul 20, 2019

DeNeutoy commented Jul 23, 2019

eliorc commented Jul 23, 2019

DeNeutoy commented Jul 23, 2019

eliorc commented Jul 24, 2019

DeNeutoy commented Jul 24, 2019

eliorc commented Jul 27, 2019

is_oov in comparison to in nlp.vocab #136

is_oov in comparison to in nlp.vocab #136

Comments

eliorc commented Jul 20, 2019

DeNeutoy commented Jul 23, 2019

eliorc commented Jul 23, 2019

DeNeutoy commented Jul 23, 2019

eliorc commented Jul 24, 2019

DeNeutoy commented Jul 24, 2019

eliorc commented Jul 27, 2019

`is_oov` in comparison to `in nlp.vocab` #136

`is_oov` in comparison to `in nlp.vocab` #136