How should `is_oov` be used? #3994

DeNeutoy · 2019-07-20T13:01:08Z

I had a question (allenai/scispacy#136) in scispacy regarding the usage of is_oov, which then confused me:

How should the is_oov flag be used? Initially, I thought it would correspond to tokens which do not have a vector, but it seems like it should correspond to existence in the nlp.vocab, given this line: https://github.com/explosion/spaCy/blob/master/spacy/cli/init_model.py#L142

In [28]: x = spacy.load("en_core_sci_sm")

In [29]: doc = x("hello this word smelling is oov.")

In [30]: [t.is_oov for t in doc]
Out[30]: [True, False, False, True, False, True, False]

In [31]: x = spacy.load("en_core_web_sm")

In [32]: doc = x("hello this word smelling is oov.")

In [33]: [t.is_oov for t in doc]
Out[33]: [True, True, True, True, True, True, True]

I've seen previously in #3986 that this was an issue from v2.0, so I double checked that the model is fresh:

In [36]: x.meta
Out[36]:
{'accuracy': {'ents_f': 85.8587845242,
  'ents_p': 86.3317889027,
  'ents_r': 85.3909350025,
  'las': 89.6616629074,
  'tags_acc': 96.7783856079,
  'token_acc': 99.0697323163,
  'uas': 91.5287392082},
 'author': 'Explosion AI',
 'description': 'English multi-task CNN trained on OntoNotes. Assigns context-specific token vectors, POS tags, dependency parse and named entities.',
 'email': '[email protected]',
 'lang': 'en',
 'license': 'MIT',
 'name': 'core_web_sm',
 'parent_package': 'spacy',
 'pipeline': ['tagger', 'parser', 'ner'],
 'sources': ['OntoNotes 5'],
 'spacy_version': '>=2.1.0',
 'speed': {'cpu': 6684.8046553827, 'gpu': None, 'nwords': 291314},
 'url': 'https://explosion.ai',
 'version': '2.1.0',
 'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None}}

Additionally, i'm not quite sure how i've managed to get this behaviour in one of the scispacy models:

In [51]: x = spacy.load("en_core_sci_sm")

In [52]: doc = x("hello this word smelling is oov.")

In [53]: for t in doc:
    ...:     print(t.is_oov, t.text in x.vocab)
    ...:
True True
False True
False True
True True
False True
True False
False True

In [54]: x = spacy.load("en_core_web_sm")

In [55]: doc = x("hello this word smelling is oov.")

In [56]: for t in doc:
    ...:     print(t.is_oov, t.text in x.vocab)
    ...:
True True
True True
True True
True True
True True
True True
True True

So basically, i'm just wondering what the correct interpretation is of Token.is_oov is 😄

Thanks!

Which page or section is this issue related to?

https://spacy.io/api/token

The text was updated successfully, but these errors were encountered:

honnibal · 2019-07-22T12:17:13Z

The intended interpretation is, "tokens that don't have a meaningful .prob value." Which corresponded to words that weren't in the vocab.

This gets non-useful in the _sm models, and might not work well in other data packs if there wasn't a big frequency count used to build the vocab. So I'm not sure it's always useful.

DeNeutoy · 2019-07-22T22:12:24Z

Cool, thanks - i'm not quite sure I understand why this is not useful in the small models? I thought that the small models still use freq statistics when building the vocab (but don't use vectors). Is the vocabulary of the small models different (aside from being smaller) as well? I think understanding this would help me get to the bottom of this case:

t in vocab == True, t.is_oov == True # seems wrong by definition

is this just an artifact of the way the small models work?

honnibal · 2019-07-23T10:29:37Z

Maybe I'm wrong but I thought the small models didn't have much vocab?

t in vocab == True, t.is_oov == True

Agree this is confusing :(. The problem is we do add entries to the vocab during processing, as the vocab also acts as a cache. But if we didn't have an initial probability value for the word, we still mark it as oov. Open to suggestions for how to improve this.

DeNeutoy · 2019-07-23T13:27:11Z

Cool, I think I understand how that could happen now, thanks!

I think the utility of whether a token has a .prob is not huge - perhaps t.is_oov == t not in vocab would be a better definition, but that's probably tricky to change now!

mdgilene · 2019-08-16T20:10:34Z

This does seem kind of strange. For my use case, I am training the NER with new entities/labels. However, after the training process the words are not added to the vocab but the NER does correctly label them. One would assume that if the NER can correctly label a word it should be in the vocab.

honnibal · 2019-09-17T12:36:41Z

@mdgilene Well...The embedding tables use the hashing trick, so they don't require a fixed-size vocabulary to be computed ahead of time. I understand that it's confusing, but I'll close this as changing the behaviour would introduce a lot of backwards incompatibilities.

lock · 2019-10-17T12:43:06Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the usage General spaCy usage label Jul 23, 2019

ines added docs Documentation and website enhancement Feature requests and improvements labels Jul 23, 2019

honnibal closed this as completed Sep 17, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should `is_oov` be used? #3994

How should `is_oov` be used? #3994

DeNeutoy commented Jul 20, 2019

honnibal commented Jul 22, 2019 •

edited

Loading

DeNeutoy commented Jul 22, 2019

honnibal commented Jul 23, 2019

DeNeutoy commented Jul 23, 2019

mdgilene commented Aug 16, 2019

honnibal commented Sep 17, 2019

lock bot commented Oct 17, 2019

How should is_oov be used? #3994

How should is_oov be used? #3994

Comments

DeNeutoy commented Jul 20, 2019

Which page or section is this issue related to?

honnibal commented Jul 22, 2019 • edited Loading

DeNeutoy commented Jul 22, 2019

honnibal commented Jul 23, 2019

DeNeutoy commented Jul 23, 2019

mdgilene commented Aug 16, 2019

honnibal commented Sep 17, 2019

lock bot commented Oct 17, 2019

How should `is_oov` be used? #3994

How should `is_oov` be used? #3994

honnibal commented Jul 22, 2019 •

edited

Loading