Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should is_oov be used? #3994

Closed
DeNeutoy opened this issue Jul 20, 2019 · 7 comments
Closed

How should is_oov be used? #3994

DeNeutoy opened this issue Jul 20, 2019 · 7 comments
Labels
docs Documentation and website enhancement Feature requests and improvements usage General spaCy usage

Comments

@DeNeutoy
Copy link
Contributor

I had a question (allenai/scispacy#136) in scispacy regarding the usage of is_oov, which then confused me:

How should the is_oov flag be used? Initially, I thought it would correspond to tokens which do not have a vector, but it seems like it should correspond to existence in the nlp.vocab, given this line: https://github.com/explosion/spaCy/blob/master/spacy/cli/init_model.py#L142

In [28]: x = spacy.load("en_core_sci_sm")

In [29]: doc = x("hello this word smelling is oov.")

In [30]: [t.is_oov for t in doc]
Out[30]: [True, False, False, True, False, True, False]

In [31]: x = spacy.load("en_core_web_sm")

In [32]: doc = x("hello this word smelling is oov.")

In [33]: [t.is_oov for t in doc]
Out[33]: [True, True, True, True, True, True, True]

I've seen previously in #3986 that this was an issue from v2.0, so I double checked that the model is fresh:

In [36]: x.meta
Out[36]:
{'accuracy': {'ents_f': 85.8587845242,
  'ents_p': 86.3317889027,
  'ents_r': 85.3909350025,
  'las': 89.6616629074,
  'tags_acc': 96.7783856079,
  'token_acc': 99.0697323163,
  'uas': 91.5287392082},
 'author': 'Explosion AI',
 'description': 'English multi-task CNN trained on OntoNotes. Assigns context-specific token vectors, POS tags, dependency parse and named entities.',
 'email': '[email protected]',
 'lang': 'en',
 'license': 'MIT',
 'name': 'core_web_sm',
 'parent_package': 'spacy',
 'pipeline': ['tagger', 'parser', 'ner'],
 'sources': ['OntoNotes 5'],
 'spacy_version': '>=2.1.0',
 'speed': {'cpu': 6684.8046553827, 'gpu': None, 'nwords': 291314},
 'url': 'https://explosion.ai',
 'version': '2.1.0',
 'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None}}

Additionally, i'm not quite sure how i've managed to get this behaviour in one of the scispacy models:

In [51]: x = spacy.load("en_core_sci_sm")

In [52]: doc = x("hello this word smelling is oov.")

In [53]: for t in doc:
    ...:     print(t.is_oov, t.text in x.vocab)
    ...:
True True
False True
False True
True True
False True
True False
False True

In [54]: x = spacy.load("en_core_web_sm")

In [55]: doc = x("hello this word smelling is oov.")

In [56]: for t in doc:
    ...:     print(t.is_oov, t.text in x.vocab)
    ...:
True True
True True
True True
True True
True True
True True
True True

So basically, i'm just wondering what the correct interpretation is of Token.is_oov is 😄

Thanks!

Which page or section is this issue related to?

https://spacy.io/api/token

@honnibal
Copy link
Member

honnibal commented Jul 22, 2019

The intended interpretation is, "tokens that don't have a meaningful .prob value." Which corresponded to words that weren't in the vocab.

This gets non-useful in the _sm models, and might not work well in other data packs if there wasn't a big frequency count used to build the vocab. So I'm not sure it's always useful.

@DeNeutoy
Copy link
Contributor Author

Cool, thanks - i'm not quite sure I understand why this is not useful in the small models? I thought that the small models still use freq statistics when building the vocab (but don't use vectors). Is the vocabulary of the small models different (aside from being smaller) as well? I think understanding this would help me get to the bottom of this case:

  • t in vocab == True, t.is_oov == True # seems wrong by definition

is this just an artifact of the way the small models work?

@honnibal
Copy link
Member

Maybe I'm wrong but I thought the small models didn't have much vocab?

t in vocab == True, t.is_oov == True

Agree this is confusing :(. The problem is we do add entries to the vocab during processing, as the vocab also acts as a cache. But if we didn't have an initial probability value for the word, we still mark it as oov. Open to suggestions for how to improve this.

@honnibal honnibal added the usage General spaCy usage label Jul 23, 2019
@ines ines added docs Documentation and website enhancement Feature requests and improvements labels Jul 23, 2019
@DeNeutoy
Copy link
Contributor Author

Cool, I think I understand how that could happen now, thanks!

I think the utility of whether a token has a .prob is not huge - perhaps t.is_oov == t not in vocab would be a better definition, but that's probably tricky to change now!

@mdgilene
Copy link

This does seem kind of strange. For my use case, I am training the NER with new entities/labels. However, after the training process the words are not added to the vocab but the NER does correctly label them. One would assume that if the NER can correctly label a word it should be in the vocab.

@honnibal
Copy link
Member

@mdgilene Well...The embedding tables use the hashing trick, so they don't require a fixed-size vocabulary to be computed ahead of time. I understand that it's confusing, but I'll close this as changing the behaviour would introduce a lot of backwards incompatibilities.

@lock
Copy link

lock bot commented Oct 17, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website enhancement Feature requests and improvements usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

4 participants