-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should is_oov
be used?
#3994
Comments
The intended interpretation is, "tokens that don't have a meaningful This gets non-useful in the |
Cool, thanks - i'm not quite sure I understand why this is not useful in the small models? I thought that the small models still use freq statistics when building the vocab (but don't use vectors). Is the vocabulary of the small models different (aside from being smaller) as well? I think understanding this would help me get to the bottom of this case:
is this just an artifact of the way the small models work? |
Maybe I'm wrong but I thought the small models didn't have much vocab?
Agree this is confusing :(. The problem is we do add entries to the vocab during processing, as the vocab also acts as a cache. But if we didn't have an initial probability value for the word, we still mark it as |
Cool, I think I understand how that could happen now, thanks! I think the utility of whether a |
This does seem kind of strange. For my use case, I am training the NER with new entities/labels. However, after the training process the words are not added to the vocab but the NER does correctly label them. One would assume that if the NER can correctly label a word it should be in the vocab. |
@mdgilene Well...The embedding tables use the hashing trick, so they don't require a fixed-size vocabulary to be computed ahead of time. I understand that it's confusing, but I'll close this as changing the behaviour would introduce a lot of backwards incompatibilities. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I had a question (allenai/scispacy#136) in scispacy regarding the usage of
is_oov
, which then confused me:How should the
is_oov
flag be used? Initially, I thought it would correspond to tokens which do not have a vector, but it seems like it should correspond to existence in thenlp.vocab
, given this line: https://github.com/explosion/spaCy/blob/master/spacy/cli/init_model.py#L142I've seen previously in #3986 that this was an issue from v2.0, so I double checked that the model is fresh:
Additionally, i'm not quite sure how i've managed to get this behaviour in one of the scispacy models:
So basically, i'm just wondering what the correct interpretation is of
Token.is_oov
is 😄Thanks!
Which page or section is this issue related to?
https://spacy.io/api/token
The text was updated successfully, but these errors were encountered: