Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Bert tokenizer is tokenizing some tokens as UNK #11184

Closed
1 task done
maxdebayser opened this issue Dec 13, 2024 · 1 comment · Fixed by #11815
Closed
1 task done

[Bug]: Bert tokenizer is tokenizing some tokens as UNK #11184

maxdebayser opened this issue Dec 13, 2024 · 1 comment · Fixed by #11815
Labels
bug Something isn't working

Comments

@maxdebayser
Copy link
Contributor

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

Model Input Dumps

No response

🐛 Describe the bug

With some Bert and Roberta models like sentence-transformers/all-MiniLM-L12-v2 I found that the output is not similar to the one generated by sentence-transformers. If I place the following prints in _normalize_prompt_text_to_input() in serving_engine.py

        print(f"{input_ids=}")

I get [101, 100, 3007, 1997, 100, 2003, 100, 1012, 102] for the sentence "The capital of France is Paris.". 100 is the UNK token. When I run with sentence-transformers, I get [ 101, 1996, 3007, 1997, 2605, 2003, 3000, 1012, 102] . This problem happens both with --tokenizer-mode auto and --tokenizer-mode slow.

cc: @DarkLight1337

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@maxdebayser maxdebayser added the bug Something isn't working label Dec 13, 2024
@maxdebayser
Copy link
Contributor Author

maxdebayser commented Dec 13, 2024

In the model's sentence_bert_config.json do_lower_case is false

{
  "max_seq_length": 128,
  "do_lower_case": false
}

but in the tokenizer_config.json it's true:

{"do_lower_case": true, ....

When we added support for sentence_bert_config.json we assumed this configuration was meant to override the tokenizer configuration, but that's probably wrong:

if (model_config.encoder_config is not None

Disabling this argument make the tokenizer convert the text to lower case correctly. I think we need to do the same thing as the sentence-transformer library and call lower() before passing the input to the tokenizer: https://github.com/UKPLab/sentence-transformers/blob/cfb883ce34ad301aa9444bca7462734565456e81/sentence_transformers/models/Transformer.py#L500

cc: @flaviabeo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant