[Bug]: Bert tokenizer is tokenizing some tokens as `UNK` #11184

maxdebayser · 2024-12-13T21:33:18Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

Model Input Dumps

No response

🐛 Describe the bug

With some Bert and Roberta models like sentence-transformers/all-MiniLM-L12-v2 I found that the output is not similar to the one generated by sentence-transformers. If I place the following prints in _normalize_prompt_text_to_input() in serving_engine.py

        print(f"{input_ids=}")

I get [101, 100, 3007, 1997, 100, 2003, 100, 1012, 102] for the sentence "The capital of France is Paris.". 100 is the UNK token. When I run with sentence-transformers, I get [ 101, 1996, 3007, 1997, 2605, 2003, 3000, 1012, 102] . This problem happens both with --tokenizer-mode auto and --tokenizer-mode slow.

cc: @DarkLight1337

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

maxdebayser · 2024-12-13T22:28:28Z

In the model's sentence_bert_config.json do_lower_case is false

{
  "max_seq_length": 128,
  "do_lower_case": false
}

but in the tokenizer_config.json it's true:

{"do_lower_case": true, ....

When we added support for sentence_bert_config.json we assumed this configuration was meant to override the tokenizer configuration, but that's probably wrong:

vllm/vllm/transformers_utils/tokenizer_group/__init__.py

Line 29 in 0d8451c

if (model_config.encoder_config is not None

Disabling this argument make the tokenizer convert the text to lower case correctly. I think we need to do the same thing as the sentence-transformer library and call lower() before passing the input to the tokenizer: https://github.com/UKPLab/sentence-transformers/blob/cfb883ce34ad301aa9444bca7462734565456e81/sentence_transformers/models/Transformer.py#L500

cc: @flaviabeo

maxdebayser added the bug Something isn't working label Dec 13, 2024

maxdebayser mentioned this issue Jan 7, 2025

treat do_lower_case in the same way as the sentence-transformers library #11815

Merged

DarkLight1337 closed this as completed in #11815 Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Bert tokenizer is tokenizing some tokens as `UNK` #11184

[Bug]: Bert tokenizer is tokenizing some tokens as `UNK` #11184

maxdebayser commented Dec 13, 2024

maxdebayser commented Dec 13, 2024 •

edited

Loading

[Bug]: Bert tokenizer is tokenizing some tokens as UNK #11184

[Bug]: Bert tokenizer is tokenizing some tokens as UNK #11184

Comments

maxdebayser commented Dec 13, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

maxdebayser commented Dec 13, 2024 • edited Loading

[Bug]: Bert tokenizer is tokenizing some tokens as `UNK` #11184

[Bug]: Bert tokenizer is tokenizing some tokens as `UNK` #11184

maxdebayser commented Dec 13, 2024 •

edited

Loading