Settings customizing tokenization #3946

ManyTheFish · 2023-07-25T08:05:14Z

Pull Request

This pull Request allows the User to customize Meilisearch Tokenization by providing specialized settings.

Small documentation

All the new settings can be set and reset like the other index settings by calling the route /indexes/:name/settings

`nonSeparatorTokens`

The Meilisearch word segmentation uses a default list of separators to segment words, however, for specific use cases some of the default separators shouldn't be considered separators, the nonSeparatorTokens setting allows to remove of some tokens from the default list of separators.

Request payload PUT- /indexes/articles/settings/non-separator-tokens

["@", "#", "&"]

`separatorTokens`

Some use cases need to define additional separators, some are related to a specific way of parsing technical documents some others are related to encodings in documents, the separatorTokens setting allows adding some tokens to the list of separators.

Request payload PUT- /indexes/articles/settings/separator-tokens

["&sect;", "&sep"]

`dictionary`

The Meilisearch word segmentation relies on separators and language-based word-dictionaries to segment words, however, this segmentation is inaccurate on technical or use-case specific vocabulary (like G/Box to say Gear Box), or on proper nouns (like J. R. R. when parsing J. R. R. Tolkien), the dictionary setting allows defining a list of words that would be segmented as described in the list.

Request payload PUT- /indexes/articles/settings/dictionary

["J. R. R.", "J.R.R."]

these last feature synergies well with the stopWords setting or the synonyms setting allowing to segment words and correctly retrieve the synonyms:
Request payload PATCH- /indexes/articles/settings

{
    "dictionary": ["J. R. R.", "J.R.R."],
    "synonyms": {
            "J.R.R.": ["jrr", "J. R. R."],
            "J. R. R.": ["jrr", "J.R.R."],
            "jrr": ["J.R.R.", "J. R. R."],
    }
}

Related specifications:

Try it with Docker

$ docker pull getmeili/meilisearch:prototype-tokenizer-customization-3

Related issue

Fixes #3610
Fixes #3917
Fixes meilisearch/product#468
Fixes meilisearch/product#160
Fixes meilisearch/product#260
Fixes meilisearch/product#381
Fixes meilisearch/product#131
Related to #2879

Fixes #2760

What does this PR do?

Add a setting nonSeparatorTokens allowing to remove a token from the default separator tokens
Add a setting separatorTokens allowing to add a token in the separator tokens
Add a setting dictionary allowing to override the segmentation on specific words
add new error code invalid_settings_non_separator_tokens (invalid_request)
add new error code invalid_settings_separator_tokens (invalid_request)
add new error code invalid_settings_dictionary (invalid_request)

tobiasnitsche · 2023-08-02T19:18:31Z

Hi Maria / ManyTheFish,

thanks for taking an eye and doing the effort, i really appreciate!

Two things came to mind:

Would it be possible just to just use "PUT /separatorTokens" endpoint and to overwrite the whole tokenizer list to whats sent over? Then i could just set "[]" to disable tokenization at all.
Is it possible to just use it on specific attributes? Similar to typoTolerance: https://www.meilisearch.com/docs/learn/configuration/typo_tolerance#disableonattributes . We have for example in our article index only one attribute "serialnumber" which should have disabled the tokenization.

But thanks again for working on this!

ManyTheFish · 2023-08-07T16:03:52Z

Hello @tobiasnitsche,

First of all this feature is not meant to deactivate the tokenization in Meilisearch but to customize the behavior of the tokenizer. Deactivating the tokenization implies giving the user the full charge of it on the indexing side, the settings side and the search side, meaning that on each API we should accept an array of tokens instead of raw strings.

Would it be possible just to just use "PUT /separatorTokens" endpoint and to overwrite the whole tokenizer list to whats sent over? Then i could just set "[]" to disable tokenization at all.

No, you could eventually set the exhaustive list of separators in the nonSeparatorTokens list, but it wouldn't deactivate the tokenization, it would just avoid segmenting the words split by a separator contained in the list, and the other type of segmentation and the normalization wouldn't be deactivated.

Is it possible to just use it on specific attributes? Similar to typoTolerance: https://www.meilisearch.com/docs/learn/configuration/typo_tolerance#disableonattributes . We have for example in our article index only one attribute "serialnumber" which should have disabled the tokenization.

Not with this feature, the only thing you can do is using the disableonwords feature, but it's not constrained to a specific attribute.

Thank you for your report and your interest in the feature,
sorry if my answer is a bit deceptive,

see you!

tobiasnitsche · 2023-08-07T16:43:50Z

Thanks for your honest & detailed feedback on this!

irevoire

Nice, there are a lot of tests thanks!
I left a few questions, but overall, I think we'll be able to merge it in no time

meilisearch-types/src/error.rs

milli/src/update/index_documents/extract/extract_docid_word_positions.rs

milli/src/update/index_documents/extract/mod.rs

milli/src/update/settings.rs

…sitions.rs Co-authored-by: Tamo <[email protected]>

irevoire

Thanks!

bors merge

meili-bors · 2023-08-10T10:55:45Z

Build succeeded:

tobiasnitsche · 2023-09-26T10:21:09Z

Hello @tobiasnitsche,

First of all this feature is not meant to deactivate the tokenization in Meilisearch...

Hi @ManyTheFish

Unfortunaely this feature is not that useful for me, as the seperators can only be set on index base and not field base. :/

My index (in a nutshell)

articles:

article-number, description

A-12345 , "An article description"
123-B/2023 , "Another article description"

To have nice searching experience, i need to set / disable the tokenizers only on the article number, the description should have the normal awesome meilisearch experience.

Is there any plan to add this feature on field base? Other settings like maxTypo has the option to set it on field base..

curquiza · 2023-09-26T11:28:57Z

Hello @tobiasnitsche
thanks for your feedback, I let our product team know.

~~If you want to detail your usecase and directly talk with the product team, you can open a discussion here (or comment the already existing if there is one)~~
Just saw you already are present on the related discussion: meilisearch/product#422
Discussing now with the product team

Thanks in advance for your feedback, it's really helpful

tobiasnitsche · 2023-11-17T22:22:55Z

Thank you @curquiza for encouraging comment.

After some thinking, a field "email" might also be a good use case for disabling tokenization on field basis.

Anyway, I love Meilisearch so far, keep on the good work!

tobiasnitsche · 2024-03-05T10:25:55Z

@curquiza @irevoire

is there any update on this?

Actually for me , its a bit shameful, to tell the users, that they can not search for email addresses , or article numbers in the search...

Would love to have an update. This customization tokenization does not solve my problem, as i described it a few times...

curquiza · 2024-03-07T11:45:52Z

Hello @tobiasnitsche

Sorry, but here it's a closed PR, we would rather avoid discussing too much in it to avoid losing track of information. PR are only for implementation and technical detail, not for product discussions 😊 Let's focus on the already existing support you interact in

tobiasnitsche · 2024-03-07T11:57:34Z

Under stoo

Hello @tobiasnitsche

Sorry, but here it's a closed PR, we would rather avoid discussing too much in it to avoid losing track of information. PR are only for implementation and technical detail, not for product discussions 😊 Let's focus on the already existing support you interact in
* https://github.com/orgs/meilisearch/discussions/422

* [Any idea to disable tokenization on specific field? #3380](https://github.com/meilisearch/meilisearch/discussions/3380)

Understood, lets move it there :-) my detailled description can be found here as well: #3380

ManyTheFish added 9 commits July 20, 2023 11:15

Update tests

0597a97

Be able to set and reset settings

d8d12d5

Make the search and the indexing work

9c485f8

Fix clippy

d4ff59f

Fix test

41c9e88

Support synonyms sinergies

d57026c

ensure the synonyms are updated when the tokenizer settings are changed

b0c1a95

Fix the synonyms settings display

0469407

Fix clippy

9d5e345

ManyTheFish added 2 commits August 8, 2023 16:03

Add API route for the new settings

ae8e69c

Merge branch 'main' into settings-customizing-tokenization

4a21fec

ManyTheFish requested a review from irevoire August 8, 2023 16:29

ManyTheFish marked this pull request as ready for review August 8, 2023 16:30

curquiza added this to the v1.4.0 milestone Aug 8, 2023

irevoire requested changes Aug 9, 2023

View reviewed changes

ManyTheFish and others added 2 commits August 10, 2023 10:05

Update milli/src/update/index_documents/extract/extract_docid_word_po…

43c13fa

…sitions.rs Co-authored-by: Tamo <[email protected]>

Fix PR comments

6b2d671

ManyTheFish requested a review from irevoire August 10, 2023 08:44

Fix clippy

5a7c1bd

irevoire approved these changes Aug 10, 2023

View reviewed changes

meili-bors bot merged commit 8084cf2 into main Aug 10, 2023

meili-bors bot deleted the settings-customizing-tokenization branch August 10, 2023 10:55

Jasperav mentioned this pull request Sep 18, 2023

After upgrading Meilisearch from 1.2.0 to 1.3.3 breaks search result #4045

Closed

meili-bot added the v1.4.0 PRs/issues solved in v1.4.0 released on 2023-09-25 label Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Settings customizing tokenization #3946

Settings customizing tokenization #3946

ManyTheFish commented Jul 25, 2023 •

edited

Loading

tobiasnitsche commented Aug 2, 2023 •

edited

Loading

ManyTheFish commented Aug 7, 2023 •

edited

Loading

tobiasnitsche commented Aug 7, 2023 •

edited

Loading

irevoire left a comment

irevoire left a comment

meili-bors bot commented Aug 10, 2023

tobiasnitsche commented Sep 26, 2023 •

edited

Loading

curquiza commented Sep 26, 2023 •

edited

Loading

tobiasnitsche commented Nov 17, 2023 •

edited

Loading

tobiasnitsche commented Mar 5, 2024

curquiza commented Mar 7, 2024

tobiasnitsche commented Mar 7, 2024 •

edited

Loading

Settings customizing tokenization #3946

Settings customizing tokenization #3946

Conversation

ManyTheFish commented Jul 25, 2023 • edited Loading

Pull Request

Small documentation

nonSeparatorTokens

separatorTokens

dictionary

Related specifications:

Try it with Docker

Related issue

What does this PR do?

tobiasnitsche commented Aug 2, 2023 • edited Loading

ManyTheFish commented Aug 7, 2023 • edited Loading

tobiasnitsche commented Aug 7, 2023 • edited Loading

irevoire left a comment

Choose a reason for hiding this comment

irevoire left a comment

Choose a reason for hiding this comment

meili-bors bot commented Aug 10, 2023

tobiasnitsche commented Sep 26, 2023 • edited Loading

curquiza commented Sep 26, 2023 • edited Loading

tobiasnitsche commented Nov 17, 2023 • edited Loading

tobiasnitsche commented Mar 5, 2024

curquiza commented Mar 7, 2024

tobiasnitsche commented Mar 7, 2024 • edited Loading

ManyTheFish commented Jul 25, 2023 •

edited

Loading

`nonSeparatorTokens`

`separatorTokens`

`dictionary`

tobiasnitsche commented Aug 2, 2023 •

edited

Loading

ManyTheFish commented Aug 7, 2023 •

edited

Loading

tobiasnitsche commented Aug 7, 2023 •

edited

Loading

tobiasnitsche commented Sep 26, 2023 •

edited

Loading

curquiza commented Sep 26, 2023 •

edited

Loading

tobiasnitsche commented Nov 17, 2023 •

edited

Loading

tobiasnitsche commented Mar 7, 2024 •

edited

Loading