-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Settings customizing tokenization #3946
Conversation
Hi Maria / ManyTheFish, thanks for taking an eye and doing the effort, i really appreciate! Two things came to mind:
But thanks again for working on this! |
Hello @tobiasnitsche, First of all this feature is not meant to deactivate the tokenization in Meilisearch but to customize the behavior of the tokenizer. Deactivating the tokenization implies giving the user the full charge of it on the indexing side, the settings side and the search side, meaning that on each API we should accept an array of tokens instead of raw strings.
No, you could eventually set the exhaustive list of separators in the
Not with this feature, the only thing you can do is using the Thank you for your report and your interest in the feature, see you! |
Thanks for your honest & detailed feedback on this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, there are a lot of tests thanks!
I left a few questions, but overall, I think we'll be able to merge it in no time
milli/src/update/index_documents/extract/extract_docid_word_positions.rs
Outdated
Show resolved
Hide resolved
milli/src/update/index_documents/extract/extract_docid_word_positions.rs
Outdated
Show resolved
Hide resolved
…sitions.rs Co-authored-by: Tamo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
bors merge
Build succeeded:
|
Hi @ManyTheFish Unfortunaely this feature is not that useful for me, as the seperators can only be set on index base and not field base. :/ My index (in a nutshell) articles: article-number, description A-12345 , "An article description" To have nice searching experience, i need to set / disable the tokenizers only on the article number, the description should have the normal awesome meilisearch experience. Is there any plan to add this feature on field base? Other settings like maxTypo has the option to set it on field base.. |
Hello @tobiasnitsche
Thanks in advance for your feedback, it's really helpful |
Thank you @curquiza for encouraging comment. After some thinking, a field "email" might also be a good use case for disabling tokenization on field basis. Anyway, I love Meilisearch so far, keep on the good work! |
is there any update on this? Actually for me , its a bit shameful, to tell the users, that they can not search for email addresses , or article numbers in the search... Would love to have an update. This customization tokenization does not solve my problem, as i described it a few times... |
Hello @tobiasnitsche Sorry, but here it's a closed PR, we would rather avoid discussing too much in it to avoid losing track of information. PR are only for implementation and technical detail, not for product discussions 😊 Let's focus on the already existing support you interact in |
Under stoo
Understood, lets move it there :-) my detailled description can be found here as well: #3380 |
Pull Request
This pull Request allows the User to customize Meilisearch Tokenization by providing specialized settings.
Small documentation
All the new settings can be set and reset like the other index settings by calling the route
/indexes/:name/settings
nonSeparatorTokens
The Meilisearch word segmentation uses a default list of separators to segment words, however, for specific use cases some of the default separators shouldn't be considered separators, the
nonSeparatorTokens
setting allows to remove of some tokens from the default list of separators.Request payload
PUT
-/indexes/articles/settings/non-separator-tokens
separatorTokens
Some use cases need to define additional separators, some are related to a specific way of parsing technical documents some others are related to encodings in documents, the
separatorTokens
setting allows adding some tokens to the list of separators.Request payload
PUT
-/indexes/articles/settings/separator-tokens
dictionary
The Meilisearch word segmentation relies on separators and language-based word-dictionaries to segment words, however, this segmentation is inaccurate on technical or use-case specific vocabulary (like
G/Box
to sayGear Box
), or on proper nouns (likeJ. R. R.
when parsingJ. R. R. Tolkien
), thedictionary
setting allows defining a list of words that would be segmented as described in the list.Request payload
PUT
-/indexes/articles/settings/dictionary
these last feature synergies well with the
stopWords
setting or thesynonyms
setting allowing to segment words and correctly retrieve the synonyms:Request payload
PATCH
-/indexes/articles/settings
Related specifications:
Try it with Docker
Related issue
Fixes #3610
Fixes #3917
Fixes meilisearch/product#468
Fixes meilisearch/product#160
Fixes meilisearch/product#260
Fixes meilisearch/product#381
Fixes meilisearch/product#131
Related to #2879
Fixes #2760
What does this PR do?
nonSeparatorTokens
allowing to remove a token from the default separator tokensseparatorTokens
allowing to add a token in the separator tokensdictionary
allowing to override the segmentation on specific wordsinvalid_settings_non_separator_tokens
(invalid_request)invalid_settings_separator_tokens
(invalid_request)invalid_settings_dictionary
(invalid_request)