Question: Configuring soft spaces for tokenization #131

shrirambalaji · 2020-10-21T16:33:53Z

shrirambalaji
Oct 21, 2020

Is your feature request related to a problem? Please describe.
I'm currently trying to use MeiliSearch for an API Reference which indexes OpenAPI Specifications. As you might know, the data has a lot of "-" and "_" characters which seem to be used by the MeiliSearch tokenizer for splitting strings from the DataTypes Doc.

Because of this,searchQuery like "authorization_token" highlights 'authorization' and 'token' separately.

Describe the solution you'd like
A way to maybe ignore '-' and '_' while using them tokenization, atleast for specific attributes in a document, similar to what's done with Stop Words.

This might already be possible in MeiliSearch which I may have missed, would appreciate if someone could point me it too.

ManyTheFish · 2020-10-22T09:44:19Z

ManyTheFish
Oct 22, 2020
Collaborator

Hey @Shriram-Balaji, thank you for your issue. For now, meilisearch tokenizer is really naive and is not configurable. We will soon work on enhancing it.
We will think on your use case when we will work on it!

0 replies

julien-c · 2022-03-11T10:54:05Z

julien-c
Mar 11, 2022

Would love this to be supported, too! cc @mishig25

0 replies

ManyTheFish · 2022-03-14T09:32:57Z

ManyTheFish
Mar 14, 2022
Collaborator

Hello @julien-c, I'm currently rewriting the tokenizer in order to enhance this behavior, we plan to use unicode-sgmentation to split Latin texts.
This library doesn't split words when they are only separated by an _, -, ....

This is in an early stage and we don't plan to release it before v0.28.0.

Thanks for your feedback! Stay tuned!

0 replies

macraig · 2023-08-02T14:57:21Z

macraig
Aug 2, 2023
Maintainer

Hello everyone 👋

We just released a 🧪 prototype that allows customizing tokenization and we'd love your feedback.

How to get the prototype?

Using docker, use the following command:

docker run -p 7700:7700 -v $(pwd)/meili_data:/meili_data getmeili/meilisearch:prototype-tokenizer-customization-2

From source, compile Meilisearch on the prototype-tokenizer-customization-2 tag

How to use the prototype?

You can find all the details in the PR.

⚠️ We do NOT recommend using this prototype in production. This is for test purposes only.

Feedback and bug reporting when using this prototype are encouraged! Thanks in advance for your involvement. It means a lot to us ❤️

1 reply

julien-c Aug 18, 2023

very cool! cc @mishig25

macraig · 2023-08-29T07:48:38Z

macraig
Aug 29, 2023
Maintainer

Hello everyone 👋

We have just released the first RC (release candidate) of Meilisearch containing this new feature!

You can test it by using:

The release assets
The Meilisearch Docker image

docker run -it --rm -p 7700:7700 -v $(pwd)/meili_data:/meili_data getmeili/meilisearch:v1.4.0-rc.0

You are welcome to leave your feedback in this discussion.

If you encounter any bugs, please report them here.
Thanks in advance for your help and your involvement in Meilisearch ❤️

🎉 Official and stable release containing this change will be available on September 25th, 2023

⚠️ RC (release candidates) are not recommended for production

0 replies

macraig · 2023-09-26T08:10:19Z

macraig
Sep 26, 2023
Maintainer

Hey folks 👋

v1.4.0 has been released! 🦓 You can now customize tokenization by adding or removing tokens from the list of separator tokens and non-separator tokens. ✨

Note:

📚 Separator tokens
📚 Non-separator tokens

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meilisearch

Question: Configuring soft spaces for tokenization #131

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Meilisearch

Question: Configuring soft spaces for tokenization #131

shrirambalaji Oct 21, 2020

Replies: 6 comments · 1 reply

ManyTheFish Oct 22, 2020 Collaborator

julien-c Mar 11, 2022

ManyTheFish Mar 14, 2022 Collaborator

macraig Aug 2, 2023 Maintainer

How to get the prototype?

How to use the prototype?

julien-c Aug 18, 2023

macraig Aug 29, 2023 Maintainer

macraig Sep 26, 2023 Maintainer

shrirambalaji
Oct 21, 2020

Replies: 6 comments 1 reply

ManyTheFish
Oct 22, 2020
Collaborator

julien-c
Mar 11, 2022

ManyTheFish
Mar 14, 2022
Collaborator

macraig
Aug 2, 2023
Maintainer

macraig
Aug 29, 2023
Maintainer

macraig
Sep 26, 2023
Maintainer