Question: Configuring soft spaces for tokenization #131
Replies: 6 comments 1 reply
-
Hey @Shriram-Balaji, thank you for your issue. For now, meilisearch tokenizer is really naive and is not configurable. We will soon work on enhancing it. |
Beta Was this translation helpful? Give feedback.
-
Would love this to be supported, too! cc @mishig25 |
Beta Was this translation helpful? Give feedback.
-
Hello @julien-c, I'm currently rewriting the tokenizer in order to enhance this behavior, we plan to use unicode-sgmentation to split Latin texts. This is in an early stage and we don't plan to release it before Thanks for your feedback! Stay tuned! |
Beta Was this translation helpful? Give feedback.
-
Hello everyone 👋 We just released a 🧪 prototype that allows customizing tokenization and we'd love your feedback. How to get the prototype?Using docker, use the following command:
From source, compile Meilisearch on the How to use the prototype?You can find all the details in the PR. Feedback and bug reporting when using this prototype are encouraged! Thanks in advance for your involvement. It means a lot to us ❤️ |
Beta Was this translation helpful? Give feedback.
-
Hello everyone 👋 We have just released the first RC (release candidate) of Meilisearch containing this new feature! You can test it by using:
You are welcome to leave your feedback in this discussion. If you encounter any bugs, please report them here. 🎉 Official and stable release containing this change will be available on September 25th, 2023 |
Beta Was this translation helpful? Give feedback.
-
Hey folks 👋 v1.4.0 has been released! 🦓 You can now customize tokenization by adding or removing tokens from the list of separator tokens and non-separator tokens. ✨ Note: |
Beta Was this translation helpful? Give feedback.
-
Is your feature request related to a problem? Please describe.
I'm currently trying to use MeiliSearch for an API Reference which indexes OpenAPI Specifications. As you might know, the data has a lot of "-" and "_" characters which seem to be used by the MeiliSearch tokenizer for splitting strings from the DataTypes Doc.
Because of this,searchQuery like "authorization_token" highlights 'authorization' and 'token' separately.
Describe the solution you'd like
A way to maybe ignore '-' and '_' while using them tokenization, atleast for specific attributes in a document, similar to what's done with Stop Words.
This might already be possible in MeiliSearch which I may have missed, would appreciate if someone could point me it too.
Beta Was this translation helpful? Give feedback.
All reactions