add khmer segmenter #203

xshadowlegendx · 2023-03-29T04:39:59Z

Pull Request

Related issue

Fixes #200

What does this PR do?

add segmenter for khmer language

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

xshadowlegendx · 2023-04-06T09:00:33Z

hello @ManyTheFish, how do I test this with meilisearch on my local machine?

ManyTheFish · 2023-04-13T14:29:29Z

Hello @xshadowlegendx,

don't hesitate to pull the main branch of Meilisearch and update the milli/Cargo.toml file linking your branch instead of a fixed version of Charabia.

This way you will be able to run Meilisearch with your changes!

xshadowlegendx · 2023-04-17T07:26:23Z

@ManyTheFish ok thanks will try it

ManyTheFish · 2023-04-25T11:53:42Z

Hello @xshadowlegendx, do you manage to test the khmer segmenter with Meilisearch?

xshadowlegendx · 2023-05-08T05:31:08Z

hello @ManyTheFish, sorry for the late reply. Currently I am still testing the khmer segmenter with meilisearch, so I am able to compile and run it with local charabia as dependency. Is there anything like text analysis where I can test if the segmenter works? so the way I am testing now is inserting some khmer data and search for word for it but it seems to search it part by part instead of word? I also tried to run a meilisearch docker container to compare and it seems to work the same as well, not sure if the segmenter is working or not yet

ManyTheFish · 2023-05-10T15:32:39Z

@xshadowlegendx, a test you could try is to push a document containing សួស្តីពិភពលោក and try searching for ពិភពលោក. As Meilisearch only supports prefix search then you should not be able to find the document containing សួស្តីពិភពលោក if your segmenter is not used. Otherwise, Meilisearch managed to segment សួស្តីពិភពលោក and so we can conclude with the fact that your segmenter is used.
If you want to activate some normalizer or add a specific one for the Khmer Language, don't hesitate to check the normalizer list, probably removing diacritics is relevant for the Khmer Language, In this case, I suggest adding it to the normalizer's language list.

I hope it helps!

xshadowlegendx · 2023-05-14T18:11:21Z

ok thanks @ManyTheFish let me try to test it again

xshadowlegendx · 2023-06-24T05:18:00Z

hello @ManyTheFish , so I tried pushing document that contains សួស្តីពិភពលោក and searching for ពិភពលោក it is able to find those document that contains it, this is with meilisearch from dockerhub without the khmer segmenter, below are steps I took to test it

docker run -it --rm -p 7800:7700 getmeili/meilisearch:v1.2

# books.json
[
  {
    "id": "ec660236154a4bed88b554cff8fb93a4",
    "title": "ភាសាខ្មែរជាភាសល្អជាងគេនៅពិភពលោក"
  },
  {
    "id": "2dea8651215a4695aaf8721f4b08fa40",
    "title": "ប្រទេសចិនប្រេីភាសាចិន"
  },
  {
    "id": "2dea8651215a4695aaf8721f4b08fa41",
    "title": "សួស្តីពិភពលោក"
  }
]

curl -X POST 'http://localhost:7800/indexes/books/documents?primaryKey=id' -H 'Content-Type: application/json' --data-binary @books.js

this screenshot seems like without khmer segmenter it also able to pick up the word

but if its not a word it also pick it up

so I try to test the khmer segmenter by cloning meilisearch to my local machine, point charabia deps to my local machine one and cargo run to run meilisearch and test just like above. Before the testing, I put a panic!("khmer segmenter says panic"); within the segment_str function to ensure it gets called and it does, then I remove the panic statement and proceed to test like above and got the same result.

below are changes to local meilisearch to test with the khmer segmenter

❯ cat milli/Cargo.toml | grep chara
charabia = { path = "../../charabia/charabia", default-features = false }
default = [ "charabia/default" ]
chinese = ["charabia/chinese"]
hebrew = ["charabia/hebrew"]
japanese = ["charabia/japanese"]
japanese-transliteration = ["charabia/japanese-transliteration"]
korean = ["charabia/korean"]
thai = ["charabia/thai"]
khmer = ["charabia/khmer"]

ManyTheFish

hello @xshadowlegendx,
I asked for small changes in your PR,

Do these changes enhance your search results?
Do you see some search results that are missing?

See you!

charabia/Cargo.toml

charabia/icu4x-khmer-keys

ManyTheFish · 2023-06-26T07:57:51Z

Could you add some sentences in benchmarks, please?

charabia/charabia/benches/bench.rs

Lines 7 to 28 in 366417d

    
           static DATA_SET: &[((usize, Script, Language), &str)] = &[ 
        
               // short texts (~130 bytes) 
        
               ((132, Script::Cj, Language::Cmn), "人人生而自由﹐在尊严和权利上一律平等。他們賦有理性和良心﹐並應以兄弟關係的精神互相對待。"), 
        
               ((132, Script::Cj, Language::Jpn), "詳しくは以下の をご覧下さい。語学ないし文学の立場からの価値判断は一切おこなっていません"), 
        
               ((132, Script::Latin, Language::Eng), "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F! Hello guys, my purpose is to benchmark tokenizer properly."), 
        
               ((132, Script::Latin, Language::Fra), "La ville avait d'abord été nommée « Lutèce » ou « boueuse », ici une tentative d'explication par le latin lŭtum « boue »."), 
        
               ((132, Script::Hebrew, Language::Heb), "הַשּׁוּעָל הַמָּהִיר (״הַחוּם״) לֹא יָכוֹל לִקְפֹּץ 8.94 מֶטְרִים, נָכוֹן?"), 
        
               ((132, Script::Thai, Language::Tha), "ไก่จิกเด็กตายเด็กตายบนปากโอ่งไก่อะไรวะโหดจัง"), 
        
               ((132, Script::Hangul, Language::Kor), "제119조 ① 대한민국의 경제질서는 개인과 기업의 경제상의 자유와 창의를 존중함을 기본으로 한다."), 
        
               ((130, Script::Greek, Language::Ell), "Οι θερμοκρασίες είναι σπάνια υπερβολικές στις παραθαλάσσιες περιοχές."), 
        
               ((132, Script::Arabic, Language::Ara), "اللُّغَةُ العربية هي أكثر اللغات السامية تحدثا، ومن أكثر اللغات انتشارا"), 
        
               // long texts (~365 bytes) 
        
               ((363, Script::Cj, Language::Cmn), "距今60万年-2万年的时间内，北京地区处于旧石器时代，在周口店发现了旧石器时代早期北京直立人、中期新洞人和晚期山顶洞人的典型遗址。北京地区在不晚于1万年前已经开始进入新石器时代。当时该地区人类定居生活固定化，逐渐从山洞中迁徙出来，到平原地区定居[12]。"), 
        
               ((364, Script::Cj, Language::Jpn), "詳しくは以下の をご覧下さい。語学ないし文学の立場からの価値判断は一切おこなっていません。だけど、バラ科の仲間ということでは「すもももももももものうち」は正しいことになります。すももものうち！今日は「すもももももももものうち」について考えます。"), 
        
               ((363, Script::Latin, Language::Eng), "The City of London Corporation is unique in the UK and has some unusual responsibilities for a local council, such as being the police authority. It is also unusual in having responsibilities and ownership beyond its boundaries. The Corporation is headed by the Lord Mayor of the City of London (an office separate from, and much older than, the Mayor of London)."), 
        
               ((363, Script::Latin, Language::Fra), "La position de Lutèce, sur l'île aujourd'hui nommée l'île de la Cité, permettant le franchissement du grand fleuve navigable qu'est la Seine par une voie reliant le Nord et le Sud des Gaules, en fait dès l'Antiquité une cité importante, capitale des Parisii, puis lieu de séjour d'un empereur romain. Le mot Lutèce resulte de la francisation de Lutetia."), 
        
               ((365, Script::Hebrew, Language::Heb), "הַשּׁוּעָל הַמָּהִיר (״הַחוּם״) לֹא יָכוֹל לִקְפֹּץ 8.94 מֶטְרִים, נָכוֹן? תַּכְלֶס, אִם הוּא הָיָה יָכוֹל, הוּא חֲתִיכַת שׁוּעָל הַשּׁוּעָל הַזֶּה.. אֲבָל הַאִם לֹא כֻּלָּנוּ שׁוּעָלִים בְּעֶצֶם? יתכן."), 
        
               ((366, Script::Thai, Language::Tha), "เราจะทำตามสัญญาขอเวลาอีกไม่นานแล้วแผ่นดินที่งดงามจะคืนกลับมาเราจะทำอย่างซื่อตรงขอแค่เธอจงไว้ใจและศรัทธาแผ่นดินจะดีในไม่ช้า"), 
        
               ((364, Script::Hangul, Language::Kor), "제30조 타인의 범죄행위로 인하여 생명·신체에 대한 피해를 받은 국민은 법률이 정하는 바에 의하여 국가로부터 구조를 받을 수 있다. ② 명령·규칙 또는 처분이 헌법이나 법률에 위반되는 여부가 재판의 전제가 된 경우에는 대법원은 이를 최종적으로 심사할 권한을 가진다."), 
        
               ((364, Script::Greek, Language::Ell), "Η άνοιξη έχει μικρή διάρκεια, διότι ο μεν χειμώνας είναι όψιμος, το δε καλοκαίρι αρχίζει πρώιμα. Το φθινόπωρο είναι μακρύ και θερμό και πολλές φορές παρατείνεται στη νότια Ελλάδα και τα νησιά μέχρι τα"), 
        
               ((366, Script::Arabic, Language::Ara), "العربية لغةٌ رسمية في كل دول الوطن العربي (إضافة إلى كونها لغة رسمية في تشاد وإريتريا). وهي إحدى اللغات الرسمية الست في منظمة الأمم المتحدة، ويُحتفل بالعربية في 18 ديسمبر كذكرى اعتمادها في الأمم المتحدة."), 
        
               ];

xshadowlegendx · 2023-07-18T16:49:42Z

hello @xshadowlegendx, I asked for small changes in your PR,
1. Do these changes enhance your search results?

2. Do you see some search results that are missing?

See you!

hello @ManyTheFish, sorry for such lateness

for your questions:

yes
yes

after those changes on my local, now it show some differences, the left one is running with docker and the right one is running with local development. so I tried searching the word លោក it shows empty result which is correct in this case and typing only a character or incomplete word also matches is this expected behavior?

Co-authored-by: Many the fish <[email protected]>

ref: https://github.com/meilisearch/charabia/pull/203\#discussion_r1241774951

…bia into add-khmer-segmenter

ManyTheFish · 2023-07-19T09:26:48Z

Hello @xshadowlegendx, Meilisearch is a prefix search engine, so it's expected that if you tip the start of a word Meilisearch find all the documents containing a word starting with the query.
Is it the case?

xshadowlegendx · 2023-07-19T10:27:19Z

hello @ManyTheFish , yes that is the case, so it is expected behavior then. I will be testing some document that are more close to real world data to see how the segmenter perform next then

ManyTheFish · 2023-09-05T08:04:24Z

Hello @xshadowlegendx,

Any news on this PR, Do you feel enough comfortable with your tests to merge it?

Thanks!

xshadowlegendx · 2023-09-11T04:20:29Z

hello @ManyTheFish, sorry for the late replies, so I tested with about 253 movies that is production data from my workplace weeks ago. the search is very fast, can search movies from titles, actor names and genres, and typo tolerance also works by updating the default since khmer word are smaller size, so I think it is ready to be merged.

but the current icu4x dependency that I am using currently have a bug with using dictionary model so currently I am using the lstm model for the segmentation but it was fixed very soon after I submit the issue but It hasnt been release yet

curquiza · 2023-10-09T08:08:54Z

Hello @xshadowlegendx
Can you fix the git conflicts you have on this PR before we review? 😊

curquiza · 2023-10-09T11:37:23Z

@xshadowlegendx
Looks like clippy and rust fmt CI fail. Can you fix these before we review?

xshadowlegendx · 2023-10-10T15:03:16Z

hello @curquiza, the clippy wants me to fix the invisible_characters such as below

❯ cargo clippy --all-targets -- --deny warnings
    Checking charabia v0.8.4 (/Users/xanonx/projects/meilisearch/charabia/charabia)
error: invisible character detected
  --> charabia/benches/bench.rs:28:43
   |
28 | ...), "រឿងពីរដែលមនុស្សហាមចិត្តខ្លួនឯងមិនបានគឺ សើច និង ស្រឡាញ់។ តែសម្រាប់ខ្ញុំ ប្រាក់ ចន្ទធីតា រឿងមួយទៀតដែលខ្ញុំហាមចិត្តខ្លួនឯងមិនបាននោះ គឺញ៉ាំ គេគ្រប់គ្នាពេលខូចចិត...បាយទឹកមិនបានទេ តែខ្ញុំពេលខូចចិត្តដឹងតែឃ្លាន ញ៉ាំច្រើនឬតិចក៏អាស្រ័យលើថាទំហំនៃការខូចចិត្តខ្លាំងឬខ្សោយ។"),
   |       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: consider replacing the string with: `"រឿង\u{200B}ពីរ\u{200B}ដែល\u{200B}មនុស្ស\u{200B}ហាម\u{200B}ចិត្ត\u{200B}ខ្លួន\u{200B}ឯង\u{200B}មិន\u{200B}បាន\u{200B}គឺ\u{200B} សើច\u{200B} និង\u{200B} ស្រឡាញ់\u{200B}\u{200B}។ តែ\u{200B}សម្រាប់\u{200B}ខ្ញុំ\u{200B} ប្រាក់ ចន្ទធីតា\u{200B} \u{200B}រឿង\u{200B}មួយ\u{200B}ទៀត\u{200B}ដែល\u{200B}ខ្ញុំ\u{200B}ហាម\u{200B}ចិត្ត\u{200B}ខ្លួន\u{200B}ឯង\u{200B}មិន\u{200B}បាន\u{200B}នោះ គឺ\u{200B}ញ៉ាំ\u{200B} គេ\u{200B}គ្រប់\u{200B}គ្នា\u{200B}ពេល\u{200B}ខូច\u{200B}ចិត្ត\u{200B}បាយ\u{200B}ទឹក\u{200B}មិន\u{200B}បាន\u{200B}ទេ តែ\u{200B}ខ្ញុំ\u{200B}ពេល\u{200B}ខូច\u{200B}ចិត្ត\u{200B}ដឹង\u{200B}តែ\u{200B}ឃ្លាន\u{200B} ញ៉ាំ\u{200B}ច្រើន\u{200B}ឬ\u{200B}តិច\u{200B}ក៏\u{200B}អាស្រ័យ\u{200B}\u{200B}លើ\u{200B}ថា\u{200B}ទំហំ\u{200B}នៃការ\u{200B}ខូច\u{200B}\u{200B}ចិត្ត\u{200B}ខ្លាំង\u{200B}ឬ\u{200B}ខ្សោយ\u{200B}\u{200B}។"`
   |
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#invisible_characters
   = note: `#[deny(clippy::invisible_characters)]` on by default

error: could not compile `charabia` (bench "bench") due to previous error

should I apply the suggested fix?

curquiza · 2023-10-10T15:16:12Z

@ManyTheFish can answer this

ManyTheFish · 2023-10-16T13:47:54Z

hello @xshadowlegendx,
if you can't remove the invisible character from the code, then you can add the flag to ignore the error, honestly, I'd prefer it if the invisible character is removed. 😄

Thanks!

ManyTheFish · 2023-10-18T09:12:02Z

bors try

ManyTheFish

Looks good to me,
thank you @xshadowlegendx for your work!

bors merge

meili-bors · 2023-10-18T09:20:46Z

try

Build succeeded:

tests

meili-bors · 2023-10-18T09:22:31Z

Build succeeded:

tests

xshadowlegendx added 6 commits March 28, 2023 15:15

add khmer segmenter copy from dummy segmenter

24d7646

import and export khmer segmenter

fd78998

add icu4x deps + khmer feature

7987eac

add icu4x khmer keys

88c4103

add khmer segmenter to segmenter list

42b6598

adding khmer segmenter

91049d6

ManyTheFish self-requested a review March 29, 2023 08:28

ManyTheFish mentioned this pull request May 11, 2023

Implement Vietnamese tokenizer for Meilisearch #174

Open

ManyTheFish requested changes Jun 26, 2023

View reviewed changes

charabia/Cargo.toml Outdated Show resolved Hide resolved

charabia/Cargo.toml Outdated Show resolved Hide resolved

charabia/icu4x-khmer-keys Outdated Show resolved Hide resolved

add some benchmark

577befe

xshadowlegendx and others added 6 commits July 18, 2023 23:50

make icu deps optional

77bbf09

Co-authored-by: Many the fish <[email protected]>

only load icu deps if khmer segmenter feature enabled

c261e2b

Co-authored-by: Many the fish <[email protected]>

move khmer keys to bin dir

69f6fba

ref: https://github.com/meilisearch/charabia/pull/203\#discussion_r1241774951

Merge branch 'add-khmer-segmenter' of github.com:xshadowlegendx/chara…

31d031b

…bia into add-khmer-segmenter

update icu for stable icu_segmenter version

45af1bd

remove unused code + use new icu segmenter api + format code

a22a20a

xshadowlegendx added 3 commits October 8, 2023 22:07

update icu deps to 1.3

3c440b7

update icu4x khmer keys from 1.3

517b022

use static include bytes + use dictionary instead of lstm

b94c54a

xshadowlegendx marked this pull request as ready for review October 8, 2023 15:11

Merge branch 'main' into add-khmer-segmenter

2d345a5

xshadowlegendx added 2 commits October 10, 2023 06:21

forgot comma

acbe5ba

ran cargo fmt

7078896

xshadowlegendx added 2 commits October 16, 2023 21:39

removed invisible chars

b198953

Merge branch 'main' into add-khmer-segmenter

fc3c2cd

meili-bors bot added a commit that referenced this pull request Oct 18, 2023

Try #203:

732667e

ManyTheFish approved these changes Oct 18, 2023

View reviewed changes

meili-bors bot merged commit 6364941 into meilisearch:main Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add khmer segmenter #203

add khmer segmenter #203

xshadowlegendx commented Mar 29, 2023 •

edited

Loading

xshadowlegendx commented Apr 6, 2023

ManyTheFish commented Apr 13, 2023

xshadowlegendx commented Apr 17, 2023

ManyTheFish commented Apr 25, 2023

xshadowlegendx commented May 8, 2023

ManyTheFish commented May 10, 2023

xshadowlegendx commented May 14, 2023

xshadowlegendx commented Jun 24, 2023

ManyTheFish left a comment

ManyTheFish commented Jun 26, 2023

xshadowlegendx commented Jul 18, 2023 •

edited

Loading

ManyTheFish commented Jul 19, 2023

xshadowlegendx commented Jul 19, 2023

ManyTheFish commented Sep 5, 2023

xshadowlegendx commented Sep 11, 2023 •

edited

Loading

curquiza commented Oct 9, 2023

curquiza commented Oct 9, 2023

xshadowlegendx commented Oct 10, 2023

curquiza commented Oct 10, 2023

ManyTheFish commented Oct 16, 2023

ManyTheFish commented Oct 18, 2023

ManyTheFish left a comment

meili-bors bot commented Oct 18, 2023

meili-bors bot commented Oct 18, 2023

add khmer segmenter #203

add khmer segmenter #203

Conversation

xshadowlegendx commented Mar 29, 2023 • edited Loading

Pull Request

Related issue

What does this PR do?

PR checklist

xshadowlegendx commented Apr 6, 2023

ManyTheFish commented Apr 13, 2023

xshadowlegendx commented Apr 17, 2023

ManyTheFish commented Apr 25, 2023

xshadowlegendx commented May 8, 2023

ManyTheFish commented May 10, 2023

xshadowlegendx commented May 14, 2023

xshadowlegendx commented Jun 24, 2023

ManyTheFish left a comment

Choose a reason for hiding this comment

ManyTheFish commented Jun 26, 2023

xshadowlegendx commented Jul 18, 2023 • edited Loading

ManyTheFish commented Jul 19, 2023

xshadowlegendx commented Jul 19, 2023

ManyTheFish commented Sep 5, 2023

xshadowlegendx commented Sep 11, 2023 • edited Loading

curquiza commented Oct 9, 2023

curquiza commented Oct 9, 2023

xshadowlegendx commented Oct 10, 2023

curquiza commented Oct 10, 2023

ManyTheFish commented Oct 16, 2023

ManyTheFish commented Oct 18, 2023

ManyTheFish left a comment

Choose a reason for hiding this comment

meili-bors bot commented Oct 18, 2023

try

meili-bors bot commented Oct 18, 2023

xshadowlegendx commented Mar 29, 2023 •

edited

Loading

xshadowlegendx commented Jul 18, 2023 •

edited

Loading

xshadowlegendx commented Sep 11, 2023 •

edited

Loading