Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add khmer segmenter #203

Merged
merged 21 commits into from
Oct 18, 2023
Merged

Conversation

xshadowlegendx
Copy link
Contributor

@xshadowlegendx xshadowlegendx commented Mar 29, 2023

Pull Request

Related issue

Fixes #200

What does this PR do?

  • add segmenter for khmer language

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

@ManyTheFish ManyTheFish self-requested a review March 29, 2023 08:28
@xshadowlegendx
Copy link
Contributor Author

hello @ManyTheFish, how do I test this with meilisearch on my local machine?

@ManyTheFish
Copy link
Member

Hello @xshadowlegendx,

don't hesitate to pull the main branch of Meilisearch and update the milli/Cargo.toml file linking your branch instead of a fixed version of Charabia.

This way you will be able to run Meilisearch with your changes!

@xshadowlegendx
Copy link
Contributor Author

@ManyTheFish ok thanks will try it

@ManyTheFish
Copy link
Member

Hello @xshadowlegendx, do you manage to test the khmer segmenter with Meilisearch?

@xshadowlegendx
Copy link
Contributor Author

hello @ManyTheFish, sorry for the late reply. Currently I am still testing the khmer segmenter with meilisearch, so I am able to compile and run it with local charabia as dependency. Is there anything like text analysis where I can test if the segmenter works? so the way I am testing now is inserting some khmer data and search for word for it but it seems to search it part by part instead of word? I also tried to run a meilisearch docker container to compare and it seems to work the same as well, not sure if the segmenter is working or not yet

@ManyTheFish
Copy link
Member

@xshadowlegendx, a test you could try is to push a document containing សួស្តីពិភពលោក and try searching for ពិភពលោក. As Meilisearch only supports prefix search then you should not be able to find the document containing សួស្តីពិភពលោក if your segmenter is not used. Otherwise, Meilisearch managed to segment សួស្តីពិភពលោក and so we can conclude with the fact that your segmenter is used.
If you want to activate some normalizer or add a specific one for the Khmer Language, don't hesitate to check the normalizer list, probably removing diacritics is relevant for the Khmer Language, In this case, I suggest adding it to the normalizer's language list.

I hope it helps!

@xshadowlegendx
Copy link
Contributor Author

ok thanks @ManyTheFish let me try to test it again

@xshadowlegendx
Copy link
Contributor Author

hello @ManyTheFish , so I tried pushing document that contains សួស្តីពិភពលោក and searching for ពិភពលោក it is able to find those document that contains it, this is with meilisearch from dockerhub without the khmer segmenter, below are steps I took to test it

docker run -it --rm -p 7800:7700 getmeili/meilisearch:v1.2

# books.json
[
  {
    "id": "ec660236154a4bed88b554cff8fb93a4",
    "title": "ភាសាខ្មែរជាភាសល្អជាងគេនៅពិភពលោក"
  },
  {
    "id": "2dea8651215a4695aaf8721f4b08fa40",
    "title": "ប្រទេសចិនប្រេីភាសាចិន"
  },
  {
    "id": "2dea8651215a4695aaf8721f4b08fa41",
    "title": "សួស្តីពិភពលោក"
  }
]

curl -X POST 'http://localhost:7800/indexes/books/documents?primaryKey=id' -H 'Content-Type: application/json' --data-binary @books.js

this screenshot seems like without khmer segmenter it also able to pick up the word
Screenshot 2023-06-24 at 11 18 19

but if its not a word it also pick it up
Screenshot 2023-06-24 at 11 18 44

so I try to test the khmer segmenter by cloning meilisearch to my local machine, point charabia deps to my local machine one and cargo run to run meilisearch and test just like above. Before the testing, I put a panic!("khmer segmenter says panic"); within the segment_str function to ensure it gets called and it does, then I remove the panic statement and proceed to test like above and got the same result.

below are changes to local meilisearch to test with the khmer segmenter

❯ cat milli/Cargo.toml | grep chara
charabia = { path = "../../charabia/charabia", default-features = false }
default = [ "charabia/default" ]
chinese = ["charabia/chinese"]
hebrew = ["charabia/hebrew"]
japanese = ["charabia/japanese"]
japanese-transliteration = ["charabia/japanese-transliteration"]
korean = ["charabia/korean"]
thai = ["charabia/thai"]
khmer = ["charabia/khmer"]

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello @xshadowlegendx,
I asked for small changes in your PR,

  1. Do these changes enhance your search results?
  2. Do you see some search results that are missing?

See you!

charabia/Cargo.toml Outdated Show resolved Hide resolved
charabia/Cargo.toml Outdated Show resolved Hide resolved
charabia/icu4x-khmer-keys Outdated Show resolved Hide resolved
@ManyTheFish
Copy link
Member

Could you add some sentences in benchmarks, please?

static DATA_SET: &[((usize, Script, Language), &str)] = &[
// short texts (~130 bytes)
((132, Script::Cj, Language::Cmn), "人人生而自由﹐在尊严和权利上一律平等。他們賦有理性和良心﹐並應以兄弟關係的精神互相對待。"),
((132, Script::Cj, Language::Jpn), "詳しくは以下の をご覧下さい。語学ないし文学の立場からの価値判断は一切おこなっていません"),
((132, Script::Latin, Language::Eng), "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F! Hello guys, my purpose is to benchmark tokenizer properly."),
((132, Script::Latin, Language::Fra), "La ville avait d'abord été nommée « Lutèce » ou « boueuse », ici une tentative d'explication par le latin lŭtum « boue »."),
((132, Script::Hebrew, Language::Heb), "הַשּׁוּעָל הַמָּהִיר (״הַחוּם״) לֹא יָכוֹל לִקְפֹּץ 8.94 מֶטְרִים, נָכוֹן?"),
((132, Script::Thai, Language::Tha), "ไก่จิกเด็กตายเด็กตายบนปากโอ่งไก่อะไรวะโหดจัง"),
((132, Script::Hangul, Language::Kor), "제119조 ① 대한민국의 경제질서는 개인과 기업의 경제상의 자유와 창의를 존중함을 기본으로 한다."),
((130, Script::Greek, Language::Ell), "Οι θερμοκρασίες είναι σπάνια υπερβολικές στις παραθαλάσσιες περιοχές."),
((132, Script::Arabic, Language::Ara), "اللُّغَةُ العربية هي أكثر اللغات السامية تحدثا، ومن أكثر اللغات انتشارا"),
// long texts (~365 bytes)
((363, Script::Cj, Language::Cmn), "距今60万年-2万年的时间内,北京地区处于旧石器时代,在周口店发现了旧石器时代早期北京直立人、中期新洞人和晚期山顶洞人的典型遗址。北京地区在不晚于1万年前已经开始进入新石器时代。当时该地区人类定居生活固定化,逐渐从山洞中迁徙出来,到平原地区定居[12]。"),
((364, Script::Cj, Language::Jpn), "詳しくは以下の をご覧下さい。語学ないし文学の立場からの価値判断は一切おこなっていません。だけど、バラ科の仲間ということでは「すもももももももものうち」は正しいことになります。すももものうち!今日は「すもももももももものうち」について考えます。"),
((363, Script::Latin, Language::Eng), "The City of London Corporation is unique in the UK and has some unusual responsibilities for a local council, such as being the police authority. It is also unusual in having responsibilities and ownership beyond its boundaries. The Corporation is headed by the Lord Mayor of the City of London (an office separate from, and much older than, the Mayor of London)."),
((363, Script::Latin, Language::Fra), "La position de Lutèce, sur l'île aujourd'hui nommée l'île de la Cité, permettant le franchissement du grand fleuve navigable qu'est la Seine par une voie reliant le Nord et le Sud des Gaules, en fait dès l'Antiquité une cité importante, capitale des Parisii, puis lieu de séjour d'un empereur romain. Le mot Lutèce resulte de la francisation de Lutetia."),
((365, Script::Hebrew, Language::Heb), "הַשּׁוּעָל הַמָּהִיר (״הַחוּם״) לֹא יָכוֹל לִקְפֹּץ 8.94 מֶטְרִים, נָכוֹן? תַּכְלֶס, אִם הוּא הָיָה יָכוֹל, הוּא חֲתִיכַת שׁוּעָל הַשּׁוּעָל הַזֶּה.. אֲבָל הַאִם לֹא כֻּלָּנוּ שׁוּעָלִים בְּעֶצֶם? יתכן."),
((366, Script::Thai, Language::Tha), "เราจะทำตามสัญญาขอเวลาอีกไม่นานแล้วแผ่นดินที่งดงามจะคืนกลับมาเราจะทำอย่างซื่อตรงขอแค่เธอจงไว้ใจและศรัทธาแผ่นดินจะดีในไม่ช้า"),
((364, Script::Hangul, Language::Kor), "제30조 타인의 범죄행위로 인하여 생명·신체에 대한 피해를 받은 국민은 법률이 정하는 바에 의하여 국가로부터 구조를 받을 수 있다. ② 명령·규칙 또는 처분이 헌법이나 법률에 위반되는 여부가 재판의 전제가 된 경우에는 대법원은 이를 최종적으로 심사할 권한을 가진다."),
((364, Script::Greek, Language::Ell), "Η άνοιξη έχει μικρή διάρκεια, διότι ο μεν χειμώνας είναι όψιμος, το δε καλοκαίρι αρχίζει πρώιμα. Το φθινόπωρο είναι μακρύ και θερμό και πολλές φορές παρατείνεται στη νότια Ελλάδα και τα νησιά μέχρι τα"),
((366, Script::Arabic, Language::Ara), "العربية لغةٌ رسمية في كل دول الوطن العربي (إضافة إلى كونها لغة رسمية في تشاد وإريتريا). وهي إحدى اللغات الرسمية الست في منظمة الأمم المتحدة، ويُحتفل بالعربية في 18 ديسمبر كذكرى اعتمادها في الأمم المتحدة."),
];

@xshadowlegendx
Copy link
Contributor Author

xshadowlegendx commented Jul 18, 2023

hello @xshadowlegendx, I asked for small changes in your PR,

1. Do these changes enhance your search results?

2. Do you see some search results that are missing?

See you!

hello @ManyTheFish, sorry for such lateness

for your questions:

  1. yes
  2. yes

after those changes on my local, now it show some differences, the left one is running with docker and the right one is running with local development. so I tried searching the word លោក it shows empty result which is correct in this case and typing only a character or incomplete word also matches is this expected behavior?

obvious changes between meilisearch with/without khmer segmenter

searches also highlight incomplete word

@ManyTheFish
Copy link
Member

Hello @xshadowlegendx, Meilisearch is a prefix search engine, so it's expected that if you tip the start of a word Meilisearch find all the documents containing a word starting with the query.
Is it the case?

@xshadowlegendx
Copy link
Contributor Author

hello @ManyTheFish , yes that is the case, so it is expected behavior then. I will be testing some document that are more close to real world data to see how the segmenter perform next then

@ManyTheFish
Copy link
Member

Hello @xshadowlegendx,

Any news on this PR, Do you feel enough comfortable with your tests to merge it?

Thanks!

@xshadowlegendx
Copy link
Contributor Author

xshadowlegendx commented Sep 11, 2023

hello @ManyTheFish, sorry for the late replies, so I tested with about 253 movies that is production data from my workplace weeks ago. the search is very fast, can search movies from titles, actor names and genres, and typo tolerance also works by updating the default since khmer word are smaller size, so I think it is ready to be merged.

but the current icu4x dependency that I am using currently have a bug with using dictionary model so currently I am using the lstm model for the segmentation but it was fixed very soon after I submit the issue but It hasnt been release yet

@xshadowlegendx xshadowlegendx marked this pull request as ready for review October 8, 2023 15:11
@curquiza
Copy link
Member

curquiza commented Oct 9, 2023

Hello @xshadowlegendx
Can you fix the git conflicts you have on this PR before we review? 😊

@curquiza
Copy link
Member

curquiza commented Oct 9, 2023

@xshadowlegendx
Looks like clippy and rust fmt CI fail. Can you fix these before we review?

@xshadowlegendx
Copy link
Contributor Author

hello @curquiza, the clippy wants me to fix the invisible_characters such as below

❯ cargo clippy --all-targets -- --deny warnings
    Checking charabia v0.8.4 (/Users/xanonx/projects/meilisearch/charabia/charabia)
error: invisible character detected
  --> charabia/benches/bench.rs:28:43
   |
28 | ...), "រឿង​ពីរ​ដែល​មនុស្ស​ហាម​ចិត្ត​ខ្លួន​ឯង​មិន​បាន​គឺ​ សើច​ និង​ ស្រឡាញ់​​។ តែ​សម្រាប់​ខ្ញុំ​ ប្រាក់ ចន្ទធីតា​ ​រឿង​មួយ​ទៀត​ដែល​ខ្ញុំ​ហាម​ចិត្ត​ខ្លួន​ឯង​មិន​បាន​នោះ គឺ​ញ៉ាំ​ គេ​គ្រប់​គ្នា​ពេល​ខូច​ចិត...បាយ​ទឹក​មិន​បាន​ទេ តែ​ខ្ញុំ​ពេល​ខូច​ចិត្ត​ដឹង​តែ​ឃ្លាន​ ញ៉ាំ​ច្រើន​ឬ​តិច​ក៏​អាស្រ័យ​​លើ​ថា​ទំហំ​នៃការ​ខូច​​ចិត្ត​ខ្លាំង​ឬ​ខ្សោយ​​។"),
   |       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: consider replacing the string with: `"រឿង\u{200B}ពីរ\u{200B}ដែល\u{200B}មនុស្ស\u{200B}ហាម\u{200B}ចិត្ត\u{200B}ខ្លួន\u{200B}ឯង\u{200B}មិន\u{200B}បាន\u{200B}គឺ\u{200B} សើច\u{200B} និង\u{200B} ស្រឡាញ់\u{200B}\u{200B}។ តែ\u{200B}សម្រាប់\u{200B}ខ្ញុំ\u{200B} ប្រាក់ ចន្ទធីតា\u{200B} \u{200B}រឿង\u{200B}មួយ\u{200B}ទៀត\u{200B}ដែល\u{200B}ខ្ញុំ\u{200B}ហាម\u{200B}ចិត្ត\u{200B}ខ្លួន\u{200B}ឯង\u{200B}មិន\u{200B}បាន\u{200B}នោះ គឺ\u{200B}ញ៉ាំ\u{200B} គេ\u{200B}គ្រប់\u{200B}គ្នា\u{200B}ពេល\u{200B}ខូច\u{200B}ចិត្ត\u{200B}បាយ\u{200B}ទឹក\u{200B}មិន\u{200B}បាន\u{200B}ទេ តែ\u{200B}ខ្ញុំ\u{200B}ពេល\u{200B}ខូច\u{200B}ចិត្ត\u{200B}ដឹង\u{200B}តែ\u{200B}ឃ្លាន\u{200B} ញ៉ាំ\u{200B}ច្រើន\u{200B}ឬ\u{200B}តិច\u{200B}ក៏\u{200B}អាស្រ័យ\u{200B}\u{200B}លើ\u{200B}ថា\u{200B}ទំហំ\u{200B}នៃការ\u{200B}ខូច\u{200B}\u{200B}ចិត្ត\u{200B}ខ្លាំង\u{200B}ឬ\u{200B}ខ្សោយ\u{200B}\u{200B}។"`
   |
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#invisible_characters
   = note: `#[deny(clippy::invisible_characters)]` on by default

error: could not compile `charabia` (bench "bench") due to previous error

should I apply the suggested fix?

@curquiza
Copy link
Member

@ManyTheFish can answer this

@ManyTheFish
Copy link
Member

hello @xshadowlegendx,
if you can't remove the invisible character from the code, then you can add the flag to ignore the error, honestly, I'd prefer it if the invisible character is removed. 😄

Thanks!

@ManyTheFish
Copy link
Member

bors try

meili-bors bot added a commit that referenced this pull request Oct 18, 2023
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me,
thank you @xshadowlegendx for your work!

bors merge

@meili-bors
Copy link
Contributor

meili-bors bot commented Oct 18, 2023

try

Build succeeded:

@meili-bors
Copy link
Contributor

meili-bors bot commented Oct 18, 2023

Build succeeded:

@meili-bors meili-bors bot merged commit 6364941 into meilisearch:main Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add support for khmer language
3 participants