You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today Charabia detects automatically the Language of the provided text choosing the best tokenization pipeline in consequence.
drawback
Sometimes the detection is not accurate, mainly when the provided text is short, and the user can't choose manually the Languages contained in the provided text.
enhancement
Add a new setting in the TokenizerBuilder forcing the detection to choose in a subset of Languages, and when there are no choices, skip the detection and pick directly the specialized pipeline. Whatlang, the library used to detect the Language, provides a way to set a subset of Languages that can be detected with the Detector::with_allowlist method.
Technical approach:
add an optional allowlist parameter to the method detect of the Detect trait in detection/mod.rs
add a segment_with_allowlist and a segment_str_with_allowlist with an additional allowlist parameter to the Segment trait in segmenter/mod.rs
add an allowlist method to the TokenizerBuilder struct in tokenizer.rs
The allowlist should be a hashmap of Script -> [Languages]
Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝
The text was updated successfully, but these errors were encountered:
Thanks for your interest in this project 🔥 You are definitely more than welcome to open a PR for this!
FYI, we prefer not assigning people to our issues because sometimes people ask to be assigned and never come back, which discourages the volunteer contributors from opening a PR to fix this issue.
We will accept and merge the first PR that fixes correctly and well implements the issue following our contributing guidelines.
Today Charabia detects automatically the Language of the provided text choosing the best tokenization pipeline in consequence.
drawback
Sometimes the detection is not accurate, mainly when the provided text is short, and the user can't choose manually the Languages contained in the provided text.
enhancement
Add a new setting in the
TokenizerBuilder
forcing the detection to choose in a subset of Languages, and when there are no choices, skip the detection and pick directly the specialized pipeline.Whatlang, the library used to detect the Language, provides a way to set a subset of Languages that can be detected with the Detector::with_allowlist method.
Technical approach:
allowlist
parameter to the methoddetect
of theDetect
trait in detection/mod.rssegment_with_allowlist
and asegment_str_with_allowlist
with an additionalallowlist
parameter to theSegment
trait in segmenter/mod.rsallowlist
method to theTokenizerBuilder
struct in tokenizer.rsThe
allowlist
should be a hashmap ofScript
->[Languages]
Files expected to be modified
The text was updated successfully, but these errors were encountered: