Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create dictionary based WordSegmenter returns MissingDataKey, segmenter/dictionary/w_auto@1 #3545

Closed
xshadowlegendx opened this issue Jun 17, 2023 · 4 comments · Fixed by #3551
Assignees
Labels
A-data Area: Data coverage or quality C-segmentation Component: Segmentation T-bug Type: Bad behavior, security, privacy

Comments

@xshadowlegendx
Copy link

hello everyone, I am not able to initialize the word segmenter using dictionary mode, but using lstm mode it works fine. with dictionary mode it throws me this error word segmenter init error - Data(DataError { kind: MissingDataKey, key: Some(DataKey{segmenter/dictionary/w_auto@1}), str_context: None, silent: true })

I used this command to generate the provider data icu4x-datagen --keys all --locales km --format blob --out my_data_blob.postcard --overwrite --trie-type fast and checking the output with grep shows that the dictionary key are there

icu4x-datagen ...  | grep segmenter
INFO  [icu_datagen] Writing key: segmenter/lstm/wl_auto@1
INFO  [icu_datagen] Writing key: segmenter/dictionary/wl_ext@1
INFO  [icu_datagen] Writing key: segmenter/dictionary/w_auto@1
INFO  [icu_datagen] Writing key: segmenter/grapheme@1
INFO  [icu_datagen] Writing key: segmenter/word@1
INFO  [icu_datagen] Writing key: segmenter/sentence@1
INFO  [icu_datagen] Writing key: segmenter/line@1

below is the main.rs and Cargo.toml

use icu::segmenter::WordSegmenter;
use icu_provider_blob::BlobDataProvider;

#[derive(Debug)]
struct CharPositionBound(usize, usize);

#[derive(Debug)]
struct CharPositionBoundConstruct(Option<usize>, Vec<CharPositionBound>);

fn main() {
    let blob = std::fs::read("my_data_blob.postcard").expect("Failed to read file");

    let buffer_provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
        .expect("Failed to initialize Data Provider.");

    let d = WordSegmenter::try_new_dictionary_with_buffer_provider(&buffer_provider);

    if let Err(err) = d {
        println!("word segmenter init error - {:?}", err);
    } else if let Ok(res) = d {
        //let w = String::from("ប្រទេសកម្ពុជាជាប្រទេសល្អ");
        // let w = "ភាសាខ្មែរ";
        // let w = "កម្ពុជាសប្យាយ";
        let w = "សួស្តីពិភពលោក";

        let CharPositionBoundConstruct(_, res) = res.segment_str(&w).fold(
            CharPositionBoundConstruct(None, Vec::new()),
            |mut acc, elem| {
                if acc.0.is_some() {
                    acc.1.push(CharPositionBound(acc.0.unwrap(), elem));
                    acc.0 = Some(elem);
                }

                acc.0 = Some(elem);

                acc
            },
        );

        println!("{:?}", res);

        for CharPositionBound(start, end) in res {
            println!("{}", &w[start..end]);
        }
    }
}
# Cargo.toml

[package]
name = "rust_icu4x_demo"
version = "0.1.0"
edition = "2021"

[dependencies]
icu = { version = "1.2.0", features = ["serde"] }
icu_provider_blob = "1.2.0"
@robertbastian
Copy link
Member

This is a bug in BlobDataProvider, whose current data model is not capable of storing keys without any values (such as segmenter/dictionary/w_auto@1 for --locales km), and therefore returns MissingDataKey instead of MissingLocale (which try_new_dictionary_with_buffer_provider would handle).

@robertbastian robertbastian added T-bug Type: Bad behavior, security, privacy C-segmentation Component: Segmentation A-data Area: Data coverage or quality labels Jun 19, 2023
@robertbastian robertbastian added this to the 1.3 Blocking ⟨P1⟩ milestone Jun 19, 2023
@robertbastian robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Jun 19, 2023
@robertbastian
Copy link
Member

robertbastian commented Jun 19, 2023

There are two workarounds I can think of if you need to use BlobDataProvider:

  • Use --locales km ja to populate the key (it only supports a ja value)
  • Write a thin wrapper around BlobDataProvider that returns the correct error:
struct FixProvider(BlobDataProvider);

use icu_provider::prelude::*;
impl BufferProvider for FixProvider {
    fn load_buffer(&self, key: DataKey, req: DataRequest) -> Result<DataResponse<BufferMarker>, DataError> {
        if key == icu::segmenter::provider::DictionaryForWordOnlyAutoV1Marker::KEY {
            Err(DataErrorKind::MissingLocale.with_req(key, req))
        } else {
            self.0.load_buffer(key, req)
        }
    }
}

let buffer_provider = FixProvider(BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
        .expect("Failed to initialize Data Provider."));

@Manishearth
Copy link
Member

For such keys can we store some dummy entry that uses an empty-string locale or something similar as a sentinel value?

Another fix would be to fix dictionary segmenter specifically so that it is ok with MissingDataKey for some of these cases.

@sffc
Copy link
Member

sffc commented Jun 20, 2023

I'm not actually convinced that failing on MissingDataKey is the best behavior in the segmenter, but the proposed change to BlobDataProvider seems fine in a vacuum.

@robertbastian robertbastian self-assigned this Jun 21, 2023
@robertbastian robertbastian removed the discuss-priority Discuss at the next ICU4X meeting label Jun 21, 2023
@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Jun 21, 2023
@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality C-segmentation Component: Segmentation T-bug Type: Bad behavior, security, privacy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants