Issue with fragment coordinates inside DetectionResult #268

Mrodent · 2023-11-06T20:47:04Z

I've identified a slight problem with the DetectionResults produced when using the feature detect_multiple_languages_of. I do appreciate this is an "experimental" feature, but it probably needs to be addressed.

To obtain the text in these fragments I'm doing this:

let mut output_str = format!("document {} self.n_ldocs {} text:\n|{}|\n", self.path_str, self.n_ldocs, ldoc_text);
for detection_result in detection_results {
    let fragment: &str = &ldoc_text[detection_result.start_index()..detection_result.end_index()];
    let confidence = self.handling_framework.language_detector.compute_language_confidence(fragment, detection_result.language());
    if confidence > 0.25 {
        ... // probably sensible language classification
    }
    else {
        ... // may be gibberish, or rather technical or something like that
    }
}

Obviously these are coordinates on the bytes in the String.

With the above, VERY occasionally, I get a panic. Examples:
"thread '<unnamed>' panicked at 'byte index 654 is not a char boundary; it is inside 'ο' (bytes 653..655) of ``ἐυηνεμον: ἐυ \"good\" + ἀνεμος \"wind\". ..."
or
"thread '<unnamed>' panicked at 'byte index 65 is not a char boundary; it is inside '\u{f0fc}' (bytes 64..67) of `` 7.3 ..."

In processing about 2 million words I only get a handful of panics, and these appear to be on VERY exotic and obscure Unicode: accented Ancient Greek or U+F0FC "Private Use Character".

But the trouble is that, currently, I'm not too sure how to "catch" such a thing before it panics: if a char boundary is violated by [result.start_index()..result.end_index()] this does not produce a Result, it just panics! And this in turn means that the whole of the rest of the document I'm parsing never gets processed. (My documents are being parsed in parallel threads so other documents are unaffected).

Naturally, I'm trying to find a way to test each proposed slice to find out whether or not it's "legal". But even if I do, obviously this would mean a lot more processing time.

So maybe you might want to think about providing indices on the chars rather than the bytes: users could then do this:

let text_vec = whole_string.chars().collect::<Vec<_>>();
fragment = text_vec[detection_result.start_index_of_chars()..detection_result.end_index_of_chars()].iter().cloned().collect::<String>();

The text was updated successfully, but these errors were encountered:

Mrodent · 2023-11-06T21:07:56Z

Ah, I have found a workaround: first time I've heard about panic::catch_unwind. Here I'm assuming .66 confidence is good enough ...

let confidence_values = self.handling_framework.language_detector
    .compute_language_confidence_values(ldoc_text)
    .into_iter()
    .map(|(language, confidence)| (language, (confidence * 100.0).round() / 100.0))
    .collect::<Vec<_>>();		
if confidence_values[0].1 > 0.66 {
    // the whole section of text gives a reasonable confidence that it's in one of the languages
    self.handling_framework.n_confident_language_ldocs.fetch_add(1, Ordering::Relaxed);
}
else {
    // not confident enough... let's see whether multiple language analysis makes any sense
    let detection_results: Vec<DetectionResult> = self.handling_framework.language_detector.detect_multiple_languages_of(ldoc_text);
    let mut output_str = format!("document {} self.n_ldocs {} text:\n|{}|\n", self.path_str, self.n_ldocs, ldoc_text);
    for detection_result in detection_results {
	let start_index = detection_result.start_index();
	let end_index = detection_result.end_index();
	let result = std::panic::catch_unwind(||{
	    &ldoc_text[start_index..end_index]
    	});
	let fragment = match result {
	    Ok(fragment) => fragment,
	    Err(e) => {
		error!("got Unicode char boundary error: {:#?}", e);
		output_str.push_str(&format!("[!!! {} bytes with problem Unicode]", end_index - start_index));
		continue
	    }
        };
	let confidence = self.handling_framework.language_detector.compute_language_confidence(fragment, detection_result.language());
	output_str.push_str(&format!("{:?} fragment |{}|\n... confidence {}\n", detection_result, fragment, confidence));
	if confidence > 0.66 {
	    // ... (the fragment is marked and treated as being "probably language X")
	}
	else if confidence > 0.33 {
	    // ... (the fragment is marked and treated as being "possibly language X")
	}
	else {
	    // ... (the fragment is marked and treated as being "language unknown")
	}
    }	
}

With my corpus of documents (.docx files, each one split up into overlapping 10-paragraph subsections, each one constituting an "LDoc", Lucene document), with 1.75 million words, this generates 31727 LDocs, of which 30694 are confident enough to be classified as a given language, without needing multiple language detection.

In terms of processing times, this actually works out at only about 50% more time than doing no language analysis at all, so pretty good really.

pemistahl · 2023-11-07T21:16:57Z

The indices are actually supposed to be character indices, not byte indices. It turns out that this is a bug in the regex crate which my library uses internally. This bug has been fixed since my last release. I'm going to create a new release soon which will use the updated regex crate including this bug fix. Then the errors you are getting now will hopefully be gone.

pemistahl closed this as completed Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with fragment coordinates inside DetectionResult #268

Issue with fragment coordinates inside DetectionResult #268

Mrodent commented Nov 6, 2023 •

edited

Loading

Mrodent commented Nov 6, 2023 •

edited

Loading

pemistahl commented Nov 7, 2023

Issue with fragment coordinates inside DetectionResult #268

Issue with fragment coordinates inside DetectionResult #268

Comments

Mrodent commented Nov 6, 2023 • edited Loading

Mrodent commented Nov 6, 2023 • edited Loading

pemistahl commented Nov 7, 2023

Mrodent commented Nov 6, 2023 •

edited

Loading

Mrodent commented Nov 6, 2023 •

edited

Loading