-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with fragment coordinates inside DetectionResult #268
Comments
Ah, I have found a workaround: first time I've heard about
With my corpus of documents (.docx files, each one split up into overlapping 10-paragraph subsections, each one constituting an "LDoc", Lucene document), with 1.75 million words, this generates 31727 LDocs, of which 30694 are confident enough to be classified as a given language, without needing multiple language detection. In terms of processing times, this actually works out at only about 50% more time than doing no language analysis at all, so pretty good really. |
The indices are actually supposed to be character indices, not byte indices. It turns out that this is a bug in the regex crate which my library uses internally. This bug has been fixed since my last release. I'm going to create a new release soon which will use the updated regex crate including this bug fix. Then the errors you are getting now will hopefully be gone. |
I've identified a slight problem with the
DetectionResult
s produced when using the featuredetect_multiple_languages_of
. I do appreciate this is an "experimental" feature, but it probably needs to be addressed.To obtain the text in these fragments I'm doing this:
Obviously these are coordinates on the bytes in the String.
With the above, VERY occasionally, I get a panic. Examples:
"thread '<unnamed>' panicked at 'byte index 654 is not a char boundary; it is inside 'ο' (bytes 653..655) of ``ἐυηνεμον: ἐυ \"good\" + ἀνεμος \"wind\". ..."
or
"thread '<unnamed>' panicked at 'byte index 65 is not a char boundary; it is inside '\u{f0fc}' (bytes 64..67) of `` 7.3 ..."
In processing about 2 million words I only get a handful of panics, and these appear to be on VERY exotic and obscure Unicode: accented Ancient Greek or U+F0FC "Private Use Character".
But the trouble is that, currently, I'm not too sure how to "catch" such a thing before it panics: if a char boundary is violated by
[result.start_index()..result.end_index()]
this does not produce aResult
, it just panics! And this in turn means that the whole of the rest of the document I'm parsing never gets processed. (My documents are being parsed in parallel threads so other documents are unaffected).Naturally, I'm trying to find a way to test each proposed slice to find out whether or not it's "legal". But even if I do, obviously this would mean a lot more processing time.
So maybe you might want to think about providing indices on the chars rather than the bytes: users could then do this:
The text was updated successfully, but these errors were encountered: