-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement UnicodeSegmentation for Iterator<item = char> #28
Comments
I will work on this. |
@CryZe can you expain your |
The problem is that unicode-segmentation is built around borrowing str slices from some owned data, like a String. That doesn't work well with char iterators, as those aren't stored anywhere. So to support this at all, you would need to introduce a separate streaming API, that provides pub enum CharOrBoundary {
Char(char),
Boundary,
} So you can then iterate over the characters and it'll tell you whether you hit a boundary or not. You could then have other helper functions that help you collect char segments between the boundaries into buffers. |
+1 to this idea, would also be useful when working with non-UTF8 strings in legacy APIs. |
To clarify, I think that providing the full What we could provide, however, is something that turns an Since |
I looked into it further and tried to adapt parts of the
The reason is that in the current API, extra input is "patched together" with existing one using UTF-8 indices as a unifying abstraction, and AFAIK there is no nice equivalent in an So unless I'm missing something obvious, it seems to me that the least bad option is to stick with the existing code, collect the iterator of chars into a (possibly truncated) UTF-8 string and use |
It would be nice to segment character iterators, especially for interoperability with the
unicode-normalization
crate. This could provide a solution to #7 when/ifio::Chars
stabilizes. In particular, I'd like to write a tokenizer like this:One issue I see is that most of the public structs provide an
as_str
method that returns "the underlying data (the part yet to be iterated) as a slice of the original string". This obviously won't work with streaming types.The text was updated successfully, but these errors were encountered: