-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New cursor-based implementation of grapheme clusters #23
Conversation
Very much work in progress. See unicode-rs#21
Implemented next_boundary and prev_boundary functions in terms of is_boundary (plus fixups to the internal state when moving the cursor). Fixed various problems in previous commit. Still work in progress, not tested yet.
Also, additional state machine work, mostly resume logic in next and prev grapheme cluster boundary methods. Includes some documentation, and also a bit of renaming from the earlier development drafts.
This adds a test case for unicode-rs#19 (which was a mismatch between forward and reverse iterators in the original codebase).
Note: this fixes #19 (which, I believe, had an incorrect forward iterator in the old codebase, because it was left in the Emoji state after the first fitzpatrick skin tone modifier). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Would like more docs, especially on API usage for the cursor stuff.
src/grapheme.rs
Outdated
Extended, // a break if not in extended mode | ||
CheckCrlf, // a break unless it's a CR LF pair | ||
Regional, // a break if preceded by an even number of RIS | ||
Emoji, // a break if preceded by emoji base and extend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not break if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to read "a break iff not in extended mode" which I think is more precise. What you suggest is the converse of what I originally said, but both are true.
src/grapheme.rs
Outdated
state: GraphemeState, | ||
cat_before: Option<GraphemeCat>, // category of codepoint immediately preceding cursor | ||
cat_after: Option<GraphemeCat>, // category of codepoint immediately after cursor | ||
pre_context_offset: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: needs more comments on how this works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/grapheme.rs
Outdated
idx = idx + self.string[idx..].chars().next().unwrap().len_utf8(); | ||
None | ||
impl GraphemeCursor { | ||
/// Create a new cursor. The string and initial offset are given at creation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Examples on how the whole cursor API is supposed to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Some of the doc tests also encouraged me to tweak the implementation.
The existing code treated CR and LF as special cases of the Control grapheme category, for reasons that weren't very good. This patch gets rid of that and just handles GB3 in the pair lookup. That should improve performance in the rope case, as it will cut down on the amount of pre-context requested when a chunk begins with LF.
Merged! Thanks so much for working on this! |
This is a draft of the implementation for #21. It should be useful both for random access (determining the previous or next boundary from any starting offset) and for dealing with noncontiguous string representations such as ropes.
I'd probably want to add doc examples. Also, testing of the cursor-based API is inadequate, it's only tested through the existing iterators (which are implemented in terms of cursors). But I think it's complete enough to take a look at.
In its current form, this patch is not suitable for purely streaming use cases (as described in #7), as incomplete chunk input may cause a
PreContext
return even when enough input has already been provided. It would be possible to add more state to the cursor, at a cost of some additional complexity. However, this implementation should at least avoid quadratic behavior when moving the cursor through a sequence of flags.