Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New cursor-based implementation of grapheme clusters #23

Merged
merged 7 commits into from
Mar 16, 2017

Conversation

raphlinus
Copy link
Contributor

This is a draft of the implementation for #21. It should be useful both for random access (determining the previous or next boundary from any starting offset) and for dealing with noncontiguous string representations such as ropes.

I'd probably want to add doc examples. Also, testing of the cursor-based API is inadequate, it's only tested through the existing iterators (which are implemented in terms of cursors). But I think it's complete enough to take a look at.

In its current form, this patch is not suitable for purely streaming use cases (as described in #7), as incomplete chunk input may cause a PreContext return even when enough input has already been provided. It would be possible to add more state to the cursor, at a cost of some additional complexity. However, this implementation should at least avoid quadratic behavior when moving the cursor through a sequence of flags.

Implemented next_boundary and prev_boundary functions in terms of
is_boundary (plus fixups to the internal state when moving the
cursor). Fixed various problems in previous commit.

Still work in progress, not tested yet.
Also, additional state machine work, mostly resume logic in next and
prev grapheme cluster boundary methods.

Includes some documentation, and also a bit of renaming from the
earlier development drafts.
This adds a test case for unicode-rs#19 (which was a mismatch between forward
and reverse iterators in the original codebase).
@raphlinus
Copy link
Contributor Author

Note: this fixes #19 (which, I believe, had an incorrect forward iterator in the old codebase, because it was left in the Emoji state after the first fitzpatrick skin tone modifier).

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Would like more docs, especially on API usage for the cursor stuff.

src/grapheme.rs Outdated
Extended, // a break if not in extended mode
CheckCrlf, // a break unless it's a CR LF pair
Regional, // a break if preceded by an even number of RIS
Emoji, // a break if preceded by emoji base and extend
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not break if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to read "a break iff not in extended mode" which I think is more precise. What you suggest is the converse of what I originally said, but both are true.

src/grapheme.rs Outdated
state: GraphemeState,
cat_before: Option<GraphemeCat>, // category of codepoint immediately preceding cursor
cat_after: Option<GraphemeCat>, // category of codepoint immediately after cursor
pre_context_offset: Option<usize>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: needs more comments on how this works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

src/grapheme.rs Outdated
idx = idx + self.string[idx..].chars().next().unwrap().len_utf8();
None
impl GraphemeCursor {
/// Create a new cursor. The string and initial offset are given at creation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Examples on how the whole cursor API is supposed to be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Manishearth
Copy link
Member

Manishearth commented Mar 13, 2017

r? @kwantam @SimonSapin

Some of the doc tests also encouraged me to tweak the implementation.
The existing code treated CR and LF as special cases of the Control
grapheme category, for reasons that weren't very good. This patch gets
rid of that and just handles GB3 in the pair lookup. That should
improve performance in the rope case, as it will cut down on the
amount of pre-context requested when a chunk begins with LF.
@Manishearth Manishearth merged commit 6fc6815 into unicode-rs:master Mar 16, 2017
@Manishearth
Copy link
Member

Merged! Thanks so much for working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants