-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: split_long_word incorrectly splits multi-codepoint character #170
base: main
Are you sure you want to change the base?
Changes from 3 commits
796cb60
f0e3d96
1b0eb26
02fbcd4
1ed8bbe
bd07291
491457f
69a7415
910ebfc
b7700ae
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
use unicode_width::{UnicodeWidthChar, UnicodeWidthStr}; | ||
use unicode_segmentation::UnicodeSegmentation; | ||
use unicode_width::UnicodeWidthStr; | ||
|
||
/// returns printed length of string | ||
/// if ansi feature enabled, takes into account escape codes | ||
|
@@ -22,12 +23,12 @@ pub fn split_long_word(allowed_width: usize, word: &str) -> (String, String) { | |
let mut current_width = 0; | ||
let mut parts = String::new(); | ||
|
||
let mut char_iter = word.chars().peekable(); | ||
let mut char_iter = word.graphemes(true).peekable(); | ||
// Check if the string might be too long, one character at a time. | ||
// Peek into the next char and check the exit condition. | ||
// That is, pushing the next character would result in the string being too long. | ||
while let Some(c) = char_iter.peek() { | ||
if (current_width + c.width().unwrap_or(1)) > allowed_width { | ||
if (current_width + c.width()) > allowed_width { | ||
break; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm considering if we'd use cc @Manishearth @Jules-Bertholet any downside of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CJK will sometimes give longer (never shorter) widths than non-CJK. In the current published version of My recommendation would be to default to non-CJK, and then maybe add CJK width calculation as an option. One minor note is that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Jules-Bertholet Thanks for your information! So, I'd keep using And I'd show my respect for all your efforts in unicode-rs. I have written Java, and it's quite a headache to handle many details in Unicode especially the width calculation (for alignment). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, looking more at your library’s docs, it seems you do try to support some amount of non-TTY usage (with In that case, one possibility is to, instead of calculating the width if each grapheme and accumulating into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the insights @Jules-Bertholet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the test case @Nukesor proposed, I noticed that
Not sure if unicode-rs has some functions to calculate emoji's display width "correctly" (I don't know what is a proper "correct" in this context, though). |
||
} | ||
|
||
|
@@ -36,15 +37,37 @@ pub fn split_long_word(allowed_width: usize, word: &str) -> (String, String) { | |
|
||
// We default to 1 char, if the character length cannot be determined. | ||
// The user has to live with this, if they decide to add control characters or some fancy | ||
// stuff into their tables. This is considered undefined behavior and we try to handle this | ||
// stuff into their tables. This is considered undefined behavior, and we try to handle this | ||
// to the best of our capabilities. | ||
let character_width = c.width().unwrap_or(1); | ||
let character_width = c.width(); | ||
|
||
current_width += character_width; | ||
parts.push(c); | ||
parts.push_str(c); | ||
} | ||
|
||
// Collect the remaining characters. | ||
let remaining = char_iter.collect(); | ||
(parts, remaining) | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use super::*; | ||
|
||
#[test] | ||
fn test_split_long_word() { | ||
let emoji = "🙂↕️"; // U+1F642 U+200D U+2195 U+FE0F head shaking vertically | ||
assert_eq!(emoji.len(), 13); | ||
assert_eq!(emoji.chars().count(), 4); | ||
assert_eq!(emoji.width(), 2); | ||
|
||
let (word, remaining) = split_long_word(emoji.width(), &emoji); | ||
|
||
assert_eq!(word, "\u{1F642}\u{200D}\u{2195}\u{FE0F}"); | ||
assert_eq!(word.len(), 13); | ||
assert_eq!(word.chars().count(), 4); | ||
assert_eq!(word.width(), 2); | ||
|
||
assert!(remaining.is_empty()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please let the options ordered :)
They're auto-ordered via taplot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ansi-str < console in lex order so it should be ordered already?