-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: split_long_word incorrectly splits multi-codepoint character #170
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: tison <[email protected]>
Signed-off-by: tison <[email protected]>
Signed-off-by: tison <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #170 +/- ##
==========================================
+ Coverage 96.59% 96.62% +0.02%
==========================================
Files 16 16
Lines 1793 1806 +13
==========================================
+ Hits 1732 1745 +13
Misses 61 61 ☔ View full report in Codecov by Sentry. |
// Check if the string might be too long, one character at a time. | ||
// Peek into the next char and check the exit condition. | ||
// That is, pushing the next character would result in the string being too long. | ||
while let Some(c) = char_iter.peek() { | ||
if (current_width + c.width().unwrap_or(1)) > allowed_width { | ||
if (current_width + c.width()) > allowed_width { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering if we'd use width_cjk
instead of width
here.
cc @Manishearth @Jules-Bertholet any downside of width_cjk
or would it introduces some typically unexpected manner compared to width
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CJK will sometimes give longer (never shorter) widths than non-CJK. In the current published version of unicode-width
, the list of affected characters is approximately (not exactly) https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[%3AEast_Asian_Width%3DA%3A]-[%3AGeneral_Category%3DLetter%3A]-[%3AGeneral_Category%3DModifier_Symbol%3A].
My recommendation would be to default to non-CJK, and then maybe add CJK width calculation as an option.
One minor note is that unicode-width
does not guarantee that the width of a string equals the sum of the widths of its grapheme clusters. The most prominent exception to this is Arabic لا
(2 codepoints and 2 graphemes, but width of 1). However, all terminals I know of render it wrong anyway. So not something you should worry too much about for now, since it will look broken no matter what you do. But be aware that there are still edge cases, and in the future the best practice might change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jules-Bertholet Thanks for your information!
So, I'd keep using width
until we notice certain use cases that with_cjk
would help.
And I'd show my respect for all your efforts in unicode-rs. I have written Java, and it's quite a headache to handle many details in Unicode especially the width calculation (for alignment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking more at your library’s docs, it seems you do try to support some amount of non-TTY usage (with tty
feature disabled). So handling Arabic properly might be worth it for you.
In that case, one possibility is to, instead of calculating the width if each grapheme and accumulating into current_width
, calculate the width of the entire word up to the nth grapheme each time. That is O(n2) though…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the insights @Jules-Bertholet.
Also TIL about CJK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the non-TTY
usage, this lib is still very much for terminal-like environments, even if they don't have tty functions to, for example, detect the current terminal width.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the test case @Nukesor proposed, I noticed that "ab🙂↕️def".width()
or width_cjk
both return 7
, but it actually take more than 7 character width in terminal:
+---------+
| test |
+=========+
| ab🙂↕️def |
+---------+
Not sure if unicode-rs has some functions to calculate emoji's display width "correctly" (I don't know what is a proper "correct" in this context, though).
This patch doesn't fix At there we use UPDATE: It seems we can handle |
Signed-off-by: tison <[email protected]>
Signed-off-by: tison <[email protected]>
Signed-off-by: tison <[email protected]>
Signed-off-by: tison <[email protected]>
Signed-off-by: tison <[email protected]>
5a4428b
to
69a7415
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heyo :)
Thanks a lot for the research and work on this topic! Much appreciated.
My comments are mostly about testing and docs, the rest looks good to me :)
-
Could you go ahead and add this regression test case to the
utf_8_characters.rs
integration test file:/// Handle emojis that're joined via the "zero-width joiner" character U+200D and contain variant /// selectors. /// /// Those composite emojis should be handled as a single grapheme and thereby have their width /// calculated based on the grapheme length instead of the individual chars. /// /// This is also a regression test, as previously emojis were split in the middle of the joiner /// sequence, resulting in two different emojis on different lines. #[test] fn zwj_utf8_word_splitting() { let mut table = Table::new(); table .set_width(8) .set_content_arrangement(ContentArrangement::Dynamic) .set_header(vec!["test"]) .add_row(vec!["ab🙂↕️def"]); println!("{table}"); let expected = " +------+ | test | +======+ | ab🙂↕️ | | def | +------+"; println!("{expected}"); assert_eq!(expected, "\n".to_string() + &table.to_string()); }
Also, it would be great if you could test other edge-cases you can come up with :)
-
I tested the performance and this actually has a pretty hefty impact the more chars are in the table (as expected).
The benchmark for a big table went from 15ms to 25ms so a +~67% performance hit.I think this is worth it, but it's something we need to be aware of.
-
UPDATE: It seems we can handle str_slice there. The AnsiCodeIterator is only related to handle escape characters.
Could you write an integration test case for that as well?
// Check if the string might be too long, one character at a time. | ||
// Peek into the next char and check the exit condition. | ||
// That is, pushing the next character would result in the string being too long. | ||
while let Some(c) = char_iter.peek() { | ||
if (current_width + c.width().unwrap_or(1)) > allowed_width { | ||
if (current_width + c.width()) > allowed_width { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the insights @Jules-Bertholet.
Also TIL about CJK.
// Check if the string might be too long, one character at a time. | ||
// Peek into the next char and check the exit condition. | ||
// That is, pushing the next character would result in the string being too long. | ||
while let Some(c) = char_iter.peek() { | ||
if (current_width + c.width().unwrap_or(1)) > allowed_width { | ||
if (current_width + c.width()) > allowed_width { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the non-TTY
usage, this lib is still very much for terminal-like environments, even if they don't have tty functions to, for example, detect the current terminal width.
Signed-off-by: tison <[email protected]>
Let me think of it. I'm not quite familiar with escape characters, though. |
This seems we pay for unicode grapheme cluster calculation :P |
Fair enough. Just let it be if it's too much of a hassle :) |
unicode-segmentation = { version = "1" } | ||
unicode-width = { version = "0.2" } | ||
|
||
# Optional dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please let the options ordered :)
They're auto-ordered via taplot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ansi-str < console in lex order so it should be ordered already?
Signed-off-by: tison <[email protected]>
Yeah .. I saw your PR in #171. Perhaps we can merge this patch first and improve later. And I just noticed that the alignment different between text and console output: there seems to be more to learn :P |
This closes #168.
cc @Nukesor @uasi