fix: split_long_word incorrectly splits multi-codepoint character #170

tisonkun · 2025-01-30T03:37:36Z

This closes #168.

cc @Nukesor @uasi

Signed-off-by: tison <[email protected]>

codecov · 2025-01-30T03:38:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.62%. Comparing base (6168f9f) to head (b7700ae).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #170      +/-   ##
==========================================
+ Coverage   96.59%   96.62%   +0.02%     
==========================================
  Files          16       16              
  Lines        1793     1806      +13     
==========================================
+ Hits         1732     1745      +13     
  Misses         61       61

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tisonkun · 2025-01-30T03:39:19Z

src/utils/formatting/content_split/normal.rs

    // Check if the string might be too long, one character at a time.
    // Peek into the next char and check the exit condition.
    // That is, pushing the next character would result in the string being too long.
    while let Some(c) = char_iter.peek() {
-        if (current_width + c.width().unwrap_or(1)) > allowed_width {
+        if (current_width + c.width()) > allowed_width {


I'm considering if we'd use width_cjk instead of width here.

cc @Manishearth @Jules-Bertholet any downside of width_cjk or would it introduces some typically unexpected manner compared to width?

CJK will sometimes give longer (never shorter) widths than non-CJK. In the current published version of unicode-width, the list of affected characters is approximately (not exactly) https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[%3AEast_Asian_Width%3DA%3A]-[%3AGeneral_Category%3DLetter%3A]-[%3AGeneral_Category%3DModifier_Symbol%3A].

My recommendation would be to default to non-CJK, and then maybe add CJK width calculation as an option.

One minor note is that unicode-width does not guarantee that the width of a string equals the sum of the widths of its grapheme clusters. The most prominent exception to this is Arabic لا‎ (2 codepoints and 2 graphemes, but width of 1). However, all terminals I know of render it wrong anyway. So not something you should worry too much about for now, since it will look broken no matter what you do. But be aware that there are still edge cases, and in the future the best practice might change.

@Jules-Bertholet Thanks for your information!

So, I'd keep using width until we notice certain use cases that with_cjk would help.

And I'd show my respect for all your efforts in unicode-rs. I have written Java, and it's quite a headache to handle many details in Unicode especially the width calculation (for alignment).

Actually, looking more at your library’s docs, it seems you do try to support some amount of non-TTY usage (with tty feature disabled). So handling Arabic properly might be worth it for you.

In that case, one possibility is to, instead of calculating the width if each grapheme and accumulating into current_width, calculate the width of the entire word up to the nth grapheme each time. That is O(n²) though…

Thanks for the insights @Jules-Bertholet.
Also TIL about CJK.

Regarding the non-TTY usage, this lib is still very much for terminal-like environments, even if they don't have tty functions to, for example, detect the current terminal width.

For the test case @Nukesor proposed, I noticed that "ab🙂‍↕️def".width() or width_cjk both return 7, but it actually take more than 7 character width in terminal:

+---------+ | test | +=========+ | ab🙂‍↕️def | +---------+

Not sure if unicode-rs has some functions to calculate emoji's display width "correctly" (I don't know what is a proper "correct" in this context, though).

tisonkun · 2025-01-30T03:41:31Z

This patch doesn't fix split_long_word under the feature flag custom_styling.

At there we use console::AnsiCodeIterator that takes a &str directly and cannot be composed with UnicodeSegmentation. Perhaps need extra investigation.

UPDATE: It seems we can handle str_slice there. The AnsiCodeIterator is only related to handle escape characters.

Signed-off-by: tison <[email protected]>

.github/workflows/test.yml

Signed-off-by: tison <[email protected]>

Nukesor

Heyo :)

Thanks a lot for the research and work on this topic! Much appreciated.
My comments are mostly about testing and docs, the rest looks good to me :)

Could you go ahead and add this regression test case to the utf_8_characters.rs integration test file:

/// Handle emojis that're joined via the "zero-width joiner" character U+200D and contain variant
/// selectors.
///
/// Those composite emojis should be handled as a single grapheme and thereby have their width
/// calculated based on the grapheme length instead of the individual chars.
///
/// This is also a regression test, as previously emojis were split in the middle of the joiner
/// sequence, resulting in two different emojis on different lines.
#[test]
fn zwj_utf8_word_splitting() {
    let mut table = Table::new();
    table
        .set_width(8)
        .set_content_arrangement(ContentArrangement::Dynamic)
        .set_header(vec!["test"])
        .add_row(vec!["ab🙂‍↕️def"]);

    println!("{table}");
    let expected = "
+------+
| test |
+======+
| ab🙂‍↕️  |
| def  |
+------+";
    println!("{expected}");
    assert_eq!(expected, "\n".to_string() + &table.to_string());
}

Also, it would be great if you could test other edge-cases you can come up with :)

I tested the performance and this actually has a pretty hefty impact the more chars are in the table (as expected).
The benchmark for a big table went from 15ms to 25ms so a +~67% performance hit.

I think this is worth it, but it's something we need to be aware of.
UPDATE: It seems we can handle str_slice there. The AnsiCodeIterator is only related to handle escape characters.

Could you write an integration test case for that as well?

src/utils/formatting/content_split/normal.rs

Nukesor · 2025-02-02T10:37:05Z

src/utils/formatting/content_split/normal.rs

    // Check if the string might be too long, one character at a time.
    // Peek into the next char and check the exit condition.
    // That is, pushing the next character would result in the string being too long.
    while let Some(c) = char_iter.peek() {
-        if (current_width + c.width().unwrap_or(1)) > allowed_width {
+        if (current_width + c.width()) > allowed_width {


Thanks for the insights @Jules-Bertholet.
Also TIL about CJK.

Nukesor · 2025-02-02T10:39:33Z

src/utils/formatting/content_split/normal.rs

    // Check if the string might be too long, one character at a time.
    // Peek into the next char and check the exit condition.
    // That is, pushing the next character would result in the string being too long.
    while let Some(c) = char_iter.peek() {
-        if (current_width + c.width().unwrap_or(1)) > allowed_width {
+        if (current_width + c.width()) > allowed_width {


Regarding the non-TTY usage, this lib is still very much for terminal-like environments, even if they don't have tty functions to, for example, detect the current terminal width.

Signed-off-by: tison <[email protected]>

tisonkun · 2025-02-02T11:32:34Z

Could you write an integration test case for that as well?

Let me think of it. I'm not quite familiar with escape characters, though.

tisonkun · 2025-02-02T11:35:59Z

I tested the performance and this actually has a pretty hefty impact the more chars are in the table (as expected). The benchmark for a big table went from 15ms to 25ms so a +~67% performance hit.

This seems we pay for unicode grapheme cluster calculation :P

Nukesor · 2025-02-02T12:11:44Z

Let me think of it. I'm not quite familiar with escape characters, though.

Fair enough. Just let it be if it's too much of a hassle :)

Nukesor · 2025-02-02T18:28:56Z

Cargo.toml

+unicode-segmentation = { version = "1" }
+unicode-width = { version = "0.2" }
+
+# Optional dependencies


Also please let the options ordered :)

They're auto-ordered via taplot.

ansi-str < console in lex order so it should be ordered already?

Signed-off-by: tison <[email protected]>

tisonkun · 2025-02-03T09:08:28Z

Fair enough. Just let it be if it's too much of a hassle :)

Yeah .. I saw your PR in #171. Perhaps we can merge this patch first and improve later.

And I just noticed that the alignment different between text and console output: there seems to be more to learn :P

tisonkun added 3 commits January 30, 2025 11:30

build: introduce unicode-segmentation

796cb60

Signed-off-by: tison <[email protected]>

add test case

f0e3d96

Signed-off-by: tison <[email protected]>

fix it

1b0eb26

Signed-off-by: tison <[email protected]>

tisonkun commented Jan 30, 2025

View reviewed changes

tisonkun added 4 commits January 30, 2025 11:43

fmt

02fbcd4

Signed-off-by: tison <[email protected]>

for custom_styling

1ed8bbe

Signed-off-by: tison <[email protected]>

fix up comments

bd07291

Signed-off-by: tison <[email protected]>

fixup CI

491457f

Signed-off-by: tison <[email protected]>

tisonkun commented Jan 30, 2025

View reviewed changes

.github/workflows/test.yml Show resolved Hide resolved

try fix GHA matric

69a7415

Signed-off-by: tison <[email protected]>

tisonkun force-pushed the fixup-split-long-word branch from 5a4428b to 69a7415 Compare January 30, 2025 04:00

Nukesor requested changes Feb 2, 2025

View reviewed changes

address comments

910ebfc

Signed-off-by: tison <[email protected]>

Nukesor requested changes Feb 2, 2025

View reviewed changes

Nukesor mentioned this pull request Feb 2, 2025

[Bug] Panic on truncating non-ASCII content #167

Open

add multi_character_cjk_word_splitting case

b7700ae

Signed-off-by: tison <[email protected]>

tisonkun requested a review from Nukesor February 4, 2025 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: split_long_word incorrectly splits multi-codepoint character #170

fix: split_long_word incorrectly splits multi-codepoint character #170

tisonkun commented Jan 30, 2025 •

edited

Loading

codecov bot commented Jan 30, 2025 •

edited

Loading

tisonkun Jan 30, 2025

Jules-Bertholet Jan 30, 2025 •

edited

Loading

tisonkun Jan 30, 2025

Jules-Bertholet Jan 30, 2025

Nukesor Feb 2, 2025

Nukesor Feb 2, 2025

tisonkun Feb 2, 2025

tisonkun commented Jan 30, 2025 •

edited

Loading

Nukesor left a comment •

edited

Loading

Nukesor Feb 2, 2025

Nukesor Feb 2, 2025

tisonkun commented Feb 2, 2025

tisonkun commented Feb 2, 2025

Nukesor commented Feb 2, 2025

Nukesor Feb 2, 2025

tisonkun Feb 3, 2025

tisonkun commented Feb 3, 2025 •

edited

Loading

fix: split_long_word incorrectly splits multi-codepoint character #170

Are you sure you want to change the base?

fix: split_long_word incorrectly splits multi-codepoint character #170

Conversation

tisonkun commented Jan 30, 2025 • edited Loading

codecov bot commented Jan 30, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Jules-Bertholet Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun commented Jan 30, 2025 • edited Loading

Nukesor left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun commented Feb 2, 2025

tisonkun commented Feb 2, 2025

Nukesor commented Feb 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun commented Feb 3, 2025 • edited Loading

tisonkun commented Jan 30, 2025 •

edited

Loading

codecov bot commented Jan 30, 2025 •

edited

Loading

Jules-Bertholet Jan 30, 2025 •

edited

Loading

tisonkun commented Jan 30, 2025 •

edited

Loading

Nukesor left a comment •

edited

Loading

tisonkun commented Feb 3, 2025 •

edited

Loading