Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: split_long_word incorrectly splits multi-codepoint character #170

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

tisonkun
Copy link
Contributor

@tisonkun tisonkun commented Jan 30, 2025

This closes #168.

cc @Nukesor @uasi

Copy link

codecov bot commented Jan 30, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.62%. Comparing base (6168f9f) to head (b7700ae).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #170      +/-   ##
==========================================
+ Coverage   96.59%   96.62%   +0.02%     
==========================================
  Files          16       16              
  Lines        1793     1806      +13     
==========================================
+ Hits         1732     1745      +13     
  Misses         61       61              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// Check if the string might be too long, one character at a time.
// Peek into the next char and check the exit condition.
// That is, pushing the next character would result in the string being too long.
while let Some(c) = char_iter.peek() {
if (current_width + c.width().unwrap_or(1)) > allowed_width {
if (current_width + c.width()) > allowed_width {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering if we'd use width_cjk instead of width here.

cc @Manishearth @Jules-Bertholet any downside of width_cjk or would it introduces some typically unexpected manner compared to width?

Copy link

@Jules-Bertholet Jules-Bertholet Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CJK will sometimes give longer (never shorter) widths than non-CJK. In the current published version of unicode-width, the list of affected characters is approximately (not exactly) https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[%3AEast_Asian_Width%3DA%3A]-[%3AGeneral_Category%3DLetter%3A]-[%3AGeneral_Category%3DModifier_Symbol%3A].

My recommendation would be to default to non-CJK, and then maybe add CJK width calculation as an option.

One minor note is that unicode-width does not guarantee that the width of a string equals the sum of the widths of its grapheme clusters. The most prominent exception to this is Arabic لا‎ (2 codepoints and 2 graphemes, but width of 1). However, all terminals I know of render it wrong anyway. So not something you should worry too much about for now, since it will look broken no matter what you do. But be aware that there are still edge cases, and in the future the best practice might change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jules-Bertholet Thanks for your information!

So, I'd keep using width until we notice certain use cases that with_cjk would help.

And I'd show my respect for all your efforts in unicode-rs. I have written Java, and it's quite a headache to handle many details in Unicode especially the width calculation (for alignment).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking more at your library’s docs, it seems you do try to support some amount of non-TTY usage (with tty feature disabled). So handling Arabic properly might be worth it for you.

In that case, one possibility is to, instead of calculating the width if each grapheme and accumulating into current_width, calculate the width of the entire word up to the nth grapheme each time. That is O(n2) though…

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insights @Jules-Bertholet.
Also TIL about CJK.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the non-TTY usage, this lib is still very much for terminal-like environments, even if they don't have tty functions to, for example, detect the current terminal width.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test case @Nukesor proposed, I noticed that "ab🙂‍↕️def".width() or width_cjk both return 7, but it actually take more than 7 character width in terminal:

+---------+
| test    |
+=========+
| ab🙂‍↕️def |
+---------+

Not sure if unicode-rs has some functions to calculate emoji's display width "correctly" (I don't know what is a proper "correct" in this context, though).

@tisonkun
Copy link
Contributor Author

tisonkun commented Jan 30, 2025

This patch doesn't fix split_long_word under the feature flag custom_styling.

At there we use console::AnsiCodeIterator that takes a &str directly and cannot be composed with UnicodeSegmentation. Perhaps need extra investigation.

UPDATE: It seems we can handle str_slice there. The AnsiCodeIterator is only related to handle escape characters.

Signed-off-by: tison <[email protected]>
Signed-off-by: tison <[email protected]>
Signed-off-by: tison <[email protected]>
Signed-off-by: tison <[email protected]>
Signed-off-by: tison <[email protected]>
@tisonkun tisonkun force-pushed the fixup-split-long-word branch from 5a4428b to 69a7415 Compare January 30, 2025 04:00
Copy link
Owner

@Nukesor Nukesor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heyo :)

Thanks a lot for the research and work on this topic! Much appreciated.
My comments are mostly about testing and docs, the rest looks good to me :)

  1. Could you go ahead and add this regression test case to the utf_8_characters.rs integration test file:

    /// Handle emojis that're joined via the "zero-width joiner" character U+200D and contain variant
    /// selectors.
    ///
    /// Those composite emojis should be handled as a single grapheme and thereby have their width
    /// calculated based on the grapheme length instead of the individual chars.
    ///
    /// This is also a regression test, as previously emojis were split in the middle of the joiner
    /// sequence, resulting in two different emojis on different lines.
    #[test]
    fn zwj_utf8_word_splitting() {
        let mut table = Table::new();
        table
            .set_width(8)
            .set_content_arrangement(ContentArrangement::Dynamic)
            .set_header(vec!["test"])
            .add_row(vec!["ab🙂‍↕️def"]);
    
        println!("{table}");
        let expected = "
    +------+
    | test |
    +======+
    | ab🙂‍↕️  |
    | def  |
    +------+";
        println!("{expected}");
        assert_eq!(expected, "\n".to_string() + &table.to_string());
    }

    Also, it would be great if you could test other edge-cases you can come up with :)

  2. I tested the performance and this actually has a pretty hefty impact the more chars are in the table (as expected).
    The benchmark for a big table went from 15ms to 25ms so a +~67% performance hit.

    I think this is worth it, but it's something we need to be aware of.

  3. UPDATE: It seems we can handle str_slice there. The AnsiCodeIterator is only related to handle escape characters.

    Could you write an integration test case for that as well?

src/utils/formatting/content_split/normal.rs Show resolved Hide resolved
// Check if the string might be too long, one character at a time.
// Peek into the next char and check the exit condition.
// That is, pushing the next character would result in the string being too long.
while let Some(c) = char_iter.peek() {
if (current_width + c.width().unwrap_or(1)) > allowed_width {
if (current_width + c.width()) > allowed_width {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insights @Jules-Bertholet.
Also TIL about CJK.

// Check if the string might be too long, one character at a time.
// Peek into the next char and check the exit condition.
// That is, pushing the next character would result in the string being too long.
while let Some(c) = char_iter.peek() {
if (current_width + c.width().unwrap_or(1)) > allowed_width {
if (current_width + c.width()) > allowed_width {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the non-TTY usage, this lib is still very much for terminal-like environments, even if they don't have tty functions to, for example, detect the current terminal width.

Signed-off-by: tison <[email protected]>
@tisonkun
Copy link
Contributor Author

tisonkun commented Feb 2, 2025

Could you write an integration test case for that as well?

Let me think of it. I'm not quite familiar with escape characters, though.

@tisonkun
Copy link
Contributor Author

tisonkun commented Feb 2, 2025

I tested the performance and this actually has a pretty hefty impact the more chars are in the table (as expected). The benchmark for a big table went from 15ms to 25ms so a +~67% performance hit.

This seems we pay for unicode grapheme cluster calculation :P

@Nukesor
Copy link
Owner

Nukesor commented Feb 2, 2025

Let me think of it. I'm not quite familiar with escape characters, though.

Fair enough. Just let it be if it's too much of a hassle :)

Comment on lines +56 to +59
unicode-segmentation = { version = "1" }
unicode-width = { version = "0.2" }

# Optional dependencies
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please let the options ordered :)

They're auto-ordered via taplot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ansi-str < console in lex order so it should be ordered already?

@tisonkun
Copy link
Contributor Author

tisonkun commented Feb 3, 2025

Fair enough. Just let it be if it's too much of a hassle :)

Yeah .. I saw your PR in #171. Perhaps we can merge this patch first and improve later.

And I just noticed that the alignment different between text and console output: there seems to be more to learn :P

@tisonkun tisonkun requested a review from Nukesor February 4, 2025 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] split_long_word incorrectly splits multi-codepoint character
3 participants