Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: split_long_word incorrectly splits multi-codepoint character #170

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -53,15 +53,18 @@ debug = []
integration_test = []

[dependencies]
unicode-segmentation = { version = "1" }
unicode-width = { version = "0.2" }

# Optional dependencies
Comment on lines +56 to +59
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please let the options ordered :)

They're auto-ordered via taplot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ansi-str < console in lex order so it should be ordered already?

ansi-str = { version = "0.8", optional = true }
console = { version = "0.15", optional = true }
unicode-width = "0.2"

[dev-dependencies]
criterion = "0.5"
pretty_assertions = "1"
proptest = "1"
rand = "0.8"
rand = "0.9"
rstest = "0.24"

# We don't need any of the default features for crossterm.
Expand Down
8 changes: 4 additions & 4 deletions benches/build_large_table.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@ use criterion::{criterion_group, criterion_main, Criterion};

use comfy_table::presets::UTF8_FULL;
use comfy_table::*;
use rand::distributions::Alphanumeric;
use rand::distr::Alphanumeric;
use rand::Rng;

/// Create a dynamic 10x500 Table with width 300 and unevenly distributed content.
/// There're no constriant, the content simply has to be formatted to fit as good as possible into
/// There are no constraint, the content simply has to be formatted to fit as good as possible into
/// the given space.
fn build_huge_table() {
let mut table = Table::new();
Expand All @@ -16,12 +16,12 @@ fn build_huge_table() {
.set_width(300)
.set_header(vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10]);

let mut rng = rand::thread_rng();
let mut rng = rand::rng();
// Create a 10x10 grid
for _ in 0..500 {
let mut row = Vec::new();
for _ in 0..10 {
let string_length = rng.gen_range(2..100);
let string_length = rng.random_range(2..100);
let random_string: String = (&mut rng)
.sample_iter(&Alphanumeric)
.take(string_length)
Expand Down
35 changes: 29 additions & 6 deletions src/utils/formatting/content_split/normal.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
use unicode_width::{UnicodeWidthChar, UnicodeWidthStr};
use unicode_segmentation::UnicodeSegmentation;
use unicode_width::UnicodeWidthStr;

/// returns printed length of string
/// if ansi feature enabled, takes into account escape codes
Expand All @@ -22,12 +23,12 @@ pub fn split_long_word(allowed_width: usize, word: &str) -> (String, String) {
let mut current_width = 0;
let mut parts = String::new();

let mut char_iter = word.chars().peekable();
let mut char_iter = word.graphemes(true).peekable();
// Check if the string might be too long, one character at a time.
// Peek into the next char and check the exit condition.
// That is, pushing the next character would result in the string being too long.
while let Some(c) = char_iter.peek() {
if (current_width + c.width().unwrap_or(1)) > allowed_width {
if (current_width + c.width()) > allowed_width {
break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering if we'd use width_cjk instead of width here.

cc @Manishearth @Jules-Bertholet any downside of width_cjk or would it introduces some typically unexpected manner compared to width?

Copy link

@Jules-Bertholet Jules-Bertholet Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CJK will sometimes give longer (never shorter) widths than non-CJK. In the current published version of unicode-width, the list of affected characters is approximately (not exactly) https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[%3AEast_Asian_Width%3DA%3A]-[%3AGeneral_Category%3DLetter%3A]-[%3AGeneral_Category%3DModifier_Symbol%3A].

My recommendation would be to default to non-CJK, and then maybe add CJK width calculation as an option.

One minor note is that unicode-width does not guarantee that the width of a string equals the sum of the widths of its grapheme clusters. The most prominent exception to this is Arabic لا‎ (2 codepoints and 2 graphemes, but width of 1). However, all terminals I know of render it wrong anyway. So not something you should worry too much about for now, since it will look broken no matter what you do. But be aware that there are still edge cases, and in the future the best practice might change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jules-Bertholet Thanks for your information!

So, I'd keep using width until we notice certain use cases that with_cjk would help.

And I'd show my respect for all your efforts in unicode-rs. I have written Java, and it's quite a headache to handle many details in Unicode especially the width calculation (for alignment).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking more at your library’s docs, it seems you do try to support some amount of non-TTY usage (with tty feature disabled). So handling Arabic properly might be worth it for you.

In that case, one possibility is to, instead of calculating the width if each grapheme and accumulating into current_width, calculate the width of the entire word up to the nth grapheme each time. That is O(n2) though…

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insights @Jules-Bertholet.
Also TIL about CJK.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the non-TTY usage, this lib is still very much for terminal-like environments, even if they don't have tty functions to, for example, detect the current terminal width.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test case @Nukesor proposed, I noticed that "ab🙂‍↕️def".width() or width_cjk both return 7, but it actually take more than 7 character width in terminal:

+---------+
| test    |
+=========+
| ab🙂‍↕️def |
+---------+

Not sure if unicode-rs has some functions to calculate emoji's display width "correctly" (I don't know what is a proper "correct" in this context, though).

}

Expand All @@ -36,15 +37,37 @@ pub fn split_long_word(allowed_width: usize, word: &str) -> (String, String) {

// We default to 1 char, if the character length cannot be determined.
// The user has to live with this, if they decide to add control characters or some fancy
// stuff into their tables. This is considered undefined behavior and we try to handle this
// stuff into their tables. This is considered undefined behavior, and we try to handle this
// to the best of our capabilities.
let character_width = c.width().unwrap_or(1);
let character_width = c.width();

current_width += character_width;
parts.push(c);
parts.push_str(c);
}

// Collect the remaining characters.
let remaining = char_iter.collect();
(parts, remaining)
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_split_long_word() {
let emoji = "🙂‍↕️"; // U+1F642 U+200D U+2195 U+FE0F head shaking vertically
assert_eq!(emoji.len(), 13);
assert_eq!(emoji.chars().count(), 4);
assert_eq!(emoji.width(), 2);

let (word, remaining) = split_long_word(emoji.width(), &emoji);

assert_eq!(word, "\u{1F642}\u{200D}\u{2195}\u{FE0F}");
assert_eq!(word.len(), 13);
assert_eq!(word.chars().count(), 4);
assert_eq!(word.width(), 2);

assert!(remaining.is_empty());
}
}
Loading