ZWSP (\U200B) is rendered as a space when in 'grapheme clusters' mode #18267

juliannoble · 2024-12-01T07:07:08Z

Testing with latest canary.

For ZWSP (\U200B) there appears to be no way to have it non-printing while using the 'Grapheme clusters' mode.
You have to use wcswidth to fix the behaviour this issue describes and then forgo grapheme clusters.

If grapheme clusters is intended to be (as described) the 'modern' way - then this bug would still seem to be present.

It's unclear to me how

/dup #1472
/dup #8000

address this specific issue.

Originally posted by @juliannoble in #11850

o-sdn-o · 2024-12-01T07:34:34Z

Repro steps:

Set WT's text measurement mode to Grapheme clusters.
Restart WT if the mode has changed.
Run the following in powershell:
```
"test`u{200B}test"  
```

Output:

Actual:
```
test test
```
Expected:
```
testtest
```

Example of using ZWSP

https://en.wikipedia.org/wiki/Thai_script#Orthography

Thai letters do not have upper- and lower-case forms like Latin letters do. Spaces between words are not used, except in certain linguistically motivated cases.

... Because of the absence of space, in computer typography, the line-break have to be inserted manually, otherwise a long sentence will not break into new lines. Some computer input methods have put zero-width space instead for word break, which would then break the long sentences into multiple lanes, ...

Thai text with tokenized words using ZWSP:

powershell:

"มวยไทยเป็นกีฬาประจำชาติไทย นักมวยไทยมักจะเป็นแชมเปียนระดับไลท์เวทของสมาคมมวยโลกเสมอ ปลายคริสต์ศตวรรษที่ 19 ประเทศไทยรับเอากีฬาจากชาติตะวันตกเข้ามาหลายชนิด โดยเริ่มมีการแข่งขันในโรงเรียนในต้นคริสต์ศตวรรษที่ 20 ตามมาด้วยในระบบการศึกษาสมัยใหม่"

Expected WT output (wcswidth text measurement mode):

Actual WT output in Grapheme clusters mode:

lhecker · 2024-12-03T00:42:51Z

Your example of Thai script makes me believe that we should indeed treat the ZWSP like any other extender. We can't treat it as a standalone zero-width cell, since those cannot exist in a terminal. And we can't leave it as-is, because it's clearly quite important for languages like Thai.

o-sdn-o · 2024-12-03T09:06:39Z

we should indeed treat the ZWSP like any other extender.

FYI, I haven't encountered any inconsistencies yet while leaving ZWSP as part of the preceding grapheme cluster. It seems that DirectWrite rasterization is not affected by the presence of ZWSP at the end of the cluster, although there may be some edge cases that I'm not aware of.

DHowett · 2024-12-03T18:58:31Z

leaving ZWSP as part of the preceding grapheme cluster.

we should indeed treat the ZWSP like any other extender.

You two are saying the same thing. :)

juliannoble · 2024-12-09T04:25:53Z

Thanks for the fix.
Just a note for the record regarding a remaining difference to some other terminals.

A leading zwsp still shows up as a space, such that the next character is in column 2.
Perhaps a leading zwsp will happen during line wrapping.
Experimentally it seems to - such that incrementing the terminal width by one 3 times can result in a temporary additional space as it crosses the point where the zwsp ends up as the first char on the next line.

This differs from some other windows consoles I tried such as wezterm - but these consoles often seem to misreport the cursor position after emitting zwsp - so I'm not necessarily saying they are rolemodels in this area. (e.g they may linewrap too early due to this, which is even uglier)

Also differs to default FreeBSD console behaviour.
I don't know if this edge case is enough of a concern to be a 'bug' or what the correct behaviour should really be - but there are potential layout anomalies due to this compared to other terminals I suppose.

juliannoble · 2024-12-09T04:52:28Z

re-testing in a different way, it only seems to be an issue if the zwsp happened to fall at column 1 in the first place.
Then the visible space will remain even as the terminal width is changed.
Potentially a very intermittent source of layout surprises.

A Thai sequence joined with zwsp (taken from o-sdn-o's example) seems to behave ok even at edge of screen - so I guess if this is a bug it's probably of fairly low importance.

lhecker · 2024-12-09T18:45:51Z

That's an issue we could fix in the future. Personally, I'm not yet convinced that it's an important edge case to fix, because I felt like a lone zero-width character in the first column should be a very rare occurrence.

The reason it happens is because before inserting anything in our text buffer we check if it joins with the already existing character and then merge it together with the new input. For any other characters we clamp the width of each grapheme cluster to a value between 1 and 2. Since a zero-width character can't join with anything in the first column it will be measured as its own cluster which results in a width of 1.
The reason we do this is primarily because of our architecture which assumes O(1) lookups in the text buffer. The only way to make that work is by assuming that each grapheme cluster has a width of 1-2. The secondary reason for that is because inserting the ZWSP there puts it in front of the first column, assigning it column -1, which is not something that should be possible. Put differently, if you're at column 1 and iterate to the previous grapheme cluster, you would end up at column 0 (the start of the line), but after the ZWSP. It's a problem I don't have a good solution for except "add a bunch of special conditions" which gave me the gut feeling of "the problem has not been fundamentally solved".

I filed an issue: #18296

microsoft-github-policy-service bot added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Dec 1, 2024

lhecker mentioned this issue Dec 4, 2024

Fix clustering of gc=Cf, GCB=CN codepoints #18285

Merged

microsoft-github-policy-service bot added the In-PR This issue has a related PR label Dec 4, 2024

carlos-zamora closed this as completed in #18285 Dec 5, 2024

carlos-zamora closed this as completed in 0961a77 Dec 5, 2024

lhecker mentioned this issue Dec 9, 2024

A lone zero-width character in the first column occupies 1 column #18296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZWSP (\U200B) is rendered as a space when in 'grapheme clusters' mode #18267

ZWSP (\U200B) is rendered as a space when in 'grapheme clusters' mode #18267

juliannoble commented Dec 1, 2024

o-sdn-o commented Dec 1, 2024 •

edited

Loading

lhecker commented Dec 3, 2024

o-sdn-o commented Dec 3, 2024 •

edited

Loading

DHowett commented Dec 3, 2024

juliannoble commented Dec 9, 2024

juliannoble commented Dec 9, 2024 •

edited

Loading

lhecker commented Dec 9, 2024

ZWSP (\U200B) is rendered as a space when in 'grapheme clusters' mode #18267

ZWSP (\U200B) is rendered as a space when in 'grapheme clusters' mode #18267

Comments

juliannoble commented Dec 1, 2024

o-sdn-o commented Dec 1, 2024 • edited Loading

lhecker commented Dec 3, 2024

o-sdn-o commented Dec 3, 2024 • edited Loading

DHowett commented Dec 3, 2024

juliannoble commented Dec 9, 2024

juliannoble commented Dec 9, 2024 • edited Loading

lhecker commented Dec 9, 2024

o-sdn-o commented Dec 1, 2024 •

edited

Loading

o-sdn-o commented Dec 3, 2024 •

edited

Loading

juliannoble commented Dec 9, 2024 •

edited

Loading