Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZWSP (\U200B) is rendered as a space when in 'grapheme clusters' mode #18267

Closed
juliannoble opened this issue Dec 1, 2024 · 7 comments · Fixed by #18285
Closed

ZWSP (\U200B) is rendered as a space when in 'grapheme clusters' mode #18267

juliannoble opened this issue Dec 1, 2024 · 7 comments · Fixed by #18285
Labels
In-PR This issue has a related PR Needs-Tag-Fix Doesn't match tag requirements Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting

Comments

@juliannoble
Copy link

Testing with latest canary.

For ZWSP (\U200B) there appears to be no way to have it non-printing while using the 'Grapheme clusters' mode.
You have to use wcswidth to fix the behaviour this issue describes and then forgo grapheme clusters.

If grapheme clusters is intended to be (as described) the 'modern' way - then this bug would still seem to be present.

It's unclear to me how

/dup #1472
/dup #8000

address this specific issue.

Originally posted by @juliannoble in #11850

@microsoft-github-policy-service microsoft-github-policy-service bot added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Dec 1, 2024
@o-sdn-o
Copy link

o-sdn-o commented Dec 1, 2024

Repro steps:

  • Set WT's text measurement mode to Grapheme clusters.
  • Restart WT if the mode has changed.
  • Run the following in powershell:
    "test`u{200B}test"  

Output:

  • Actual:
    test test
    
  • Expected:
    testtest
    
Example of using ZWSP

https://en.wikipedia.org/wiki/Thai_script#Orthography

Thai letters do not have upper- and lower-case forms like Latin letters do. Spaces between words are not used, except in certain linguistically motivated cases.

... Because of the absence of space, in computer typography, the line-break have to be inserted manually, otherwise a long sentence will not break into new lines. Some computer input methods have put zero-width space instead for word break, which would then break the long sentences into multiple lanes, ...

Thai text with tokenized words using ZWSP:

  • powershell:
    "มวยไทย​เป็น​กีฬา​ประจำ​ชาติ​ไทย​ นัก​มวยไทย​มัก​จะ​เป็น​แช​ม​เปีย​นระ​ดับ​ไลท์เวท​ของ​สมาคม​มวย​โลก​เสมอ ​ปลาย​คริสต์​ศตวรรษ​ที่​ 19​ ประเทศไทย​รับ​เอา​กีฬา​จาก​ชาติ​ตะวัน​ตก​เข้า​มา​หลาย​ชนิด​ โดย​เริ่ม​มี​การ​แข่งขัน​ใน​โรงเรียน​ใน​ต้น​คริสต์​ศตวรรษ​ที่​ 20​ ตาม​มา​ด้วย​ใน​ระบบ​การ​ศึกษา​สมัย​ใหม่"

Expected WT output (wcswidth text measurement mode):
Image

Actual WT output in Grapheme clusters mode:
Image

@lhecker
Copy link
Member

lhecker commented Dec 3, 2024

Your example of Thai script makes me believe that we should indeed treat the ZWSP like any other extender. We can't treat it as a standalone zero-width cell, since those cannot exist in a terminal. And we can't leave it as-is, because it's clearly quite important for languages like Thai.

@o-sdn-o
Copy link

o-sdn-o commented Dec 3, 2024

we should indeed treat the ZWSP like any other extender.

FYI, I haven't encountered any inconsistencies yet while leaving ZWSP as part of the preceding grapheme cluster. It seems that DirectWrite rasterization is not affected by the presence of ZWSP at the end of the cluster, although there may be some edge cases that I'm not aware of.

@DHowett
Copy link
Member

DHowett commented Dec 3, 2024

leaving ZWSP as part of the preceding grapheme cluster.

we should indeed treat the ZWSP like any other extender.

You two are saying the same thing. :)

@juliannoble
Copy link
Author

Thanks for the fix.
Just a note for the record regarding a remaining difference to some other terminals.

A leading zwsp still shows up as a space, such that the next character is in column 2.
Perhaps a leading zwsp will happen during line wrapping.
Experimentally it seems to - such that incrementing the terminal width by one 3 times can result in a temporary additional space as it crosses the point where the zwsp ends up as the first char on the next line.

This differs from some other windows consoles I tried such as wezterm - but these consoles often seem to misreport the cursor position after emitting zwsp - so I'm not necessarily saying they are rolemodels in this area. (e.g they may linewrap too early due to this, which is even uglier)

Also differs to default FreeBSD console behaviour.
I don't know if this edge case is enough of a concern to be a 'bug' or what the correct behaviour should really be - but there are potential layout anomalies due to this compared to other terminals I suppose.

@juliannoble
Copy link
Author

juliannoble commented Dec 9, 2024

re-testing in a different way, it only seems to be an issue if the zwsp happened to fall at column 1 in the first place.
Then the visible space will remain even as the terminal width is changed.
Potentially a very intermittent source of layout surprises.

A Thai sequence joined with zwsp (taken from o-sdn-o's example) seems to behave ok even at edge of screen - so I guess if this is a bug it's probably of fairly low importance.

@lhecker
Copy link
Member

lhecker commented Dec 9, 2024

That's an issue we could fix in the future. Personally, I'm not yet convinced that it's an important edge case to fix, because I felt like a lone zero-width character in the first column should be a very rare occurrence.

The reason it happens is because before inserting anything in our text buffer we check if it joins with the already existing character and then merge it together with the new input. For any other characters we clamp the width of each grapheme cluster to a value between 1 and 2. Since a zero-width character can't join with anything in the first column it will be measured as its own cluster which results in a width of 1.
The reason we do this is primarily because of our architecture which assumes O(1) lookups in the text buffer. The only way to make that work is by assuming that each grapheme cluster has a width of 1-2. The secondary reason for that is because inserting the ZWSP there puts it in front of the first column, assigning it column -1, which is not something that should be possible. Put differently, if you're at column 1 and iterate to the previous grapheme cluster, you would end up at column 0 (the start of the line), but after the ZWSP. It's a problem I don't have a good solution for except "add a bunch of special conditions" which gave me the gut feeling of "the problem has not been fundamentally solved".

I filed an issue: #18296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
In-PR This issue has a related PR Needs-Tag-Fix Doesn't match tag requirements Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants