Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xterm.js lies about Unicode 11, really uses Unicode 12 #4753

Open
apparebit opened this issue Sep 7, 2023 · 8 comments
Open

xterm.js lies about Unicode 11, really uses Unicode 12 #4753

apparebit opened this issue Sep 7, 2023 · 8 comments
Labels
area/addon/unicode11 type/proposal A proposal that needs some discussion before proceeding

Comments

@apparebit
Copy link

I doubt this will make much difference in practice, but xterm.js has been falsely advertising its Unicode 11 chops when in reality it seems to have been implementing Unicode 12 rules for wcwidth from the day the Unicode "11" extension was released.

For background, I'm the author of demicode, a command line tool for exploring fixed-width rendering of Unicode. One of demicode's builtin test sets is --with-version-oracle, which blots a character with a single code point, default emoji presentation, and East Asian Width wide for every Unicode version such a character is available. The intuition here is: Emoji presentation and East Asian Width should result in double-column rendering only if the tool knows about the code point, I.e., incorporates the corresponding Unicode version's data. Hyper and Visual Studio Code both get all emoji up to and including Unicode 12 right but fail for 13, 14, and 15.

Here's the somewhat cropped output of demicode for Visual Studio Code:

version-oracle-vscode

So I was wondering who is right, xterm.js or demicode? As it turns out, it's demicode. If you look at the original pull request that added the Unicode "11" extension, more specifically the Unicode code point ranges for HIGH_WIDE, you'll see the ranges [0x1F9A5, 0x1F9AA], [0x1F9AE, 0x1F9CA]. I picked the first one because that's just the range for U+1F9A9, the emoji used by demicode for Unicode 12. And the next range served as confirmation. Those two ranges both appear in EastAsianWidth.txt for v12.0 but not in EastAsianWidth.txt for v11.0.

The pull request took some time to land and was merged on February 3, 2020. Unicode 12 was released March 5, 2019. So there was plenty of time for grabbing the newer data as well.

In short, xterm.js may call it Unicode "11", but the data really seems to come from Unicode 12. Aren't you glad to finally find out? 🤪

Details

All xterm.js versions with the Unicode "11" extension.

Steps to reproduce

Install demicode, run --with-version-oracle in Hyper or Visual Studio Code.

@jerch
Copy link
Member

jerch commented Sep 7, 2023

Oha 😅 The addon started at v10, and I remember playing around with v11 and v12. At release time the idea was not to go with v12 yet, because it was still pretty new. Could it be, that I have mixed up the tables in the end?

@apparebit
Copy link
Author

LOL. I read through the two prior draft pull requests that were rejected and then yours. And I was really impressed by how careful and thorough y’all are in maintaining xterm.js. Also, I’m glad this happened because it gave me something funny to discover. 😎 Alas, if you plan on ingesting more Unicode versions, I’d automate the version and file management. Feel free to reuse the implementation in demicode (though it is in Python). See the mirror module.

@jerch
Copy link
Member

jerch commented Sep 8, 2023

.. if you plan on ingesting more Unicode versions ..

Yes that was the initial idea of the unicode provider - to be able attach different unicode versions. So far our fake v11 addon worked well enough beside some newer emoji, so we did not release a newer version yet.

We are currently in the process to restructure the provider API to also support grapheme segmentation (PR #4519) and to shift to v15. There you can also find a rough outline, how we could deal with multiple unicode versions in the future #4519 (comment). As you can see there, we are not keen about super strict version handling at least for newly defined codepoints, but ofc would have to split addon versions for any other change affecting stuff we deal with.

Your tool really looks handy, e.g. to check whether such an addon split would be needed. It prolly can speedup the nasty manual checks of the new unicode release changesets by far.

@apparebit
Copy link
Author

Your tool really looks handy, e.g. to check whether such an addon split would be needed. It prolly can speedup the nasty manual checks of the new unicode release changesets by far.

Thank you. I sure hope so and, please, do let me know if anything can be improved. I'm close to yet another release that adds paging backwards and running non-interactively. I'm certainly slouching to 1.0.

I did see PR #4519. It sure looks impressive. Three comments though (which I'm happy to make on the PR as well, if you want me to):

  • I have good news: Segmentation into grapheme clusters uses the Extended_Pictograph property of Unicode, which is a really good thing because said property includes currently unassigned code points that are reserved for future emoji use. That means, if you render said code points as wide, you probably can make xterm.js a bit more future-proof.
  • I am not convinced that the core logic in _shouldJoin etc implements the complete logic from the Unicode TR. But that may just be because it's written as piecemeal predicates where I just implemented the regex from TR-29. In either case, I highly recommend running your implementation against the full set of Unicode tests in GraphemeBreakTest.txt. It covers a number of boundary conditions I totally missed in my code and hence made all the difference in implementation quality. Sorry, @PerBothner to potentially add more to your already extensive PR.
  • Finally, I have bad news: Unicode 15.1 (link to draft TR29), which is about to be released, complicates grapheme breaking quite a bit. It appears that the description for 15.0 always was incomplete and ICU and CLDR actually implement more complex rules. (15.0 also contains a bug but that was already discovered when I reported it.) The changes aren't that complicated per se, but they draw on several more Unicode properties, including Script, Canonical_Combining_Class, and Indic_Syllabic_Category. Since I haven't added them to demicode (yet), I can't even tell you how big they are. Script obviously covers a large part of Unicode, though in this case only a few scripts are of interest. Surprise! 😳

@PerBothner
Copy link
Contributor

"I highly recommend running your implementation against the full set of Unicode tests in GraphemeBreakTest.txt. It covers a number of boundary conditions I totally missed in my code and hence made all the difference in implementation quality"

That might be reasonable - but note our goal is not to be pedantically correct in edge case: We need to do the right thing with valid/meaningful code sequences, plus not blow up in pathological cases. For meaningless/undefined sequences, GraphemeBreakTest.txt tells us how to break into clusters - but it doesn't tell us what you do with the various questionable clusters - it's basically undefined behavior.

I think it would be disirable to write a test that reads GraphemeBreakTest.txt, parses it, and for each each non-comment line calls directly into the grapheme-break API - without trying to display anything. However, I suggest deferring this until the Unicode specification has stabilized and 15.1 or 16 is official, and Unicode has updated GraphemeBreakTest.txt: Partly to avoid duplicate work; partly because the longer we wait to merge the pull-request the more more useless work resolving conflitcs we have to do; partly because my Node/Playwright skills are limited and I need a break before I try to learn more tools and APIs :-)

@apparebit
Copy link
Author

I wouldn't quite go as far as calling some seemingly incomplete sequences "undefined behavior." They may not usually occur in text, but at least some of them occur fairly regularly in the Unicode standard and related documents 😜. Now that I'm writing a blog post about Unicode I appreciate the need for three notations, i.e., the characters, their names, and their code points. Any one of them fails to be readable at times.

Unicode 15.1 was released a few days ago and I did release demicode 1.0 yesterday. It implements grapheme cluster breaking according to 15.1 and also 15.0 (using the same regex). The final release of Unicode 15.1 introduces a new property Indic_Conjunct_Break, which is derived from Canonical_Combining_Class, Indic_Syllabic_Category, and Script. Using the derived property, I ran into a small complication because I'm using a regex and thus need to map both Grapheme_Cluster_Break and Indic_Conjunct_Break to letters. The pickup is that both properties assign non-default values to the same code points. But the overlap is simple enough that a combined value space still is straight-forward. Once I got that right, my implementation passed the new Unicode tests with flying colors (there's about twice as many as for 15.0). So I imagine that your rule-based implementation should be even easier to update. If you run demicode with --stats, it prints the number of code points per property, the number of ranges in the Unicode data file, and the number of ranges after combining abutting ones for the same value. Add --ucd-version VERSION and you can compare those counts between 15.0 and 15.1.

@PerBothner
Copy link
Contributor

I think it would be desirable to update xterm.js before the 5.4.0 release, though I don't think it's a show-stopper. Regardless, I'd like to wait until we understand the concerns of @DHowett that @Tyriar mentioned.

I don't know how to turn the GraphemeBreakTest.txt into a "playright" test, but it doesn't seem too difficult to write a node script that parses the file, creates a string for each entry, calls into the API to find break points, and checks that those match those in the file.

Right now I'm focused on a different non-trivial task: I've updated changing to my BufferLine rewrite to use the new Unicode API; next is to replace the _text string with codepoints inline in the _data array.

@Tyriar
Copy link
Member

Tyriar commented Sep 25, 2023

@PerBothner the simplest playwright tests just run code on the web side and then get the result:

test('dispose', async () => {
await ctx.page.evaluate(`
if ('term' in window) {
try {
window.term.dispose();
} catch {}
}
window.term = new Terminal();
window.term.dispose();
`);
strictEqual(await ctx.page.evaluate(`window.term._core._isDisposed`), true);
});

The newer ones hides a bunch of the complexity in a proxy object to improve test writing and make it closer to coding against the actual library:

test('#4773: block cursor should render when the cell is selected', async () => {
const theme: ITheme = {
cursor: '#0000FF',
selectionBackground: '#FF0000'
};
await ctx.value.page.evaluate(`window.term.options.theme = ${JSON.stringify(theme)};`);
await ctx.value.proxy.focus();
await ctx.value.proxy.selectAll();
await pollFor(ctx.value.page, () => getCellColor(ctx.value, 1, 1), [0, 0, 255, 255]);
});

Doing the former is fine, that's preferred over a node unit test as it's closer to a real world scenario.

@Tyriar Tyriar added area/addon/unicode11 type/proposal A proposal that needs some discussion before proceeding labels Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/addon/unicode11 type/proposal A proposal that needs some discussion before proceeding
Projects
None yet
Development

No branches or pull requests

4 participants