-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New unicode-graphemes addon. #4519
Conversation
Handle grapheme cluster lookup as well as wcwidth. This is based on Unicode 15, and could replace the unicode11 addon.
There are some |
I fixed a bunch of tests. There are 6 failures like the following. In spite of Googling, I haven't yet figued out to fix these.
Except for these, Still 2 errors in |
You need to set the tsconfigs in .eslintrc.json: Lines 8 to 33 in b7d73f6
|
Thanks. I added the following to
|
Next problem:
I might have to figure this out by myself - but if someone has an insight it would be appreciated. (I'm also seeing some |
The playwright tests use the demo, this is about exposing the addon on client.ts: Lines 204 to 214 in 913cb25
|
At this point I think this is ready for merge - or at least feedback. I'd be curious to run a performance comparison. What is the recommended way to do that? |
Good to see it passing 👍. It's a pretty beefy PR so it might take me a couple of weeks to get to. @jerch would know the best way to benchmark |
No hurry. It does mean I may want to put the BufferLine changes on hold since there is some overlap, mainly in the InputHandler logic. Though maybe not - the overlap is likely to be limited to a small area. |
@PerBothner For perf test you can use/alter https://github.com/xtermjs/xterm.js/blob/master/test/benchmark/Terminal.benchmark.ts, and then run it with
You can also alter the command to run above in before --> spawn (used that to test different stress pattern in the past). |
@PerBothner Imho the idea to have this first as an addon is the right choice, which gives us time to level out raw edges (if there are any, no clue yet). I was really skeptical if this could be put into an addon with a reasonable afford, so thanks for doing the hard work. ❤️ |
@PerBothner The perf test wont run by just altering the script, it is in fact quite involved to get it working (this is mainly because of the repo structure, where addons are not seen by the base path lookups...). Still got it running, if you want I commit the needed changes into your branch. Overall the grapheme segementation performs really well 🚀 :
so overall perf degrades by ~30% for UTF8 and ~40% for JS string input. Thats def. not as bad as I feared, so I wont block this just for perf reasons (yes I would have done that, if the perf penalty was >2 times). There are prolly still some optimizations possible, and I hope we can get these numbers below 30%. So lets not get sidetracked too much by perf ideas here, it is imho already good enough to follow your impl route and polish interfaces and test cases first. On a sidenote (@Tyriar) - Seems we already lost throughput over time for some reason. When I did the initial buffer rework the perf test was at ~38MB/s for UTF8 and ~32MB/s for JS strings, as far as I remember. |
It seems plausible that the table lookup in Another possibly-worthwhile simple optimization might be to bypass the trie-lookup for ASCII (0020-007E) or below 0300 (with a few exceptions, to complicate things). |
@PerBothner Yeah wasm speedup will most likely be limited by the granularity we can call it with, and what additional data moving is needed per step. So while this def. worth a shot, I would not put a priority on that yet, before we settled the inner works and have a reliable test base (wasm is quite cumbersome, when it comes to testing). |
The data for width+category fits in 8 bits. I also note that UnicodeV6 and UnicodeV11 use a 65536-element Uint8Array for the BMP. There is no reason UnicodeGraphemeProvider can't do the same, making a Beyond the BMP we could use unicode-trie or perhaps a binary search in a Uint32Array (23 bits character range start + 8 bits value). But I agree: Lets nail down the API before worrying about further optimization. |
To move things forward I suggest that we should check, if and how we can activate the addon and grapheme handling during browser API tests. This would give us at least a basic regression testing against top level API. This should be possible, if we rework the API tests to boostrap its terminal instance from a common module, where we already injected the grapheme addon. It will still miss several core tests on the unit testing level done on nodeJS side, but I am unsure, if we can get it hooked in there with a reasonable effort. This is a bit of a downside of the addon abstraction - they get mixed in late, so cannot get injected easily to existing tests on unit level (still might be possible for the Why bothering with already existing tests? For mainly 3 reasons:
|
About the API tests - for tests using diff --git a/test/api/TestUtils.ts b/test/api/TestUtils.ts
index 3fd0648e..904c0d12 100644
--- a/test/api/TestUtils.ts
+++ b/test/api/TestUtils.ts
@@ -46,6 +46,11 @@ export async function timeout(ms: number): Promise<void> {
export async function openTerminal(page: playwright.Page, options: ITerminalOptions & ITerminalInitOnlyOptions = {}): Promise<void> {
await page.evaluate(`window.term = new Terminal(${JSON.stringify({ allowProposedApi: true, ...options })})`);
await page.evaluate(`window.term.open(document.querySelector('#terminal-container'))`);
+ await page.evaluate(`
+ window.unicode = new UnicodeGraphemesAddon();
+ window.term.loadAddon(window.unicode);
+ window.term.unicode.activeVersion = '15-graphemes';
+ `);
await page.waitForSelector('.xterm-rows');
} This is just quickly hacked in to see, how it could be done. For better integration testing it prolly will need a conditional here, whether to run API tests with or w'o graphemes (could be done in the CI file as env var). Also have not checked yet, if all API tests call
(It tests unicode provider registration, but does not know about the new provider yet, thus it should have failed, hurray 😸) |
@PerBothner Should I do a PR against your branch, so you can add the perf test and the API test changes? |
No opinion - whatever you think is reasonable. |
Created PR: PerBothner#2 |
How much longer do we wait for @DHowett? (I also sent him an email, which he hasn't responded to. Perhaps @Tyriar could make another attempt to contact him? I updated the branch from upstream. GitHub reports some weird testsuite-failures, including lint running out of memory. ( |
You're right, I can have that discussion internally and unblock merging this if we just mark it as experimental in the readme. |
I'll have a look at the CI failures, OOM is quite surprising. |
@jerch I'll do a once over of the core files, any concerns on your end to get this merged given we have a warning in the readme now? |
@jerch I'll bring it up in our next monthly sync |
@PerBothner fixed the CI issues:
Sorry about all the conflicts, we have a pretty sweet CI setup now though 😅 |
Merging! We'll set up publishing #4797 |
Thanks! Next I will "rebase" the BufferLine re-write to use the grapheme-cluster API. |
@PerBothner 👌, I think I mentioned it before but we'll want to chat about the high level ideas in the proposed buffer line rewrite before investing too much time to make sure we're all on the same page. |
@Tyriar "we'll want to chat about the high level ideas in the proposed buffer line rewrite" I agree. There was some discussion in issue #4513, but mixed in with the clustering issue, This issue is pretty important to me. I am seriously considering switching DomTerm engine to xterms.js for various reasons (mainly performance and community). However, DomTerm has a very ambitious vision and already has a lot of features that xterm.js doesn't have. (Hence a lot of DomTerm features would have to be ported or re-implemented.) The existing BufferLine representation is simple but quite limited, and not a good match for some of the things I want to do, including things that DomTerm currently supports. |
@PerBothner closed, yes please create the issue and we can chat there. It's a bit tough to filter through all this info quickly so it would be good to separate the discussions. The main goals on my end of the buffer system is fast access and tight packing of the data. The major things I'd like that come to mind for the buffer are deferred clean up/reflow of the arrays (#4232) and leveraging less, larger array buffers to store the data (I think this might be something you were after too). We also have the decoration system which adds a lot of extensibility capabilities which are slower but very flexible. |
FWIW DomTerm already does deferred re-flow. (Partly because re-flow is fairly expensive in the general case, which may include non-trivial pretty-printing and variable-width fonts.) I will cover re-flow in my proposal. I'll write up the design, goals, and features in a new Issue, hopefully later today or possibly tomorrow. |
See new issue #4800 about a |
Handle grapheme cluster lookup as well as wcwidth.
This is based on Unicode 15, and could replace the unicode11 addon.
See the discussion in issue #4513 - but not that this pull request does not change the BufferLine implementation. (Except an extra
width
parameter foraddCodepointToCell
.Note this combines the lookup for wcwidth and grapheme-break-property in a single trie lookup, so I'm hoping there won't be a noticable performace cost. Please let me know what you think.
Here is a screenshot of DomTerm with the xterm.js emulator. This is in a Tauri/Wry window (WebKit-based). Firefox also works. Chrome-based browsers (including Electron) mostly work - except they don't seem to handle Korean Hangul properly. (This might be fixable by changing fonts.)