Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New unicode-graphemes addon. #4519

Merged
merged 62 commits into from
Sep 12, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
2f7fe11
New unicode-graphemes addon.
PerBothner May 17, 2023
67e9689
Cleanup - fix various tests.
PerBothner May 18, 2023
41760df
xterm-addon-unicode-graphemes - fix 'yarn test'
PerBothner May 18, 2023
8e730cd
xterm-addon-unicode-graphemes: add api tests and get demo working
PerBothner May 18, 2023
9c34bb4
Merge branch 'master' into clusters
PerBothner May 18, 2023
466501c
patch for working perf test and API test injection
jerch May 20, 2023
284cde8
fix benchmark
jerch May 20, 2023
68abf10
remove spurious .only
jerch May 20, 2023
ebfa201
revert wrong added file
jerch May 20, 2023
3003185
revert wrongly editied files
jerch May 20, 2023
e87fa16
Merge pull request #2 from jerch/clusters-patch
PerBothner May 21, 2023
1a8c7da
Update .eslintrc.json to include xterm-addon-unicode-graphemes/benchmark
PerBothner May 21, 2023
fa89e17
Optimize UnicodeGraphemeProvider for the ASCII case.
PerBothner May 21, 2023
cf403bc
Merge branch 'master' into clusters
jerch May 21, 2023
693f6b2
Merge branch 'master' into clusters
PerBothner Jun 27, 2023
dc6818d
Fix error in charProperties optimization.
PerBothner Jun 28, 2023
5568233
Fix for lint the previous charProperties change.
PerBothner Jun 28, 2023
7baff4e
Merge branch 'master' into clusters
Tyriar Aug 2, 2023
08a0914
Add grapheme cluster test button to demo
Tyriar Aug 2, 2023
6b24593
Null out following columns after grapheme cluster
PerBothner Aug 4, 2023
4204215
Merge branch 'master' into clusters
PerBothner Aug 4, 2023
dccfc11
Tweak to avoid line-ending spaces.
PerBothner Aug 6, 2023
a0dc881
Fix copyright year in new file.
PerBothner Aug 6, 2023
88bfc6b
Tweak `charProperties` doc comment.
PerBothner Aug 6, 2023
988b3b1
Remove tailing space.
PerBothner Aug 6, 2023
7202979
Changed xterm-addon-unicode-graphemes version number to 0.1.0.
PerBothner Aug 6, 2023
f103540
Move precedingJoinState from EscapeSequenceParser to InputHandler
PerBothner Aug 6, 2023
b9abbc0
Split complicated conditional expression.
PerBothner Aug 7, 2023
4fbc623
Expand doc comments for UnicodeCharProperties and UnicodeCharWidth.
PerBothner Aug 7, 2023
302da18
Rename local variable precedingInfo in InputHandler to precedingJoinS…
PerBothner Aug 7, 2023
b9a4760
Move files imported from unicode-properties into subdirectory
PerBothner Aug 9, 2023
3bfe5d8
Add testOptions parameter to openTerminal in TestUtils.
PerBothner Aug 9, 2023
9b21786
Merge branch 'master' into clusters
jerch Aug 10, 2023
fb2c680
Replace comments by assert message in unicode-graphemes tests
PerBothner Aug 10, 2023
aac5d47
Merge branch 'master' into clusters
PerBothner Aug 10, 2023
cf243ad
Merge branch 'master' into clusters
PerBothner Aug 14, 2023
5994e25
Merge branch 'master' into clusters
PerBothner Aug 15, 2023
e3d3cdb
Merge branch 'master' into clusters
PerBothner Aug 17, 2023
f1bc956
Add more grapheme-cluster examples to test and demo
PerBothner Aug 17, 2023
2070cc4
Merge branch 'master' into clusters
PerBothner Aug 17, 2023
6c1eb97
Remove no-longer-applicable comment.
PerBothner Aug 17, 2023
21fd545
Don't load unicode11 addon on demo startup.
PerBothner Aug 18, 2023
00fd2b8
Set unicode.activeVersion when loading UnicodeGraphemeProvider.
PerBothner Aug 18, 2023
e21e525
Merge branch 'master' into clusters
Tyriar Aug 18, 2023
2204375
Merge branch 'master' into clusters
PerBothner Aug 21, 2023
e5a2ebd
Provide with "15" and "15-graphemes" UnicodeProviders
PerBothner Aug 21, 2023
5ccf4d3
Merge remote-tracking branch 'upstream/master' into pr/PerBothner/4519-1
Tyriar Aug 23, 2023
dc6dd6d
Remove precedingCodepoint - use precedingJoinState instead
PerBothner Aug 24, 2023
44e1977
Merge branch 'master' into clusters
PerBothner Aug 27, 2023
630d213
Fix 2 comment typos.
PerBothner Sep 6, 2023
930e5c4
Merge branch 'master' into clusters
PerBothner Sep 7, 2023
9963c58
Two testing-related fixes
PerBothner Sep 7, 2023
427c472
Merge branch 'master' into clusters
PerBothner Sep 7, 2023
bddff60
Merge branch 'master' into clusters
PerBothner Sep 11, 2023
a003034
Increase node memory for lint in CI
Tyriar Sep 11, 2023
1b5756c
Add unicode-grapheme addon files to build artifacts
Tyriar Sep 11, 2023
17c4d09
Add experimental note in readme
Tyriar Sep 11, 2023
9e536ca
Fix warnings in client.ts
Tyriar Sep 11, 2023
24b932d
Update to es2021
Tyriar Sep 11, 2023
9ab0cd6
Set repository to sub-directory
Tyriar Sep 12, 2023
8183d1d
Merge branch 'master' into clusters
Tyriar Sep 12, 2023
32c6165
Final polish
Tyriar Sep 12, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .eslintrc.json
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@
"addons/xterm-addon-serialize/benchmark/tsconfig.json",
"addons/xterm-addon-unicode11/src/tsconfig.json",
"addons/xterm-addon-unicode11/test/tsconfig.json",
"addons/xterm-addon-unicode-graphemes/src/tsconfig.json",
"addons/xterm-addon-unicode-graphemes/test/tsconfig.json",
"addons/xterm-addon-unicode-graphemes/benchmark/tsconfig.json",
"addons/xterm-addon-web-links/src/tsconfig.json",
"addons/xterm-addon-web-links/test/tsconfig.json",
"addons/xterm-addon-webgl/src/tsconfig.json",
Expand All @@ -34,6 +37,9 @@
"sourceType": "module"
},
"ignorePatterns": [
"addons/xterm-addon-unicode-graphemes/src/tiny-inflate.ts",
"addons/xterm-addon-unicode-graphemes/src/unicode-trie.ts",
"addons/xterm-addon-unicode-graphemes/src/UnicodeProperties.ts",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of ignoring random files how about we have a src/vendor/ folder and put third party non-node_modules that we don't want to format into that/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree those should be moved. I'm leaning to a sub-directory named unicode-properties. E.g. addons/xterm-addon-unicode-graphemes/src/unicode-properties/UnicodeProperties.ts. However, a more generic/conventional subdirectory-name such as imported or upstream could also make sense. In that case .eslintc.json could use a generic pattern like "addons/*/src/imported/**/*.ts".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vendor/ or third_party/ would be my preference as imported/upstream are more ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the 3 files into third-party. I prefer hyphens rather than underscores in file-/directory-names. This may be partly because of my Lisp/Scheme history - but it also seems common in xterm.js. If you prefer third_party with underscore I'll be happy change it,

"**/typings/*.d.ts",
"**/node_modules",
"**/*.js"
Expand Down
3 changes: 3 additions & 0 deletions addons/xterm-addon-unicode-graphemes/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
lib
node_modules
out-benchmark
29 changes: 29 additions & 0 deletions addons/xterm-addon-unicode-graphemes/.npmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Blacklist - exclude everything except npm defaults such as LICENSE, etc
*
!*/

# Whitelist - lib/
!lib/**/*.d.ts

!lib/**/*.js
!lib/**/*.js.map

!lib/**/*.css

# Whitelist - src/
!src/**/*.ts
!src/**/*.d.ts

!src/**/*.js
!src/**/*.js.map

!src/**/*.css

# Blacklist - src/ test files
src/**/*.test.ts
src/**/*.test.d.ts
src/**/*.test.js
src/**/*.test.js.map

# Whitelist - typings/
!typings/*.d.ts
19 changes: 19 additions & 0 deletions addons/xterm-addon-unicode-graphemes/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Copyright (c) 2019, The xterm.js authors (https://github.com/xtermjs/xterm.js)
Tyriar marked this conversation as resolved.
Show resolved Hide resolved

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
25 changes: 25 additions & 0 deletions addons/xterm-addon-unicode-graphemes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
## xterm-addon-unicode-graphemes

An addon providing enhanced Unicode support (include grapheme clustering) for xterm.js.

The file `src/UnicodeProperties.ts` is generated and depends on the Unicode version. See [the unicode-properties project](https://github.com/PerBothner/unicode-properties) for credits and re-generation instructions.

### Install

```bash
npm install --save xterm-addon-unicode-graphemes
```

### Usage

```ts
import { Terminal } from 'xterm';
import { UnicodeGraphemeAddon } from 'xterm-addon-unicode-graphemes';

const terminal = new Terminal();
const unicodeGraphemeAddon = new UnicodeGraphemeAddon();
terminal.loadAddon(unicodeGraphemeAddon);

// activate the new version
terminal.unicode.activeVersion = '15';
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
/**
* Copyright (c) 2019 The xterm.js authors. All rights reserved.
* @license MIT
*/

import { perfContext, before, ThroughputRuntimeCase } from 'xterm-benchmark';

import { spawn } from 'node-pty';
import { Utf8ToUtf32, stringFromCodePoint } from 'common/input/TextDecoder';
import { Terminal } from 'browser/Terminal';
import { UnicodeGraphemeProvider } from 'UnicodeGraphemeProvider';


function fakedAddonLoad(terminal: any): void {
// resembles what UnicodeGraphemesAddon.activate does under the hood
terminal.unicodeService.register(new UnicodeGraphemeProvider());
terminal.unicodeService.activeVersion = '15-graphemes';
}


perfContext('Terminal: ls -lR /usr/lib', () => {
let content = '';
let contentUtf8: Uint8Array;

before(async () => {
// grab output from "ls -lR /usr"
const p = spawn('ls', ['--color=auto', '-lR', '/usr/lib'], {
name: 'xterm-256color',
cols: 80,
rows: 25,
cwd: process.env.HOME,
env: process.env,
encoding: (null as unknown as string) // needs to be fixed in node-pty
});
const chunks: Buffer[] = [];
let length = 0;
p.on('data', data => {
chunks.push(data as unknown as Buffer);
length += data.length;
});
await new Promise<void>(resolve => p.on('exit', () => resolve()));
contentUtf8 = Buffer.concat(chunks, length);
// translate to content string
const buffer = new Uint32Array(contentUtf8.length);
const decoder = new Utf8ToUtf32();
const codepoints = decoder.decode(contentUtf8, buffer);
for (let i = 0; i < codepoints; ++i) {
content += stringFromCodePoint(buffer[i]);
// peek into content to force flat repr in v8
if (!(i % 10000000)) {
content[i];
}
}
});

perfContext('write/string/async', () => {
let terminal: Terminal;
before(() => {
terminal = new Terminal({ cols: 80, rows: 25, scrollback: 1000 });
fakedAddonLoad(terminal);
});
new ThroughputRuntimeCase('', async () => {
await new Promise<void>(res => terminal.write(content, res));
return { payloadSize: contentUtf8.length };
}, { fork: false }).showAverageThroughput();
});

perfContext('write/Utf8/async', () => {
let terminal: Terminal;
before(() => {
terminal = new Terminal({ cols: 80, rows: 25, scrollback: 1000 });
});
new ThroughputRuntimeCase('', async () => {
await new Promise<void>(res => terminal.write(content, res));
return { payloadSize: contentUtf8.length };
}, { fork: false }).showAverageThroughput();
});
});
19 changes: 19 additions & 0 deletions addons/xterm-addon-unicode-graphemes/benchmark/benchmark.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"APP_PATH": ".benchmark",
"evalConfig": {
"tolerance": {
"*": [0.75, 1.5],
"*.dev": [0.01, 1.5],
"*.cv": [0.01, 1.5],
"EscapeSequenceParser.benchmark.js.*.averageThroughput.mean": [0.9, 5]
},
"skip": [
"*.median",
"*.runs",
"*.dev",
"*.cv",
"EscapeSequenceParser.benchmark.js.*.averageRuntime",
"Terminal.benchmark.js.*.averageRuntime"
]
}
}
23 changes: 23 additions & 0 deletions addons/xterm-addon-unicode-graphemes/benchmark/tsconfig.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"compilerOptions": {
"lib": ["dom", "es6"],
"outDir": "../out-benchmark",
"types": ["../../../node_modules/@types/node"],
"moduleResolution": "node",
"strict": false,
"target": "es2015",
"module": "commonjs",
"baseUrl": ".",
"paths": {
"common/*": ["../../../src/common/*"],
"browser/*": ["../../../src/browser/*"],
"UnicodeGraphemeProvider": ["../src/UnicodeGraphemeProvider"]
}
},
"include": ["../**/*", "../../../typings/xterm.d.ts"],
"exclude": ["../../../**/*test.ts", "../../**/*api.ts"],
"references": [
{ "path": "../../../src/common" },
{ "path": "../../../src/browser" }
]
}
29 changes: 29 additions & 0 deletions addons/xterm-addon-unicode-graphemes/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"name": "xterm-addon-unicode-graphemes",
Tyriar marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should stick to the major version of unicode to separate the unicode support. Thats easier to grasp for ppl.

For the grapheme vs not-grapheme support I suggest to expose 2 providers from the addon, one with and one w'o support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you draw the line? The non-grapheme variant should still handle simple clusters that consist of just a single base character followed by some modifiers. Otherwise it's a regression.

Without having examined the issue, the main differentiator is whether to ignore ZWJ (zero-width-joiner), thus treating true compounds as separate clusters. Handling of Regional Indicators (flags) is another issue. It is certainly possible to add a flag that restores the old non-standard behavior.. I'd rather not (I'm not sure it's needed), but I'll do whatever you and @Tyriar agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the problem here is, that many cmdline apps are simply not segmentation aware, unless the system provides better unicode support for them, and instead just add wcwidth values to guess their content width.

Esp. macos is likely to create issues here, as it delivers a very old wcwidth impl (not sure, which unicode version it is created from, when I tried to deal with it 3ys ago, it still was v6)

So to separate grapheme support from non grapheme support, I would not deactivate certain segmentation rules, but the rule handling completely falling back to old wcwidth measuring. Note that grapheme handling was not even a thing in terminals until recently, so apps are very likely to get this wrong until they adopted it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the question is: If you paste a string containing a compound cluster into an input field in an application that just uses wcwidth and doesn't know anything about clusters, what will happen? Assume the application knows enough that wcwidth==0 means a combination. What should happen? Is this a use-case we should worry about?

In addition to spacing and alignment issues, consider what happens when editing the input field: The application thinks the cursor is one place, while the terminal thinks it is another place.

Is this something we should support? Is suggesting that people fall back to unicode6 or unicode11 unacceptable?

It is probably not too difficult for the addon to provide two implementations of IUnicodeVersionProvider which would differ in their implementation of charProperties. Instead of calling UC.shouldJoin you just check if the width of the current character is 0. You select one provider by setting activeVersion to "15" and the other to "15-grpahemes". Presumably the latter would the default if you just load the addon.

I haven't looked into how much would be involved.

Copy link
Member

@jerch jerch Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the question is: If you paste a string containing a compound cluster into an input field in an application that just uses wcwidth and doesn't know anything about clusters, what will happen?

Whatever is defined in the selected unicode version + ruleset, e.g. for no-grapheme support the old wcwidth calc.

Is suggesting that people fall back to unicode6 or unicode11 unacceptable?

Yes, it misses a lot of codepoints. V6 is even worse, as codepoints changed metrics in between.

It is probably not too difficult for the addon to provide two implementations...

Isn't it just a bypass for the joining, as the width has to be pulled anyway?


I know the situation with unicode is really messed up in terminals, even worse for xtermjs, as it cannot even determine a sane default from the system it runs on (other desktop TEs can do that). Regarding grapheme support in terminals we are in an "infrastructure trap" - someone has to make the first move (here either TE or app side), and the other side should follow. Thus I think grapheme handling should be on by default. But making it mandatory is imho wrong, as it is still pretty new in TE context.

"version": "0.5.0",
Tyriar marked this conversation as resolved.
Show resolved Hide resolved
"author": {
"name": "The xterm.js authors",
"url": "https://xtermjs.org/"
},
"main": "lib/xterm-addon-unicode-graphemes.js",
"types": "typings/xterm-addon-unicode-graphemes.d.ts",
"repository": "https://github.com/xtermjs/xterm.js",
jerch marked this conversation as resolved.
Show resolved Hide resolved
"license": "MIT",
"keywords": [
"terminal",
"xterm",
"xterm.js"
],
"scripts": {
"build": "../../node_modules/.bin/tsc -p .",
"prepackage": "npm run build",
"package": "../../node_modules/.bin/webpack",
"prepublishOnly": "npm run package",
"benchmark": "NODE_PATH=../../out:./out:./out-benchmark/ ../../node_modules/.bin/xterm-benchmark -r 5 -c benchmark/benchmark.json out-benchmark/benchmark/*benchmark.js",
"benchmark-baseline": "NODE_PATH=../../out:./out:./out-benchmark/ ../../node_modules/.bin/xterm-benchmark -r 5 -c benchmark/benchmark.json --baseline out-benchmark/benchmark/*benchmark.js",
"benchmark-eval": "NODE_PATH=../../out:./out:./out-benchmark/ ../../node_modules/.bin/xterm-benchmark -r 5 -c benchmark/benchmark.json --eval out-benchmark/benchmark/*benchmark.js"
},
"peerDependencies": {
"xterm": "^5.0.0"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
/**
* Copyright (c) 2023 The xterm.js authors. All rights reserved.
* @license MIT
*/

import { IUnicodeVersionProvider } from 'xterm';
import { UnicodeCharProperties, UnicodeCharWidth } from 'common/services/Services';
import { UnicodeService } from 'common/services/UnicodeService';
import * as UC from './UnicodeProperties';

export class UnicodeGraphemeProvider implements IUnicodeVersionProvider {
public readonly version = '15-graphemes';
public ambiguousCharsAreWide: boolean = false;
constructor() {
}

public charProperties(codepoint: number, preceding: UnicodeCharProperties): UnicodeCharProperties {
// Optimize the simple ASCII case, under the condition that
// UnicodeService.extractCharKind(preceding) === GRAPHEME_BREAK_Other
// (which also covers the case that preceding === 0).

if ((codepoint >= 32 && codepoint < 127) && (preceding >> 3) === 0) {
// Inlined UnicodeService.createPropertyValue(GRAPHEME_BREAK_Other, 1, false)
return 1;
}

let charInfo = UC.getInfo(codepoint);
let w = UC.infoToWidthInfo(charInfo);
let shouldJoin = false;
if (w >= 2) {
// Treat emoji_presentation_selector as WIDE.
w = w === 3 || this.ambiguousCharsAreWide || codepoint === 0xfe0f ? 2 : 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need an explicit handling of VS15 (U+FE0E) as well? Imho VS15 should always turn emojis into text repr with only 1 cell width.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really needed an update to the ancient East Asian Widths TR to deal with grapheme clusters and more. In addition to VS15, it would be nice to have a recommendation on how wide en/em dashes/spaces should be.

@Tyriar If perchance you have a contact that is involved with Unicode standardization, perhaps check with them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PerBothner I think they dont feel obligated to solve the monospace issue for us:

Note: The East_Asian_Width property is not intended for use by modern terminal emulators without appropriate tailoring on a case-by-case basis. Such terminal emulators need a way to resolve the halfwidth/fullwidth dichotomy that is necessary for such environments, but the East_Asian_Width property does not provide an off-the-shelf solution for all situations. The growing repertoire of the Unicode Standard has long exceeded the bounds of East Asian legacy character encodings, and terminal emulations often need to be customized to support edge cases and for changes in typographical behavior over time.

That note got added to TR-11 with unicode version 11. It basically says - "it is much more complicated, and we dont care to spec things out for the terminal". 🙈

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah well, I am not sure anymore, if VS15 should reduce emojis to 1 cell width. For VS16 it is explicitly mentioned that pictograms should be wide. There no such notion for VS15, which makes them somewhat unclear again (if width property should be applied, which can be narrow or wide for different symbols, or whether they still stick to wide). What a nonsense...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No contact 🙁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To not block on a silly thing like VS15, maybe just add a comment somewhere, that VS15 is not treated special in width handling.

} else {
w = 1;
}
if (preceding !== 0) {
const oldWidth = UnicodeService.extractWidth(preceding);
charInfo = UC.shouldJoin(UnicodeService.extractCharKind(preceding), charInfo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat unhappy with extractWidth and extractCharKind being implemented on UnicodeService. It leaks facts about the impl in this addon into the core code. Even worse - it forces anyone implementing that interface to use the same UnicodeCharProperties layout. More about this below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not the cleanest approach. However, if we support multiple UnicodeCharProperties layouts then these functions can no longer be static, which is likely to have performance implications. A static function can potentially be inlined - though I don't know of a way to have TypeScript actually do inlining.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, as far as I am aware TS does not do that kind of optimization for us. I've never seen it inlining code blocks, I only know of the const enum trick for simple const value inlining. For functions/methods it will always keep the call, which might or might not be inlined by the JS JIT later on.

(@Tyriar plz correct me, if I miss here something).

I have not followed all call traces of these functions in detail, but maybe reshaping UnicodeCharProperties into a ref passed object as described below can help here? It would also omit the additional bit packing & extraction steps, which are known to create instruction pressure (at least in C).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For functions/methods it will always keep the call, which might or might not be inlined by the JS JIT later on.

This is my understanding but I'd need to look up the conditions in which a function is inlined. I know there's some V8 tool that lets you inspect optimizations like this.

shouldJoin = charInfo > 0;
if (shouldJoin) {
if (oldWidth > w) {
w = oldWidth;
} else if (charInfo === 32) { // UC.GRAPHEME_BREAK_SAW_Regional_Pair)
w = 2;
}
}
}
return UnicodeService.createPropertyValue(charInfo, w, shouldJoin);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as comment above.

}

public wcwidth(codepoint: number): UnicodeCharWidth {
const charInfo = UC.getInfo(codepoint);
const w = UC.infoToWidthInfo(charInfo);
const kind = (charInfo & UC.GRAPHEME_BREAK_MASK) >> UC.GRAPHEME_BREAK_SHIFT;
return (kind === UC.GRAPHEME_BREAK_Extend || kind === UC.GRAPHEME_BREAK_Prepend) ? 0
: (w >= 2 && (w === 3 || this.ambiguousCharsAreWide)) ? 2 : 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split this into ifs with early returns? The nested ternaries is a bit hard to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this.

(If have mixed feelings about the best style, given JavaScript's limitations - Rust handles this better. I would prefer 'if-else if-else' as being closer to preferred "structured programming" style, but lint complains about "Unnecessary 'else' after 'return' no-else-return".)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer early returns to minimize indentation as you can then to things like this which are very readable imo:

function x() {
  if (!a) {
    return;
  }
  if (!b) {
    return;
  }
  // do the thing
}

This is just a personal preference, it doesn't really matter that much imo just as long as everything is consistent. Generally we try follow similar style to the codebase but we don't strictly enforce unless we set up a lint rule. I started wasting a lot less time when I stopped calling out style issues in PRs and instead spent that effort setting up lint rules 🙂

Copy link
Member

@jerch jerch Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this hot path code? I the past I found ternary to perform better in Javascript, thus I opt for ternary in hot path code normally, when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not hot-path anymore, as InputHandler no longer calls wcwidth.

}
}
17 changes: 17 additions & 0 deletions addons/xterm-addon-unicode-graphemes/src/UnicodeGraphemesAddon.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
/**
* Copyright (c) 2023 The xterm.js authors. All rights reserved.
* @license MIT
*
* UnicodeVersionProvider for V15 with grapeme cluster handleing.
*/

import { Terminal, ITerminalAddon } from 'xterm';
import { UnicodeGraphemeProvider } from './UnicodeGraphemeProvider';


export class UnicodeGraphemesAddon implements ITerminalAddon {
public activate(terminal: Terminal): void {
terminal.unicode.register(new UnicodeGraphemeProvider());
}
public dispose(): void { }
}
Loading