-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initTable is slow, can we lazily load parts as they're needed? #1785
Comments
@Tyriar 20ms - eww thats much more than it shows in the tests (I see an average of 8ms for it). Yeah it can be done in chunks. It might have negative impact on the input flow, since it would introduce another condition to be evaluated for every single char, have to test it. Btw what are those really high framerates on right side, the new thingy? (not letting the cat out of the bag) 😸 |
@Tyriar , Can i look at changing this ? |
@skprabhanjan Its already highly optimized code, the returned function is used in the hot loop during input for every char. Therefore its not that easy to chunkify it without reintroducing a runtime penalty. Also it is not directly testable since v8 typically inlines the function. If you are still with me and still want to do it, these are the steps to be taken:
There are several tricks possible to further "hide" the workload (maybe use Oh - and the lookup table is highly packed to avoid wasting memory. Would be good if this does not end up taking much more memory. |
@jerch , Thanks for the reply.
If i choose to go with the second approach then what exactly does first occurence of a char within a parts range would mean ? |
@skprabhanjan I think the most straight forward way is to do the following:
From there on its a matter of code golfing to keep the runtime as low as possible. As suggested above you might also want to try a single big table instead of distinct tables, but you would have to track the separate init states as well. Not sure which one turns out to be faster, if think the approach with distinct tables is superior as it might be easier to comprehend. If you do it like above the main difference in the hot function to the current impl is the |
Did some quick tests with not so nice results. Changing that: const t = table || initTable();
if (num < 65536) {
return t[num >> 4] >> ((num & 15) << 1) & 3;
} into this: if (num < 65536) {
const t = chunkTables[num >> 11] || initChunkTable(num >> 11);
return t[(num & 2047) >> 4] >> ((num & 15) << 1) & 3;
} doubles the Currently Hmm, not sure yet how to solve this, maybe we should try to go with the @skprabhanjan Maybe you have a better idea how to get the chunks working in a speedy manner, but 35 MB/s is not acceptable for Edit: Yepp, with a Uint8Array the chinese char shows 80 MB/s as well. (but table creation is a bit slower). |
Next try: ...
const table = new Uint8Array(65536);
table.fill(255);
...
return (num) => {
...
if (num < 65536) {
return (table[num] === 255) ? (table[num] = wcwidthBMP(num)) : table[num];
}
...
} Its stable over 70 MB/s for an already seen char, still drops down to 20 MB/s for fresh chars. Hmm, kinda promising, not sure yet if avoiding one frame hiccup is worth this. |
@jerch , I went through all the comments and followed the complete sequence to get the point where you said the The writethrough model was the fastest. (via the latest comment). |
Yes xterm-benchmark is the right tool for it. Currently there is no test case for import { before, ThroughputRuntimeCase } from '..';
import * as fs from 'fs';
const getStringWidth = require('xterm/lib/CharWidth').getStringWidth;
let content = fs.readFileSync('./chinese_out', {encoding: 'utf8'});
new ThroughputRuntimeCase('', () => {
console.log(getStringWidth(content));
return {payloadSize: content.length};
}, {fork: true}).showRuntime().showThroughput().showAverageRuntime().showAverageThroughput(); You can run the test with About the writethrough model - yeah this looks promising, still it drops throughput badly for new chars. Edit: './chinese_out' is a 50 MB file with the same chinese full width char repeated. |
Sure, now I understand how to run it, Thanks :)
I tried finding this file online but was not succesfully. |
Ah I dont remember the char I used, you can use any wide char < 65536 to create such a file, e.g:
|
@jerch , I generated the chinese_out file but having trouble understanding the command to run it.
|
You have to build xterm-benchmark first with |
@jerch , im really sorry that i am still struck at running the benchmark.
|
@skprabhanjan I think thats an error in xterm.js, I see this too and ignored it, @Tyriar any insights on that error or how to silent it? About the puppeteer error - yeah it tries to redownload the latest revision which fails if it already got downloaded. You can also delete the whole puppeteer folder before doing |
@jerch , sorry my bad, even though it threw error the lib folder was created and housed a cli.js
PS: i tried running the exisiting perfcases like parser.ts and it threw the same error. |
Ah well. the benchmark tool cannot run typescript files directly, thus you need to call it with Here is the corrected test file I used: import { ThroughputRuntimeCase } from '..';
import * as fs from 'fs';
const getStringCellWidth = require('xterm/lib/CharWidth').getStringCellWidth;
let content = fs.readFileSync('./chinese_out', {encoding: 'utf8'});
new ThroughputRuntimeCase('wcwidth', () => {
console.log(getStringCellWidth(content));
return {payloadSize: content.length};
}, {fork: true}).showRuntime().showThroughput().showAverageRuntime().showAverageThroughput(); Since the file is Typescript and resides under |
Oops, sorry for that silly error. I realized it after i asked :P
Now i will actually start to do the main part. PS: I can add the steps in xterm-benchmark that i followed in order to successfully run the test (which would be helpful for people like me ) |
Sure feel free to make issues/PRs there (I already fixed the chrome-timeline version to 0.0.11). |
@skprabhanjan Sweet, already found a faster variant of I think this cannot go much faster, as directly returning the number from |
@skprabhanjan Another approach - programmatically walk the 0 and 2 width codepoints to fill the table. Instead of repeatingly doing the bisect and function iTableProgrammed() {
table.fill(1);
table[0] = opts.nul;
// control chars
for (let i = 1; i < 32; ++i) {
table[i] = opts.control;
}
for (let i = 0x7f; i < 0xa0; ++i) {
table[i] = opts.control;
}
// combining 0
for (let r = 0; r < COMBINING_BMP.length; ++r) {
for (let i = COMBINING_BMP[r][0]; i < COMBINING_BMP[r][1]; ++i) {
table[i] = 0;
}
}
// wide chars
for (let i = 0x1100; i <= 0x115f; ++i) {
table[i] = 2;
}
table[0x2329] = 2;
table[0x232a] = 2;
for (let i = 0x2e80; i <= 0xa4cf; ++i) {
table[i] = 2;
}
table[0x303f] = 1; // wrongly added in loop before
for (let i = 0xac00; i <= 0xd7a3; ++i) {
table[i] = 2;
}
for (let i = 0xac00; i <= 0xd7a3; ++i) {
table[i] = 2;
}
for (let i = 0xf900; i <= 0xfaff; ++i) {
table[i] = 2;
}
for (let i = 0xfe10; i <= 0xfe19; ++i) {
table[i] = 2;
}
for (let i = 0xfe30; i <= 0xfe6f; ++i) {
table[i] = 2;
}
for (let i = 0xff00; i <= 0xff60; ++i) {
table[i] = 2;
}
for (let i = 0xffe0; i <= 0xffe6; ++i) {
table[i] = 2;
}
}
iTableProgrammed(); This lowers the creation cost down to 1.5 ms for the whole table and can be done during lib initialization. This coupled with the slightly more expensive 64k table and we have really nice speedup at initialization and during runtime. 😸 @Tyriar Mission accomplished? Edit: With |
The final thing looks like this now: export const wcwidth = (function(opts: {nul: number, control: number}): (ucs: number) => number {
// extracted from https://www.cl.cam.ac.uk/%7Emgk25/ucs/wcwidth.c
// combining characters
const COMBINING_BMP = [
...
];
const COMBINING_HIGH = [
...
];
// binary search
function bisearch(ucs: number, data: number[][]): boolean {
...
}
function wcwidthBMP(ucs: number): number {
...
}
function isWideBMP(ucs: number): boolean {
...
}
function wcwidthHigh(ucs: number): 0 | 1 | 2 {
...
}
const control = opts.control | 0;
// all above unchanged
// create lookup table for BMP plane
// TODO: make callable/configurable from UnicodeManager
const table = new Uint8Array(65536);
table.fill(1);
table[0] = opts.nul;
// control chars
table.subarray(1, 32).fill(opts.control);
table.subarray(0x7f, 0xa0).fill(opts.control);
// combining 0
for (let r = 0; r < COMBINING_BMP.length; ++r) {
table.subarray(COMBINING_BMP[r][0], COMBINING_BMP[r][1]).fill(0);
}
// wide chars
table.subarray(0x1100, 0x1160).fill(2);
table[0x2329] = 2;
table[0x232a] = 2;
table.subarray(0x2e80, 0xa4d0).fill(2);
table[0x303f] = 1; // wrongly added before
table.subarray(0xac00, 0xd7a4).fill(2);
table.subarray(0xf900, 0xfb00).fill(2);
table.subarray(0xfe10, 0xfe1a).fill(2);
table.subarray(0xfe30, 0xfe70).fill(2);
table.subarray(0xff00, 0xff61).fill(2);
table.subarray(0xffe0, 0xffe7).fill(2);
return function (num: number): number {
if (num < 32) {
return control | 0;
}
if (num < 127) {
return 1;
}
if (num < 65536) {
return table[num];
}
// do a full search for high codepoints
return wcwidthHigh(num);
};
})({nul: 0, control: 0}); // configurable options Status:
Next steps:
|
@skprabhanjan Those errors are related to the TypeScript types, I'd expect removing
If you can get the whole table init down dramatically that's even better 😄 |
@jerch , looked at the final version(which looks amazing) and ran the tests for the same as well, consistently showing 80.41 MB/s on an average on my system. Edit : If chunkifying the table and loading only parts is still worth trying then I can think of trying that with Uint8ClampedArray and splitting it into parts of 255 codepoints each . |
Since the table creation dropped into nothing I think we should not do this anymore. Being able to directly return
Not sure how this would look like but sure, feel free to test different versions. |
@jerch, okay I understood the point . |
@skprabhanjan |
@jerch, that sounds good and I agree there is nothing much to be done :) |
Fixed with #1789. |
On my fairly fast gaming desktop it's over a frame:
It looks like
initTable
could be made so only load chunks oftable
as necessary, splitting it into 32 parts (66636/32=2048 codepoints) would probably make it have minimal impact on frames following a load.xterm.js/src/CharWidth.ts
Lines 126 to 131 in cb97ab3
The text was updated successfully, but these errors were encountered: