Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initTable is slow, can we lazily load parts as they're needed? #1785

Closed
Tyriar opened this issue Nov 11, 2018 · 29 comments
Closed

initTable is slow, can we lazily load parts as they're needed? #1785

Tyriar opened this issue Nov 11, 2018 · 29 comments
Assignees
Labels
area/parser area/performance type/enhancement Features or improvements to existing features
Milestone

Comments

@Tyriar
Copy link
Member

Tyriar commented Nov 11, 2018

On my fairly fast gaming desktop it's over a frame:

image

It looks like initTable could be made so only load chunks of table as necessary, splitting it into 32 parts (66636/32=2048 codepoints) would probably make it have minimal impact on frames following a load.

xterm.js/src/CharWidth.ts

Lines 126 to 131 in cb97ab3

const CODEPOINTS = 65536; // BMP holds 65536 codepoints
const BITWIDTH = 2; // a codepoint can have a width of 0, 1 or 2
const ITEMSIZE = 32; // using uint32_t
const CONTAINERSIZE = CODEPOINTS * BITWIDTH / ITEMSIZE;
const CODEPOINTS_PER_ITEM = ITEMSIZE / BITWIDTH;
table = (typeof Uint32Array === 'undefined')

@Tyriar Tyriar added type/proposal A proposal that needs some discussion before proceeding area/performance area/parser labels Nov 11, 2018
@jerch
Copy link
Member

jerch commented Nov 12, 2018

@Tyriar 20ms - eww thats much more than it shows in the tests (I see an average of 8ms for it).

Yeah it can be done in chunks. It might have negative impact on the input flow, since it would introduce another condition to be evaluated for every single char, have to test it.

Btw what are those really high framerates on right side, the new thingy? (not letting the cat out of the bag) 😸

@skprabhanjan
Copy link
Contributor

@Tyriar , Can i look at changing this ?
const CODEPOINTS = 65536; should be changed to 2048 and what would be the next steps ?
Thanks :)

@jerch
Copy link
Member

jerch commented Nov 14, 2018

@skprabhanjan Its already highly optimized code, the returned function is used in the hot loop during input for every char. Therefore its not that easy to chunkify it without reintroducing a runtime penalty. Also it is not directly testable since v8 typically inlines the function. If you are still with me and still want to do it, these are the steps to be taken:

  • See faster wcwidth with lookup table #798 for original PR of this with old impl and some benchmark numbers
  • split the lookup table in either several tables or one table whose parts are filled independently with the first occurence of a char within a parts range
  • extend the returned function with a condition, that inits the appropriate table (part) and fix the used table / offset
  • compare the runtime of InputHandler.print before and after the change with some input data, either via chrome's timeline view (less reliable) or with https://github.com/xtermjs/xterm-benchmark (more reliable)

There are several tricks possible to further "hide" the workload (maybe use requestIdleCallback in the browser).
The original PR #798 saved ~10% runtime for ASCII chars and >80% for CJK chars in InputHandler.print compared to the old implementation (dont remember the exact values). An acceptable solution should not penalize the current runtime of InputHandler.print by more than 5-10% imho (must be tested with the different input char ranges).

Oh - and the lookup table is highly packed to avoid wasting memory. Would be good if this does not end up taking much more memory.

@skprabhanjan
Copy link
Contributor

skprabhanjan commented Nov 15, 2018

@jerch , Thanks for the reply.
I would need more help from your side going forward with this.
I tried understanding this for quite some time and I shall try to understand this more deeply but for now i need clarification for this question ( Also if you can tell me what exactly to start off with ? )

split the lookup table in either several tables or one table whose parts are filled independently with the first occurence of a char within a parts range

If i choose to go with the second approach then what exactly does first occurence of a char within a parts range would mean ?
Thanks :)

@jerch
Copy link
Member

jerch commented Nov 15, 2018

@skprabhanjan I think the most straight forward way is to do the following:

  • let table: number[] | Uint32Array = null;
    replace the table with a table array, that gonna hold the subtables of the chunks, lets call it chunkTables

  • function initTable(): number[] | Uint32Array {
    give initTable some sort of argument that points to the chunk to be initialized

  • in initTable only initialize the range of a single chunk, set the created sub table to the right slot in chunkTables and return that sub table from initTable to be used in the return function. The resulting tables should not exceed 66536 (thats the BMP plane only) as the higher chars are rarely occur and are scattered up to 2^21 which would just fill the memory with useless lookup entries. The higher plane numbers are handled at the end of the return function with the full bisect search.

  • in the return function grab the chunk table maybe like this:

    const t = chunkTables[num >> 11] || initTable(num >> 11);

    Here num >> 11 would split the full range of 66536 into 32 chunks, num >> 12 into 16 and so on (number of chunks is 2^(16 - shifted bits)). The bit operation also gives you the hint, that the chunk in initTable has to be filled from the argument upwards (as it automatically floors the number).
    You also gonna need to change the flow control abit, esp. for num >= 65536 we dont want to create a lookup table, so the line above maybe should go into the if (num < 65536) check.

From there on its a matter of code golfing to keep the runtime as low as possible. As suggested above you might also want to try a single big table instead of distinct tables, but you would have to track the separate init states as well. Not sure which one turns out to be faster, if think the approach with distinct tables is superior as it might be easier to comprehend.

If you do it like above the main difference in the hot function to the current impl is the chunkTables[num >> 11] vs. table call plus some additional mod calc to apply num to the smaller table. I think it will not penalize the runtime much (only a few cycles if the cache does not miss).
Also feel free to rename things, maybe createChunkTable and so on.

@jerch
Copy link
Member

jerch commented Nov 16, 2018

Did some quick tests with not so nice results. Changing that:

      const t = table || initTable();
      if (num < 65536) {
        return t[num >> 4] >> ((num & 15) << 1) & 3;
      }

into this:

    if (num < 65536) {
      const t = chunkTables[num >> 11] || initChunkTable(num >> 11);
      return t[(num & 2047) >> 4] >> ((num & 15) << 1) & 3;
    }

doubles the wcwidth runtime for any char > 127 (throughput dropped from ~70 MB/s to ~35 MB/s). Even tried with 5 MB of the same chinese char to rule out cache misses, it seems the needed chunk table changes (num >> 11 and num & 2047) alone cause this. ASCII chars are at ~80 MB/s. This hot function is very sensitive to any code additions. 😢

Currently wcwidth accounts for a third of InputHandler.print's runtime, so doubling here would add a third on top of that runtime for CJK chars. Since many CJK chars already get inserted much slower due to their double cell nature, I think we should not chunkify the lookup table as it would further degrade the performance for east asian ppl.

Hmm, not sure yet how to solve this, maybe we should try to go with the requestIdleCallback idea.

@skprabhanjan Maybe you have a better idea how to get the chunks working in a speedy manner, but 35 MB/s is not acceptable for wcwidth alone as it would have a big impact on the whole input flow. I am currently aiming at >30 MB/s for the whole input flow with the new typed array buffer, and wcwidth is only a tiny bit of that terminal part. Feel free to try totally different ideas. (I remember that using a Uint8Array instead of Uint32Array was a bit faster, guess we could try to change that as well).

Edit: Yepp, with a Uint8Array the chinese char shows 80 MB/s as well. (but table creation is a bit slower).

@jerch
Copy link
Member

jerch commented Nov 16, 2018

Next try:
With the sacrifice of some memory (64k vs. 16k in the current packed version) it is possible to go with a writethrough model, that saves single char widths independently and does not need initTable:

...
const table = new Uint8Array(65536);
table.fill(255);
...
return (num) => {
    ...
    if (num < 65536) {
        return (table[num] === 255) ? (table[num] = wcwidthBMP(num)) : table[num];
    }
    ...
}

Its stable over 70 MB/s for an already seen char, still drops down to 20 MB/s for fresh chars. Hmm, kinda promising, not sure yet if avoiding one frame hiccup is worth this.
Btw with initTable in place this unpacked table variant with return table[num]; is the fastest (who would have guessed, ~85 MB/s). It even outperforms the ASCII shortcut, if the function is branchless (which would not work since we really dont want to build a table for the higher planes).

@skprabhanjan
Copy link
Contributor

skprabhanjan commented Nov 16, 2018

@jerch , I went through all the comments and followed the complete sequence to get the point where you said the The writethrough model was the fastest. (via the latest comment).
And also I will check if anything else will be feasible.
But can you please help me set up how to run the test cases to get the results that you are sharing ( eg the 35 MB/s test and so on you quoted ) . I suppose I should use https://github.com/xtermjs/xterm-benchmark , but I am not able to connect the two and get the results as quoted by you .
PS : Sorry for being slow and thanks for your help, I am just trying to understand each thing in detail

@jerch
Copy link
Member

jerch commented Nov 16, 2018

Yes xterm-benchmark is the right tool for it. Currently there is no test case for wcwidth, so you have to write one first, something like this (rewritten from memory, since I am currently not at the machine from the tests above):

import { before, ThroughputRuntimeCase } from '..';
import * as fs from 'fs';
const getStringWidth = require('xterm/lib/CharWidth').getStringWidth;

let content = fs.readFileSync('./chinese_out', {encoding: 'utf8'});

new ThroughputRuntimeCase('', () => {
    console.log(getStringWidth(content));
    return {payloadSize: content.length};
}, {fork: true}).showRuntime().showThroughput().showAverageRuntime().showAverageThroughput();

You can run the test with node lib/cli.js <path_to_testfile>. Note that the numbers you see might be higher if you have a better system (I tested it on a quite old sandybridge laptop).

About the writethrough model - yeah this looks promising, still it drops throughput badly for new chars.

Edit: './chinese_out' is a 50 MB file with the same chinese full width char repeated.

@skprabhanjan
Copy link
Contributor

Sure, now I understand how to run it, Thanks :)

Edit: './chinese_out' is a 50 MB file with the same chinese full width char repeated.

I tried finding this file online but was not succesfully.
Can you please let me know where I can find this ?
Thanks :)

@jerch
Copy link
Member

jerch commented Nov 16, 2018

Ah I dont remember the char I used, you can use any wide char < 65536 to create such a file, e.g:

printf %16666667s | sed 's/ /冾/g' > chinese_out

@skprabhanjan
Copy link
Contributor

@jerch , I generated the chinese_out file but having trouble understanding the command to run it.

node lib/cli.js <path_to_testfile> is the command you said to run ,
I cant find the lib folder anywhere ( searched both xterm.js and xterm-benchmark , but xterm.js has a lib folder but doesnt have cli.js in it )
So beacuse of the above things, I am not able to successfully run the test.
Thanks for your continued help in guiding me :)

@jerch
Copy link
Member

jerch commented Nov 16, 2018

You have to build xterm-benchmark first with npm run tsc, which will create lib/cli.js. And the repo should be cloned next to the xterm.js repo, otherwise it will not find xterm.
Also the test file should be created under xterm-benchmark/src/xterm_perfcases or the imports wil not work.
Yeah sorry, the benchmark tool is still alpha.

@skprabhanjan
Copy link
Contributor

@jerch , im really sorry that i am still struck at running the benchmark.
I got a new bunch of errors.

  • I tried npm install , it threw an error saying

No matching version found for [email protected]

Now it threw some Chromium download error regarding PUPPETEER_SKIP_CHROMIUM_DOWNLOAD

  • I checked for that solution online and set npm config set puppeteer_skip_chromium_download true -g

npm install worked skipping the Chromium download.

  • I ran npm run tsc after this

It throws many errors ( Attached a screenshot for the same ).

screen shot 2018-11-16 at 10 44 27 pm

@jerch
Copy link
Member

jerch commented Nov 16, 2018

@skprabhanjan I think thats an error in xterm.js, I see this too and ignored it, @Tyriar any insights on that error or how to silent it?
The build still works, if not you can try to run npm run watch or rename all scripts file endings under src/xterm_perfcases (xterm-benchmark does not need or build xterm itself, just those scripts import the resources).
Sorry for all the hassles.

About the puppeteer error - yeah it tries to redownload the latest revision which fails if it already got downloaded. You can also delete the whole puppeteer folder before doing npm install again.

@skprabhanjan
Copy link
Contributor

skprabhanjan commented Nov 16, 2018

@jerch , sorry my bad, even though it threw error the lib folder was created and housed a cli.js
So i went ahead and ran the node lib/cli.js src/xterm_perfcases/wcwidth_perf.ts
Running this is throwing one more weird error which I am not able to debug

(function (exports, require, module, __filename, __dirname) { import { perfContext, before, ThroughputRuntimeCase } from '..'; ^^^^^^ SyntaxError: Unexpected token import

PS: i tried running the exisiting perfcases like parser.ts and it threw the same error.
Am i doing something completely wrong here ?

@jerch
Copy link
Member

jerch commented Nov 16, 2018

Ah well. the benchmark tool cannot run typescript files directly, thus you need to call it with lib/.../...js instead of src/.../...ts.

Here is the corrected test file I used:

import { ThroughputRuntimeCase } from '..';
import * as fs from 'fs';
const getStringCellWidth = require('xterm/lib/CharWidth').getStringCellWidth;

let content = fs.readFileSync('./chinese_out', {encoding: 'utf8'});

new ThroughputRuntimeCase('wcwidth', () => {
    console.log(getStringCellWidth(content));
    return {payloadSize: content.length};
}, {fork: true}).showRuntime().showThroughput().showAverageRuntime().showAverageThroughput();

Since the file is Typescript and resides under src/ it also gets transpiled to JS under lib/.

@skprabhanjan
Copy link
Contributor

skprabhanjan commented Nov 16, 2018

Oops, sorry for that silly error. I realized it after i asked :P
I was able to run the tests successfully(without any changes still) and this was the result

Context "lib/xterm_perfcases/wcwidth_perf.js"
33333334
Case "wcwidth" : 1 - runtime: 217.83 ms
Case "wcwidth" : 1 - throughput: 72.97 MB/s
Case "wcwidth" : 1 runs - average runtime: 217.83 ms
Case "wcwidth" : 1 runs - average throughput: 72.97 MB/s

Now i will actually start to do the main part.
Thanks for the help and sorry for the trouble .

PS: I can add the steps in xterm-benchmark that i followed in order to successfully run the test (which would be helpful for people like me )

@jerch
Copy link
Member

jerch commented Nov 16, 2018

PS: I can add the steps in xterm-benchmark that i followed in order to successfully run the test (which would be helpful for people like me )

Sure feel free to make issues/PRs there (I already fixed the chrome-timeline version to 0.0.11).

@jerch
Copy link
Member

jerch commented Nov 16, 2018

@skprabhanjan Sweet, already found a faster variant of getStringCellWidth (also used in InputHandler.print). Also I think the old variant does not quite the right thing (as stated here https://mathiasbynens.be/notes/javascript-unicode). I am now at 90 MB/s with the writethrough version, which is a big improvement (current master is at 70 MB/s).

I think this cannot go much faster, as directly returning the number from wcwidth omitting all the lookup table stuff is at ~100 MB/s. If we allow slightly wrong handling of surrogate pairs in getStringCellWidth I can get this to this 100 MB/s limit, but I think we should not do that (It will basically not check the second surrogate pair for correct codepoint range before summing up to a utf32 codepoint - potentially leading to invalid utf32 codepoints).

@jerch
Copy link
Member

jerch commented Nov 16, 2018

@skprabhanjan Another approach - programmatically walk the 0 and 2 width codepoints to fill the table. Instead of repeatingly doing the bisect and isWideBMP eval for every single codepoint from 0 to 65536 we can simply create the lookup table cheaper by this:

function iTableProgrammed() {
  table.fill(1);
  table[0] = opts.nul;
  // control chars
  for (let i = 1; i < 32; ++i) {
    table[i] = opts.control;
  }
  for (let i = 0x7f; i < 0xa0; ++i) {
    table[i] = opts.control;
  }
  // combining 0
  for (let r = 0; r < COMBINING_BMP.length; ++r) {
    for (let i = COMBINING_BMP[r][0]; i < COMBINING_BMP[r][1]; ++i) {
      table[i] = 0;
    }
  }
  // wide chars
  for (let i = 0x1100; i <= 0x115f; ++i) {
    table[i] = 2;
  }
  table[0x2329] = 2;
  table[0x232a] = 2;
  for (let i = 0x2e80; i <= 0xa4cf; ++i) {
    table[i] = 2;
  }
  table[0x303f] = 1;  // wrongly added in loop before
  for (let i = 0xac00; i <= 0xd7a3; ++i) {
    table[i] = 2;
  }
  for (let i = 0xac00; i <= 0xd7a3; ++i) {
    table[i] = 2;
  }
  for (let i = 0xf900; i <= 0xfaff; ++i) {
    table[i] = 2;
  }
  for (let i = 0xfe10; i <= 0xfe19; ++i) {
    table[i] = 2;
  }
  for (let i = 0xfe30; i <= 0xfe6f; ++i) {
    table[i] = 2;
  }
  for (let i = 0xff00; i <= 0xff60; ++i) {
    table[i] = 2;
  }
  for (let i = 0xffe0; i <= 0xffe6; ++i) {
    table[i] = 2;
  }
}
iTableProgrammed();

This lowers the creation cost down to 1.5 ms for the whole table and can be done during lib initialization. This coupled with the slightly more expensive 64k table and we have really nice speedup at initialization and during runtime. 😸

@Tyriar Mission accomplished?

Edit: With .subarray instead of the loops the above table creation runs in 0.2 ms, lol.

@jerch
Copy link
Member

jerch commented Nov 16, 2018

The final thing looks like this now:

export const wcwidth = (function(opts: {nul: number, control: number}): (ucs: number) => number {
    // extracted from https://www.cl.cam.ac.uk/%7Emgk25/ucs/wcwidth.c
    // combining characters
    const COMBINING_BMP = [
      ...
    ];
    const COMBINING_HIGH = [
      ...
    ];
    // binary search
    function bisearch(ucs: number, data: number[][]): boolean {
      ...
    }
    function wcwidthBMP(ucs: number): number {
      ...
    }
    function isWideBMP(ucs: number): boolean {
      ...
    }
    function wcwidthHigh(ucs: number): 0 | 1 | 2 {
      ...
    }
    const control = opts.control | 0;
    // all above unchanged

    // create lookup table for BMP plane
    // TODO: make callable/configurable from UnicodeManager
    const table = new Uint8Array(65536);
    table.fill(1);
    table[0] = opts.nul;
    // control chars
    table.subarray(1, 32).fill(opts.control);
    table.subarray(0x7f, 0xa0).fill(opts.control);
    // combining 0
    for (let r = 0; r < COMBINING_BMP.length; ++r) {
      table.subarray(COMBINING_BMP[r][0], COMBINING_BMP[r][1]).fill(0);
    }
    // wide chars
    table.subarray(0x1100, 0x1160).fill(2);
    table[0x2329] = 2;
    table[0x232a] = 2;
    table.subarray(0x2e80, 0xa4d0).fill(2);
    table[0x303f] = 1;  // wrongly added before
    table.subarray(0xac00, 0xd7a4).fill(2);
    table.subarray(0xf900, 0xfb00).fill(2);
    table.subarray(0xfe10, 0xfe1a).fill(2);
    table.subarray(0xfe30, 0xfe70).fill(2);
    table.subarray(0xff00, 0xff61).fill(2);
    table.subarray(0xffe0, 0xffe7).fill(2);

    return function (num: number): number {
      if (num < 32) {
        return control | 0;
      }
      if (num < 127) {
        return 1;
      }
      if (num < 65536) {
        return table[num]; 
      }
      // do a full search for high codepoints
      return wcwidthHigh(num);
    };
})({nul: 0, control: 0});  // configurable options

Status:

  • faster lookup table init compared to master (0.2 ms vs. 20 ms)
  • faster at runtime during input flow (85 MB/s vs. 70 MB/s for CJK wide chars)
  • slightly bigger lookup table (64kB vs. 16kB)

Next steps:

  • prepare for integration with upcoming unicode provider multiple unicode version support #1714 (should lazy init lookup table to avoid initialization of unused tables with multiple wcwidth versions)
  • equality tests

@Tyriar
Copy link
Member Author

Tyriar commented Nov 16, 2018

It throws many errors ( Attached a screenshot for the same ).

@skprabhanjan Those errors are related to the TypeScript types, I'd expect removing node_modules and running yarn to fix it.

Mission accomplished?

If you can get the whole table init down dramatically that's even better 😄

@skprabhanjan
Copy link
Contributor

skprabhanjan commented Nov 17, 2018

@jerch , looked at the final version(which looks amazing) and ran the tests for the same as well, consistently showing 80.41 MB/s on an average on my system.
So one thing I would think of is
Creating and populating the table only if (num < 65536) , as how it was in previous versions :)
And also what would be the next steps to be followed from my end ?
Thanks :)

Edit : If chunkifying the table and loading only parts is still worth trying then I can think of trying that with Uint8ClampedArray and splitting it into parts of 255 codepoints each .

@jerch
Copy link
Member

jerch commented Nov 18, 2018

Creating and populating the table only if (num < 65536) , as how it was in previous versions :)

Since the table creation dropped into nothing I think we should not do this anymore. Being able to directly return table[num] gives ~8 MB/s boost, so I see no reason why we should stick with the slower createTable || ... thing. return table[num]; is kinda the shortest hot function version I can think of without pulling the table to the caller side (and the latter prolly would not give any advantage since it will be inlined anyway).
Since wcwidth will be part of the unicode manager I think the table should instead be initialized there once a specific unicode version is requested (#1714)

If chunkifying the table and loading only parts is still worth trying then I can think of trying that with Uint8ClampedArray and splitting it into parts of 255 codepoints each

Not sure how this would look like but sure, feel free to test different versions.

@skprabhanjan
Copy link
Contributor

@jerch, okay I understood the point .
If you could also specify what should be done next from my end it would be helpful :)
Thanks for the constant feedback and help:)

@jerch
Copy link
Member

jerch commented Nov 18, 2018

@skprabhanjan
Imho there nothing left to be done here beside a regression test. Since I already have a PR for it, gonna add the test and push that PR.
Many thanks for your offer to help, much appreciated. And sorry for the bumpy road with xterm-benchmark. Maybe you want to take any of the other issues, just look around and take whatever you feel comfortable with.

@skprabhanjan
Copy link
Contributor

@jerch, that sounds good and I agree there is nothing much to be done :)
It was a very good experience interacting with you and taking your guidance, thanks a lot for the same :)
And yes I shall find some suitable issue :)

@Tyriar Tyriar added this to the 3.9.0 milestone Nov 23, 2018
@Tyriar Tyriar added type/enhancement Features or improvements to existing features and removed type/proposal A proposal that needs some discussion before proceeding labels Nov 23, 2018
@jerch
Copy link
Member

jerch commented Nov 23, 2018

Fixed with #1789.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/parser area/performance type/enhancement Features or improvements to existing features
Projects
None yet
Development

No branches or pull requests

3 participants