Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692

Merged
merged 10 commits into from
Aug 27, 2024
65 changes: 54 additions & 11 deletions common/models/wordbreakers/src/data-compiler/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,37 @@ const categoryMap = new Map<string, number>();

for(let cat of categories) {
categoryMap.set(cat, catIndexSeed++);
if(catIndexSeed == '`'.charCodeAt(0)) {
catIndexSeed++; // Skip the back-tick as an encoding symbol.
// Reduces complications, as it's the encoding string start/end char.
}
}

const bmpRanges: typeof ranges = [];
const nonBmpRanges: typeof ranges = [];

// { start: number, property: number}[]
for(let range of ranges) { // already sorted
if(range.start <= 0xFFFF) {
bmpRanges.push(range);
} else {
if(nonBmpRanges.length == 0) {
const finalBmpRange = bmpRanges[bmpRanges.length - 1];
bmpRanges.push({
start: 0xFFFF,
property: range.property,
end: undefined
});

nonBmpRanges.push({
start: 0x10000,
property: finalBmpRange.property,
end: undefined
});
}

nonBmpRanges.push(range);
}
}

//////////////////////// Creating the generated file /////////////////////////
Expand All @@ -107,28 +138,40 @@ let stream = fs.createWriteStream(generatedFilename);

// Generate the file!
stream.write(`// Automatically generated file. DO NOT MODIFY.

/**
* Valid values for a word break property.
*/
export const enum WordBreakProperty {
${ /* Create enum values for each word break property */
Array.from(categories)
.map(x => ` ${x}`)
.map(x => ` ${x} = ${categoryMap.get(x)}`)
.join(',\n')
}
};

export const WORD_BREAK_PROPERTY: [number, WordBreakProperty][] = [
${
// TODO: Two versions: one that's BMP-encoded, one that's non-BMP encoded.
ranges.map(({start, property}) => (` [` +
`/*start*/ 0x${start.toString(16).toUpperCase()}, ` +
`WordBreakProperty.${property}],`
)).join('\n')
export const WORD_BREAK_PROPERTY_BMP: string = \`${
// To consider: emit `\uxxxx` codes instead of the raw char?
bmpRanges.map(({start, property}) => {
let codedStart = String.fromCodePoint(start);
if(codedStart == '`') {
// Prevents accidental unescaped use of the string start/end char.
// The backslash gets removed on file-load.
codedStart = '\\`';
}
const codedProp = String.fromCharCode(categoryMap.get(property));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possibly-relevant cross-reference from a different draft pred-text PR: #11088 (comment)

It may help the encoding to shift the enum values by about 0x20 in order to avoid use of the ASCII control-code range.

Those chars are "fine" for the base character (codedStart) since they'll only be used once each, if that, as a key. Used for values, however, the control codes would see very high usage. Assuming esbuild would "escape" control-code characters, frequent use of them would likely greatly lower our file-size savings.

return `${codedStart}${codedProp}`;
}).join('')
}\`

export const WORD_BREAK_PROPERTY_NON_BMP: string = \`${
// To consider: emit `\uxxxx` codes instead of the raw char?
nonBmpRanges.map(({start, property}) => {
const codedStart = String.fromCodePoint(start);
const codedProp = String.fromCharCode(categoryMap.get(property));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point as prior comment.

return `${codedStart}${codedProp}`;
}).join('')
}
];
`);
\``);

/**
* Reads a Unicode character property file.
Expand Down
Loading