-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692
Merged
Merged
Changes from 3 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
0b79b86
feat(common/models): initial pass for encoded-string wordbreaker data…
jahorton 47d945f
fix(web): conditional import path
jahorton bbda771
chore(web): Merge branch 'feat/web/wordbreaker-property-data-gen' int…
jahorton 23c0308
chore(web): Merge branch 'feat/web/wordbreaker-property-data-gen' int…
jahorton 5045df1
change(common/models): improve encoding format
jahorton fdecc60
feat(common/models): connect encoded lookup table
jahorton b145adf
fix(web): fixes value for end of BMP range
jahorton 02f721d
feat(common/models): add property-lookup unit test set
jahorton fef5b3a
chore(web): Merge branch 'feat/web/wordbreaker-property-data-gen' int…
jahorton 4159bab
fix(common/models): fix unit-test reference to renamed file
jahorton File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -95,6 +95,37 @@ const categoryMap = new Map<string, number>(); | |
|
||
for(let cat of categories) { | ||
categoryMap.set(cat, catIndexSeed++); | ||
if(catIndexSeed == '`'.charCodeAt(0)) { | ||
catIndexSeed++; // Skip the back-tick as an encoding symbol. | ||
// Reduces complications, as it's the encoding string start/end char. | ||
} | ||
} | ||
|
||
const bmpRanges: typeof ranges = []; | ||
const nonBmpRanges: typeof ranges = []; | ||
|
||
// { start: number, property: number}[] | ||
for(let range of ranges) { // already sorted | ||
if(range.start <= 0xFFFF) { | ||
bmpRanges.push(range); | ||
} else { | ||
if(nonBmpRanges.length == 0) { | ||
const finalBmpRange = bmpRanges[bmpRanges.length - 1]; | ||
bmpRanges.push({ | ||
start: 0xFFFF, | ||
property: range.property, | ||
end: undefined | ||
}); | ||
|
||
nonBmpRanges.push({ | ||
start: 0x10000, | ||
property: finalBmpRange.property, | ||
end: undefined | ||
}); | ||
} | ||
|
||
nonBmpRanges.push(range); | ||
} | ||
} | ||
|
||
//////////////////////// Creating the generated file ///////////////////////// | ||
|
@@ -107,28 +138,40 @@ let stream = fs.createWriteStream(generatedFilename); | |
|
||
// Generate the file! | ||
stream.write(`// Automatically generated file. DO NOT MODIFY. | ||
|
||
/** | ||
* Valid values for a word break property. | ||
*/ | ||
export const enum WordBreakProperty { | ||
${ /* Create enum values for each word break property */ | ||
Array.from(categories) | ||
.map(x => ` ${x}`) | ||
.map(x => ` ${x} = ${categoryMap.get(x)}`) | ||
.join(',\n') | ||
} | ||
}; | ||
|
||
export const WORD_BREAK_PROPERTY: [number, WordBreakProperty][] = [ | ||
${ | ||
// TODO: Two versions: one that's BMP-encoded, one that's non-BMP encoded. | ||
ranges.map(({start, property}) => (` [` + | ||
`/*start*/ 0x${start.toString(16).toUpperCase()}, ` + | ||
`WordBreakProperty.${property}],` | ||
)).join('\n') | ||
export const WORD_BREAK_PROPERTY_BMP: string = \`${ | ||
// To consider: emit `\uxxxx` codes instead of the raw char? | ||
bmpRanges.map(({start, property}) => { | ||
let codedStart = String.fromCodePoint(start); | ||
if(codedStart == '`') { | ||
// Prevents accidental unescaped use of the string start/end char. | ||
// The backslash gets removed on file-load. | ||
codedStart = '\\`'; | ||
} | ||
const codedProp = String.fromCharCode(categoryMap.get(property)); | ||
return `${codedStart}${codedProp}`; | ||
}).join('') | ||
}\` | ||
|
||
export const WORD_BREAK_PROPERTY_NON_BMP: string = \`${ | ||
// To consider: emit `\uxxxx` codes instead of the raw char? | ||
nonBmpRanges.map(({start, property}) => { | ||
const codedStart = String.fromCodePoint(start); | ||
const codedProp = String.fromCharCode(categoryMap.get(property)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same point as prior comment. |
||
return `${codedStart}${codedProp}`; | ||
}).join('') | ||
} | ||
]; | ||
`); | ||
\``); | ||
|
||
/** | ||
* Reads a Unicode character property file. | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possibly-relevant cross-reference from a different draft pred-text PR: #11088 (comment)
It may help the encoding to shift the enum values by about 0x20 in order to avoid use of the ASCII control-code range.
Those chars are "fine" for the base character (
codedStart
) since they'll only be used once each, if that, as a key. Used for values, however, the control codes would see very high usage. Assumingesbuild
would "escape" control-code characters, frequent use of them would likely greatly lower our file-size savings.