-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692
Conversation
User Test ResultsTest specification and instructions Test Artifacts
|
…o feat/web/wordbreaker-data-optimization
// The backslash gets removed on file-load. | ||
codedStart = '\\`'; | ||
} | ||
const codedProp = String.fromCharCode(categoryMap.get(property)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possibly-relevant cross-reference from a different draft pred-text PR: #11088 (comment)
It may help the encoding to shift the enum values by about 0x20 in order to avoid use of the ASCII control-code range.
Those chars are "fine" for the base character (codedStart
) since they'll only be used once each, if that, as a key. Used for values, however, the control codes would see very high usage. Assuming esbuild
would "escape" control-code characters, frequent use of them would likely greatly lower our file-size savings.
// To consider: emit `\uxxxx` codes instead of the raw char? | ||
nonBmpRanges.map(({start, property}) => { | ||
const codedStart = String.fromCodePoint(start); | ||
const codedProp = String.fromCharCode(categoryMap.get(property)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same point as prior comment.
…o feat/web/wordbreaker-data-optimization
local RETRY_DELAY=5 # Make curl sleep this amount of time before each retry when a transfer has failed | ||
|
||
echo "Downloading ${SRC} - ${RETRY} attempts" | ||
# local URL_DOWNLOAD_FILE=`curl --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "${SRC}" | "$JQ" -r .txt` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# local URL_DOWNLOAD_FILE=`curl --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "${SRC}" | "$JQ" -r .txt` |
EMOJI_DATA_SRC_HREF="https://www.unicode.org/Public/$KEYMAN_VERSION_UNICODE/ucd/emoji/emoji-data.txt" | ||
EMOJI_DATA_SRC_LOCAL="./emoji-data.txt" | ||
|
||
function downloadPropertyFile() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adapted the pattern used here from that seen in the linked block below:
keyman/resources/build/build-download-resources.sh
Lines 22 to 41 in 1d27a10
function downloadKeyboardPackage() { | |
# Check that $KEYBOARDS_TARGET is valid | |
if [ "$#" -ne 2 ]; then | |
builder_die "downloadKeyboardPackage requires KEYBOARD_PACKAGE_ID and KEYBOARDS_TARGET to be set" | |
fi | |
# Default Keyboard | |
local ID="$1" | |
local KEYBOARDS_TARGET="$2" | |
local URL_DOWNLOAD=https://downloads.keyman.com | |
local URL_API_KEYBOARD_VERSION=${URL_DOWNLOAD}/api/keyboard/ | |
local RETRY=5 # Curl retries this number of times before giving up | |
local RETRY_DELAY=5 # Make curl sleep this amount of time before each retry when a transfer has failed | |
echo "Downloading ${ID}.kmp from downloads.keyman.com up to ${RETRY} attempts" | |
local URL_DOWNLOAD_FILE=`curl --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "$URL_API_KEYBOARD_VERSION/${ID}" | "$JQ" -r .kmp` | |
curl --fail --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "$URL_DOWNLOAD_FILE" --output "$KEYBOARDS_TARGET" || { | |
builder_die "Downloading $KEYBOARDS_TARGET failed with error $?" | |
} | |
} |
Note that the original performed a query for the keyboard's actual URL (the local URL_DOWNLOAD_FILE
line) before doing the actual download.
The linked block is used both in Android builds (run on Windows + Linux) and in iOS builds (run on Mac), so we should be good for cross-platform compatibility here.
Android reference:
keyman/android/KMAPro/build.sh
Line 11 in 1d27a10
. "$KEYMAN_ROOT/resources/build/build-download-resources.sh" |
keyman/android/KMAPro/build.sh
Lines 74 to 85 in 1d27a10
if builder_start_action configure; then | |
KEYBOARD_PACKAGE_ID="sil_euro_latin" | |
KEYBOARDS_TARGET="$KEYMAN_ROOT/android/KMAPro/kMAPro/src/main/assets/${KEYBOARD_PACKAGE_ID}.kmp" | |
MODEL_PACKAGE_ID="nrc.en.mtnt" | |
MODELS_TARGET="$KEYMAN_ROOT/android/KMAPro/kMAPro/src/main/assets/${MODEL_PACKAGE_ID}.model.kmp" | |
downloadKeyboardPackage "$KEYBOARD_PACKAGE_ID" "$KEYBOARDS_TARGET" | |
downloadModelPackage "$MODEL_PACKAGE_ID" "$MODELS_TARGET" | |
builder_finish_action success configure | |
fi |
iOS reference:
Line 11 in 1d27a10
. "$KEYMAN_ROOT/resources/build/build-download-resources.sh" |
Lines 87 to 92 in 1d27a10
function do_packages() { | |
mkdir -p "$BUNDLE_PATH" | |
downloadKeyboardPackage "$DEFAULT_KBD_ID" "$BUNDLE_PATH/$DEFAULT_KBD_ID.kmp" | |
downloadModelPackage "$DEFAULT_LM_ID" "$BUNDLE_PATH/$DEFAULT_LM_ID.model.kmp" | |
} |
…o feat/web/wordbreaker-data-optimization
Test Results
|
Changes in this pull request will be available for download in Keyman version 18.0.99-alpha |
Relates to #7224.
This PR focuses mostly on this comment: #7224 (comment)
I've now worked out the details for encoding the table into string form. Going to a UTF-8 style string-encoding turned out not to be as beastly as I previously thought.
Using our current esbuild settings, this change drops the data table's filesize contribution from 18.4kb to 14.8kb. It's not that much, mostly because of how esbuild auto-escapes non ASCII chars, which re-expands the value for each character. The savings we get is mostly from removal of characters from the old array-based pattern's syntax.
In doing some research, I noted the following conversations that will be quite relevant to any related decisions moving forward:
esbuild
default settings prefer to encode non-ASCII chars as\uxxxx
escape sequences, even in UTF-8 files, due to performance concerns when a file uses the full UTF-8 range instead of pure ASCII encoding. The V8 parser has been noted to default to ASCII, redoing the buffer when it detects the first non ASCII char. Refer to https://v8.dev/blog/scanner#advanceuntil - note the image under that header, which is the last graph on the page. It's not explicitly mentioned in text on the page, but the graph certainly implies that pure ASCII is ~70% faster to load.We don't really have concerns about loading a UTF-8 encoded file; our lexical models already have that enforced within them, encoding with standard UTF-8 instead of escapes.
The minified output for our current data table looks like this:
True string length in code units: 223 (without the whitespace added to emulate word-wrapping)
Our encoded format looks quite different:
Starting and ending with the same codepoints...
String length in code units: 60
These two samples are from the SMP-range. Fun note... with Unicode 13 data, half of the entries in that range would be missing... but even then, the new encoding pattern including all of them would still win over the old encoding pattern with the original set.
This PR also adds unit tests to verify that we can properly search the data tables for both encoded strings.
User Testing
TEST_GENERAL_PREDICTIONS: Do some basic predictive-text testing and verify that things work normally.