Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692

Merged
merged 10 commits into from
Aug 27, 2024

Conversation

jahorton
Copy link
Contributor

@jahorton jahorton commented Feb 13, 2024

Relates to #7224.

This PR focuses mostly on this comment: #7224 (comment)

I've also thought of "a way" to shrink the size of the backing data table, but that would be its own beast of a side project and would result in a notably less human-readable file. [...]

(The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference.)

I've now worked out the details for encoding the table into string form. Going to a UTF-8 style string-encoding turned out not to be as beastly as I previously thought.

Using our current esbuild settings, this change drops the data table's filesize contribution from 18.4kb to 14.8kb. It's not that much, mostly because of how esbuild auto-escapes non ASCII chars, which re-expands the value for each character. The savings we get is mostly from removal of characters from the old array-based pattern's syntax.


In doing some research, I noted the following conversations that will be quite relevant to any related decisions moving forward:

esbuild default settings prefer to encode non-ASCII chars as \uxxxx escape sequences, even in UTF-8 files, due to performance concerns when a file uses the full UTF-8 range instead of pure ASCII encoding. The V8 parser has been noted to default to ASCII, redoing the buffer when it detects the first non ASCII char. Refer to https://v8.dev/blog/scanner#advanceuntil - note the image under that header, which is the last graph on the page. It's not explicitly mentioned in text on the page, but the graph certainly implies that pure ASCII is ~70% faster to load.

We don't really have concerns about loading a UTF-8 encoded file; our lexical models already have that enforced within them, encoding with standard UTF-8 instead of escapes.

The minified output for our current data table looks like this:

[94178,0],[94179,11],[94180,14],[94181,0],[94192,14],[94194,0],[110576,17],[110580,0],
[110581,17],[110588,0],[110589,17],[110591,0],[110592,17],[110593,0],[110880,17],[110883,0],
[110933,17],[110934,0],[110948,17],[110952,0]

True string length in code units: 223 (without the whitespace added to emulate word-wrapping)

Our encoded format looks quite different:

Starting and ending with the same codepoints...

𖿢 𖿣+𖿤.𖿥 𖿰.𖿲 𚿰1𚿴 𚿵1𚿼 𚿽1𚿿 𛀀1𛀁 𛄠1𛄣 𛅕1𛅖 𛅤1𛅨 

String length in code units: 60

These two samples are from the SMP-range. Fun note... with Unicode 13 data, half of the entries in that range would be missing... but even then, the new encoding pattern including all of them would still win over the old encoding pattern with the original set.

This PR also adds unit tests to verify that we can properly search the data tables for both encoded strings.

User Testing

TEST_GENERAL_PREDICTIONS: Do some basic predictive-text testing and verify that things work normally.

@jahorton jahorton added this to the 18.0 milestone Feb 13, 2024
@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Feb 13, 2024
@github-actions github-actions bot added web/ and removed web/ labels Feb 16, 2024
// The backslash gets removed on file-load.
codedStart = '\\`';
}
const codedProp = String.fromCharCode(categoryMap.get(property));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possibly-relevant cross-reference from a different draft pred-text PR: #11088 (comment)

It may help the encoding to shift the enum values by about 0x20 in order to avoid use of the ASCII control-code range.

Those chars are "fine" for the base character (codedStart) since they'll only be used once each, if that, as a key. Used for values, however, the control codes would see very high usage. Assuming esbuild would "escape" control-code characters, frequent use of them would likely greatly lower our file-size savings.

// To consider: emit `\uxxxx` codes instead of the raw char?
nonBmpRanges.map(({start, property}) => {
const codedStart = String.fromCodePoint(start);
const codedProp = String.fromCharCode(categoryMap.get(property));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point as prior comment.

@github-actions github-actions bot added common/resources/ Build infrastructure web/ and removed web/ labels Aug 7, 2024
@jahorton jahorton changed the title feat(web): wordbreaker data table optimization feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ Aug 7, 2024
@keymanapp-test-bot keymanapp-test-bot bot removed the user-test-missing User tests have not yet been defined for the PR label Aug 7, 2024
@keymanapp-test-bot keymanapp-test-bot bot added has-user-test user-test-required User tests have not been completed labels Aug 7, 2024
local RETRY_DELAY=5 # Make curl sleep this amount of time before each retry when a transfer has failed

echo "Downloading ${SRC} - ${RETRY} attempts"
# local URL_DOWNLOAD_FILE=`curl --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "${SRC}" | "$JQ" -r .txt`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# local URL_DOWNLOAD_FILE=`curl --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "${SRC}" | "$JQ" -r .txt`

EMOJI_DATA_SRC_HREF="https://www.unicode.org/Public/$KEYMAN_VERSION_UNICODE/ucd/emoji/emoji-data.txt"
EMOJI_DATA_SRC_LOCAL="./emoji-data.txt"

function downloadPropertyFile() {
Copy link
Contributor Author

@jahorton jahorton Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adapted the pattern used here from that seen in the linked block below:

function downloadKeyboardPackage() {
# Check that $KEYBOARDS_TARGET is valid
if [ "$#" -ne 2 ]; then
builder_die "downloadKeyboardPackage requires KEYBOARD_PACKAGE_ID and KEYBOARDS_TARGET to be set"
fi
# Default Keyboard
local ID="$1"
local KEYBOARDS_TARGET="$2"
local URL_DOWNLOAD=https://downloads.keyman.com
local URL_API_KEYBOARD_VERSION=${URL_DOWNLOAD}/api/keyboard/
local RETRY=5 # Curl retries this number of times before giving up
local RETRY_DELAY=5 # Make curl sleep this amount of time before each retry when a transfer has failed
echo "Downloading ${ID}.kmp from downloads.keyman.com up to ${RETRY} attempts"
local URL_DOWNLOAD_FILE=`curl --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "$URL_API_KEYBOARD_VERSION/${ID}" | "$JQ" -r .kmp`
curl --fail --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "$URL_DOWNLOAD_FILE" --output "$KEYBOARDS_TARGET" || {
builder_die "Downloading $KEYBOARDS_TARGET failed with error $?"
}
}

Note that the original performed a query for the keyboard's actual URL (the local URL_DOWNLOAD_FILE line) before doing the actual download.

The linked block is used both in Android builds (run on Windows + Linux) and in iOS builds (run on Mac), so we should be good for cross-platform compatibility here.

Android reference:

. "$KEYMAN_ROOT/resources/build/build-download-resources.sh"

if builder_start_action configure; then
KEYBOARD_PACKAGE_ID="sil_euro_latin"
KEYBOARDS_TARGET="$KEYMAN_ROOT/android/KMAPro/kMAPro/src/main/assets/${KEYBOARD_PACKAGE_ID}.kmp"
MODEL_PACKAGE_ID="nrc.en.mtnt"
MODELS_TARGET="$KEYMAN_ROOT/android/KMAPro/kMAPro/src/main/assets/${MODEL_PACKAGE_ID}.model.kmp"
downloadKeyboardPackage "$KEYBOARD_PACKAGE_ID" "$KEYBOARDS_TARGET"
downloadModelPackage "$MODEL_PACKAGE_ID" "$MODELS_TARGET"
builder_finish_action success configure
fi

iOS reference:

. "$KEYMAN_ROOT/resources/build/build-download-resources.sh"

function do_packages() {
mkdir -p "$BUNDLE_PATH"
downloadKeyboardPackage "$DEFAULT_KBD_ID" "$BUNDLE_PATH/$DEFAULT_KBD_ID.kmp"
downloadModelPackage "$DEFAULT_LM_ID" "$BUNDLE_PATH/$DEFAULT_LM_ID.model.kmp"
}

@jahorton jahorton changed the base branch from master to feat/web/wordbreaker-property-data-gen August 7, 2024 06:18
@jahorton jahorton marked this pull request as ready for review August 7, 2024 06:18
@jahorton jahorton requested a review from mcdurdin as a code owner August 7, 2024 06:18
@github-actions github-actions bot added common/resources/ Build infrastructure web/ and removed web/ common/resources/ Build infrastructure labels Aug 8, 2024
@github-actions github-actions bot added common/resources/ Build infrastructure web/ and removed web/ common/resources/ Build infrastructure labels Aug 9, 2024
@dinakaranr
Copy link

Test Results

  • TEST_GENERAL_PREDICTIONS (Passed):
    I tested this issue with the attached "Keyman 18.0.84-alpha-test-10692" build on the Android 14 & iPhone 13. I'm sharing my observation here.
  1. Installed the "Keyman-18.0.84" build for Android 14 & iPhone 13 for testflight app and gave all permissions to the application.
  2. Checked the "Enable Keyman as system-wide keyboard" and set the keyboard as the default keyboard box on the settings page.
  3. Open the Keyman app. Enable the "Predictions" and install the "Dictionary."
  4. Installed the "EuroLatin (SIL)" keyboard(v 3.0.2) and the standard MTNT dictionary(v0.3.2) model.
  5. Open the keyman notepad.
  6. Enter a word and observe that the word is highlighted on the banner. 
  7. Press the "Spacebar" and then the word added.
  8. The word is selected using a cursor from left to right and right to left.
  9. The word prediction works on the banner and selection works correctly.
  10. The word correction works well if I enter the wrong word.
    It works well. Thank you.

@keymanapp-test-bot keymanapp-test-bot bot removed the user-test-required User tests have not been completed label Aug 9, 2024
Base automatically changed from feat/web/wordbreaker-property-data-gen to master August 27, 2024 03:19
@jahorton jahorton merged commit 957412a into master Aug 27, 2024
15 checks passed
@jahorton jahorton deleted the feat/web/wordbreaker-data-optimization branch August 27, 2024 03:19
@keyman-server
Copy link
Collaborator

Changes in this pull request will be available for download in Keyman version 18.0.99-alpha

@darcywong00 darcywong00 modified the milestones: 18.0, A18S9 Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants