feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692

jahorton · 2024-02-13T02:16:21Z

Relates to #7224.

This PR focuses mostly on this comment: #7224 (comment)

I've also thought of "a way" to shrink the size of the backing data table, but that would be its own beast of a side project and would result in a notably less human-readable file. [...]

(The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference.)

I've now worked out the details for encoding the table into string form. Going to a UTF-8 style string-encoding turned out not to be as beastly as I previously thought.

Using our current esbuild settings, this change drops the data table's filesize contribution from 18.4kb to 14.8kb. It's not that much, mostly because of how esbuild auto-escapes non ASCII chars, which re-expands the value for each character. The savings we get is mostly from removal of characters from the old array-based pattern's syntax.

In doing some research, I noted the following conversations that will be quite relevant to any related decisions moving forward:

esbuild default settings prefer to encode non-ASCII chars as \uxxxx escape sequences, even in UTF-8 files, due to performance concerns when a file uses the full UTF-8 range instead of pure ASCII encoding. The V8 parser has been noted to default to ASCII, redoing the buffer when it detects the first non ASCII char. Refer to https://v8.dev/blog/scanner#advanceuntil - note the image under that header, which is the last graph on the page. It's not explicitly mentioned in text on the page, but the graph certainly implies that pure ASCII is ~70% faster to load.

We don't really have concerns about loading a UTF-8 encoded file; our lexical models already have that enforced within them, encoding with standard UTF-8 instead of escapes.

The minified output for our current data table looks like this:

[94178,0],[94179,11],[94180,14],[94181,0],[94192,14],[94194,0],[110576,17],[110580,0],
[110581,17],[110588,0],[110589,17],[110591,0],[110592,17],[110593,0],[110880,17],[110883,0],
[110933,17],[110934,0],[110948,17],[110952,0]

True string length in code units: 223 (without the whitespace added to emulate word-wrapping)

Our encoded format looks quite different:

Starting and ending with the same codepoints...

𖿢 𖿣+𖿤.𖿥 𖿰.𖿲 𚿰1𚿴 𚿵1𚿼 𚿽1𚿿 𛀀1𛀁 𛄠1𛄣 𛅕1𛅖 𛅤1𛅨

String length in code units: 60

These two samples are from the SMP-range. Fun note... with Unicode 13 data, half of the entries in that range would be missing... but even then, the new encoding pattern including all of them would still win over the old encoding pattern with the original set.

This PR also adds unit tests to verify that we can properly search the data tables for both encoded strings.

User Testing

TEST_GENERAL_PREDICTIONS: Do some basic predictive-text testing and verify that things work normally.

… tables

keymanapp-test-bot · 2024-02-13T02:16:25Z

User Test Results

Test specification and instructions

✅ TEST_GENERAL_PREDICTIONS (PASSED) (notes)

Test Artifacts

Android
Developer
iOS
- Keyman for iOS (simulator image)
- FirstVoices Keyboards for iOS (simulator image)
- TestFlight internal PR build version - 18.0.84 (0.10692.11700)
Keyboards
- Test Keyboards
Web
- KeymanWeb Test Home
Windows

…o feat/web/wordbreaker-data-optimization

jahorton · 2024-04-09T05:30:30Z

common/models/wordbreakers/src/data-compiler/index.ts

+      // The backslash gets removed on file-load.
+      codedStart = '\\`';
+    }
+    const codedProp = String.fromCharCode(categoryMap.get(property));


A possibly-relevant cross-reference from a different draft pred-text PR: #11088 (comment)

It may help the encoding to shift the enum values by about 0x20 in order to avoid use of the ASCII control-code range.

Those chars are "fine" for the base character (codedStart) since they'll only be used once each, if that, as a key. Used for values, however, the control codes would see very high usage. Assuming esbuild would "escape" control-code characters, frequent use of them would likely greatly lower our file-size savings.

jahorton · 2024-04-09T05:30:40Z

common/models/wordbreakers/src/data-compiler/index.ts

+  // To consider:  emit `\uxxxx` codes instead of the raw char?
+  nonBmpRanges.map(({start, property}) => {
+    const codedStart = String.fromCodePoint(start);
+    const codedProp = String.fromCharCode(categoryMap.get(property));


Same point as prior comment.

…o feat/web/wordbreaker-data-optimization

jahorton · 2024-08-07T04:32:28Z

resources/standards-data/unicode-character-database/build.sh

+  local RETRY_DELAY=5 # Make curl sleep this amount of time before each retry when a transfer has failed
+
+  echo "Downloading ${SRC} - ${RETRY} attempts"
+  # local URL_DOWNLOAD_FILE=`curl --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "${SRC}" | "$JQ" -r .txt`


Suggested change

# local URL_DOWNLOAD_FILE=`curl --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "${SRC}" | "$JQ" -r .txt`

jahorton · 2024-08-07T04:34:21Z

resources/standards-data/unicode-character-database/build.sh

+EMOJI_DATA_SRC_HREF="https://www.unicode.org/Public/$KEYMAN_VERSION_UNICODE/ucd/emoji/emoji-data.txt"
+EMOJI_DATA_SRC_LOCAL="./emoji-data.txt"
+
+function downloadPropertyFile() {


I adapted the pattern used here from that seen in the linked block below:

keyman/resources/build/build-download-resources.sh

Lines 22 to 41 in 1d27a10

function downloadKeyboardPackage() {

# Check that $KEYBOARDS_TARGET is valid

if [ "$#" -ne 2 ]; then

builder_die "downloadKeyboardPackage requires KEYBOARD_PACKAGE_ID and KEYBOARDS_TARGET to be set"

fi

# Default Keyboard

local ID="$1"

local KEYBOARDS_TARGET="$2"

local URL_DOWNLOAD=https://downloads.keyman.com

local URL_API_KEYBOARD_VERSION=${URL_DOWNLOAD}/api/keyboard/

local RETRY=5 # Curl retries this number of times before giving up

local RETRY_DELAY=5 # Make curl sleep this amount of time before each retry when a transfer has failed

echo "Downloading ${ID}.kmp from downloads.keyman.com up to ${RETRY} attempts"

local URL_DOWNLOAD_FILE=`curl --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "$URL_API_KEYBOARD_VERSION/${ID}" | "$JQ" -r .kmp`

curl --fail --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "$URL_DOWNLOAD_FILE" --output "$KEYBOARDS_TARGET" || {

builder_die "Downloading $KEYBOARDS_TARGET failed with error $?"

}

}

Note that the original performed a query for the keyboard's actual URL (the local URL_DOWNLOAD_FILE line) before doing the actual download.

The linked block is used both in Android builds (run on Windows + Linux) and in iOS builds (run on Mac), so we should be good for cross-platform compatibility here.

Android reference:

keyman/android/KMAPro/build.sh

Line 11 in 1d27a10

. "$KEYMAN_ROOT/resources/build/build-download-resources.sh"

keyman/android/KMAPro/build.sh

Lines 74 to 85 in 1d27a10

if builder_start_action configure; then

KEYBOARD_PACKAGE_ID="sil_euro_latin"

KEYBOARDS_TARGET="$KEYMAN_ROOT/android/KMAPro/kMAPro/src/main/assets/${KEYBOARD_PACKAGE_ID}.kmp"

MODEL_PACKAGE_ID="nrc.en.mtnt"

MODELS_TARGET="$KEYMAN_ROOT/android/KMAPro/kMAPro/src/main/assets/${MODEL_PACKAGE_ID}.model.kmp"

downloadKeyboardPackage "$KEYBOARD_PACKAGE_ID" "$KEYBOARDS_TARGET"

downloadModelPackage "$MODEL_PACKAGE_ID" "$MODELS_TARGET"

builder_finish_action success configure

fi

iOS reference:

keyman/ios/engine/build.sh

Line 11 in 1d27a10

. "$KEYMAN_ROOT/resources/build/build-download-resources.sh"

keyman/ios/engine/build.sh

Lines 87 to 92 in 1d27a10

function do_packages() {

mkdir -p "$BUNDLE_PATH"

downloadKeyboardPackage "$DEFAULT_KBD_ID" "$BUNDLE_PATH/$DEFAULT_KBD_ID.kmp"

downloadModelPackage "$DEFAULT_LM_ID" "$BUNDLE_PATH/$DEFAULT_LM_ID.model.kmp"

}

…o feat/web/wordbreaker-data-optimization

dinakaranr · 2024-08-09T10:40:38Z

Test Results

TEST_GENERAL_PREDICTIONS (Passed):
I tested this issue with the attached "Keyman 18.0.84-alpha-test-10692" build on the Android 14 & iPhone 13. I'm sharing my observation here.

Installed the "Keyman-18.0.84" build for Android 14 & iPhone 13 for testflight app and gave all permissions to the application.
Checked the "Enable Keyman as system-wide keyboard" and set the keyboard as the default keyboard box on the settings page.
Open the Keyman app. Enable the "Predictions" and install the "Dictionary."
Installed the "EuroLatin (SIL)" keyboard(v 3.0.2) and the standard MTNT dictionary(v0.3.2) model.
Open the keyman notepad.
Enter a word and observe that the word is highlighted on the banner.
Press the "Spacebar" and then the word added.
The word is selected using a cursor from left to right and right to left.
The word prediction works on the banner and selection works correctly.
The word correction works well if I enter the wrong word.
It works well. Thank you.

keyman-server · 2024-08-27T18:02:53Z

Changes in this pull request will be available for download in Keyman version 18.0.99-alpha

feat(common/models): initial pass for encoded-string wordbreaker data…

0b79b86

… tables

jahorton added this to the 18.0 milestone Feb 13, 2024

keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Feb 13, 2024

github-actions bot added common/ common/models/ common/models/wordbreakers/ feat web/ labels Feb 13, 2024

jahorton added 2 commits February 16, 2024 11:56

fix(web): conditional import path

47d945f

chore(web): Merge branch 'feat/web/wordbreaker-property-data-gen' int…

bbda771

…o feat/web/wordbreaker-data-optimization

github-actions bot added web/ and removed web/ labels Feb 16, 2024

jahorton commented Apr 9, 2024

View reviewed changes

jahorton mentioned this pull request Aug 7, 2024

feat(web): import the generator for the pred-text wordbreaker's Unicode-property data-table ⚡ #10690

Merged

jahorton added 5 commits August 7, 2024 08:52

chore(web): Merge branch 'feat/web/wordbreaker-property-data-gen' int…

23c0308

…o feat/web/wordbreaker-data-optimization

change(common/models): improve encoding format

5045df1

feat(common/models): connect encoded lookup table

fdecc60

fix(web): fixes value for end of BMP range

b145adf

feat(common/models): add property-lookup unit test set

02f721d

github-actions bot added common/resources/ Build infrastructure web/ and removed web/ labels Aug 7, 2024

jahorton changed the title ~~feat(web): wordbreaker data table optimization~~ feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ Aug 7, 2024

keymanapp-test-bot bot removed the user-test-missing User tests have not yet been defined for the PR label Aug 7, 2024

jahorton mentioned this pull request Aug 7, 2024

feat(web): enable utf8 charset encoding for the build artifacts ⚡ #12115

Merged

keymanapp-test-bot bot added has-user-test user-test-required User tests have not been completed labels Aug 7, 2024

jahorton commented Aug 7, 2024

View reviewed changes

jahorton changed the base branch from master to feat/web/wordbreaker-property-data-gen August 7, 2024 06:18

jahorton marked this pull request as ready for review August 7, 2024 06:18

jahorton requested a review from mcdurdin as a code owner August 7, 2024 06:18

chore(web): Merge branch 'feat/web/wordbreaker-property-data-gen' int…

fef5b3a

…o feat/web/wordbreaker-data-optimization

github-actions bot added common/resources/ Build infrastructure web/ and removed web/ common/resources/ Build infrastructure labels Aug 8, 2024

fix(common/models): fix unit-test reference to renamed file

4159bab

github-actions bot added common/resources/ Build infrastructure web/ and removed web/ common/resources/ Build infrastructure labels Aug 9, 2024

keymanapp-test-bot bot removed the user-test-required User tests have not been completed label Aug 9, 2024

mcdurdin approved these changes Aug 26, 2024

View reviewed changes

Base automatically changed from feat/web/wordbreaker-property-data-gen to master August 27, 2024 03:19

jahorton merged commit 957412a into master Aug 27, 2024
15 checks passed

jahorton deleted the feat/web/wordbreaker-data-optimization branch August 27, 2024 03:19

darcywong00 modified the milestones: 18.0, A18S9 Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692

feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692

jahorton commented Feb 13, 2024 •

edited

Loading

keymanapp-test-bot bot commented Feb 13, 2024 •

edited

Loading

jahorton Apr 9, 2024

jahorton Apr 9, 2024

jahorton Aug 7, 2024

jahorton Aug 7, 2024 •

edited

Loading

dinakaranr commented Aug 9, 2024

keyman-server commented Aug 27, 2024

	function downloadKeyboardPackage() {
	# Check that $KEYBOARDS_TARGET is valid
	if [ "$#" -ne 2 ]; then
	builder_die "downloadKeyboardPackage requires KEYBOARD_PACKAGE_ID and KEYBOARDS_TARGET to be set"
	fi
	# Default Keyboard
	local ID="$1"
	local KEYBOARDS_TARGET="$2"

	local URL_DOWNLOAD=https://downloads.keyman.com
	local URL_API_KEYBOARD_VERSION=${URL_DOWNLOAD}/api/keyboard/
	local RETRY=5 # Curl retries this number of times before giving up
	local RETRY_DELAY=5 # Make curl sleep this amount of time before each retry when a transfer has failed

	echo "Downloading ${ID}.kmp from downloads.keyman.com up to ${RETRY} attempts"
	local URL_DOWNLOAD_FILE=`curl --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "$URL_API_KEYBOARD_VERSION/${ID}" \| "$JQ" -r .kmp`
	curl --fail --retry "$RETRY" --retry-delay "$RETRY_DELAY" --silent "$URL_DOWNLOAD_FILE" --output "$KEYBOARDS_TARGET" \|\| {
	builder_die "Downloading $KEYBOARDS_TARGET failed with error $?"
	}
	}

	if builder_start_action configure; then

	KEYBOARD_PACKAGE_ID="sil_euro_latin"
	KEYBOARDS_TARGET="$KEYMAN_ROOT/android/KMAPro/kMAPro/src/main/assets/${KEYBOARD_PACKAGE_ID}.kmp"
	MODEL_PACKAGE_ID="nrc.en.mtnt"
	MODELS_TARGET="$KEYMAN_ROOT/android/KMAPro/kMAPro/src/main/assets/${MODEL_PACKAGE_ID}.model.kmp"

	downloadKeyboardPackage "$KEYBOARD_PACKAGE_ID" "$KEYBOARDS_TARGET"
	downloadModelPackage "$MODEL_PACKAGE_ID" "$MODELS_TARGET"

	builder_finish_action success configure
	fi

	function do_packages() {
	mkdir -p "$BUNDLE_PATH"

	downloadKeyboardPackage "$DEFAULT_KBD_ID" "$BUNDLE_PATH/$DEFAULT_KBD_ID.kmp"
	downloadModelPackage "$DEFAULT_LM_ID" "$BUNDLE_PATH/$DEFAULT_LM_ID.model.kmp"
	}

feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692

feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692

Conversation

jahorton commented Feb 13, 2024 • edited Loading

User Testing

keymanapp-test-bot bot commented Feb 13, 2024 • edited Loading

User Test Results

Test Artifacts

jahorton Apr 9, 2024

Choose a reason for hiding this comment

jahorton Apr 9, 2024

Choose a reason for hiding this comment

jahorton Aug 7, 2024

Choose a reason for hiding this comment

jahorton Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

dinakaranr commented Aug 9, 2024

Test Results

keyman-server commented Aug 27, 2024

jahorton commented Feb 13, 2024 •

edited

Loading

keymanapp-test-bot bot commented Feb 13, 2024 •

edited

Loading

jahorton Aug 7, 2024 •

edited

Loading