feat(common/models): wordbreaker customization #7279

jahorton · 2022-09-13T04:04:26Z

Addresses #3347. If nothing else, this should help prompt some nice discussion toward a better solution.

Note the addition of unit tests to cover almost every base case of the standard Unicode word-breaking implementation, then extra ones to verify that customization can work for a few decent sample cases.

To be clear, any custom rules will have priority over all rules after WB4 from the wordbreaker spec. Furthermore, reclassifying any character is absolute - each character may only match a single word-breaking property value.

These changes don't make any direct changes to the lexical model compiler, so testing may be needed to be sure there are no issues with specifying the customizations to the word-breaking function. (That said, it appears to be 100% fine when using command-line compilation!) Alternatively, the model.ts file could provide a more friendly format (syntactic sugar) that compiles down to the actual customization function closure format shown below.

I have tested this with KeymanWeb via an edited lexical model (for a minority language with Khmer-based script):

Before:

wordBreaker: wordBreakers['default']

After:

wordBreaker: (text) => {
    let customization = {
      rules: [{
        match: (context) => {
          if(context.propertyMatch(null, ["ALetter"], ["Hyphen"], ["ALetter"])) {
            return true;
          } else if(context.propertyMatch(null, ["ALetter"], ["Hyphen"], ["eot"])) {
            return true;
          } else if(context.propertyMatch(["ALetter"], ["Hyphen"], ["ALetter"], null)) {
            return true;
          } else {
            return false;
          }
        },
        breakIfMatch: false
      }],
      propertyMapping: (char) => {
        let hyphens = ['\u002d', '\u2010', '\u058a', '\u30a0'];
        if(char >= '\u1780' && char <= '\u17b3') {  // treats Khmer consonants & independent vowels as English chars
            return "ALetter";
        } else if(hyphens.includes(char)) {
            return "Hyphen";
        } else {
          // The other Khmer characters already have useful word-breaking
          // property assignments.
          return null;
        }
      },
      customProperties: ["Hyphen"]
    }
    return wordBreakers['default'](text, customization);
  },

Resulting package: sil.jra-khmr.jarai.model.kmp.zip

The experiment was done by making these edits in https://github.com/keymanapp/lexical-models/tree/master/release/sil/sil.jra-khmr.jarai. Note that the unit tests contain a few other nifty examples for possible customization, though I'll admit the solution for French & Italian article conjunctions isn't the most elegant.

Again, the actual model.ts file compiles well via command-line with the lexical-model compiler - no editing needed! That said, when editing (granted, in VSCode), I did see a minor issue on the final line of the closure - it doesn't know what wordBreakers['default'] is referencing there. I don't know if this will also be an issue in Developer yet.

@keymanapp-test-bot skip

I believe the sizable array of new unit tests are sufficient to cover us here.

Though, if we did want some form of user test, installation of the model from the zipped KMP could be used with one of the build artifacts for a test.

keymanapp-test-bot · 2022-09-13T04:04:45Z

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

Android
Developer
iOS
- Keyman for iOS (simulator image)
- FirstVoices Keyboards for iOS (simulator image)
- TestFlight internal PR build version - 16.0.61 (0.7279.6916)
Keyboards
- Test Keyboards
Web
- KeymanWeb Test Home
Windows
- Keyman for Windows
- FirstVoices Keyboards for Windows

eddieantonio

What is the plan on documenting this feature? It presents quite a large API surface!

mcdurdin

The design looks quite clean and extensible. Well done.

The file size increase is significant. Are there any ways we can tighten that up?

What is the plan on documenting this feature? It presents quite a large API surface!

This needs to be added to the LM documentation on help.keyman.com. I'd like to see a draft of that before we merge this, because this becomes a permanent API for Keyman lexical models once we merge, and so we need to be confident that we are not going to need any breaking changes to it in the future.

mcdurdin · 2022-09-13T23:45:48Z

common/models/wordbreakers/src/default/data.ts

@@ -4,7 +4,7 @@ export namespace data {
 /**
 * Valid values for a word break property.
 */
-export const enum WordBreakProperty {
+export const enum WordBreakProperty { // Scary bit:  this does not exist as an object at run-time!


Suggested change

export const enum WordBreakProperty { // Scary bit: this does not exist as an object at run-time!

export const enum WordBreakProperty {

I don't understand the point of that comment? That's a standard feature of TypeScript.

Right, looks like I left a coding investigation comment fragment in again.

That said...

The file size increase is significant. Are there any ways we can tighten that up?

It's highly related to the answer to this comment of yours. Also, the implications of const enum use in an auto-generated file here motivated a significant amount of the design.

In the examples in the description above, note the use of strings to refer to the character classes. Because of this "standard feature" - the fact that const enums compile down - we cannot use this enum to facilitate that use of strings. Correspondingly, there are a few functions and a data structure that I had to write for the sole purpose of facilitating string-based specification of custom classes and rules.

If we were willing to forgo this aspect of the feature:

which would mean relying solely on the enum above for the default character classes

thus requiring the model developer to track separate and unique numbers for use with custom classes

(Documentation: "We strongly suggest use of negative integers for custom class IDs. Non-negative integers are reserved.")

and thus specifying the rules with each class's enum value or selected unique number

which would either force additional annoying syntax for TS-compiler required typecasting or require us to 'loosen' up the internal typing of wordbreaker functions

and, finally, ensuring that char class enum values do not shift on rebuilds of data.ts, as they need to remain consistent in all versions of Keyman...

(This would be a followup point we'd need to include with feat(common/models): update wordbreaker data #7224 if and when we address that.)

then I could definitely bring the number of changed lines down and reduce the file size impact if we were willing to push the top 3 bullet points above onto model developers.

I noted this in a later comment, but since it's relevant to my comment above: the minified propertyMap used for string -> numeric property ID lookup... is 206 bytes minified by itself.

Then, there's the two methods that exist to map string-based specifications to the internal int/enum-based one:

propertyMatch: minifies at 260 bytes

propertyVal: minifies at 250 bytes(!) - I didn't expect this one to be as large as it apparently is; looks like that's due to optional chaining.

jahorton · 2022-09-14T01:19:29Z

This needs to be added to the LM documentation on help.keyman.com. I'd like to see a draft of that before we merge this, because this becomes a permanent API for Keyman lexical models once we merge, and so we need to be confident that we are not going to need any breaking changes to it in the future.

Which means we need to stabilize the final form of it before working on the documentation.

The file size increase is significant. Are there any ways we can tighten that up?

Which, in turn, means figuring out what our answer to that is first.

jahorton · 2022-09-14T01:26:30Z

I've also thought of "a way" to shrink the size of the backing data table, but that would be its own beast of a side project and would result in a notably less human-readable file. Granted, it's supposed to be auto-generated, so that's not the worst thing, but still. Note that this would likely be far more impactful than any optimizations we could do here - that table by itself weighs in at 17698 bytes in its current form when minified.

(The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference. Such work would belong as a followup to suggestions seen in #7224.)

jahorton · 2022-09-14T01:57:58Z

Assuming I've identified the minified code region properly, the BreakerContext class is responsible for 1518 bytes of the 2663 bytes reported.

It may be possible to trim off a few bytes by trimming some of the internal field names, though they're referenced infrequently enough that it won't be a big savings. Additionally, reverting the simplest rules back to their pre-change form may save a few more bytes. And finally... it's surprising that minification didn't try to optimize all the null uses with a common variable. I might be able to do something there... but in my initial attempts, minification's being too smart for its own good. (It's inlining null automatically.)

Of course, note that the more aggressive we get with this, the less easily-maintainable the result might be.

Hitting the first two points: I can get something a little trickier to follow that saves about 250 bytes. (Note: the format of some default rules within index.ts will be less consistent.) For comparison, the minified propertyMap used for string -> numeric property ID lookup... is 206 bytes minified by itself.

mcdurdin · 2022-09-14T04:52:53Z

Thank you for investigating the potential size optimizations. I agree with you that these smaller optimizations are going to be minimal wins at best, with a maintenance cost that makes them not worthwhile.

When we move to ES6 (iOS 13.4 is the primary roadblock on this) we'll see improvements across the board with optional chaining and other similar tweaks.

I've also thought of "a way" to shrink the size of the backing data table, ... that table by itself weighs in at 17698 bytes in its current form when minified.

(The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference. Such work would belong as a followup to suggestions seen in #7224.)

This is a worthwhile optimization project. It shouldn't be particularly difficult to do that? I look forward to seeing that as a follow-up to #7224 as you say.

mcdurdin · 2022-09-14T05:02:26Z

Again, the actual model.ts file compiles well via command-line with the lexical-model compiler - no editing needed! That said, when editing (granted, in VSCode), I did see a minor issue on the final line of the closure - it doesn't know what wordBreakers['default'] is referencing there. I don't know if this will also be an issue in Developer yet.

Within Developer (and probably VSCode), we don't have access to the default wordbreaker code for the TSServer. So red squiggles will be a reality for now. As long as it compiles, probably all good. Note that the Developer parser for the model editor is pretty limited (once we move away from Delphi and do the editor in TypeScript, this problem should be resolvable).

jahorton · 2022-09-14T08:52:07Z

I've started a draft PR for documentation on the new wordbreaker extensions. I think I've about got the custom rule part covered, but I haven't yet covered the other two aspects.

keymanapp/help.keyman.com#600

jahorton · 2022-09-15T03:43:24Z

common/models/wordbreakers/src/default/index.ts

+  function propertyVal(propName: string, options?: DefaultWordBreakerOptions) {
+    const matcher = (name: string) => name.toLowerCase() == propName.toLowerCase()
+
+    const customIndex = options?.customProperties?.findIndex(matcher) ?? -1;
+    return customIndex != -1 ? -customIndex - 1 : data.propertyMap.findIndex(matcher);
+  }
+


After some further thought, it may be possible to drop the customProperties array and dynamically build it as custom properties are referenced by this function.

Check if the property name is not in the default list. If so, return it.

If not, check if it's already stored within customProperties (dynamically built when run).

If stored within customProperties already, return the value as above.

If not, add it to the array and return the corresponding value for the new entry.

Of course, this won't help if typos are made in the property name... but there wasn't an existing "catch" for that to begin with, so oh well.

It would make things a little more complex within this function, but it would remove the need for the verbose external boilerplate - the explicit definition of customProperties. (Even when writing it, I thought "this feels like it should be implicit.")

Hmm... but trying to infer does run into an issue: early custom rules may not try to match against the custom property, so it's possible for characters to be classified as "Other" before the new property can be inferred.

Suppose the following:

if(context.propertyMatch(null, ["ALetter"], ["Other"], null)) { // ... } else if(context.propertyMatch(null, ["Other"], ["Custom"], null) { // ... }

Both instances of ["Other"] could be matched to characters that should be mapped to ["Custom"] when relying on run-time inference, as "Custom" would only be inferred - and thus, defined - when the ["Custom"] part is processed. That's... not good.

So... it may be best to rely on an explicit declaration after all, since that can easily avoid such an issue.

…caret

mcdurdin

LGTM.

@eddieantonio what do you think of the docs in keymanapp/help.keyman.com#600?

Also approving bypass of the file-size check. We will try and win back bytes in other ways 😁

eddieantonio

I think the docs look good! LGTM!

keyman-server · 2022-09-19T18:02:46Z

Changes in this pull request will be available for download in Keyman version 16.0.67-alpha

darcywong00 · 2022-11-07T00:56:55Z

common/models/wordbreakers/src/default/index.ts

+  function propertyVal(propName: string, options?: DefaultWordBreakerOptions) {
+    const matcher = (name: string) => name.toLowerCase() == propName.toLowerCase()
+
+    const customIndex = options?.customProperties?.findIndex(matcher) ?? -1;


Do we need a polyfill of findIndex for older Android API (Chrome 37)?

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/findIndex

jahorton added 5 commits September 12, 2022 15:14

refactor(common/models): default wordbreaker state tracking

5bad267

feat(common/models): wordbreaker context .match method

991a5bc

feat(common/models): interface for custom breaker rules

51caf53

feat(common/models): actual wordbreaker customization

8e076fe

docs(common/models): adds minor doc re 'sot', 'eot'

261e252

jahorton added this to the A16S10 milestone Sep 13, 2022

jahorton requested a review from eddieantonio September 13, 2022 04:04

keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Sep 13, 2022

github-actions bot added common/ common/models/ common/models/wordbreakers/ feat labels Sep 13, 2022

chore(common/models): minor cleanup

94face2

jahorton marked this pull request as ready for review September 13, 2022 05:11

jahorton requested a review from mcdurdin as a code owner September 13, 2022 05:11

eddieantonio reviewed Sep 13, 2022

View reviewed changes

mcdurdin reviewed Sep 13, 2022

View reviewed changes

jahorton mentioned this pull request Sep 14, 2022

feat(common/models): update wordbreaker data #7224

Closed

jahorton mentioned this pull request Sep 15, 2022

feat: docs for wordbreaker extension API keymanapp/help.keyman.com#600

Merged

jahorton commented Sep 15, 2022

View reviewed changes

feat(common/models): adds tokenization tests for wordbreaks near the …

26fb6da

…caret

github-actions bot added the common/models/templates/ label Sep 15, 2022

keymanapp-test-bot bot removed the user-test-missing User tests have not yet been defined for the PR label Sep 15, 2022

mcdurdin approved these changes Sep 15, 2022

View reviewed changes

jahorton mentioned this pull request Sep 16, 2022

bug(web): unrelated suggestions in banner for languages (non-Roman script case) #4953

Open

2 tasks

eddieantonio approved these changes Sep 16, 2022

View reviewed changes

mcdurdin modified the milestones: A16S10, A16S11 Sep 17, 2022

jahorton merged commit d0d3ebb into master Sep 19, 2022

jahorton deleted the feat/common/models/wordbreaker-extension branch September 19, 2022 01:32

darcywong00 mentioned this pull request Nov 4, 2022

bug(android): Error in Keyboard 'sil_euro..' for language message appears in Android 5.0 emulator #7613

Closed

8 tasks

darcywong00 reviewed Nov 7, 2022

View reviewed changes

This was referenced Nov 8, 2022

fix(web): Add polyfill for Array.includes() #7646

Merged

fix(web): Add polyfill for Array.findIndex() #7652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(common/models): wordbreaker customization #7279

feat(common/models): wordbreaker customization #7279

jahorton commented Sep 13, 2022 •

edited

Loading

keymanapp-test-bot bot commented Sep 13, 2022 •

edited

Loading

eddieantonio left a comment

mcdurdin left a comment

mcdurdin Sep 13, 2022

jahorton Sep 14, 2022 •

edited

Loading

jahorton Sep 14, 2022

jahorton commented Sep 14, 2022 •

edited

Loading

jahorton commented Sep 14, 2022 •

edited

Loading

jahorton commented Sep 14, 2022 •

edited

Loading

mcdurdin commented Sep 14, 2022 •

edited

Loading

mcdurdin commented Sep 14, 2022

jahorton commented Sep 14, 2022

jahorton Sep 15, 2022

jahorton Sep 15, 2022

mcdurdin left a comment

eddieantonio left a comment

keyman-server commented Sep 19, 2022

darcywong00 Nov 7, 2022

	export const enum WordBreakProperty { // Scary bit: this does not exist as an object at run-time!
	export const enum WordBreakProperty {

feat(common/models): wordbreaker customization #7279

feat(common/models): wordbreaker customization #7279

Conversation

jahorton commented Sep 13, 2022 • edited Loading

keymanapp-test-bot bot commented Sep 13, 2022 • edited Loading

User Test Results

Test Artifacts

eddieantonio left a comment

Choose a reason for hiding this comment

mcdurdin left a comment

Choose a reason for hiding this comment

mcdurdin Sep 13, 2022

Choose a reason for hiding this comment

jahorton Sep 14, 2022 • edited Loading

Choose a reason for hiding this comment

jahorton Sep 14, 2022

Choose a reason for hiding this comment

jahorton commented Sep 14, 2022 • edited Loading

jahorton commented Sep 14, 2022 • edited Loading

jahorton commented Sep 14, 2022 • edited Loading

mcdurdin commented Sep 14, 2022 • edited Loading

mcdurdin commented Sep 14, 2022

jahorton commented Sep 14, 2022

jahorton Sep 15, 2022

Choose a reason for hiding this comment

jahorton Sep 15, 2022

Choose a reason for hiding this comment

mcdurdin left a comment

Choose a reason for hiding this comment

eddieantonio left a comment

Choose a reason for hiding this comment

keyman-server commented Sep 19, 2022

darcywong00 Nov 7, 2022

Choose a reason for hiding this comment

jahorton commented Sep 13, 2022 •

edited

Loading

keymanapp-test-bot bot commented Sep 13, 2022 •

edited

Loading

jahorton Sep 14, 2022 •

edited

Loading

jahorton commented Sep 14, 2022 •

edited

Loading

jahorton commented Sep 14, 2022 •

edited

Loading

jahorton commented Sep 14, 2022 •

edited

Loading

mcdurdin commented Sep 14, 2022 •

edited

Loading