Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(common/models): wordbreaker customization #7279

Merged
merged 7 commits into from
Sep 19, 2022

Conversation

jahorton
Copy link
Contributor

@jahorton jahorton commented Sep 13, 2022

Addresses #3347. If nothing else, this should help prompt some nice discussion toward a better solution.

Note the addition of unit tests to cover almost every base case of the standard Unicode word-breaking implementation, then extra ones to verify that customization can work for a few decent sample cases.

To be clear, any custom rules will have priority over all rules after WB4 from the wordbreaker spec. Furthermore, reclassifying any character is absolute - each character may only match a single word-breaking property value.

These changes don't make any direct changes to the lexical model compiler, so testing may be needed to be sure there are no issues with specifying the customizations to the word-breaking function. (That said, it appears to be 100% fine when using command-line compilation!) Alternatively, the model.ts file could provide a more friendly format (syntactic sugar) that compiles down to the actual customization function closure format shown below.

I have tested this with KeymanWeb via an edited lexical model (for a minority language with Khmer-based script):

Before:

wordBreaker: wordBreakers['default']

After:

wordBreaker: (text) => {
    let customization = {
      rules: [{
        match: (context) => {
          if(context.propertyMatch(null, ["ALetter"], ["Hyphen"], ["ALetter"])) {
            return true;
          } else if(context.propertyMatch(null, ["ALetter"], ["Hyphen"], ["eot"])) {
            return true;
          } else if(context.propertyMatch(["ALetter"], ["Hyphen"], ["ALetter"], null)) {
            return true;
          } else {
            return false;
          }
        },
        breakIfMatch: false
      }],
      propertyMapping: (char) => {
        let hyphens = ['\u002d', '\u2010', '\u058a', '\u30a0'];
        if(char >= '\u1780' && char <= '\u17b3') {  // treats Khmer consonants & independent vowels as English chars
            return "ALetter";
        } else if(hyphens.includes(char)) {
            return "Hyphen";
        } else {
          // The other Khmer characters already have useful word-breaking
          // property assignments.
          return null;
        }
      },
      customProperties: ["Hyphen"]
    }
    return wordBreakers['default'](text, customization);
  },

Resulting package: sil.jra-khmr.jarai.model.kmp.zip

The experiment was done by making these edits in https://github.com/keymanapp/lexical-models/tree/master/release/sil/sil.jra-khmr.jarai. Note that the unit tests contain a few other nifty examples for possible customization, though I'll admit the solution for French & Italian article conjunctions isn't the most elegant.

Again, the actual model.ts file compiles well via command-line with the lexical-model compiler - no editing needed! That said, when editing (granted, in VSCode), I did see a minor issue on the final line of the closure - it doesn't know what wordBreakers['default'] is referencing there. I don't know if this will also be an issue in Developer yet.

@keymanapp-test-bot skip

I believe the sizable array of new unit tests are sufficient to cover us here.

Though, if we did want some form of user test, installation of the model from the zipped KMP could be used with one of the build artifacts for a test.

@jahorton jahorton added this to the A16S10 milestone Sep 13, 2022
@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Sep 13, 2022
@jahorton jahorton marked this pull request as ready for review September 13, 2022 05:11
@jahorton jahorton requested a review from mcdurdin as a code owner September 13, 2022 05:11
Copy link
Contributor

@eddieantonio eddieantonio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan on documenting this feature? It presents quite a large API surface!

Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design looks quite clean and extensible. Well done.

The file size increase is significant. Are there any ways we can tighten that up?

What is the plan on documenting this feature? It presents quite a large API surface!

This needs to be added to the LM documentation on help.keyman.com. I'd like to see a draft of that before we merge this, because this becomes a permanent API for Keyman lexical models once we merge, and so we need to be confident that we are not going to need any breaking changes to it in the future.

@@ -4,7 +4,7 @@ export namespace data {
/**
* Valid values for a word break property.
*/
export const enum WordBreakProperty {
export const enum WordBreakProperty { // Scary bit: this does not exist as an object at run-time!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export const enum WordBreakProperty { // Scary bit: this does not exist as an object at run-time!
export const enum WordBreakProperty {

I don't understand the point of that comment? That's a standard feature of TypeScript.

Copy link
Contributor Author

@jahorton jahorton Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, looks like I left a coding investigation comment fragment in again.

That said...

The file size increase is significant. Are there any ways we can tighten that up?

It's highly related to the answer to this comment of yours. Also, the implications of const enum use in an auto-generated file here motivated a significant amount of the design.

In the examples in the description above, note the use of strings to refer to the character classes. Because of this "standard feature" - the fact that const enums compile down - we cannot use this enum to facilitate that use of strings. Correspondingly, there are a few functions and a data structure that I had to write for the sole purpose of facilitating string-based specification of custom classes and rules.

If we were willing to forgo this aspect of the feature:

  • which would mean relying solely on the enum above for the default character classes
  • thus requiring the model developer to track separate and unique numbers for use with custom classes
    • (Documentation: "We strongly suggest use of negative integers for custom class IDs. Non-negative integers are reserved.")
  • and thus specifying the rules with each class's enum value or selected unique number
    • which would either force additional annoying syntax for TS-compiler required typecasting or require us to 'loosen' up the internal typing of wordbreaker functions
  • and, finally, ensuring that char class enum values do not shift on rebuilds of data.ts, as they need to remain consistent in all versions of Keyman...

then I could definitely bring the number of changed lines down and reduce the file size impact if we were willing to push the top 3 bullet points above onto model developers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted this in a later comment, but since it's relevant to my comment above: the minified propertyMap used for string -> numeric property ID lookup... is 206 bytes minified by itself.

Then, there's the two methods that exist to map string-based specifications to the internal int/enum-based one:

  • propertyMatch: minifies at 260 bytes
  • propertyVal: minifies at 250 bytes(!) - I didn't expect this one to be as large as it apparently is; looks like that's due to optional chaining.

@jahorton
Copy link
Contributor Author

jahorton commented Sep 14, 2022

This needs to be added to the LM documentation on help.keyman.com. I'd like to see a draft of that before we merge this, because this becomes a permanent API for Keyman lexical models once we merge, and so we need to be confident that we are not going to need any breaking changes to it in the future.

Which means we need to stabilize the final form of it before working on the documentation.

The file size increase is significant. Are there any ways we can tighten that up?

Which, in turn, means figuring out what our answer to that is first.

@jahorton
Copy link
Contributor Author

jahorton commented Sep 14, 2022

I've also thought of "a way" to shrink the size of the backing data table, but that would be its own beast of a side project and would result in a notably less human-readable file. Granted, it's supposed to be auto-generated, so that's not the worst thing, but still. Note that this would likely be far more impactful than any optimizations we could do here - that table by itself weighs in at 17698 bytes in its current form when minified.

(The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference. Such work would belong as a followup to suggestions seen in #7224.)

@jahorton
Copy link
Contributor Author

jahorton commented Sep 14, 2022

Assuming I've identified the minified code region properly, the BreakerContext class is responsible for 1518 bytes of the 2663 bytes reported.

It may be possible to trim off a few bytes by trimming some of the internal field names, though they're referenced infrequently enough that it won't be a big savings. Additionally, reverting the simplest rules back to their pre-change form may save a few more bytes. And finally... it's surprising that minification didn't try to optimize all the null uses with a common variable. I might be able to do something there... but in my initial attempts, minification's being too smart for its own good. (It's inlining null automatically.)

Of course, note that the more aggressive we get with this, the less easily-maintainable the result might be.

Hitting the first two points: I can get something a little trickier to follow that saves about 250 bytes. (Note: the format of some default rules within index.ts will be less consistent.) For comparison, the minified propertyMap used for string -> numeric property ID lookup... is 206 bytes minified by itself.

@mcdurdin
Copy link
Member

mcdurdin commented Sep 14, 2022

Thank you for investigating the potential size optimizations. I agree with you that these smaller optimizations are going to be minimal wins at best, with a maintenance cost that makes them not worthwhile.

When we move to ES6 (iOS 13.4 is the primary roadblock on this) we'll see improvements across the board with optional chaining and other similar tweaks.

I've also thought of "a way" to shrink the size of the backing data table, ... that table by itself weighs in at 17698 bytes in its current form when minified.

(The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference. Such work would belong as a followup to suggestions seen in #7224.)

This is a worthwhile optimization project. It shouldn't be particularly difficult to do that? I look forward to seeing that as a follow-up to #7224 as you say.

@mcdurdin
Copy link
Member

Again, the actual model.ts file compiles well via command-line with the lexical-model compiler - no editing needed! That said, when editing (granted, in VSCode), I did see a minor issue on the final line of the closure - it doesn't know what wordBreakers['default'] is referencing there. I don't know if this will also be an issue in Developer yet.

Within Developer (and probably VSCode), we don't have access to the default wordbreaker code for the TSServer. So red squiggles will be a reality for now. As long as it compiles, probably all good. Note that the Developer parser for the model editor is pretty limited (once we move away from Delphi and do the editor in TypeScript, this problem should be resolvable).

@jahorton
Copy link
Contributor Author

I've started a draft PR for documentation on the new wordbreaker extensions. I think I've about got the custom rule part covered, but I haven't yet covered the other two aspects.

keymanapp/help.keyman.com#600

Comment on lines +580 to +586
function propertyVal(propName: string, options?: DefaultWordBreakerOptions) {
const matcher = (name: string) => name.toLowerCase() == propName.toLowerCase()

const customIndex = options?.customProperties?.findIndex(matcher) ?? -1;
return customIndex != -1 ? -customIndex - 1 : data.propertyMap.findIndex(matcher);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some further thought, it may be possible to drop the customProperties array and dynamically build it as custom properties are referenced by this function.

  1. Check if the property name is not in the default list. If so, return it.
  2. If not, check if it's already stored within customProperties (dynamically built when run).
  3. If stored within customProperties already, return the value as above.
  4. If not, add it to the array and return the corresponding value for the new entry.

Of course, this won't help if typos are made in the property name... but there wasn't an existing "catch" for that to begin with, so oh well.

It would make things a little more complex within this function, but it would remove the need for the verbose external boilerplate - the explicit definition of customProperties. (Even when writing it, I thought "this feels like it should be implicit.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... but trying to infer does run into an issue: early custom rules may not try to match against the custom property, so it's possible for characters to be classified as "Other" before the new property can be inferred.

Suppose the following:

if(context.propertyMatch(null, ["ALetter"], ["Other"], null)) {
  // ...
} else if(context.propertyMatch(null, ["Other"], ["Custom"], null) {
  // ...
}

Both instances of ["Other"] could be matched to characters that should be mapped to ["Custom"] when relying on run-time inference, as "Custom" would only be inferred - and thus, defined - when the ["Custom"] part is processed. That's... not good.

So... it may be best to rely on an explicit declaration after all, since that can easily avoid such an issue.

@keymanapp-test-bot keymanapp-test-bot bot removed the user-test-missing User tests have not yet been defined for the PR label Sep 15, 2022
Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@eddieantonio what do you think of the docs in keymanapp/help.keyman.com#600?

Also approving bypass of the file-size check. We will try and win back bytes in other ways 😁

Copy link
Contributor

@eddieantonio eddieantonio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docs look good! LGTM!

@mcdurdin mcdurdin modified the milestones: A16S10, A16S11 Sep 17, 2022
@jahorton jahorton merged commit d0d3ebb into master Sep 19, 2022
@jahorton jahorton deleted the feat/common/models/wordbreaker-extension branch September 19, 2022 01:32
@keyman-server
Copy link
Collaborator

Changes in this pull request will be available for download in Keyman version 16.0.67-alpha

function propertyVal(propName: string, options?: DefaultWordBreakerOptions) {
const matcher = (name: string) => name.toLowerCase() == propName.toLowerCase()

const customIndex = options?.customProperties?.findIndex(matcher) ?? -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a polyfill of findIndex for older Android API (Chrome 37)?

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/findIndex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants