-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(common/models): wordbreaker customization #7279
Conversation
User Test ResultsTest specification and instructions User tests are not required Test Artifacts
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan on documenting this feature? It presents quite a large API surface!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design looks quite clean and extensible. Well done.
The file size increase is significant. Are there any ways we can tighten that up?
What is the plan on documenting this feature? It presents quite a large API surface!
This needs to be added to the LM documentation on help.keyman.com. I'd like to see a draft of that before we merge this, because this becomes a permanent API for Keyman lexical models once we merge, and so we need to be confident that we are not going to need any breaking changes to it in the future.
@@ -4,7 +4,7 @@ export namespace data { | |||
/** | |||
* Valid values for a word break property. | |||
*/ | |||
export const enum WordBreakProperty { | |||
export const enum WordBreakProperty { // Scary bit: this does not exist as an object at run-time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
export const enum WordBreakProperty { // Scary bit: this does not exist as an object at run-time! | |
export const enum WordBreakProperty { |
I don't understand the point of that comment? That's a standard feature of TypeScript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, looks like I left a coding investigation comment fragment in again.
That said...
The file size increase is significant. Are there any ways we can tighten that up?
It's highly related to the answer to this comment of yours. Also, the implications of const enum
use in an auto-generated file here motivated a significant amount of the design.
In the examples in the description above, note the use of strings to refer to the character classes. Because of this "standard feature" - the fact that const enum
s compile down - we cannot use this enum to facilitate that use of strings. Correspondingly, there are a few functions and a data structure that I had to write for the sole purpose of facilitating string-based specification of custom classes and rules.
If we were willing to forgo this aspect of the feature:
- which would mean relying solely on the enum above for the default character classes
- thus requiring the model developer to track separate and unique numbers for use with custom classes
- (Documentation: "We strongly suggest use of negative integers for custom class IDs. Non-negative integers are reserved.")
- and thus specifying the rules with each class's enum value or selected unique number
- which would either force additional annoying syntax for TS-compiler required typecasting or require us to 'loosen' up the internal typing of wordbreaker functions
- and, finally, ensuring that char class enum values do not shift on rebuilds of data.ts, as they need to remain consistent in all versions of Keyman...
- (This would be a followup point we'd need to include with feat(common/models): update wordbreaker data #7224 if and when we address that.)
then I could definitely bring the number of changed lines down and reduce the file size impact if we were willing to push the top 3 bullet points above onto model developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noted this in a later comment, but since it's relevant to my comment above: the minified propertyMap used for string -> numeric property ID lookup... is 206 bytes minified by itself.
Then, there's the two methods that exist to map string-based specifications to the internal int/enum-based one:
propertyMatch
: minifies at 260 bytespropertyVal
: minifies at 250 bytes(!) - I didn't expect this one to be as large as it apparently is; looks like that's due to optional chaining.
Which means we need to stabilize the final form of it before working on the documentation.
Which, in turn, means figuring out what our answer to that is first. |
I've also thought of "a way" to shrink the size of the backing data table, but that would be its own beast of a side project and would result in a notably less human-readable file. Granted, it's supposed to be auto-generated, so that's not the worst thing, but still. Note that this would likely be far more impactful than any optimizations we could do here - that table by itself weighs in at 17698 bytes in its current form when minified. (The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference. Such work would belong as a followup to suggestions seen in #7224.) |
Assuming I've identified the minified code region properly, the It may be possible to trim off a few bytes by trimming some of the internal field names, though they're referenced infrequently enough that it won't be a big savings. Additionally, reverting the simplest rules back to their pre-change form may save a few more bytes. And finally... it's surprising that minification didn't try to optimize all the Of course, note that the more aggressive we get with this, the less easily-maintainable the result might be. Hitting the first two points: I can get something a little trickier to follow that saves about 250 bytes. (Note: the format of some default rules within index.ts will be less consistent.) For comparison, the minified |
Thank you for investigating the potential size optimizations. I agree with you that these smaller optimizations are going to be minimal wins at best, with a maintenance cost that makes them not worthwhile. When we move to ES6 (iOS 13.4 is the primary roadblock on this) we'll see improvements across the board with optional chaining and other similar tweaks.
This is a worthwhile optimization project. It shouldn't be particularly difficult to do that? I look forward to seeing that as a follow-up to #7224 as you say. |
Within Developer (and probably VSCode), we don't have access to the default wordbreaker code for the TSServer. So red squiggles will be a reality for now. As long as it compiles, probably all good. Note that the Developer parser for the model editor is pretty limited (once we move away from Delphi and do the editor in TypeScript, this problem should be resolvable). |
I've started a draft PR for documentation on the new wordbreaker extensions. I think I've about got the custom rule part covered, but I haven't yet covered the other two aspects. |
function propertyVal(propName: string, options?: DefaultWordBreakerOptions) { | ||
const matcher = (name: string) => name.toLowerCase() == propName.toLowerCase() | ||
|
||
const customIndex = options?.customProperties?.findIndex(matcher) ?? -1; | ||
return customIndex != -1 ? -customIndex - 1 : data.propertyMap.findIndex(matcher); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some further thought, it may be possible to drop the customProperties
array and dynamically build it as custom properties are referenced by this function.
- Check if the property name is not in the default list. If so, return it.
- If not, check if it's already stored within
customProperties
(dynamically built when run). - If stored within
customProperties
already, return the value as above. - If not, add it to the array and return the corresponding value for the new entry.
Of course, this won't help if typos are made in the property name... but there wasn't an existing "catch" for that to begin with, so oh well.
It would make things a little more complex within this function, but it would remove the need for the verbose external boilerplate - the explicit definition of customProperties
. (Even when writing it, I thought "this feels like it should be implicit.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... but trying to infer does run into an issue: early custom rules may not try to match against the custom property, so it's possible for characters to be classified as "Other" before the new property can be inferred.
Suppose the following:
if(context.propertyMatch(null, ["ALetter"], ["Other"], null)) {
// ...
} else if(context.propertyMatch(null, ["Other"], ["Custom"], null) {
// ...
}
Both instances of ["Other"]
could be matched to characters that should be mapped to ["Custom"]
when relying on run-time inference, as "Custom"
would only be inferred - and thus, defined - when the ["Custom"]
part is processed. That's... not good.
So... it may be best to rely on an explicit declaration after all, since that can easily avoid such an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@eddieantonio what do you think of the docs in keymanapp/help.keyman.com#600?
Also approving bypass of the file-size check. We will try and win back bytes in other ways 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docs look good! LGTM!
Changes in this pull request will be available for download in Keyman version 16.0.67-alpha |
function propertyVal(propName: string, options?: DefaultWordBreakerOptions) { | ||
const matcher = (name: string) => name.toLowerCase() == propName.toLowerCase() | ||
|
||
const customIndex = options?.customProperties?.findIndex(matcher) ?? -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a polyfill of findIndex
for older Android API (Chrome 37)?
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/findIndex
Addresses #3347. If nothing else, this should help prompt some nice discussion toward a better solution.
Note the addition of unit tests to cover almost every base case of the standard Unicode word-breaking implementation, then extra ones to verify that customization can work for a few decent sample cases.
To be clear, any custom rules will have priority over all rules after WB4 from the wordbreaker spec. Furthermore, reclassifying any character is absolute - each character may only match a single word-breaking property value.
These changes don't make any direct changes to the lexical model compiler, so testing may be needed to be sure there are no issues with specifying the customizations to the word-breaking function. (That said, it appears to be 100% fine when using command-line compilation!) Alternatively, the model.ts file could provide a more friendly format (syntactic sugar) that compiles down to the actual customization function closure format shown below.
I have tested this with KeymanWeb via an edited lexical model (for a minority language with Khmer-based script):
Before:
After:
Resulting package: sil.jra-khmr.jarai.model.kmp.zip
The experiment was done by making these edits in https://github.com/keymanapp/lexical-models/tree/master/release/sil/sil.jra-khmr.jarai. Note that the unit tests contain a few other nifty examples for possible customization, though I'll admit the solution for French & Italian article conjunctions isn't the most elegant.
Again, the actual model.ts file compiles well via command-line with the lexical-model compiler - no editing needed! That said, when editing (granted, in VSCode), I did see a minor issue on the final line of the closure - it doesn't know what
wordBreakers['default']
is referencing there. I don't know if this will also be an issue in Developer yet.@keymanapp-test-bot skip
I believe the sizable array of new unit tests are sufficient to cover us here.
Though, if we did want some form of user test, installation of the model from the zipped KMP could be used with one of the build artifacts for a test.