-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(common/models): update wordbreaker data #7224
Comments
Notes for a follow-up / further enhancement based on this:
|
A fun note from @srl295: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter provides a standardized implementation for some word-breaking and related functionality. It's not in all the browsers we aim to support - in fact, it's not in Firefox at all yet - but it's a promising detail for the future. |
Yes, except that we can't just use system-supplied (browser or OS) functionality because we are enabling the bleeding edge of language support. We still need to be able to do this ourselves. |
I think I've said this a few times, but it bears repeating: with Keyman, we can never rely on language support that is there in the system, whether that is segmentation, normalization, BCP 47, or anything else. We support languages that have never been supported and which may never be supported. And even if they are eventually supported, we aim to provide the functionality today. |
I know; I just wanted to note that it exists; it may also be of some use for supplying default data during model development, for example. He had some other ideas too, but I'll let him write that comment. |
Really keen to use existing functionality where it helps, so long as we have a way to roll-our-own also 😁 |
Briefly, at the very least this is the api we ought to use even if implementation is something else. |
Adding this as a related note: see #10568 for a reference on a license to copy over if/when implementing this potential update, especially should we copy over and check in the related source files. |
Is your feature request related to a problem? Please describe.
The current data.ts for our predictive-text wordbreaker is based on Unicode 13.0 / https://www.unicode.org/reports/tr41/tr41-26.html#Props0, but there are more recent versions of Unicode available. We may want to consider some mechanism to update the file periodically.
Note that the file is generated from code provided by @eddieantonio @ https://github.com/eddieantonio/unicode-default-word-boundary/tree/master/libexec. (In fact, the rest of the wordbreaker code was developed there first, then replicated here in namespace format instead of the module format seen there!)
Describe the solution you'd like
There are a few different approaches we could consider:
/latest/
part!The text was updated successfully, but these errors were encountered: