Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(common/models): update wordbreaker data #7224

Closed
jahorton opened this issue Sep 7, 2022 · 8 comments · Fixed by #10690
Closed

feat(common/models): update wordbreaker data #7224

jahorton opened this issue Sep 7, 2022 · 8 comments · Fixed by #10690

Comments

@jahorton
Copy link
Contributor

jahorton commented Sep 7, 2022

Is your feature request related to a problem? Please describe.

The current data.ts for our predictive-text wordbreaker is based on Unicode 13.0 / https://www.unicode.org/reports/tr41/tr41-26.html#Props0, but there are more recent versions of Unicode available. We may want to consider some mechanism to update the file periodically.

Note that the file is generated from code provided by @eddieantonio @ https://github.com/eddieantonio/unicode-default-word-boundary/tree/master/libexec. (In fact, the rest of the wordbreaker code was developed there first, then replicated here in namespace format instead of the module format seen there!)

Describe the solution you'd like

There are a few different approaches we could consider:

  1. Just write up a readme about the process, including links to that repo, and remember to run an update manually once a release cycle or something.
    • For now, I suppose this issue is that "readme", in a sense.
  2. Import the code used to generate the data.ts, tweak it if (and as) necessary, and write up a readme for that.
  3. We could consider writing a tool to automate most, if not all, of the process!
    • Noting the format of the URLs provided by the Unicode reports, they may provide an evergreen link to the most current version of the files:
    • It should be "simple enough" to write up a tool to poll the relevant URLs (there's an extra file that was originally 'baked in'), download 'em, and run the data.ts-generator on 'em.
    • If the URLs are indeed stable and always point to the 'latest', we could, in theory, include the update as a CI step.
@jahorton
Copy link
Contributor Author

Notes for a follow-up / further enhancement based on this:

#7279 (comment)

I've also thought of "a way" to shrink the size of the backing data table, but that would be its own beast of a side project and would result in a notably less human-readable file. [...]

(The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference.

@mcdurdin mcdurdin added this to the 17.0 milestone Oct 14, 2022
@mcdurdin mcdurdin modified the milestones: 17.0, Future Dec 9, 2022
@jahorton
Copy link
Contributor Author

jahorton commented Dec 12, 2022

A fun note from @srl295: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter provides a standardized implementation for some word-breaking and related functionality. It's not in all the browsers we aim to support - in fact, it's not in Firefox at all yet - but it's a promising detail for the future.

@mcdurdin
Copy link
Member

Yes, except that we can't just use system-supplied (browser or OS) functionality because we are enabling the bleeding edge of language support. We still need to be able to do this ourselves.

@mcdurdin
Copy link
Member

I think I've said this a few times, but it bears repeating: with Keyman, we can never rely on language support that is there in the system, whether that is segmentation, normalization, BCP 47, or anything else. We support languages that have never been supported and which may never be supported. And even if they are eventually supported, we aim to provide the functionality today.

@jahorton
Copy link
Contributor Author

I know; I just wanted to note that it exists; it may also be of some use for supplying default data during model development, for example. He had some other ideas too, but I'll let him write that comment.

@mcdurdin
Copy link
Member

Really keen to use existing functionality where it helps, so long as we have a way to roll-our-own also 😁

@srl295
Copy link
Member

srl295 commented Dec 12, 2022

Briefly, at the very least this is the api we ought to use even if implementation is something else.

@jahorton
Copy link
Contributor Author

Adding this as a related note: see #10568 for a reference on a license to copy over if/when implementing this potential update, especially should we copy over and check in the related source files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment