feat(common/models): update wordbreaker data #7224

jahorton · 2022-09-07T01:07:40Z

Is your feature request related to a problem? Please describe.

The current data.ts for our predictive-text wordbreaker is based on Unicode 13.0 / https://www.unicode.org/reports/tr41/tr41-26.html#Props0, but there are more recent versions of Unicode available. We may want to consider some mechanism to update the file periodically.

Note that the file is generated from code provided by @eddieantonio @ https://github.com/eddieantonio/unicode-default-word-boundary/tree/master/libexec. (In fact, the rest of the wordbreaker code was developed there first, then replicated here in namespace format instead of the module format seen there!)

Describe the solution you'd like

There are a few different approaches we could consider:

Just write up a readme about the process, including links to that repo, and remember to run an update manually once a release cycle or something.
- For now, I suppose this issue is that "readme", in a sense.
Import the code used to generate the data.ts, tweak it if (and as) necessary, and write up a readme for that.
We could consider writing a tool to automate most, if not all, of the process!
- Noting the format of the URLs provided by the Unicode reports, they may provide an evergreen link to the most current version of the files:
  - https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakProperty.txt - note the /latest/ part!
- It should be "simple enough" to write up a tool to poll the relevant URLs (there's an extra file that was originally 'baked in'), download 'em, and run the data.ts-generator on 'em.
- If the URLs are indeed stable and always point to the 'latest', we could, in theory, include the update as a CI step.

The text was updated successfully, but these errors were encountered:

jahorton · 2022-09-14T04:56:34Z

Notes for a follow-up / further enhancement based on this:

#7279 (comment)

I've also thought of "a way" to shrink the size of the backing data table, but that would be its own beast of a side project and would result in a notably less human-readable file. [...]

(The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference.

jahorton · 2022-12-12T07:07:03Z

A fun note from @srl295: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter provides a standardized implementation for some word-breaking and related functionality. It's not in all the browsers we aim to support - in fact, it's not in Firefox at all yet - but it's a promising detail for the future.

mcdurdin · 2022-12-12T07:09:08Z

Yes, except that we can't just use system-supplied (browser or OS) functionality because we are enabling the bleeding edge of language support. We still need to be able to do this ourselves.

mcdurdin · 2022-12-12T07:10:30Z

I think I've said this a few times, but it bears repeating: with Keyman, we can never rely on language support that is there in the system, whether that is segmentation, normalization, BCP 47, or anything else. We support languages that have never been supported and which may never be supported. And even if they are eventually supported, we aim to provide the functionality today.

jahorton · 2022-12-12T07:15:53Z

I know; I just wanted to note that it exists; it may also be of some use for supplying default data during model development, for example. He had some other ideas too, but I'll let him write that comment.

mcdurdin · 2022-12-12T07:20:21Z

Really keen to use existing functionality where it helps, so long as we have a way to roll-our-own also 😁

srl295 · 2022-12-12T14:05:06Z

Briefly, at the very least this is the api we ought to use even if implementation is something else.

jahorton · 2024-01-31T02:33:44Z

Adding this as a related note: see #10568 for a reference on a license to copy over if/when implementing this potential update, especially should we copy over and check in the related source files.

jahorton added common/ common/models/ feat common/models/wordbreakers/ labels Sep 7, 2022

jahorton mentioned this issue Sep 14, 2022

feat(common/models): wordbreaker customization #7279

Merged

mcdurdin added this to the 17.0 milestone Oct 14, 2022

mcdurdin modified the milestones: 17.0, Future Dec 9, 2022

mcdurdin added the m:normalization label Sep 21, 2023

This was referenced Feb 13, 2024

feat(web): import the generator for the pred-text wordbreaker's Unicode-property data-table ⚡ #10690

Merged

feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692

Merged

jahorton modified the milestones: Future, A18S9, A18S8 Aug 7, 2024

mcdurdin assigned jahorton Aug 8, 2024

darcywong00 modified the milestones: A18S8, A18S9 Aug 17, 2024

jahorton closed this as completed in #10690 Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(common/models): update wordbreaker data #7224

feat(common/models): update wordbreaker data #7224

jahorton commented Sep 7, 2022

jahorton commented Sep 14, 2022

jahorton commented Dec 12, 2022 •

edited

Loading

mcdurdin commented Dec 12, 2022

mcdurdin commented Dec 12, 2022

jahorton commented Dec 12, 2022

mcdurdin commented Dec 12, 2022

srl295 commented Dec 12, 2022

jahorton commented Jan 31, 2024

feat(common/models): update wordbreaker data #7224

feat(common/models): update wordbreaker data #7224

Comments

jahorton commented Sep 7, 2022

jahorton commented Sep 14, 2022

jahorton commented Dec 12, 2022 • edited Loading

mcdurdin commented Dec 12, 2022

mcdurdin commented Dec 12, 2022

jahorton commented Dec 12, 2022

mcdurdin commented Dec 12, 2022

srl295 commented Dec 12, 2022

jahorton commented Jan 31, 2024

jahorton commented Dec 12, 2022 •

edited

Loading