-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lw, ss options? #23
Comments
@anba - is there an easy way to check the payload difference between supporting |
Does ICU actually support the "lw" Unicode extension key? At least I didn't see any functions mentioning "work break style" in the ICU API docs. I didn't measure the size for the sentence break suppression data, but it's probably negligible. IIUC for sentence break suppression a filtered break iterator with locale-specific data is used, e.g. this file [2] for English (the entries under "exceptions/SentenceBreak" are used). Only German, English, Spanish, French, Italian, Portuguese, and Russian seem to be supported [3] and the overall number of entries seems to be relatively low. [1] https://searchfox.org/mozilla-central/rev/5536f71c3833018c4f4e2c73f37eae635aab63ff/intl/icu/source/common/brkiter.cpp#436-449 |
@anba not sure. it may be that
|
Is lw supported by ICU4J? I don't think so. lw in CSS3 is not that useful as is specified / implemented in Blink and other rendering engines. For Korean, it's marginally useful but for Japanese (and Chinese), it's NOT. To implement lw to the fullest, we need not just a dictionary (for word-breaking) but also a PoS marker. lw (word-break: keep-all) is useful when typesetting a multi-line title, a few lines of text found in Ad/Billboard, etc. PoS tagger is necessary because 'particle' or case-marker should stay together with a word associated with it. Anyway, what I meant wrt line-beraking is not about lw but about lb={strict,normal,loose}. |
We discussed these options in the March 2018 Intl call and decided not to add either of these two options in this version:
|
I wanted to understand the criteria here:
This is no different from other i18n areas. Using a collator is more expensive than
Can you expand here? The segmentation exceptions are data driven, so change with locales like everything else. I ask because this exact issue (segmentation exceptions) came up again today. I process a lot of translated content via JavaScript these days. |
Not sure if it was clear from above, but the idea of the current specification is that it's valid to turn ss on all the time; we're just not exposing the flag to developers. If I were to implement Intl.Segmenter now, from what I know, I'd turn on ss.
I thought you said that it could use more performance work. I was probably misunderstanding. There was also a data size concern.
Maybe I am misunderstanding; I thought this was a reason you raised for why someone may want to turn off ss. I'd be fine to put this on the agenda for the next meeting to revisit; I was just trying to record the previous decision. |
Reviewing the notes from the last meeting, it seems like we settled on considering |
OK, closing this issue given the resolution in #23 (comment) |
There are additional options for breaking, specifically:
It's not clear how important these options are. When we discussed this issue in the ECMA 402 VC meeting in January 2018, @srl295 argued that all options should be presented, while @jungshik argued that the line breaking options are the most important ones. Do these options require taking up additional data size? If so, this is an additional argument against them.
The text was updated successfully, but these errors were encountered: