Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lw, ss options? #23

Closed
littledan opened this issue Jan 20, 2018 · 9 comments
Closed

Add lw, ss options? #23

littledan opened this issue Jan 20, 2018 · 9 comments

Comments

@littledan
Copy link
Member

There are additional options for breaking, specifically:

  • lw -- word break style (normal, keepall, breakall, matching CSS)
  • ss -- sentence break suppression (none, standard -- standard might be the better behavior here)

It's not clear how important these options are. When we discussed this issue in the ECMA 402 VC meeting in January 2018, @srl295 argued that all options should be presented, while @jungshik argued that the line breaking options are the most important ones. Do these options require taking up additional data size? If so, this is an additional argument against them.

@zbraniecki
Copy link
Member

@anba - is there an easy way to check the payload difference between supporting lw and ss and not?

@anba
Copy link

anba commented Feb 19, 2018

Does ICU actually support the "lw" Unicode extension key? At least I didn't see any functions mentioning "work break style" in the ICU API docs.

I didn't measure the size for the sentence break suppression data, but it's probably negligible. IIUC for sentence break suppression a filtered break iterator with locale-specific data is used, e.g. this file [2] for English (the entries under "exceptions/SentenceBreak" are used). Only German, English, Spanish, French, Italian, Portuguese, and Russian seem to be supported [3] and the overall number of entries seems to be relatively low.

[1] https://searchfox.org/mozilla-central/rev/5536f71c3833018c4f4e2c73f37eae635aab63ff/intl/icu/source/common/brkiter.cpp#436-449
[2] http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/brkitr/en.txt
[3] http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/brkitr

@srl295
Copy link
Member

srl295 commented Feb 19, 2018

@anba not sure. it may be that lw support is only in icu4J

ss does require some data (as you noted), maybe 12k (on disk) or less.

@jungshik
Copy link

Is lw supported by ICU4J? I don't think so.

lw in CSS3 is not that useful as is specified / implemented in Blink and other rendering engines. For Korean, it's marginally useful but for Japanese (and Chinese), it's NOT.

To implement lw to the fullest, we need not just a dictionary (for word-breaking) but also a PoS marker.
ICU does not have that.

lw (word-break: keep-all) is useful when typesetting a multi-line title, a few lines of text found in Ad/Billboard, etc. PoS tagger is necessary because 'particle' or case-marker should stay together with a word associated with it.

Anyway, what I meant wrt line-beraking is not about lw but about lb={strict,normal,loose}.

@littledan
Copy link
Member Author

We discussed these options in the March 2018 Intl call and decided not to add either of these two options in this version:

  • @jungshik explained, as above, that lw doesn't really make sense generally
  • The specification is vague enough that implementations can decide whether to turn on ss themselves. The upside to ss seemed to be accuracy (though it might not be accurate enough), and the downside seemed to be performance and stability. It sounds like performance may improve over time with ss.

@srl295
Copy link
Member

srl295 commented Apr 17, 2018

I wanted to understand the criteria here:

(ss) downside seemed to be performance

This is no different from other i18n areas. Using a collator is more expensive than memcmp(), normalizing everything costs, etc. I didn't have anything else off the top of my head in the other meeting.

and stability

Can you expand here? The segmentation exceptions are data driven, so change with locales like everything else.

I ask because this exact issue (segmentation exceptions) came up again today. I process a lot of translated content via JavaScript these days.

@littledan
Copy link
Member Author

Not sure if it was clear from above, but the idea of the current specification is that it's valid to turn ss on all the time; we're just not exposing the flag to developers. If I were to implement Intl.Segmenter now, from what I know, I'd turn on ss.

performance

I thought you said that it could use more performance work. I was probably misunderstanding. There was also a data size concern.

Can you expand here? The segmentation exceptions are data driven, so change with locales like everything else.

Maybe I am misunderstanding; I thought this was a reason you raised for why someone may want to turn off ss.

I'd be fine to put this on the agenda for the next meeting to revisit; I was just trying to record the previous decision.

@littledan littledan reopened this Apr 17, 2018
@littledan
Copy link
Member Author

Reviewing the notes from the last meeting, it seems like we settled on considering ss for a follow-on proposal, and leaving it out in the first pass. Are you still happy with that conclusion?

@littledan
Copy link
Member Author

OK, closing this issue given the resolution in #23 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants