Add lw, ss options? #23

littledan · 2018-01-20T17:57:51Z

There are additional options for breaking, specifically:

lw -- word break style (normal, keepall, breakall, matching CSS)
ss -- sentence break suppression (none, standard -- standard might be the better behavior here)

It's not clear how important these options are. When we discussed this issue in the ECMA 402 VC meeting in January 2018, @srl295 argued that all options should be presented, while @jungshik argued that the line breaking options are the most important ones. Do these options require taking up additional data size? If so, this is an additional argument against them.

zbraniecki · 2018-02-16T17:48:47Z

@anba - is there an easy way to check the payload difference between supporting lw and ss and not?

anba · 2018-02-19T14:45:51Z

Does ICU actually support the "lw" Unicode extension key? At least I didn't see any functions mentioning "work break style" in the ICU API docs.

I didn't measure the size for the sentence break suppression data, but it's probably negligible. IIUC for sentence break suppression a filtered break iterator with locale-specific data is used, e.g. this file [2] for English (the entries under "exceptions/SentenceBreak" are used). Only German, English, Spanish, French, Italian, Portuguese, and Russian seem to be supported [3] and the overall number of entries seems to be relatively low.

[1] https://searchfox.org/mozilla-central/rev/5536f71c3833018c4f4e2c73f37eae635aab63ff/intl/icu/source/common/brkiter.cpp#436-449
[2] http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/brkitr/en.txt
[3] http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/brkitr

srl295 · 2018-02-19T17:38:35Z

@anba not sure. it may be that lw support is only in icu4J

ss does require some data (as you noted), maybe 12k (on disk) or less.

jungshik · 2018-02-21T01:23:45Z

Is lw supported by ICU4J? I don't think so.

lw in CSS3 is not that useful as is specified / implemented in Blink and other rendering engines. For Korean, it's marginally useful but for Japanese (and Chinese), it's NOT.

To implement lw to the fullest, we need not just a dictionary (for word-breaking) but also a PoS marker.
ICU does not have that.

lw (word-break: keep-all) is useful when typesetting a multi-line title, a few lines of text found in Ad/Billboard, etc. PoS tagger is necessary because 'particle' or case-marker should stay together with a word associated with it.

Anyway, what I meant wrt line-beraking is not about lw but about lb={strict,normal,loose}.

littledan · 2018-04-17T13:32:47Z

We discussed these options in the March 2018 Intl call and decided not to add either of these two options in this version:

@jungshik explained, as above, that lw doesn't really make sense generally
The specification is vague enough that implementations can decide whether to turn on ss themselves. The upside to ss seemed to be accuracy (though it might not be accurate enough), and the downside seemed to be performance and stability. It sounds like performance may improve over time with ss.

srl295 · 2018-04-17T21:10:56Z

I wanted to understand the criteria here:

(ss) downside seemed to be performance

This is no different from other i18n areas. Using a collator is more expensive than memcmp(), normalizing everything costs, etc. I didn't have anything else off the top of my head in the other meeting.

and stability

Can you expand here? The segmentation exceptions are data driven, so change with locales like everything else.

I ask because this exact issue (segmentation exceptions) came up again today. I process a lot of translated content via JavaScript these days.

littledan · 2018-04-17T21:18:57Z

Not sure if it was clear from above, but the idea of the current specification is that it's valid to turn ss on all the time; we're just not exposing the flag to developers. If I were to implement Intl.Segmenter now, from what I know, I'd turn on ss.

performance

I thought you said that it could use more performance work. I was probably misunderstanding. There was also a data size concern.

Can you expand here? The segmentation exceptions are data driven, so change with locales like everything else.

Maybe I am misunderstanding; I thought this was a reason you raised for why someone may want to turn off ss.

I'd be fine to put this on the agenda for the next meeting to revisit; I was just trying to record the previous decision.

littledan · 2018-04-17T21:30:48Z

Reviewing the notes from the last meeting, it seems like we settled on considering ss for a follow-on proposal, and leaving it out in the first pass. Are you still happy with that conclusion?

littledan · 2018-10-06T09:50:52Z

OK, closing this issue given the resolution in #23 (comment)

littledan closed this as completed Apr 17, 2018

littledan reopened this Apr 17, 2018

littledan closed this as completed Oct 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lw, ss options? #23

Add lw, ss options? #23

littledan commented Jan 20, 2018

zbraniecki commented Feb 16, 2018

anba commented Feb 19, 2018

srl295 commented Feb 19, 2018 •

edited

Loading

jungshik commented Feb 21, 2018

littledan commented Apr 17, 2018

srl295 commented Apr 17, 2018

littledan commented Apr 17, 2018

littledan commented Apr 17, 2018

littledan commented Oct 6, 2018

Add lw, ss options? #23

Add lw, ss options? #23

Comments

littledan commented Jan 20, 2018

zbraniecki commented Feb 16, 2018

anba commented Feb 19, 2018

srl295 commented Feb 19, 2018 • edited Loading

jungshik commented Feb 21, 2018

littledan commented Apr 17, 2018

srl295 commented Apr 17, 2018

littledan commented Apr 17, 2018

littledan commented Apr 17, 2018

littledan commented Oct 6, 2018

srl295 commented Feb 19, 2018 •

edited

Loading