Using Unicode locale ID vs BCP 47 in our spec #63

nciric · 2018-10-02T03:55:32Z

@littledan this is a proposal we could work into our Locale spec, if we can get group to agree on the change.

Current spec (and most of the constructors) expect bcp-47 locale id. A cleaner approach would be to use Unicode locale ID, see here for differences:

http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#BCP_47_Conformance

It does not allow for the full syntax of [BCP47]:

No irregular or BCP47 grandfathered tags are allowed
No extlang subtags are allowed
A tag must not start with the subtag "x". Thus a privateuse (eg x-abc) can only be after a language subtag like "und"

It allows for certain additions:

For field separator characters, the "_" character can be used as well as the "-" used in [BCP47].
"root" to indicate the generic locale used as the parent of all languages in the CLDR data model.
Certain codes that are private-use in BCP-47 and ISO are given semantics by LDML.
Each macrolanguage has an identified primary encompassed language. That encompassed language is treated as an alias for the macrolanguage, and thus is replaced when canonicalizing.
The language tag may begin with a script rather than a language (specialized use only).

There are multiple problems with bcp-47 tags, from slightly annoying grandfathered tags (source of most Locale bugs in v8), to script mapping.

For example:

Canonicalization of zh, zh-cmn - https://unicode-org.atlassian.net/browse/ICU-7656
ICU strict bcp-47 implementation (given migration to unicode language ids, team is hesitant to do this) - https://unicode-org.atlassian.net/browse/ICU-20167

zbraniecki · 2018-10-02T10:22:02Z

Wow. It seems like they differences are pretty much in line with our pain-points (irregular, grandfather, privateuse without langtag) or things we already talked about supporting on input (_ -> -).

I'm cautiously excited about this proposal.

aphillips · 2018-10-02T15:29:10Z

@zbraniecki Their pain points and yours exist for equivalent reasons: these are notable problems when dealing with language tags in the context of locale-based APIs. What's more, CLDR provides the basis of many underlying implementations, so it makes sense to arrive at very similar choices when dealing with these issues. As such, I support this proposal.

Some minor nits: the excluded tags at least potentially exist in content. They need to be addressed, even if it is mapping them all to und or root

It is a bit incorrect to say that all grandfathered tags are not allowed. The regular grandfathered tags all canonicalize to their modern equivalents (there is no round trip).

Similarly, extlang is permitted as input, but implementations strip either the primary (zh-yue -> yue) or map away the primary enclosed language (zh-cmn -> zh).

markusicu · 2018-10-02T16:25:35Z

+1

Some of BCP 47 + language-subtag-registry seems more geared towards bibliographic use.
Unicode language identifiers reflect computer industry practice for tagging of messages (translations) and contents (web pages, emails, ...).

FYI, the CLDR spec link above is for the latest draft (which will soon be released for CLDR 34).
For reference to a stable URL for the differences I would use http://www.unicode.org/reports/tr35/#BCP_47_Conformance

For the definition of Unicode Language Identifier: http://www.unicode.org/reports/tr35/#Unicode_language_identifier

macchiati · 2018-10-02T17:27:03Z

+1, for reasons already stated. Can follow up tomorrow.

macchiati · 2018-10-04T13:18:13Z

Well, not quite "tomorrow"...

For the reasons stated, it is much cleaner to use the Unicode locale identifiers — the cleanest being the "Unicode BCP 47 locale identifiers" as in Unicode BCP 47 Conformance (draft, but soon to be released). Those are all conformant BCP 47 language tags, but with some additional semantic restrictions and semantic additions.

In case it is useful, note that Addison and I are the editors of the main RFC of BCP 47.

aphillips · 2018-10-04T16:02:24Z

My one caveat/concern with this thread and related ones is: there is a universe of tags, including rubbish ones, that can't be overlooked by Intl. There needs to be a clearly defined mapping or method of handling them, given that someone is finding utility out there for using said tags. Unicode's mapping is helpful, but not round trip. The constraints provided may not be enough: say what happens with the other tags, even the inconvenient ones. (Saying that rubbish things happen with rubbish tags is fine).

I guess my objection could be summed up as: I don't like the gap UTS35 leaves in grandfathered tags. Just say they all turn into root or something innocuous or useless (tlh-Cyrl-AQ !). Ditto private use tags. Further, specify that the tag in may not be recoverable later, at least in these cases.

Otherwise, +1 to @macchiati

Sorry for brevity: (Tablet, airplane)

markusicu · 2018-10-04T17:37:15Z

I guess my objection could be summed up as: I don't like the gap UTS35 leaves in grandfathered tags. Just say they all turn into root or something innocuous or useless (tlh-Cyrl-AQ !).

I agree that it would be useful to specify what to do with them, rather than "cannot be converted".

Simplest: Turn them into und or root, depending on whether root makes sense (the spec already has conditionals for that).

Ditto private use tags.

CLDR does say to prepend "und-" in conversion to Unicode lang IDs. (At least in the draft for CLDR 34.)

The conversion to BCP 47 could turn an initial "und-x-" into just "x-" to make all-privateuse tag round-trip, but then tags that are "und-x-..." to begin with won't round-trip. You have to choose one or the other. I think it's fair to leave the "und-" prefix alone, especially looking what a pain it is trying to support privateuse tags "properly". (They are the only case where conceivably a getLanguageSubtag() API would return a string of arbitrary length for a valid tag, rather than a single subtag of at most 8 characters.)

Further, specify that the tag in may not be recoverable later, at least in these cases.

SGTM

aphillips · 2018-10-04T21:28:33Z

@markusicu You could turn x- into und-x-x- for round trip: once you see the X singleton, further subtag checking is turned off (save for 1*8alphanum). It still doesn't produce a useful locale, but I thought I'd point it out........

littledan · 2018-10-06T10:09:17Z

Good to see the above discussion. I think this is a really important issue.

Switching to referencing Unicode locale identifiers sounds good to me at a high level, but we've discussed some aspects of Unicode locale identifiers and come to different conclusions. For example, we explicitly decided to not support some of the allowed features in Unicode language tags, such as _ instead of -. In Intl v1, it was a particular design decision to not expose the root locale, to avoid misuse. But we can reconsider these things.

I have definitely heard feature requests from web developers about accepting various different kinds of tags, as @aphillips mentions, but it's not clear what the API, definition or data sources should be. For some of these tags, we were considering a potential future separate API for their processing.

macchiati · 2018-10-06T12:30:28Z

The working draft version introduces a term exactly for that usage: http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#BCP_47_Conformance Expected to be final in a week or so, but feedback still welcome. {phone}

…

On Sat, Oct 6, 2018, 12:09 Daniel Ehrenberg ***@***.***> wrote: Good to see the above discussion. I think this is a really important issue. Switching to referencing Unicode locale identifiers sounds good to me at a high level, but we've discussed some aspects of Unicode locale identifiers and come to different conclusions. For example, we explicitly decided to not support some of the allowed features in Unicode language tags, such as _ instead of -. In Intl v1, it was a particular design decision to not expose the root locale, to avoid misuse. But we can reconsider these things. I have definitely heard feature requests from web developers about accepting various different kinds of tags, as @aphillips <https://github.com/aphillips> mentions, but it's not clear what the API, definition or data sources should be. For some of these tags, we were considering a potential future separate API for their processing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMDHd55ZUuBVT74drkSoYlIyu0B6jks5uiIFOgaJpZM4XDSS8> .

jungshik · 2018-10-11T08:53:44Z

Good to see the above discussion. I think this is a really important issue.

Switching to referencing Unicode locale identifiers sounds good to me at a high level, but we've discussed some aspects of Unicode locale identifiers and come to different conclusions. For example, we explicitly decided to not support some of the allowed features in Unicode language tags, such as _ instead of -. In Intl v1, it was a particular design decision to not expose the root locale, to avoid misuse. But we can reconsider these things.

+1 I'd add to the list disallowing a language tag starting with a script subtag.

jungshik · 2018-10-11T09:00:15Z

Using Unicode locale ID vs BCP 47 in our spec

What's used in the current spec is not BCP 47 alone but "BCP 47 + RFC 6067 + IANA Language subtag registry".

macchiati · 2018-10-11T10:48:02Z

The next version of the spec (due in a few days) separates the backwards compatibility aspects of Unicode locale identifiers out, and defines a term for the Unicode locale identifiers that don't have any of those backwards compatibility features: *Unicode BCP 47 locale identifier* http://www.unicode.org/reports/tr35/proposed.html#BCP_47_Conformance So that is what I would recommend for this case. Mark

…

On Thu, Oct 11, 2018 at 10:54 AM Jungshik Shin ***@***.***> wrote: Good to see the above discussion. I think this is a really important issue. Switching to referencing Unicode locale identifiers sounds good to me at a high level, but we've discussed some aspects of Unicode locale identifiers and come to different conclusions. For example, we explicitly decided to not support some of the allowed features in Unicode language tags, such as _ instead of -. In Intl v1, it was a particular design decision to not expose the root locale, to avoid misuse. But we can reconsider these things. +1 I'd add to the list disallowing a language tag starting with a script subtag. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMAaMBx0AYMA-LrSh9U3ZAYPD2BoAks5ujwdCgaJpZM4XDSS8> .

littledan · 2018-10-14T19:09:41Z

Related issue: tc39/ecma402#212

littledan · 2018-10-14T19:36:04Z

@macchiati This looks great--if we stick to Unicode BCP 47 locale identifiers, it seems like many annoying edge cases that we've spent a lot of time working through are simply defined away.

jungshik · 2018-10-15T07:46:00Z

@macchiati : With 'Unicode BCP 47 locale identifier', how are variants like 'preeuro', 'stroke', 'cyrillic', 'direct' and 'pinyin' handled? (see tc39/ecma402#273 ). I hope they're not given any special treatment/mapping.

The current ICU implementation results in the following mapping and many others : (after going through forLanguageTag and toLanguageTag)

zh-pinyin ==>  zh-u-co-pinyin
es-ES-preeuro => es-ES-u-cu-esp
uz-UZ-CYRILLIC =>  uz-Cyrl-UZ

macchiati · 2018-10-15T15:14:54Z

The variants on the left are not allowed in BCP 47 (and thus not in Unicode BCP 47 locale identifiers), while those on the right are Unicode BCP 47 locale identifiers. Does that answer your question/concern? Mark

…

On Mon, Oct 15, 2018 at 9:46 AM Jungshik Shin ***@***.***> wrote: @macchiati <https://github.com/macchiati> : With 'Unicode BCP 47 locale identifier', how are variants like 'preeuro', 'stroke', 'cyrillic', 'direct' and 'pinyin' handled? (see tc39/ecma402#273 <tc39/ecma402#273> ). I hope they're not given any special treatment/mapping. The current ICU implementation <https://github.com/unicode-org/icu/blob/e8159dee5bd990daf1d3c4b0c9a2f9b2d34e2037/icu4c/source/common/uloc.cpp#L468> results in the following mapping and many others : (after going through forLanguageTag and toLanguageTag) zh-pinyin ==> zh-u-co-pinyin es-ES-preeuro => es-ES-u-cu-esp uz-UZ-CYRILLIC => uz-Cyrl-UZ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMDwXy4U0E5NRbANi9M4ym1WNVIvZks5ulD07gaJpZM4XDSS8> .

littledan · 2018-10-15T15:50:47Z

@macchiati That helps, thanks.

So, if we say that Intl.Locale (and all of ECMA-402's constructors) supports only Unicode BCP 47 locale identifiers, those would throw a RangeError. Would folks be happy with those semantics?

In a follow-on proposal, we could create alternate factory functions on Intl.Locale for various more tolerant/legacy locale identifiers.

macchiati · 2018-10-15T16:26:42Z

Yes, that would be fine to throw an exception on anything but well-formed Unicode BCP 47 locale identifiers. As you say, there could be more lenient factory methods added later. Mark

…

On Mon, Oct 15, 2018 at 5:50 PM Daniel Ehrenberg ***@***.***> wrote: @macchiati <https://github.com/macchiati> That helps, thanks. So, if we say that Intl.Locale (and all of ECMA-402's constructors) supports only Unicode BCP 47 locale identifiers, those would throw a RangeError. Would folks be happy with those semantics? In a follow-on proposal, we could create alternate factory functions on Intl.Locale for various more tolerant/legacy locale identifiers. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMI7on2-pvkqI7PwqEmHqy_rWJyjgks5ulK7YgaJpZM4XDSS8> .

jungshik · 2018-10-18T15:53:08Z

@macchiati Thank you for the clarification. My question was if the canonicalization of bogus/legacy variant subtag currently done by ICU (such as mapping zh-pinyin to zh-u-co-pinyin) is allowed/required by Unicode BCP 47 locale identifier handling. Good to hear that it's not.

@littledan wrote:

In a follow-on proposal, we could create alternate factory functions on Intl.Locale for various more tolerant/legacy locale identifiers.

Why do you want to do that? What would we gain from this?

if we say that Intl.Locale (and all of ECMA-402's constructors) supports only Unicode BCP 47 locale identifiers, those would throw a RangeError. Would folks be happy with those semantics?

Well, zh-pinyin, es-ES-preeuro etc are still structurally valid per BCP 47 although pinyin and preeuro are NOT registered so that they are not valid variant subtag per BCP 47.

The current spec does not throw a range error for language tags that are structurally valid but (partly) made of unregistered subtags. Instead, it just passes them through. Changing that behavior would lead to a significant (?) burden on the implementation. c.f. ICU does not go beyond the structural validity check (+ canonicalization), either although it may do in the future.

BTW, Ecma 402 does require that a given timezone ID is checked to if it's in the list of allowed tz IDs. Spidermonkey implementation has a rather large list of mapping/exception lists on top of ICU's list. For timezone ids, it's a lot more manageable than lang tags.

jungshik · 2018-10-18T15:58:49Z

One more clarifying question: What part of the IANA language tag registry's deprecated/preferred value mapping has to be followed and what part should not in "Unicode BCP 47 locale identifiers" ?

'Unicode BCP 47 locale identifier' has its own mapping entries for language and regions. For some subtags, it's more comprehensive (e.g. treatment of region subtag 'SU' in a context dependent manner). For others, it's less or different.

macchiati · 2018-10-18T19:36:42Z

Ok, here are some thoughts; much longer than I'd intended at first. *Well-formedness. *The first level is to guarantee structural integrity: that each Unicode BCP 47 Locale Identifier (UBLI?) is well formed, following the spec. Supporting that requires little code and no substantial data. At this level, I'd also include mechanical canonicalization. That is, performing all the steps that also don't require any data: making sure the casing is right, making sure that the right fields are in the right order (variants are sorted, extensions are sorted, keys for -u- and -t- are sorted). *Validity. *This tests each field in the locale to make sure that it has acceptable values. Why do this? It is so that you prevent common mistakes where invalid codes are used. We've seen many of these over the years: validation helps you tell that the data is bad when another process hands you de-SW to mean German (Switzerland) — instead of the correct de-CH. General purpose systems still should allow deprecated codes for backwards compatibility, so that if you get my-BU for some reason, you can still treat it as valid. (BU being deprecated). If your system is always kept up to date, such as in some companies, validity is very helpful; since your system is always using the latest validity information, you can prevent these kinds of errors. On the other hand, if your system may be running on devices that get out of date (say mobile phones), you really don't want to be that exacting. You don't want to throw exceptions when a more up-to-date system passes you de-SW, because that newer system has the new country code for New South Wales (which seceded from Australia in 2020). In an ideal world, the validity data would work like the timezone data; almost all systems update pretty quickly. But in the actual world, general purpose systems should give the choice as to whether to validate or not. *Canonicalization. *This is to ensure that the most up-to-date codes are used. Why do it? Because it is crucial for correct comparison and matching. The key example is he-IL vs iw-IL. These are semantically identical, but the iw form was deprecated in favor of he*. There are uncounted problems because some systems use one code and some use the other; and these are problems not just between different vendors, but also within companies (speaking from painful experience). There are two ways to make this work. One is to alter the equals() and compareTo() methods (or the equivalent in whatever programming language is being used). That can solve many problems, but has two disadvantages. First, it makes comparison slower, since there is always an extra check to see if (for example) a failed comparison between he and iw needs to access an alias mapping. Second, there are many times when locales are serialized out into the string format (eg in a database), and the raw string comparisons would fail. The other alternative is to have a canonicalization operation. There are defined alias tables for doing this in CLDR, and they map deprecated forms to their canonical equivalents. Thus iw => he, BU => MM, etc. By canonicalizing the locales, you ensure that equals() and compareTo() work as expected. It does not solve the problem completely; you can still have the serialized form of a string need to change because of a new deprecation. However, it massively reduces the problem. The BCP 47 language codes are far more stable now: we should expect no changes that would affect significant numbers of users. Region codes are more likely to cause problems. Suppose that the US split into the Confederate States (CQ) and the Union (UU), neither keeping the US region code. In that case, stored strings of es-US would need to be recanonicalized. Mark *I'll rant a bit here: this is mostly due to ISO not having had any stability constraints; even though these are internal codes, they felt no compunction about changing them. And even worse, they also reused them: the ISO code CS was reused for two different countries! So a database identifying country of birth by ISO code would suddenly have incorrect data. That was one of the driving forces behind BCP 47, which added a mechanism to ensure that arbitrary changes wouldn't occur and that codes wouldn't be reused. There still can be deprecations, however. Note: recently we found that some of the language-code deprecations from BCP 47 were not being pulled into CLDR. There's a ticket to fix that and make it part of the automatic update process for each CLDR release, and I expect that ticket to be fixed in the next release. Luckily, none of the missing ones would affect any substantial number of users, but it's still embarrassing!

…

On Thu, Oct 18, 2018 at 5:53 PM Jungshik Shin ***@***.***> wrote: @macchiati <https://github.com/macchiati> Thank you for the clarification. My question was if the canonicalization of bogus/legacy variant subtag currently done by ICU (such as mapping zh-pinyin to zh-u-co-pinyin) is allowed/required by Unicode BCP 47 locale identifier handling. Good to hear that it's not. @littledan <https://github.com/littledan> wrote: In a follow-on proposal, we could create alternate factory functions on Intl.Locale for various more tolerant/legacy locale identifiers. Why do you want to do that? What would we gain from this? if we say that Intl.Locale (and all of ECMA-402's constructors) supports only Unicode BCP 47 locale identifiers, those would throw a RangeError. Would folks be happy with those semantics? Well, zh-pinyin, es-ES-preeuro etc are still *structurally valid* per BCP 47 although pinyin and preeuro are NOT registered so that they are not valid variant subtag per BCP 47. The current spec does not throw a range error for language tags that are structurally valid but (partly) made of unregistered subtags. Instead, it just passes them through. Changing that behavior would lead to a significant (?) burden on the implementation. c.f. ICU does not go beyond the structural validity check (+ canonicalization), either although it may do in the future. BTW, Ecma 402 does require that a given timezone ID is checked to if it's in the list of allowed tz IDs. Spidermonkey implementation has a rather large list of mapping/exception lists on top of ICU's list. For timezone ids, it's a lot more manageable than lang tags. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMJNwiZ55ET9IPLJcN548xzfXoRkgks5umKPmgaJpZM4XDSS8> .

jungshik · 2018-10-20T17:25:29Z

Thank you for a long reply with details. The current ICU implementation does the first two (structure check and mechanical canonicalization along with mapping deprecated sub tags to preferred values). So do the spec and implementations of Ecma Intl.Locale and locale parameter handling in other Intl APIs.

What is not done is checking against the list of valid subtags.

macchiati · 2018-10-21T08:55:13Z

Right. There are internals (LocaleValidityChecker) in ICU4J (but not C) that will validate, but since that isn't surfaced as public API... Mark

…

On Sat, Oct 20, 2018 at 7:25 PM Jungshik Shin ***@***.***> wrote: Thank you for a long reply with details. The current ICU implementation does the first two (structure check and mechanical canonicalization along with mapping deprecated sub tags to preferred values). So do the spec and implementations of Ecma Intl.Locale and locale parameter handling in other Intl APIs. What is not done is checking against the list of valid subtags. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMKjimSx2UyEId10QIM08qIBR0a6aks5um1yKgaJpZM4XDSS8> .

littledan · 2018-10-21T10:57:58Z

But in the actual world, general purpose systems should give the choice as to whether to validate or not.

Do we want to start doing this checking? Given that mobile phones are a key use case for us, and we have a long history of not checking in ECMA-402 on the web, maybe we should leave that in the "follow-on proposal" bucket.

Why do you want to do that? What would we gain from this?

I'm not sure if it would be so high priority, but the goal would be to help JS programs deal with legacy/platform-specific locale identifiers. Separating into a separate API keeps the core simple.

littledan · 2018-10-21T19:42:08Z

Well, if we don't barf on them or "canonicalize" them to root, it becomes difficult to do things like apply additional tags to them. The current Intl.Locale algorithm is full of special cases for this particular purpose.

aphillips · 2018-10-21T19:49:59Z

I'm mostly in violent agreement with @macchiati. I guess my position boils down to: don't barf, cannibalize to root to save all the attempts to extract "meaning" from the meaningless.

littledan · 2018-10-23T12:16:47Z

@aphillips and I talked in the W3C i18n meeting about this topic further, in particular about the few grandfathered tags that don't canonicalize to anything. @aphillips suggested that CLDR add canonicalizations for them (possibly matching what ICU outputs), and we move our reference for this data from IANA to CLDR.

Would anyone be interested in filing these CLDR tickets? @anba wrote up the list of the exceptions in #12 (comment) .

macchiati · 2018-10-23T12:45:48Z

In the most recent release of LDML spec, they are canonicalized to valid tags — see http://unicode.org/reports/tr35/#Language_Tag_to_Locale_Identifier The ultimate fallback is und-x-<original code>, so cel-gaulish → und-x-cel-gaulish. We could also add some specific aliases (such as cel-gaulish → xtg-x-cel-gaulish) although since these are essentially never used, it hardly seems worth the effort. Mark

…

On Tue, Oct 23, 2018 at 2:16 PM Daniel Ehrenberg ***@***.***> wrote: @aphillips <https://github.com/aphillips> and I talked in the W3C i18n meeting about this topic further, in particular about the few grandfathered tags that don't canonicalize to anything. @aphillips <https://github.com/aphillips> suggested that CLDR add canonicalizations for them (possibly matching what ICU outputs), and we move our reference for this data from IANA to CLDR. Would anyone be interested in filing these CLDR tickets? I believe @anba <https://github.com/anba> had the list of the exceptions in #12 (comment) <#12 (comment)> . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMHO-IUiGySh7671TEcaP0ITlE_hYks5unwiwgaJpZM4XDSS8> .

littledan · 2018-10-23T13:24:19Z

@macchiati Thanks! I missed that change (not sure how, the text is very straightforward). Seems like there's nothing to change in CLDR, just for the spec text in this proposal to be updated.

macchiati · 2018-10-23T13:29:12Z

Np. Mark

…

On Tue, Oct 23, 2018 at 3:26 PM Daniel Ehrenberg ***@***.***> wrote: @macchiati <https://github.com/macchiati> Thanks! I missed that change (not sure how, the text is very straightforward). Seems like there's nothing to change in CLDR, just for the spec text to be updated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMMTJw0_KWLixiwQ0lskSEqXhhdPAks5unxkBgaJpZM4XDSS8> .

aphillips · 2018-10-23T13:38:11Z

@littledan I drew the action to follow up on this, so thanks for doing this.
@macchiati I thought you had done this---and you had. I agree about not bothering mapping the cel-gaulish's of the world.

macchiati · 2018-10-23T13:59:25Z

Thanks, sounds like we are all in sync. Mark

…

On Tue, Oct 23, 2018 at 3:40 PM Addison Phillips ***@***.***> wrote: @littledan <https://github.com/littledan> I drew the action to follow up on this, so thanks for doing this. @macchiati <https://github.com/macchiati> I thought you had done this---and you had. I agree about not bothering mapping the cel-gaulish's of the world. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMPUCFgkW-dxz0IrTuccBsWKAAw2Zks5unxxDgaJpZM4XDSS8> .

littledan · 2018-10-29T21:33:15Z

Clarifying question: Is the canonicalization in step 1 of the BCP 47 Language Tag to Unicode BCP 47 Locale Identifier algorithm intended to sort the -u- extensions? (My guess: yes?)

UTS 35, rather than RFC 5646, provides a more modern and regular normalization algorithm for locales. This standard definition will be implementable in ICU and then shared among implementations, rather than relying on buggy, implementation-specific normalization algorithms. It also provides a more regular and easier-to-manipulate form for Intl.Locale. c.f. tc39/proposal-intl-locale#63

littledan · 2019-01-22T23:23:33Z

We've concluded that we will reference Unicode BCP 47 Locale Identifiers, which resolves this issue. Thanks for suggesting the simplification here!

UTS 35, rather than RFC 5646, provides a more modern and regular normalization algorithm for locales. This standard definition will be implementable in ICU and then shared among implementations, rather than relying on buggy, implementation-specific normalization algorithms. It also provides a more regular and easier-to-manipulate form for Intl.Locale. c.f. tc39/proposal-intl-locale#63

FrankYFTang · 2019-01-25T00:31:08Z

We could also add some specific aliases (such as cel-gaulish → xtg-x-cel-gaulish) although since these are essentially never used, it hardly seems worth the effort.

Mark- I have one problem related to this in the test
could you explain tome what does xtg mean in xtg-x-cel-gaulish ?

Currently https://github.com/tc39/test262/blob/master/test/intl402/Locale/extensions-grandfathered.js fail because this.

cel-gaulish got turn into xtg-x-cel-gaulish first , then we try to build the locale by replacing the
options: {
language: "fr",
script: "Cyrl",
region: "FR",
numberingSystem: "latn",
},
The current expectation i the test is
"fr-Cyrl-FR-u-nu-latn"
but my implementation got "fr-Cyrl-FR-u-nu-latn-x-cel-gaulish" because cel-gaulish first became xtg-x-cel-gaulish

I am currently use icu::Locale to parse the language/script/region/variant/other, but that will cause not only the parsing but also the canonicalization. Maybe I should just build my own simple parser to do the replacement instead so I can avoid such "early canonization".

FrankYFTang · 2019-02-08T07:58:58Z

zh-pinyin ==>  zh-u-co-pinyin
es-ES-preeuro => es-ES-u-cu-esp
uz-UZ-CYRILLIC =>  uz-Cyrl-UZ

@macchiati wrote
"The variants on the left are not allowed in BCP 47 (and thus not in Unicode BCP 47 locale identifiers), while those on the right are Unicode BCP 47 locale identifiers."

@macchiati - Do you mean "pinyin", "preeuro" and "CYRILLIC" are not registered under https://tools.ietf.org/html/bcp47#section-3.5 so they are not allowed in BCP 47? Because these are structural valid variant, right?

macchiati · 2019-02-08T10:33:06Z

xtg is Transalpine Gaulish https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry search in page for "xtg"Transalpine Gaulish x-.... always comes at the end. So if you are trying to replace "fr-Cyrl-FR-u-nu-latn" with relevant fields from "cel-gaulish" the process would be: canonicalize both: "fr-Cyrl-FR-u-nu-latn" , "xtg-x-cel-gaulish" then fr gets replaced by xtg and (no x) gets replaced by x-cel-gaulish so you get xtg-Cyrl-FR-u-nu-latn-x-cel-gaulish Mark

…

On Fri, Jan 25, 2019 at 1:31 AM Frank Yung-Fong Tang < ***@***.***> wrote: We could also add some specific aliases (such as cel-gaulish → xtg-x-cel-gaulish) although since these are essentially never used, it hardly seems worth the effort. Mark- I have one problem related to this in the test could you explain tome what does xtg mean in xtg-x-cel-gaulish ? Currently https://github.com/tc39/test262/blob/master/test/intl402/Locale/extensions-grandfathered.js fail because this. cel-gaulish got turn into xtg-x-cel-gaulish first , then we try to build the locale by replacing the options: { language: "fr", script: "Cyrl", region: "FR", numberingSystem: "latn", }, The current expectation i the test is "fr-Cyrl-FR-u-nu-latn" but my implementation got "fr-Cyrl-FR-u-nu-latn-x-cel-gaulish" because cel-gaulish first became xtg-x-cel-gaulish I am currently use icu::Locale to parse the language/script/region/variant/other, but that will cause not only the parsing but also the canonicalization. Maybe I should just build my own simple parser to do the replacement instead so I can avoid such "early canonization". — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMKV5ioDWlUb0l83N-URLygGOXvM1ks5vGlBMgaJpZM4XDSS8> .

macchiati · 2019-02-08T10:38:07Z

Sorry, the variants in the 2nd and 3rd lines. "pinyin" is valid. preeuro and CYRILLIC are not in https://www.unicode.org/repos/cldr/tags/latest/common/validity/variant.xml (nor https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry). They are syntacticly well-formed, but not valid. (BTW, I don't remember the context, and the rest of the message has been omitted.) Mark

…

On Fri, Feb 8, 2019 at 8:59 AM Frank Yung-Fong Tang < ***@***.***> wrote: zh-pinyin ==> zh-u-co-pinyin es-ES-preeuro => es-ES-u-cu-esp uz-UZ-CYRILLIC => uz-Cyrl-UZ @macchiati <https://github.com/macchiati> wrote "The variants on the left are not allowed in BCP 47 (and thus not in Unicode BCP 47 locale identifiers), while those on the right are Unicode BCP 47 locale identifiers." @macchiati <https://github.com/macchiati> - Do you mean "pinyin", "preeuro" and "CYRILLIC" are not registered under https://tools.ietf.org/html/bcp47#section-3.5 so they are not allowed in BCP 47? Because these are structural valid variant, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMCNp8ORtnumkSMZ9uJqKPHXHODTAks5vLS5DgaJpZM4XDSS8> .

FrankYFTang · 2019-02-08T17:43:56Z

On Fri, 8 Feb 2019 at 02:38, Mark Davis ***@***.***> wrote: Sorry, the variants in the 2nd and 3rd lines. "pinyin" is valid. preeuro and CYRILLIC are not in https://www.unicode.org/repos/cldr/tags/latest/common/validity/variant.xml (nor https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry ). They are syntacticly well-formed, but not valid. (BTW, I don't remember the context, and the rest of the message has been omitted.)

The context is this: in ECMA402, Intl.Locale constructor take a tag and canonicalize it. The question is when it got "zh-pinyin", "es-ES-preeuro" and "uz-UZ-CYRILLIC" as the input, should it keep it AS IS after the (BCP47 Locale Identifier ) canonicalization process, our should the BCP canonicalization process as step 1 of Locale Identifier canonicalization process turn them into something else. From what I can read, pinyin, preeuro and CYRILLIC are all structural valid variant, and I see no registered variant value in the IANA that have a preferred value, so I do not believe, following that it should be turn canonicalized into something else. But currently ICU canonicalized as below zh-pinyin ==> zh-u-co-pinyin es-ES-preeuro => es-ES-u-cu-esp uz-UZ-CYRILLIC => uz-Cyrl-UZ this behavior break test262 tests now and I try to figure out what action should I take. Should I 1) find out reasonable standard / spec to justify the current ICU behavior and request the test262 to change the expectation, OR 2) consider it is a ICU bug and file a ICU ticket to request changing them. Please advise which of the above action I should take. I believe I should take 2) because I cannot find such information to support 1) but maybe I missed something. Thanks

…

Mark On Fri, Feb 8, 2019 at 8:59 AM Frank Yung-Fong Tang < ***@***.***> wrote: > zh-pinyin ==> zh-u-co-pinyin > es-ES-preeuro => es-ES-u-cu-esp > uz-UZ-CYRILLIC => uz-Cyrl-UZ > > @macchiati <https://github.com/macchiati> wrote > "The variants on the left are not allowed in BCP 47 (and thus not in > Unicode BCP 47 locale identifiers), while those on the right are Unicode > BCP 47 locale identifiers." > > @macchiati <https://github.com/macchiati> - Do you mean "pinyin", > "preeuro" and "CYRILLIC" are not registered under > https://tools.ietf.org/html/bcp47#section-3.5 so they are not allowed in > BCP 47? Because these are structural valid variant, right? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #63 (comment) >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AJKyMCNp8ORtnumkSMZ9uJqKPHXHODTAks5vLS5DgaJpZM4XDSS8 > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AnTdKcyO0Xhm-uCfiuMIRo-lgBWefjoOks5vLVOQgaJpZM4XDSS8> .

-- Frank Yung-Fong Tang 譚永鋒 / 🌭🍊 Sr. Software Engineer

littledan · 2019-02-10T22:58:46Z

I'm a bit lost on the technical details here. Is there a change we need to follow up with for tests or the specification?

FrankYFTang · 2019-02-11T02:00:12Z

I'm a bit lost on the technical details here. Is there a change we need to follow up with for tests or the specification?

that is what I am trying to figure out.

macchiati · 2019-02-11T12:43:06Z

Here is the issue. Those 3 mappings were added for compatibility with pre-bcpr7 versions of Unicode. I don't know whether it is necessary for ICU to continue to support them (as far as I'm concerned they could be dropped). So I see the following options: 1. No change to the ECMA spec, thus follows LDML for canonicalization. 1. File a ticket in ICU to drop those three mappings 2. OR Use ICU, but special case those 3 cases (ugly but doable). (If #1 is going to be done, this could just be a temporary workaround). 2. OR Modify the ECMA spec to allow these 3 mappings for backwards compatibility. {phone}

…

On Mon, Feb 11, 2019, 03:00 Frank Yung-Fong Tang ***@***.***> wrote: I'm a bit lost on the technical details here. Is there a change we need to follow up with for tests or the specification? that is what I am trying to figure out. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJKyMOhYQuSmKFo4XmFZ_JFhw1cJG4cxks5vMM6sgaJpZM4XDSS8> .

FrankYFTang · 2019-02-11T21:19:42Z

Here is the issue. Those 3 mappings were added for compatibility with pre-bcpr7 versions of Unicode. I don't know whether it is necessary for ICU to continue to support them (as far as I'm concerned they could be dropped). So I see the following options: 1. No change to the ECMA spec, thus follows LDML for canonicalization. 1. File a ticket in ICU to drop those three mappings 2. OR Use ICU, but special case those 3 cases (ugly but doable). (If #1 is going to be done, this could just be a temporary workaround). 2. OR Modify the ECMA spec to allow these 3 mappings for backwards compatibility. {phone}
…
On Mon, Feb 11, 2019, 03:00 Frank Yung-Fong Tang @.***> wrote: I'm a bit lost on the technical details here. Is there a change we need to follow up with for tests or the specification? that is what I am trying to figure out. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AJKyMOhYQuSmKFo4XmFZ_JFhw1cJG4cxks5vMM6sgaJpZM4XDSS8 .

@macchiati Thanks for your reply. Now I understand it is not I missed something from the UTS35 or LDML but the ICU behavior is simply out of sync with the spec. I already file bugs in ICU, just want to make sure that should be a bug instead of a "feature". See https://unicode-org.atlassian.net/browse/ICU-20187 and https://unicode-org.atlassian.net/browse/ICU-20411.

littledan · 2019-02-28T04:25:07Z

OK, sounds like there is nothing to do at the specification level then, right?

littledan · 2019-05-20T15:38:03Z

We've switched the ECMA-402 spec to Unicode BCP 47 Locale Identifiers, so this issue should be resolved.

This was referenced Oct 9, 2018

Changes for InsertUnicodeExtension #62

Closed

Locale canonicalization: Some variant subtag canonicalization is not useful when following the letter of BCP 47 tc39/ecma402#273

Closed

littledan mentioned this issue Oct 14, 2018

Incorrect double validation for language tag in ApplyOptionsToTag #52

Closed

anba mentioned this issue Oct 17, 2018

Test bugs oct2018 tc39/test262#1869

Merged

littledan mentioned this issue Oct 21, 2018

BCP 47, or Unicode BCP 47 Locale IDs? w3c/bp-i18n-specdev#33

Open

This was referenced Oct 29, 2018

Normative: Reference UTS 35 Unicode BCP 47 Locale Identifiers tc39/ecma402#289

Merged

Normative: Simplify algorithms, without privateuse/grandfathered tags #66

Merged

FrankYFTang mentioned this issue Dec 18, 2018

Change the ref grammar from RFC5646 to UTS35 #67

Merged

littledan closed this as completed Jan 22, 2019

littledan reopened this Feb 11, 2019

littledan closed this as completed May 20, 2019

longlho mentioned this issue Jun 1, 2021

fix: add es2016.intl.d.ts microsoft/TypeScript#39662

Closed

Using Unicode locale ID vs BCP 47 in our spec #63

Using Unicode locale ID vs BCP 47 in our spec #63

Comments

nciric commented Oct 2, 2018

zbraniecki commented Oct 2, 2018

aphillips commented Oct 2, 2018

markusicu commented Oct 2, 2018

macchiati commented Oct 2, 2018

macchiati commented Oct 4, 2018

aphillips commented Oct 4, 2018

markusicu commented Oct 4, 2018

aphillips commented Oct 4, 2018

littledan commented Oct 6, 2018

macchiati commented Oct 6, 2018 via email

jungshik commented Oct 11, 2018

jungshik commented Oct 11, 2018

macchiati commented Oct 11, 2018 via email

littledan commented Oct 14, 2018

littledan commented Oct 14, 2018

jungshik commented Oct 15, 2018

macchiati commented Oct 15, 2018 via email

littledan commented Oct 15, 2018

macchiati commented Oct 15, 2018 via email

jungshik commented Oct 18, 2018

jungshik commented Oct 18, 2018

macchiati commented Oct 18, 2018 via email

jungshik commented Oct 20, 2018

macchiati commented Oct 21, 2018 via email

littledan commented Oct 21, 2018

littledan commented Oct 21, 2018

aphillips commented Oct 21, 2018

littledan commented Oct 23, 2018 • edited Loading

macchiati commented Oct 23, 2018 via email

littledan commented Oct 23, 2018 • edited Loading

macchiati commented Oct 23, 2018 via email

aphillips commented Oct 23, 2018

macchiati commented Oct 23, 2018 via email

littledan commented Oct 29, 2018 • edited Loading

littledan commented Jan 22, 2019

FrankYFTang commented Jan 25, 2019

FrankYFTang commented Feb 8, 2019

macchiati commented Feb 8, 2019 via email

macchiati commented Feb 8, 2019 via email

FrankYFTang commented Feb 8, 2019 via email

littledan commented Feb 10, 2019

FrankYFTang commented Feb 11, 2019

macchiati commented Feb 11, 2019 via email

FrankYFTang commented Feb 11, 2019

littledan commented Feb 28, 2019

littledan commented May 20, 2019

littledan commented Oct 23, 2018 •

edited

Loading

littledan commented Oct 23, 2018 •

edited

Loading

littledan commented Oct 29, 2018 •

edited

Loading