-
Notifications
You must be signed in to change notification settings - Fork 17
Using Unicode locale ID vs BCP 47 in our spec #63
Comments
Wow. It seems like they differences are pretty much in line with our pain-points (irregular, grandfather, privateuse without langtag) or things we already talked about supporting on input ( I'm cautiously excited about this proposal. |
@zbraniecki Their pain points and yours exist for equivalent reasons: these are notable problems when dealing with language tags in the context of locale-based APIs. What's more, CLDR provides the basis of many underlying implementations, so it makes sense to arrive at very similar choices when dealing with these issues. As such, I support this proposal. Some minor nits: the excluded tags at least potentially exist in content. They need to be addressed, even if it is mapping them all to It is a bit incorrect to say that all Similarly, |
+1 Some of BCP 47 + language-subtag-registry seems more geared towards bibliographic use. FYI, the CLDR spec link above is for the latest draft (which will soon be released for CLDR 34). For the definition of Unicode Language Identifier: http://www.unicode.org/reports/tr35/#Unicode_language_identifier |
+1, for reasons already stated. Can follow up tomorrow. |
Well, not quite "tomorrow"... For the reasons stated, it is much cleaner to use the Unicode locale identifiers — the cleanest being the "Unicode BCP 47 locale identifiers" as in Unicode BCP 47 Conformance (draft, but soon to be released). Those are all conformant BCP 47 language tags, but with some additional semantic restrictions and semantic additions. In case it is useful, note that Addison and I are the editors of the main RFC of BCP 47. |
My one caveat/concern with this thread and related ones is: there is a universe of tags, including rubbish ones, that can't be overlooked by Intl. There needs to be a clearly defined mapping or method of handling them, given that someone is finding utility out there for using said tags. Unicode's mapping is helpful, but not round trip. The constraints provided may not be enough: say what happens with the other tags, even the inconvenient ones. (Saying that rubbish things happen with rubbish tags is fine). I guess my objection could be summed up as: I don't like the gap UTS35 leaves in grandfathered tags. Just say they all turn into root or something innocuous or useless (tlh-Cyrl-AQ !). Ditto private use tags. Further, specify that the tag in may not be recoverable later, at least in these cases. Otherwise, +1 to @macchiati Sorry for brevity: (Tablet, airplane) |
I agree that it would be useful to specify what to do with them, rather than "cannot be converted". Simplest: Turn them into und or root, depending on whether root makes sense (the spec already has conditionals for that).
CLDR does say to prepend "und-" in conversion to Unicode lang IDs. (At least in the draft for CLDR 34.) The conversion to BCP 47 could turn an initial "und-x-" into just "x-" to make all-privateuse tag round-trip, but then tags that are "und-x-..." to begin with won't round-trip. You have to choose one or the other. I think it's fair to leave the "und-" prefix alone, especially looking what a pain it is trying to support privateuse tags "properly". (They are the only case where conceivably a getLanguageSubtag() API would return a string of arbitrary length for a valid tag, rather than a single subtag of at most 8 characters.)
SGTM |
@markusicu You could turn |
Good to see the above discussion. I think this is a really important issue. Switching to referencing Unicode locale identifiers sounds good to me at a high level, but we've discussed some aspects of Unicode locale identifiers and come to different conclusions. For example, we explicitly decided to not support some of the allowed features in Unicode language tags, such as _ instead of -. In Intl v1, it was a particular design decision to not expose the root locale, to avoid misuse. But we can reconsider these things. I have definitely heard feature requests from web developers about accepting various different kinds of tags, as @aphillips mentions, but it's not clear what the API, definition or data sources should be. For some of these tags, we were considering a potential future separate API for their processing. |
The working draft version introduces a term exactly for that usage:
http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#BCP_47_Conformance
Expected to be final in a week or so, but feedback still welcome.
{phone}
…On Sat, Oct 6, 2018, 12:09 Daniel Ehrenberg ***@***.***> wrote:
Good to see the above discussion. I think this is a really important issue.
Switching to referencing Unicode locale identifiers sounds good to me at a
high level, but we've discussed some aspects of Unicode locale identifiers
and come to different conclusions. For example, we explicitly decided to
not support some of the allowed features in Unicode language tags, such as
_ instead of -. In Intl v1, it was a particular design decision to not
expose the root locale, to avoid misuse. But we can reconsider these things.
I have definitely heard feature requests from web developers about
accepting various different kinds of tags, as @aphillips
<https://github.com/aphillips> mentions, but it's not clear what the API,
definition or data sources should be. For some of these tags, we were
considering a potential future separate API for their processing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMDHd55ZUuBVT74drkSoYlIyu0B6jks5uiIFOgaJpZM4XDSS8>
.
|
+1 I'd add to the list disallowing a language tag starting with a script subtag. |
What's used in the current spec is not BCP 47 alone but "BCP 47 + RFC 6067 + IANA Language subtag registry". |
The next version of the spec (due in a few days) separates the backwards
compatibility aspects of Unicode locale identifiers out, and defines a term
for the Unicode locale identifiers that don't have any of those backwards
compatibility features: *Unicode BCP 47 locale identifier*
http://www.unicode.org/reports/tr35/proposed.html#BCP_47_Conformance
So that is what I would recommend for this case.
Mark
…On Thu, Oct 11, 2018 at 10:54 AM Jungshik Shin ***@***.***> wrote:
Good to see the above discussion. I think this is a really important issue.
Switching to referencing Unicode locale identifiers sounds good to me at a
high level, but we've discussed some aspects of Unicode locale identifiers
and come to different conclusions. For example, we explicitly decided to
not support some of the allowed features in Unicode language tags, such as
_ instead of -. In Intl v1, it was a particular design decision to not
expose the root locale, to avoid misuse. But we can reconsider these things.
+1 I'd add to the list disallowing a language tag starting with a script
subtag.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMAaMBx0AYMA-LrSh9U3ZAYPD2BoAks5ujwdCgaJpZM4XDSS8>
.
|
Related issue: tc39/ecma402#212 |
@macchiati This looks great--if we stick to Unicode BCP 47 locale identifiers, it seems like many annoying edge cases that we've spent a lot of time working through are simply defined away. |
@macchiati : With 'Unicode BCP 47 locale identifier', how are variants like 'preeuro', 'stroke', 'cyrillic', 'direct' and 'pinyin' handled? (see tc39/ecma402#273 ). I hope they're not given any special treatment/mapping. The current ICU implementation results in the following mapping and many others : (after going through forLanguageTag and toLanguageTag)
|
The variants on the left are not allowed in BCP 47 (and thus not in Unicode
BCP 47 locale identifiers), while those on the right are Unicode BCP 47
locale identifiers.
Does that answer your question/concern?
Mark
…On Mon, Oct 15, 2018 at 9:46 AM Jungshik Shin ***@***.***> wrote:
@macchiati <https://github.com/macchiati> : With 'Unicode BCP 47 locale
identifier', how are variants like 'preeuro', 'stroke', 'cyrillic',
'direct' and 'pinyin' handled? (see tc39/ecma402#273
<tc39/ecma402#273> ). I hope they're not given
any special treatment/mapping.
The current ICU implementation
<https://github.com/unicode-org/icu/blob/e8159dee5bd990daf1d3c4b0c9a2f9b2d34e2037/icu4c/source/common/uloc.cpp#L468>
results in the following mapping and many others : (after going through
forLanguageTag and toLanguageTag)
zh-pinyin ==> zh-u-co-pinyin
es-ES-preeuro => es-ES-u-cu-esp
uz-UZ-CYRILLIC => uz-Cyrl-UZ
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMDwXy4U0E5NRbANi9M4ym1WNVIvZks5ulD07gaJpZM4XDSS8>
.
|
@macchiati That helps, thanks. So, if we say that Intl.Locale (and all of ECMA-402's constructors) supports only Unicode BCP 47 locale identifiers, those would throw a RangeError. Would folks be happy with those semantics? In a follow-on proposal, we could create alternate factory functions on Intl.Locale for various more tolerant/legacy locale identifiers. |
Yes, that would be fine to throw an exception on anything but
well-formed Unicode BCP 47 locale identifiers. As you say, there could be
more lenient factory methods added later.
Mark
…On Mon, Oct 15, 2018 at 5:50 PM Daniel Ehrenberg ***@***.***> wrote:
@macchiati <https://github.com/macchiati> That helps, thanks.
So, if we say that Intl.Locale (and all of ECMA-402's constructors)
supports only Unicode BCP 47 locale identifiers, those would throw a
RangeError. Would folks be happy with those semantics?
In a follow-on proposal, we could create alternate factory functions on
Intl.Locale for various more tolerant/legacy locale identifiers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMI7on2-pvkqI7PwqEmHqy_rWJyjgks5ulK7YgaJpZM4XDSS8>
.
|
@macchiati Thank you for the clarification. My question was if the canonicalization of bogus/legacy variant subtag currently done by ICU (such as mapping zh-pinyin to zh-u-co-pinyin) is allowed/required by Unicode BCP 47 locale identifier handling. Good to hear that it's not. @littledan wrote:
Why do you want to do that? What would we gain from this?
Well, zh-pinyin, es-ES-preeuro etc are still structurally valid per BCP 47 although pinyin and preeuro are NOT registered so that they are not valid variant subtag per BCP 47. The current spec does not throw a range error for language tags that are structurally valid but (partly) made of unregistered subtags. Instead, it just passes them through. Changing that behavior would lead to a significant (?) burden on the implementation. c.f. ICU does not go beyond the structural validity check (+ canonicalization), either although it may do in the future. BTW, Ecma 402 does require that a given timezone ID is checked to if it's in the list of allowed tz IDs. Spidermonkey implementation has a rather large list of mapping/exception lists on top of ICU's list. For timezone ids, it's a lot more manageable than lang tags. |
One more clarifying question: What part of the IANA language tag registry's deprecated/preferred value mapping has to be followed and what part should not in "Unicode BCP 47 locale identifiers" ? 'Unicode BCP 47 locale identifier' has its own mapping entries for language and regions. For some subtags, it's more comprehensive (e.g. treatment of region subtag 'SU' in a context dependent manner). For others, it's less or different. |
Ok, here are some thoughts; much longer than I'd intended at first.
*Well-formedness. *The first level is to guarantee structural integrity:
that each Unicode BCP 47 Locale Identifier (UBLI?) is well formed,
following the spec. Supporting that requires little code and no substantial
data. At this level, I'd also include mechanical canonicalization. That is,
performing all the steps that also don't require any data: making sure the
casing is right, making sure that the right fields are in the right order
(variants are sorted, extensions are sorted, keys for -u- and -t- are
sorted).
*Validity. *This tests each field in the locale to make sure that it has
acceptable values. Why do this? It is so that you prevent common mistakes
where invalid codes are used. We've seen many of these over the years:
validation helps you tell that the data is bad when another process hands
you de-SW to mean German (Switzerland) — instead of the correct de-CH. General
purpose systems still should allow deprecated codes for backwards
compatibility, so that if you get my-BU for some reason, you can still
treat it as valid. (BU being deprecated).
If your system is always kept up to date, such as in some companies,
validity is very helpful; since your system is always using the latest
validity information, you can prevent these kinds of errors. On the other
hand, if your system may be running on devices that get out of date (say
mobile phones), you really don't want to be that exacting. You don't want
to throw exceptions when a more up-to-date system passes you de-SW, because
that newer system has the new country code for New South Wales (which
seceded from Australia in 2020).
In an ideal world, the validity data would work like the timezone data;
almost all systems update pretty quickly. But in the actual world, general
purpose systems should give the choice as to whether to validate or not.
*Canonicalization. *This is to ensure that the most up-to-date codes are
used. Why do it? Because it is crucial for correct comparison and matching.
The key example is he-IL vs iw-IL. These are semantically identical, but
the iw form was deprecated in favor of he*. There are uncounted problems
because some systems use one code and some use the other; and these are
problems not just between different vendors, but also within companies
(speaking from painful experience).
There are two ways to make this work. One is to alter the equals() and
compareTo() methods (or the equivalent in whatever programming language is
being used). That can solve many problems, but has two disadvantages.
First, it makes comparison slower, since there is always an extra check to
see if (for example) a failed comparison between he and iw needs to access
an alias mapping. Second, there are many times when locales are serialized
out into the string format (eg in a database), and the raw string
comparisons would fail.
The other alternative is to have a canonicalization operation. There are
defined alias tables for doing this in CLDR, and they map deprecated forms
to their canonical equivalents. Thus iw => he, BU => MM, etc. By
canonicalizing the locales, you ensure that equals() and compareTo() work
as expected. It does not solve the problem completely; you can still have
the serialized form of a string need to change because of a new
deprecation. However, it massively reduces the problem. The BCP 47 language
codes are far more stable now: we should expect no changes that would
affect significant numbers of users. Region codes are more likely to cause
problems. Suppose that the US split into the Confederate States (CQ) and
the Union (UU), neither keeping the US region code. In that case, stored
strings of es-US would need to be recanonicalized.
Mark
*I'll rant a bit here: this is mostly due to ISO not having had any
stability constraints; even though these are internal codes, they felt no
compunction about changing them. And even worse, they also reused them: the
ISO code CS was reused for two different countries! So a database
identifying country of birth by ISO code would suddenly have incorrect
data. That was one of the driving forces behind BCP 47, which added a
mechanism to ensure that arbitrary changes wouldn't occur and that codes
wouldn't be reused. There still can be deprecations, however.
Note: recently we found that some of the language-code deprecations from
BCP 47 were not being pulled into CLDR. There's a ticket to fix that and
make it part of the automatic update process for each CLDR release, and I
expect that ticket to be fixed in the next release. Luckily, none of the
missing ones would affect any substantial number of users, but it's still
embarrassing!
…On Thu, Oct 18, 2018 at 5:53 PM Jungshik Shin ***@***.***> wrote:
@macchiati <https://github.com/macchiati> Thank you for the
clarification. My question was if the canonicalization of bogus/legacy
variant subtag currently done by ICU (such as mapping zh-pinyin to
zh-u-co-pinyin) is allowed/required by Unicode BCP 47 locale identifier
handling. Good to hear that it's not.
@littledan <https://github.com/littledan> wrote:
In a follow-on proposal, we could create alternate factory functions on
Intl.Locale for various more tolerant/legacy locale identifiers.
Why do you want to do that? What would we gain from this?
if we say that Intl.Locale (and all of ECMA-402's constructors) supports
only Unicode BCP 47 locale identifiers, those would throw a RangeError.
Would folks be happy with those semantics?
Well, zh-pinyin, es-ES-preeuro etc are still *structurally valid* per BCP
47 although pinyin and preeuro are NOT registered so that they are not
valid variant subtag per BCP 47.
The current spec does not throw a range error for language tags that are
structurally valid but (partly) made of unregistered subtags. Instead, it
just passes them through. Changing that behavior would lead to a
significant (?) burden on the implementation. c.f. ICU does not go beyond
the structural validity check (+ canonicalization), either although it may
do in the future.
BTW, Ecma 402 does require that a given timezone ID is checked to if it's
in the list of allowed tz IDs. Spidermonkey implementation has a rather
large list of mapping/exception lists on top of ICU's list. For timezone
ids, it's a lot more manageable than lang tags.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMJNwiZ55ET9IPLJcN548xzfXoRkgks5umKPmgaJpZM4XDSS8>
.
|
Thank you for a long reply with details. The current ICU implementation does the first two (structure check and mechanical canonicalization along with mapping deprecated sub tags to preferred values). So do the spec and implementations of Ecma Intl.Locale and locale parameter handling in other Intl APIs. What is not done is checking against the list of valid subtags. |
Right. There are internals (LocaleValidityChecker) in ICU4J (but not C)
that will validate, but since that isn't surfaced as public API...
Mark
…On Sat, Oct 20, 2018 at 7:25 PM Jungshik Shin ***@***.***> wrote:
Thank you for a long reply with details. The current ICU implementation
does the first two (structure check and mechanical canonicalization along
with mapping deprecated sub tags to preferred values). So do the spec and
implementations of Ecma Intl.Locale and locale parameter handling in other
Intl APIs.
What is not done is checking against the list of valid subtags.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMKjimSx2UyEId10QIM08qIBR0a6aks5um1yKgaJpZM4XDSS8>
.
|
Do we want to start doing this checking? Given that mobile phones are a key use case for us, and we have a long history of not checking in ECMA-402 on the web, maybe we should leave that in the "follow-on proposal" bucket.
I'm not sure if it would be so high priority, but the goal would be to help JS programs deal with legacy/platform-specific locale identifiers. Separating into a separate API keeps the core simple. |
Well, if we don't barf on them or "canonicalize" them to root, it becomes difficult to do things like apply additional tags to them. The current Intl.Locale algorithm is full of special cases for this particular purpose. |
I'm mostly in violent agreement with @macchiati. I guess my position boils down to: don't barf, cannibalize to root to save all the attempts to extract "meaning" from the meaningless. |
@aphillips and I talked in the W3C i18n meeting about this topic further, in particular about the few grandfathered tags that don't canonicalize to anything. @aphillips suggested that CLDR add canonicalizations for them (possibly matching what ICU outputs), and we move our reference for this data from IANA to CLDR. Would anyone be interested in filing these CLDR tickets? @anba wrote up the list of the exceptions in #12 (comment) . |
In the most recent release of LDML spec, they are canonicalized to valid
tags — see
http://unicode.org/reports/tr35/#Language_Tag_to_Locale_Identifier
The ultimate fallback is und-x-<original code>, so cel-gaulish →
und-x-cel-gaulish.
We could also add some specific aliases (such as cel-gaulish →
xtg-x-cel-gaulish) although since these are essentially never used, it
hardly seems worth the effort.
Mark
…On Tue, Oct 23, 2018 at 2:16 PM Daniel Ehrenberg ***@***.***> wrote:
@aphillips <https://github.com/aphillips> and I talked in the W3C i18n
meeting about this topic further, in particular about the few grandfathered
tags that don't canonicalize to anything. @aphillips
<https://github.com/aphillips> suggested that CLDR add canonicalizations
for them (possibly matching what ICU outputs), and we move our reference
for this data from IANA to CLDR. Would anyone be interested in filing these
CLDR tickets? I believe @anba <https://github.com/anba> had the list of
the exceptions in #12 (comment)
<#12 (comment)>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMHO-IUiGySh7671TEcaP0ITlE_hYks5unwiwgaJpZM4XDSS8>
.
|
@macchiati Thanks! I missed that change (not sure how, the text is very straightforward). Seems like there's nothing to change in CLDR, just for the spec text in this proposal to be updated. |
Np.
Mark
…On Tue, Oct 23, 2018 at 3:26 PM Daniel Ehrenberg ***@***.***> wrote:
@macchiati <https://github.com/macchiati> Thanks! I missed that change
(not sure how, the text is very straightforward). Seems like there's
nothing to change in CLDR, just for the spec text to be updated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMMTJw0_KWLixiwQ0lskSEqXhhdPAks5unxkBgaJpZM4XDSS8>
.
|
@littledan I drew the action to follow up on this, so thanks for doing this. |
Thanks, sounds like we are all in sync.
Mark
…On Tue, Oct 23, 2018 at 3:40 PM Addison Phillips ***@***.***> wrote:
@littledan <https://github.com/littledan> I drew the action to follow up
on this, so thanks for doing this.
@macchiati <https://github.com/macchiati> I thought you had done
this---and you had. I agree about not bothering mapping the cel-gaulish's
of the world.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMPUCFgkW-dxz0IrTuccBsWKAAw2Zks5unxxDgaJpZM4XDSS8>
.
|
Clarifying question: Is the canonicalization in step 1 of the BCP 47 Language Tag to Unicode BCP 47 Locale Identifier algorithm intended to sort the |
UTS 35, rather than RFC 5646, provides a more modern and regular normalization algorithm for locales. This standard definition will be implementable in ICU and then shared among implementations, rather than relying on buggy, implementation-specific normalization algorithms. It also provides a more regular and easier-to-manipulate form for Intl.Locale. c.f. tc39/proposal-intl-locale#63
We've concluded that we will reference Unicode BCP 47 Locale Identifiers, which resolves this issue. Thanks for suggesting the simplification here! |
UTS 35, rather than RFC 5646, provides a more modern and regular normalization algorithm for locales. This standard definition will be implementable in ICU and then shared among implementations, rather than relying on buggy, implementation-specific normalization algorithms. It also provides a more regular and easier-to-manipulate form for Intl.Locale. c.f. tc39/proposal-intl-locale#63
Mark- I have one problem related to this in the test Currently https://github.com/tc39/test262/blob/master/test/intl402/Locale/extensions-grandfathered.js fail because this. cel-gaulish got turn into xtg-x-cel-gaulish first , then we try to build the locale by replacing the I am currently use icu::Locale to parse the language/script/region/variant/other, but that will cause not only the parsing but also the canonicalization. Maybe I should just build my own simple parser to do the replacement instead so I can avoid such "early canonization". |
@macchiati wrote @macchiati - Do you mean "pinyin", "preeuro" and "CYRILLIC" are not registered under https://tools.ietf.org/html/bcp47#section-3.5 so they are not allowed in BCP 47? Because these are structural valid variant, right? |
xtg is Transalpine Gaulish
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
search in page for "xtg"Transalpine Gaulish
x-.... always comes at the end. So if you are trying to replace
"fr-Cyrl-FR-u-nu-latn" with relevant fields from "cel-gaulish" the process
would be:
canonicalize both:
"fr-Cyrl-FR-u-nu-latn" , "xtg-x-cel-gaulish"
then fr gets replaced by xtg
and (no x) gets replaced by x-cel-gaulish
so you get
xtg-Cyrl-FR-u-nu-latn-x-cel-gaulish
Mark
…On Fri, Jan 25, 2019 at 1:31 AM Frank Yung-Fong Tang < ***@***.***> wrote:
We could also add some specific aliases (such as cel-gaulish →
xtg-x-cel-gaulish) although since these are essentially never used, it
hardly seems worth the effort.
Mark- I have one problem related to this in the test
could you explain tome what does xtg mean in xtg-x-cel-gaulish ?
Currently
https://github.com/tc39/test262/blob/master/test/intl402/Locale/extensions-grandfathered.js
fail because this.
cel-gaulish got turn into xtg-x-cel-gaulish first , then we try to build
the locale by replacing the
options: {
language: "fr",
script: "Cyrl",
region: "FR",
numberingSystem: "latn",
},
The current expectation i the test is
"fr-Cyrl-FR-u-nu-latn"
but my implementation got "fr-Cyrl-FR-u-nu-latn-x-cel-gaulish" because
cel-gaulish first became xtg-x-cel-gaulish
I am currently use icu::Locale to parse the
language/script/region/variant/other, but that will cause not only the
parsing but also the canonicalization. Maybe I should just build my own
simple parser to do the replacement instead so I can avoid such "early
canonization".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMKV5ioDWlUb0l83N-URLygGOXvM1ks5vGlBMgaJpZM4XDSS8>
.
|
Sorry, the variants in the 2nd and 3rd lines. "pinyin" is valid.
preeuro and CYRILLIC are not in
https://www.unicode.org/repos/cldr/tags/latest/common/validity/variant.xml
(nor
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry).
They are syntacticly well-formed, but not valid.
(BTW, I don't remember the context, and the rest of the message has been
omitted.)
Mark
…On Fri, Feb 8, 2019 at 8:59 AM Frank Yung-Fong Tang < ***@***.***> wrote:
zh-pinyin ==> zh-u-co-pinyin
es-ES-preeuro => es-ES-u-cu-esp
uz-UZ-CYRILLIC => uz-Cyrl-UZ
@macchiati <https://github.com/macchiati> wrote
"The variants on the left are not allowed in BCP 47 (and thus not in
Unicode BCP 47 locale identifiers), while those on the right are Unicode
BCP 47 locale identifiers."
@macchiati <https://github.com/macchiati> - Do you mean "pinyin",
"preeuro" and "CYRILLIC" are not registered under
https://tools.ietf.org/html/bcp47#section-3.5 so they are not allowed in
BCP 47? Because these are structural valid variant, right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMCNp8ORtnumkSMZ9uJqKPHXHODTAks5vLS5DgaJpZM4XDSS8>
.
|
On Fri, 8 Feb 2019 at 02:38, Mark Davis ***@***.***> wrote:
Sorry, the variants in the 2nd and 3rd lines. "pinyin" is valid.
preeuro and CYRILLIC are not in
https://www.unicode.org/repos/cldr/tags/latest/common/validity/variant.xml
(nor
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
).
They are syntacticly well-formed, but not valid.
(BTW, I don't remember the context, and the rest of the message has been
omitted.)
The context is this:
in ECMA402, Intl.Locale constructor take a tag and canonicalize it.
The question is when it got "zh-pinyin", "es-ES-preeuro" and
"uz-UZ-CYRILLIC" as the input, should it keep it AS IS after the (BCP47
Locale Identifier ) canonicalization process, our should the BCP
canonicalization process as step 1 of Locale Identifier canonicalization
process turn them into something else. From what I can read, pinyin,
preeuro and CYRILLIC are all structural valid variant, and I see no
registered variant value in the IANA that have a preferred value, so I do
not believe, following that it should be turn canonicalized into something
else.
But currently ICU canonicalized as below
zh-pinyin ==> zh-u-co-pinyin
es-ES-preeuro => es-ES-u-cu-esp
uz-UZ-CYRILLIC => uz-Cyrl-UZ
this behavior break test262 tests now and I try to figure out what action
should I take. Should I
1) find out reasonable standard / spec to justify the current ICU behavior
and request the test262 to change the expectation, OR
2) consider it is a ICU bug and file a ICU ticket to request changing them.
Please advise which of the above action I should take. I believe I should
take 2) because I cannot find such information to support 1) but maybe I
missed something.
Thanks
…
Mark
On Fri, Feb 8, 2019 at 8:59 AM Frank Yung-Fong Tang <
***@***.***> wrote:
> zh-pinyin ==> zh-u-co-pinyin
> es-ES-preeuro => es-ES-u-cu-esp
> uz-UZ-CYRILLIC => uz-Cyrl-UZ
>
> @macchiati <https://github.com/macchiati> wrote
> "The variants on the left are not allowed in BCP 47 (and thus not in
> Unicode BCP 47 locale identifiers), while those on the right are Unicode
> BCP 47 locale identifiers."
>
> @macchiati <https://github.com/macchiati> - Do you mean "pinyin",
> "preeuro" and "CYRILLIC" are not registered under
> https://tools.ietf.org/html/bcp47#section-3.5 so they are not allowed in
> BCP 47? Because these are structural valid variant, right?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#63 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AJKyMCNp8ORtnumkSMZ9uJqKPHXHODTAks5vLS5DgaJpZM4XDSS8
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AnTdKcyO0Xhm-uCfiuMIRo-lgBWefjoOks5vLVOQgaJpZM4XDSS8>
.
--
Frank Yung-Fong Tang
譚永鋒 / 🌭🍊
Sr. Software Engineer
|
I'm a bit lost on the technical details here. Is there a change we need to follow up with for tests or the specification? |
that is what I am trying to figure out. |
Here is the issue.
Those 3 mappings were added for compatibility with pre-bcpr7 versions of
Unicode. I don't know whether it is necessary for ICU to continue to
support them (as far as I'm concerned they could be dropped). So I see the
following options:
1. No change to the ECMA spec, thus follows LDML for canonicalization.
1. File a ticket in ICU to drop those three mappings
2. OR Use ICU, but special case those 3 cases (ugly but doable). (If
#1 is going to be done, this could just be a temporary workaround).
2. OR Modify the ECMA spec to allow these 3 mappings for backwards
compatibility.
{phone}
…On Mon, Feb 11, 2019, 03:00 Frank Yung-Fong Tang ***@***.***> wrote:
I'm a bit lost on the technical details here. Is there a change we need to
follow up with for tests or the specification?
that is what I am trying to figure out.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJKyMOhYQuSmKFo4XmFZ_JFhw1cJG4cxks5vMM6sgaJpZM4XDSS8>
.
|
@macchiati Thanks for your reply. Now I understand it is not I missed something from the UTS35 or LDML but the ICU behavior is simply out of sync with the spec. I already file bugs in ICU, just want to make sure that should be a bug instead of a "feature". See https://unicode-org.atlassian.net/browse/ICU-20187 and https://unicode-org.atlassian.net/browse/ICU-20411. |
OK, sounds like there is nothing to do at the specification level then, right? |
We've switched the ECMA-402 spec to Unicode BCP 47 Locale Identifiers, so this issue should be resolved. |
@littledan this is a proposal we could work into our Locale spec, if we can get group to agree on the change.
Current spec (and most of the constructors) expect bcp-47 locale id. A cleaner approach would be to use Unicode locale ID, see here for differences:
http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#BCP_47_Conformance
It does not allow for the full syntax of [BCP47]:
It allows for certain additions:
There are multiple problems with bcp-47 tags, from slightly annoying grandfathered tags (source of most Locale bugs in v8), to script mapping.
For example:
The text was updated successfully, but these errors were encountered: