Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

Should we map the code if the type is region ? #81

Closed
FrankYFTang opened this issue Jul 9, 2020 · 10 comments
Closed

Should we map the code if the type is region ? #81

FrankYFTang opened this issue Jul 9, 2020 · 10 comments

Comments

@FrankYFTang
Copy link
Collaborator

In #77 (comment)
@anba suggested
"
Region (and scripts) subtags should also get canonicalised to replaced outdated subtags with their preferred value.
"
This issue track the "region part" only since the issue with script is different.

I have concern about this. (canonicalize the region code). There are no pre-defined process in UTS35 for this. The process for the region subtag within unicode_language_id stated in https://unicode-org.github.io/cldr/ldml/tr35.html#Canonical_Unicode_Locale_Identifiers depends on the language code (and script code if present) while there are multiple territories listed in the replacement attribute of territoryAlias.

@anba
Copy link
Collaborator

anba commented Jul 9, 2020

ICU internally already defaults to replacing deprecated region codes and it's not possible to disable this behaviour: https://github.com/unicode-org/icu/blob/f917c43cf153bfca7ffd60fc1cdcbb32360967ce/icu4c/source/common/locresdata.cpp#L97-L102

CLDR also doesn't provide localised names for deprecated region names, e.g. there's no entry for "SU" to localise it to "Soviet Union".

For example, new Intl.DisplayNames("en", {type:"region"}).of("SU") currently returns "Russia" in both V8 and SpiderMonkey.

Furthermore the linked ICU source code shows some issues when we don't properly canonicalise upfront before calling into ICU:

d8> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("GB")
"UK"
d8> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("UK")
"United Kingdom"

whereas in SpiderMonkey with explicit canonicalisation:

js> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("GB")
"UK"
js> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("UK")  
"UK"

(ICU only handles the string "Countries", but should also handle the string "Countries%short".)

If we don't want to use some ad-hoc steps to canonicalise a standalone region subtag, we could prepend the language subtag "und" to get a proper Unicode BCP 47 locale identifier and then canonicalise that one. So for example when the region subtag is "SU", we prepend "und" to get "und-SU", canonicalise "und-SU" to get "und-RU" and then extract the region subtag from "und-RU" to get "RU".


Apart from that, canonicalisation will also help implementations to properly call into ICU, because ICU expects at least case canonicalised inputs. (Too lazy to properly report this bug. 😄)

d8> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("su")
"su"
d8> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("SU")
"Russia"

@FrankYFTang
Copy link
Collaborator Author

@anba - ok. That make sense. Any suggestion how should we spec out such mapping process?

@FrankYFTang
Copy link
Collaborator Author

I dig int a little bit more. I do not think ICU perform such mapping for all as what @anba said. I think there are some mapping there but not the mapping in the UTS35.
For example, if you look at https://github.com/unicode-org/cldr/blame/master/common/supplemental/supplementalMetadata.xml

I believe ICU does not map the following
FQ, NT, PC, 062, 172, 200, 532, 582, 830, 890

@anba
Copy link
Collaborator

anba commented Jul 9, 2020

uloc_getTableStringWithFallback from the linked ICU source code calls uloc_getCurrentCountryID, which in turn uses a hard-coded list of deprecated country ids.

(Hmm, does this further strengthen my argument to perform an explicit canonicalisation before calling into ICU, because then we don't need to rely on some hard-coded values in ICU? 😄)

@FrankYFTang
Copy link
Collaborator Author

Here are the meeting notes about our discussion in 2020-07-09 ECMA 402 meeting:

FYT: Anba suggested that we canonicalize the code, not just in terms of casing but in terms of aliases. The tricky part is canonicalizing the region code and script code. There isn't a mapping defined for this in UTS 35. If we do this, we need to have some way to spec it out clearly. Do we want to perform this additional mapping or just the casing change?
SFC: Why can't we say that the region code is canonicalized based on the locale? In other words, spec out the current behaviour? Is it because it's not defined in UTS 35?
FYT: It uses additional information (the language code) to canonicalize it. There's no standalone algorithm that we could spec out. It's tricky.
JSW: I think deprecated region tags are listed with a mapping?
SFC: Are there cases in practice where the mapping of a deprecated region code changes based on the language?
JSW: I think it is the case with ab-SU and ru-SU where they will canonicalize to ab-GE and ru-RU respectively.
FYT: I think SU maps to 13 different region codes depending on the language.
ZB: What would happen if we decided not to map at all?
SFC: Anba's point was that ICU doesn't support that behaviour right now.
ZB: Why? If you ask for a display name for SU will it return "Soviet Union"?
JSW: I think Anba said that they don't have any display names for SU at all.
ZB: So we would just return "SU", like any other well-formed region code that we don't have a display name for.
SFC: I'm fine with that behaviour if ICU supports it.
ZB: my thinking is: what will be the result of any attempt to specify this? This will be quirky because it’s bound to be.
I don't expect it to be of value to users of the modern web. If your system has a display name for SU, it will return it, and if not, not. I'm not sure how ICU supports it.
SFC: I am fine with what ZB suggested.
FYT: I would prefer to stay with not mapping, just case-correcting, unless we can specify exactly what the mapping is.
JSW: I'm a little leery about this but I can't say anything specific just yet.
FYT: Which canonicalization step are we talking about?
JSW: I was assuming it was the same step as on the locale (...???)
SFC: I propose that JSW, FYT, and Anba should sync offline about this. I tend to agree with ZB that canonicalizing a region code in Intl.DisplayNames is fundamentally different from canonicalizing a region code in a locale.

@anba @zbraniecki @jswalden

@anba
Copy link
Collaborator

anba commented Sep 8, 2020

ZB: So we would just return "SU", like any other well-formed region code that we don't have a display name for.
SFC: I'm fine with that behaviour if ICU supports it.

Nope, ICU doesn't support it: #81 (comment)

For cases like the "UK" one outlined in #81 (comment), I'll probably keep complete canonicalisation in SpiderMonkey, even if the spec only requires case canonicalisation, but return only the case normalised code if no localised name is present. So for example, "su" will still return "Russia" in SpiderMonkey, but in case there's no localised name for "Russia", case normalised "SU" will be returned (instead of "RU").

@FrankYFTang
Copy link
Collaborator Author

For cases like the "UK" one outlined in #81 (comment), I'll probably keep complete canonicalisation in SpiderMonkey, even if the spec only requires case canonicalisation, but return only the case normalised code if no localised name is present. So for example, "su" will still return "Russia" in SpiderMonkey, but in case there's no localised name for "Russia", case normalised "SU" will be returned (instead of "RU").
There are no standard to rule how to localize the code SU , right? so if we do not canonicalize it in the spec but the implementation return "Russia", it is NOT a violation of the spec. It is just act as if there is a resource for SU and the name for SU is "Russia".

@anba
Copy link
Collaborator

anba commented Sep 9, 2020

There are no standard to rule how to localize the code SU , right? so if we do not canonicalize it in the spec but the implementation return "Russia", it is NOT a violation of the spec. It is just act as if there is a resource for SU and the name for SU is "Russia".

Yes, exactly that.

@sffc
Copy link
Collaborator

sffc commented Sep 10, 2020

@FrankYFTang
Copy link
Collaborator Author

This proposal is now in stage 4 per 2020-sept TC39 meeting. If you still feel a need to map the code when the type is region, please file a new issue in the v2 repo. I am closing this issue now.
https://github.com/tc39/intl-displaynames-v2/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants