Should we map the code if the type is region ? #81

FrankYFTang · 2020-07-09T01:37:17Z

In #77 (comment)
@anba suggested
"
Region (and scripts) subtags should also get canonicalised to replaced outdated subtags with their preferred value.
"
This issue track the "region part" only since the issue with script is different.

I have concern about this. (canonicalize the region code). There are no pre-defined process in UTS35 for this. The process for the region subtag within unicode_language_id stated in https://unicode-org.github.io/cldr/ldml/tr35.html#Canonical_Unicode_Locale_Identifiers depends on the language code (and script code if present) while there are multiple territories listed in the replacement attribute of territoryAlias.

anba · 2020-07-09T09:42:34Z

ICU internally already defaults to replacing deprecated region codes and it's not possible to disable this behaviour: https://github.com/unicode-org/icu/blob/f917c43cf153bfca7ffd60fc1cdcbb32360967ce/icu4c/source/common/locresdata.cpp#L97-L102

CLDR also doesn't provide localised names for deprecated region names, e.g. there's no entry for "SU" to localise it to "Soviet Union".

For example, new Intl.DisplayNames("en", {type:"region"}).of("SU") currently returns "Russia" in both V8 and SpiderMonkey.

Furthermore the linked ICU source code shows some issues when we don't properly canonicalise upfront before calling into ICU:

d8> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("GB")
"UK"
d8> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("UK")
"United Kingdom"

whereas in SpiderMonkey with explicit canonicalisation:

js> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("GB")
"UK"
js> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("UK")  
"UK"

(ICU only handles the string "Countries", but should also handle the string "Countries%short".)

If we don't want to use some ad-hoc steps to canonicalise a standalone region subtag, we could prepend the language subtag "und" to get a proper Unicode BCP 47 locale identifier and then canonicalise that one. So for example when the region subtag is "SU", we prepend "und" to get "und-SU", canonicalise "und-SU" to get "und-RU" and then extract the region subtag from "und-RU" to get "RU".

Apart from that, canonicalisation will also help implementations to properly call into ICU, because ICU expects at least case canonicalised inputs. (Too lazy to properly report this bug. 😄)

d8> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("su")
"su"
d8> new Intl.DisplayNames("en", {type:"region", style:"narrow"}).of("SU")
"Russia"

FrankYFTang · 2020-07-09T16:59:51Z

@anba - ok. That make sense. Any suggestion how should we spec out such mapping process?

FrankYFTang · 2020-07-09T18:26:02Z

I dig int a little bit more. I do not think ICU perform such mapping for all as what @anba said. I think there are some mapping there but not the mapping in the UTS35.
For example, if you look at https://github.com/unicode-org/cldr/blame/master/common/supplemental/supplementalMetadata.xml

I believe ICU does not map the following
FQ, NT, PC, 062, 172, 200, 532, 582, 830, 890

anba · 2020-07-09T18:39:36Z

uloc_getTableStringWithFallback from the linked ICU source code calls uloc_getCurrentCountryID, which in turn uses a hard-coded list of deprecated country ids.

(Hmm, does this further strengthen my argument to perform an explicit canonicalisation before calling into ICU, because then we don't need to rely on some hard-coded values in ICU? 😄)

FrankYFTang · 2020-08-03T18:15:53Z

Here are the meeting notes about our discussion in 2020-07-09 ECMA 402 meeting:

FYT: Anba suggested that we canonicalize the code, not just in terms of casing but in terms of aliases. The tricky part is canonicalizing the region code and script code. There isn't a mapping defined for this in UTS 35. If we do this, we need to have some way to spec it out clearly. Do we want to perform this additional mapping or just the casing change?
SFC: Why can't we say that the region code is canonicalized based on the locale? In other words, spec out the current behaviour? Is it because it's not defined in UTS 35?
FYT: It uses additional information (the language code) to canonicalize it. There's no standalone algorithm that we could spec out. It's tricky.
JSW: I think deprecated region tags are listed with a mapping?
SFC: Are there cases in practice where the mapping of a deprecated region code changes based on the language?
JSW: I think it is the case with ab-SU and ru-SU where they will canonicalize to ab-GE and ru-RU respectively.
FYT: I think SU maps to 13 different region codes depending on the language.
ZB: What would happen if we decided not to map at all?
SFC: Anba's point was that ICU doesn't support that behaviour right now.
ZB: Why? If you ask for a display name for SU will it return "Soviet Union"?
JSW: I think Anba said that they don't have any display names for SU at all.
ZB: So we would just return "SU", like any other well-formed region code that we don't have a display name for.
SFC: I'm fine with that behaviour if ICU supports it.
ZB: my thinking is: what will be the result of any attempt to specify this? This will be quirky because it’s bound to be.
I don't expect it to be of value to users of the modern web. If your system has a display name for SU, it will return it, and if not, not. I'm not sure how ICU supports it.
SFC: I am fine with what ZB suggested.
FYT: I would prefer to stay with not mapping, just case-correcting, unless we can specify exactly what the mapping is.
JSW: I'm a little leery about this but I can't say anything specific just yet.
FYT: Which canonicalization step are we talking about?
JSW: I was assuming it was the same step as on the locale (...???)
SFC: I propose that JSW, FYT, and Anba should sync offline about this. I tend to agree with ZB that canonicalizing a region code in Intl.DisplayNames is fundamentally different from canonicalizing a region code in a locale.

@anba @zbraniecki @jswalden

anba · 2020-09-08T18:26:14Z

ZB: So we would just return "SU", like any other well-formed region code that we don't have a display name for.
SFC: I'm fine with that behaviour if ICU supports it.

Nope, ICU doesn't support it: #81 (comment)

For cases like the "UK" one outlined in #81 (comment), I'll probably keep complete canonicalisation in SpiderMonkey, even if the spec only requires case canonicalisation, but return only the case normalised code if no localised name is present. So for example, "su" will still return "Russia" in SpiderMonkey, but in case there's no localised name for "Russia", case normalised "SU" will be returned (instead of "RU").

FrankYFTang · 2020-09-09T17:46:31Z

For cases like the "UK" one outlined in #81 (comment), I'll probably keep complete canonicalisation in SpiderMonkey, even if the spec only requires case canonicalisation, but return only the case normalised code if no localised name is present. So for example, "su" will still return "Russia" in SpiderMonkey, but in case there's no localised name for "Russia", case normalised "SU" will be returned (instead of "RU").
There are no standard to rule how to localize the code SU , right? so if we do not canonicalize it in the spec but the implementation return "Russia", it is NOT a violation of the spec. It is just act as if there is a resource for SU and the name for SU is "Russia".

anba · 2020-09-09T20:02:08Z

There are no standard to rule how to localize the code SU , right? so if we do not canonicalize it in the spec but the implementation return "Russia", it is NOT a violation of the spec. It is just act as if there is a resource for SU and the name for SU is "Russia".

Yes, exactly that.

sffc · 2020-09-10T20:49:44Z

Discussion from 2020-09-10:

https://github.com/tc39/ecma402/blob/master/meetings/notes-2020-09-10.md#intldisplaynames-toward-stage-4

FrankYFTang · 2020-09-24T00:24:24Z

This proposal is now in stage 4 per 2020-sept TC39 meeting. If you still feel a need to map the code when the type is region, please file a new issue in the v2 repo. I am closing this issue now.
https://github.com/tc39/intl-displaynames-v2/

anba mentioned this issue Jul 9, 2020

Should we map the code if the type is script ? #82

Closed

FrankYFTang closed this as completed Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we map the code if the type is region ? #81

Should we map the code if the type is region ? #81

FrankYFTang commented Jul 9, 2020

anba commented Jul 9, 2020

FrankYFTang commented Jul 9, 2020

FrankYFTang commented Jul 9, 2020

anba commented Jul 9, 2020

FrankYFTang commented Aug 3, 2020

anba commented Sep 8, 2020

FrankYFTang commented Sep 9, 2020

anba commented Sep 9, 2020

sffc commented Sep 10, 2020

FrankYFTang commented Sep 24, 2020

Should we map the code if the type is region ? #81

Should we map the code if the type is region ? #81

Comments

FrankYFTang commented Jul 9, 2020

anba commented Jul 9, 2020

FrankYFTang commented Jul 9, 2020

FrankYFTang commented Jul 9, 2020

anba commented Jul 9, 2020

FrankYFTang commented Aug 3, 2020

anba commented Sep 8, 2020

FrankYFTang commented Sep 9, 2020

anba commented Sep 9, 2020

sffc commented Sep 10, 2020

FrankYFTang commented Sep 24, 2020