Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language Negotiation #5

Merged
merged 3 commits into from
Feb 27, 2017
Merged

Conversation

zbraniecki
Copy link
Collaborator

I'd like to start discussing our intended language negotiation strategy (I'd also like to share it between Fluent and Gecko).

I looked at rfc4647 and some of the ICU implementations, but I don't think that what they suggest (matching vs filtering) applies to us perfectly.

Thus, I started working on my own algorithm and the JS implementation is imho a good starting point to discuss it.

The patch here is not finalized but it uses two data-driven algorithms:

  • getParent - which in simplest form cuts off last part ("en-US" -> "en") but eventually will have some exceptions to prevent things like "zh-Hant" -> "zh", and hopefully will land in ECMA402 (see Intl.getParentLocales tc39/ecma402#87)
  • getBestSubtags does not have a good equivalent in ICU so will likely stay curated by us. It allows us to easily go up from more generic to more detailed subtags

The algorithm currently has four parts. It tries, in order:

  • exact match at weight 1.0
  • likely subtags (tests "en-US" for "en") at weight 0.9
  • parents (tests "en" for "en-US") at weight 0.8
  • siblings (tests "en-GB" for "en-US") at weight 0.6

It stops on the best match for the candidate.

I'd like to get feedback and any suggestions at this point from @Pike and @stasm.

@zbraniecki
Copy link
Collaborator Author

RFC4647 - https://tools.ietf.org/html/rfc4647

@Pike
Copy link
Contributor

Pike commented Feb 15, 2017

Looking at https://tools.ietf.org/html/rfc4647#section-3.3.2, that's mostly what we want, aside from the problem that, AFAICT, langneg(["de-DE"], ["de"]) === [], and we want that to be ["de"], right?

I think the right way to address that problem is to treat the available languages as language ranges. Which is what we intend to do: fr is used for French spoken across the globe. Not French spoken nowhere. I think that getBestSubtags is the wrong concept.

For langneg(["en"], ["en-US", "en-GB"]) that does leave us with some ambiguity, and I think that's accurate. On our UX side, we should discourage the situation. We can build in a deterministic ordering of sibling language tags, which would do the thing you built getBestSubtags for, in this use case?

For siblings, conceptually, we ammend the requested language range with a more generic range, if applicable? ["de-DE"] becomes ["de-DE", "de-*"] ? I think that's sound if we have the "don't drop script ranges for locales where that matters". Are there more constraints to that?

We have two choices here, ["de-DE", "fr-CA"] can either become ["de-DE", "de-*", "fr-CA", "fr-*"] or ["de-DE", "fr-CA", "de-*", "fr-*"]. I think it's the first we want?

The numerical weights are hard to get right, I think. In particular in the code right now, the weights mix as soon as you have 10 or so requested languages, as you mix order within the requested language array and weight. I'd much rather just use different output lists, and concat and unique them afterwards.

Last but not least, the idea that this algorithm always returns an non-empty result is OK, but not something that can be implemented generically. It's a project setup question on which localization to fall back to in the end. Take the BN, maybe? Site developed in Polish, with a German localization? You'd need to configure your l10n setup to fall back to Polish in this case.

I think the correct way to implement this is to post-fix the requested languages with the site default language, and only then to call into the language negotiation code. By doing that, you guarantee a non-empty result, but that's not part of the algorithm, but part of the setup.

@zbraniecki
Copy link
Collaborator Author

Thanks Pike!

I started implementing it and the only problem I encountered so far is the one related to:

langneg(["en"], ["en-US", "en-GB"])

Since the second list is not ordered, there's no way for me to predict which one will it pick.

ICU introduces addLikelySubtags which I was intended to use - http://www.icu-project.org/apiref/icu4c/uloc_8h.html#a0cb2dcd65f745e7a966a729395499770

Do you think it's ok to use that?

I may also have to expand ranges with scripts (so, "en" is not becoming "en-", but "en-Latn-" in order to handle sr, zh etc.).

@zbraniecki
Copy link
Collaborator Author

I think I'm implementing this: http://www.unicode.org/reports/tr35/#LanguageMatching
so I'll have weights there.

Is that ok, @Pike ?

@Pike
Copy link
Contributor

Pike commented Feb 17, 2017

The whole weights stuff in tr35 is a bit bewildering, their math is close to using complex numbers?

We'd run the match algorithm as a filter, though, right?

Also, in the gecko version, we can use addLikelySubtags, but we'd still need languageInfo.xml data? And for a vanilla js impl, we'd need that and likelySubtags.xml, too?

@zbraniecki
Copy link
Collaborator Author

We'd run the match algorithm as a filter, though, right?

yes

but we'd still need languageInfo.xml data? And for a vanilla js impl, we'd need that and likelySubtags.xml, too?

For JS, I'd like to limit the amount of data we carry. I'm thinking about trying to filter the data based on all-locales.

@Pike
Copy link
Contributor

Pike commented Feb 17, 2017 via email

@zbraniecki
Copy link
Collaborator Author

@Pike - I updated the patch to mostly follow the RFC4647, with a few bits from LDML likelySubtags and languageMatching.

I think the code is actually pretty good for how small it is (less than 100 lines of code). It doesn't do everything that I'd like it to do (including ability for each available locale to provide its own fallback chain), but I think it's a good start.

Let me know what you think.

@zbraniecki zbraniecki requested a review from Pike February 19, 2017 10:13
Copy link
Contributor

@Pike Pike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure on what to focus on in this review. I've ended up being weirdly close to the code at big-picture things, so I nag about details where the big picture is involved.

With the bugs I see, I don't yet see the outcome of the algorithm as it'd impact the big picture. Seems the English translation of the German saying is "I can't see the wood for the trees".

Hope this is helpful regardless.

if (loc[2] === '*') {
loc[2] = loc[0].toUpperCase();
return loc.join('-');
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this fully data-driven. Assuming that a language tag in upper case was a region code is just wrong. Let's not introduce ca-CA or lij-LIJ as likely subtags?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here's the thing. LikelySubtags is a 50kb uncompressed JSON table [0].
If we have it (Gecko does), or in case where the user will want to pay the price (give me fluent with likelySubtags), then we don't need a dummy code like that.

But when the user doesn't want to pay the price, and 50kb is a pretty hefty price to pay on the client side, lack of this dummy code will make us not be able to catch any kind of 'ab-AB' extrapolation from 'ab'.
Since it's unlikely that anyone will provide 'ca-CA' or 'lij-LIJ' in available locales, there's no consequence of testing against it. But in the simplest form, they'll at least match all 'fr' to 'fr-FR', and 'it' to 'it-IT' and 'de' to 'de-DE' etc.

My idea was to allow for three "modes":

  • Full CLDR likelySubtags (50kb price for excellent matching)
  • Limited CLDR likelySubtags (say, 10-20 most common ones) + dummy
  • Just dummy

I can see us curating this limited CLDR likelySubtags just to catch the most common ones like 'en' into 'en-US', and 'fr' into 'fr-FR' etc.

I believe that if we remove the dummy all together, we'll make the algorithm not work for most common cases and basically require the likelySubtags.

What do you think?

[0] https://github.com/unicode-cldr/cldr-core/blob/master/supplemental/likelySubtags.json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the line about "a language tag is a script tag with bad casing" pretty revolting, tbh. Let's not set precedent for that, people look at mozilla for how to handle standards, and this is just not like it.

I also start to realize that I actually disagree with the way you use of likelySubtags. In that context, I think most of the unfiltered likelySubtags data goes away. I'll take that top-level.

* Replaces the region position with a range to allow for matches
* with the same language/script but from different region.
*/
function getRegionRange(locale) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd name this something else, maybe regionRangeFor? getRegionRange sounds like a getter of information, not a transformation of the passed in data.

* It can also accept a range `*` character on any position.
*/
const localeRe = new RegExp(
`^${languageCodeRe}${scriptCodeRe}?${regionCodeRe}?${variantCodeRe}?$`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regular expression isn't implementing the spec, I'd recommend to just parse a script tag as a {4}, and make it optional. Like

/^([a-z]{2,3})(?:[-([a-z]{4}|\*))?/i

I'm not sure if we should allow for trailing junk?

Also, we should probably special-case -mac as something that violates well-formed locale tags, as long as we still have it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm following BCP47 in language-script-region-variant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the regexp, but I think it makes sense to keep the 4 pieces. I may accept and cut off any extension keys (-t-, -x- and -u-).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was really about

const scriptCodeRe = '(?:-([a-zA-Z]{2,4}|\\*))';

which should be exactly 4,

const scriptCodeRe = '(?:-([a-z]{4}|\\*))';

Also, I'd remove (did here) the Upper vs lower case regex ranges, and create the regex with a flag='i'.

}

return [language, script, territory, variant];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... this code would become easier, if the regular expression was more standards-conformant, I think.

Also, we should probably normalize casing here, so that we can compare strings later on. 'cause language tags are compared case-independently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll normalize case, either by going all lower-case or running it via Canonicalize.
what do you mean by easier?

function filterMatches(requestedLocales, availableLocales) {
const supportedLocales = [];

outer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably up to @stasm, but GOTOs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can improve the exact code later if we agree on how it's supposed to work and what kind of matches produce.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference about labelled for-loops.

}

for (const availableLocale of availableLocales) {
if (compareLocales(requestedLocale, availableLocale)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parses the language tags over and over in a loop, I'd refactor this code to not do that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. I'll improve the exact code once we agree on what the code is supposed to do :)

for (const requestedLocale of requestedLocales) {

for (const availableLocale of availableLocales) {
if (requestedLocale === availableLocale) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... needs to be independent of casing.


export function negotiateLanguages(requestedLocales,
availableLocales,
defaultLocale) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultLocale isn't covered in tests, otherwise the bug below would have been caught.

const supportedLocales = filterMatches(requestedLocales, availableLocales);
if (supportedLocales.includes(defaultLocale)) {
supportedLocales.push(defaultLocale);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... if undefined is in list, add it to the list.

You want that the other way around.

return supportedLocales;
}

export function negotiateLanguages(requestedLocales,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only public method of this module, and the only without documentation :-/

@stasm, what's the intent for the generated documentation? It seems to be done, but I haven't found it having a prominent role yet, so maybe it doesn't matter. Your call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up. For now, each package in the fluent.js repository has its docstrings extracted using the documentation.js package. The extracted markdown files (one per package) land in the top-loevel docs/ directory.

It might be better to move the docs into each package's directory as $(PACKAGE)/docs/api.md, in case we want to add more hand-written docs later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to move the docs into each package's directory as $(PACKAGE)/docs/api.md, in case we want to add more hand-written docs later on.

Implemented in 6c46100.

@zbraniecki
Copy link
Collaborator Author

Thank you Like! That's precisely the level of detail I want to iron things out on right now.

Let's get to an algorithm we both like and then I'll add detailed tests for each function and step.

I'll work on applying your feedback today.

@zbraniecki
Copy link
Collaborator Author

*Pike (autocorrect snafu)

Copy link
Contributor

@Pike Pike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent quite some time trying to find out what data backs up likelySubtags in CLDR, and much of it is thin air.

I think it's OK to use the data to order the matching languages in the absence of anything better, but really, we shouldn't exclude a match for anything in that data set.

In particular for cases where the requested language don't specify a script and we have multiple scripts, we should just return all of the matching languages.

Also, I think that when we lack the likelySubtags data, all that happens is that the returned match data might not be deterministically ordered. I think that's a fair outcome, and we don't need to write any kind of funky code to avoid that.

On the topic of modes, how is that supposed to work in practice? As in, I npm install fluent, and then?

* It can also accept a range `*` character on any position.
*/
const localeRe = new RegExp(
`^${languageCodeRe}${scriptCodeRe}?${regionCodeRe}?${variantCodeRe}?$`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was really about

const scriptCodeRe = '(?:-([a-zA-Z]{2,4}|\\*))';

which should be exactly 4,

const scriptCodeRe = '(?:-([a-z]{4}|\\*))';

Also, I'd remove (did here) the Upper vs lower case regex ranges, and create the regex with a flag='i'.

if (loc[2] === '*') {
loc[2] = loc[0].toUpperCase();
return loc.join('-');
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the line about "a language tag is a script tag with bad casing" pretty revolting, tbh. Let's not set precedent for that, people look at mozilla for how to handle standards, and this is just not like it.

I also start to realize that I actually disagree with the way you use of likelySubtags. In that context, I think most of the unfiltered likelySubtags data goes away. I'll take that top-level.


assert.deepEqual(
negotiateLanguages(['sr'], ['sr-Cyrl', 'sr-Latn']),
['sr-Cyrl']);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent half a day trying to find out why cldr says these things about Serbian, and to the best of my findings, this is just made up. Notably, sr-Latn-RS is given a writingPopulation of 5%, because

<reference type="R1017">For languages not customarily written, the writing population is artificially set to 5% in the absence of better information.</reference>

@stasm
Copy link
Contributor

stasm commented Feb 20, 2017

On the topic of modes, how is that supposed to work in practice? As in, I npm install fluent, and then?

I'd like the language negotiation module to be a separate package. You would then do:

npm install fluent-langneg

And:

import negotiateLanguages from 'fluent-langneg';
negotiateLanguages(requested, available);

Any additional data can be published in a separate package or as a submodule:

import negotiateLanguages from 'fluent-langneg';
import withLikelySubtags from 'fluent-langneg/subtags';
withLikelySubtags(negotiateLanguages)(requested, available);

For this to work, negotiateLanguages should accept likelySubtags as a parameter. Perhaps a better approach overall would be to mimic other Intl objects:

const negotiator = LanguageNegotiation({
 available: [],
 default: [],
 likelySubtags: []
});

And then you import likelySubtags from 'fluent-langneg/subtags';

Copy link
Contributor

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of small changes to align this with the rest of the repo structure.

If you're on the latest master, you can run make html to have jsdoc extract the API docs into html/fluent-langneg in the root of the repo, and verify that everything looks OK.

@@ -0,0 +1,2 @@
fluent-intl-polyfill.js
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to change this to fluent-langneg.js.

# fluent-langneg

`fluent-langneg` is an API for language negotiation API that is recommended
by the Fluent Team for all language selection and matching.
Copy link
Contributor

@stasm stasm Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started including a one-sentence summary of what Fluent is at the beginning of all READMEs. Here's how I'd phrase it:

`fluent-langneg` is an API for negotiating languages. It's part of
Project Fluent, a localization framework designed to unleash 
the expressive power of the natural language.


## How to use

Simply `import` or `require` the package somewhere in your code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the line? It was written for fluent-intl-polyfill which doesn't expose any importable API.

Simply `import` or `require` the package somewhere in your code.

```javascript
import { negotiateLanguages } from 'fluent-langneg';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With fluent-langneg being so purpose-specific, I think we could make negotiateLanguages the default export:

import negotiateLanguages from 'fluent-langneg';

@@ -0,0 +1 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the whole docs directory.

@@ -0,0 +1,267 @@
(function (global, factory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file.

@@ -0,0 +1,282 @@
/*
* @module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned last night that the @module tag takes the module's name. What we probably want here is:

@module fluent-langneg
@overview

`fluent-intl-polyfill` provides...

return fallback;
}

export function negotiateLanguages(requestedLocales,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider making this export default.

* This means that if `ab` locale is present in the available locales,
* it is treated as matching `ab-*-*-*`.
*/
function compareLocales(loc1, loc2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename this function to localesEqual to clearly indicate that it returns a bool?

@@ -0,0 +1,144 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there two blank lines here and in test/setup.js? Do you know if the use strict pragma is still required in node? Can we skip it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip it.

Copy link
Contributor

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we make the additional JSON data available in browsers? I think a safer approach would be to store the data in a JS file, e.g. data/subtags.js which could be imported if needed.

Or, it looks like there's https://github.com/rollup/rollup-plugin-json but I haven't tried it.

```

The API reference is available at
http://projectfluent.io/fluent.js/fluent-syntax.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```javascript
import negotiateLanguages from 'fluent-langneg';

const supported = negotiateLanguages(requested, available, default);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is a reserved keyword in JS, so I'd suggest changing this to:

const supportedLanguages = negotiateLanguages(
    requestedLanguages, availableLanguages, defaultLanguage
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are these languages or locales?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are locales. We will negotiate between the language part of those locales so we will negotiate languages, but we pass locales and the returned ids are locales as well. I know it's complex :(


export default function negotiateLanguages(requestedLocales,
availableLocales,
options = {}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird indentation here. You can just indent one level like so:

export default function negotiateLanguages(
  requestedLocales,
  availableLocales,
  options = {}
) {

availableLocales,
options = {}) {

const defaultLocale = GetOption(options, 'defaultLocale', 'string');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't defaultLocale common enough that it should get its own positional argument?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is. Since the algorithm works perfectly fine without it, I'd prefer to keep only the necessary arguments as positional - requestedLocales and availableLocales.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm missing something but I don't see it in tests. What happens when there are no common languages between requested and available and no default has been specified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we return an empty list as per language matching spec in rfc4647. Added to tests.

options, 'likelySubtags', 'object', undefined, {});

const supportedLocales =
filterMatches(requestedLocales, availableLocales, likelySubtags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you change this to:

const supportedLocales = filterMatches(
  requestedLocales, availableLocales, likelySubtags
);

@@ -0,0 +1,18 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so small that I think we should include this in the main package. Can rollup bundle JSON files? If not, I'd just put this in src/subtags.js.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how small it will be yet. I'll need to generate the data for the min variant, and so far it's just a stub. I agree we'll want to bundle it, but I'd like to make it a conditional bundling i.e. when you're building fluent-langneg I'd like you to be able to choose if you want with min or full likelySubtags.

@@ -0,0 +1,266 @@
/*
* @module fluent-langneg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the @overview tag here.

@zbraniecki
Copy link
Collaborator Author

@Pike , @stasm - I think this is ready for another round of reviews.

There are two three known things I'd like to add once we agree on the logic:

  1. more tests
  2. ability to build fluent-langneg with either 'min' or full likelySubtags
  3. a script to generate the 'likelySubtagsMin' based on our own threshold.

But first, I'd like to iron out the first version of the algorithm to be stable.

Copy link
Contributor

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. r+.

I'd suggest using default exports more, especially with well-scoped modules. From https://esdiscuss.org/topic/moduleimport#content-0: ES6 favors the single/default export style.

ability to build fluent-langneg with either 'min' or full likelySubtags

My recommendation right now would be to ship 0.0.1 with 10-20 most likely subtags hardcoded in the JS source and with a dummy algorithmic way of guessing more. It's okay if it makes mistakes at this stage.

I'd also put the 50KB JSON in the data/ folder so that anyone can use it if they wish. They'll need to figure out how to load it themselves, but at least it's there.

I'd also prefer if the JSON file name was all lowercase, as this is the file naming scheme we've been using so far. So perhaps: data/likely-subtags.json. I see that the camelcase name comes directly from the CLDR repo, so keeping it that way is okay, too.

a script to generate the 'likelySubtagsMin' based on our own threshold.

Let's move this to a future milestone, together with finding out how to bundle JSON files using our build system.

"engine": {
"node": ">=6"
},
"eslintConfig": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to remove this. I followed your suggestion and used a global .eslintrc.json file in the root of the repo (0f1bc1f).


if (strategy === 'lookup') {
if (supportedLocales.length === 0) {
supportedLocales.push(defaultLocale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the defaultLocale is undefined? Are you okay pushing it to supportedLocales too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's undefined it'll throw earlier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't notice that. Thanks.

* It also allows skipping the script section of the id, so `en-US` is properly
* parsed as `en-*-US-*`.
*/
export function parseLocale(locale, range = false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export default?

* ['en-AU'] * ['en-US'] = ['en-US']
* ['sr-RU'] * ['sr-Latn-RO'] = ['sr-Latn-RO'] // sr-RU -> sr-Latn-RU
*/
export function filterMatches(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export default?

) {
const supportedLocales = new Set();

const availableLocalesCache = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Map? That way you won't have to check with hasOwnProperty in getOrParseLocale.

@zbraniecki
Copy link
Collaborator Author

My recommendation right now would be to ship 0.0.1 with 10-20 most likely subtags hardcoded in the JS source

The challenge here is how we select the 10-20? Is 'it' more important than 'in'?

and with a dummy algorithmic way of guessing more. It's okay if it makes mistakes at this stage.

@Pike would you be ok with this?

Alternatively, I could add a curated by us list of locals that are ok to expand:

const localesThatHasDefaultThatMatchRegion = ['it', 'fr', 'ru', 'cs', 'pl', ...];

instead?

@zbraniecki
Copy link
Collaborator Author

I'd also put the 50KB JSON in the data/ folder so that anyone can use it if they wish. They'll need to figure out how to load it themselves, but at least it's there.

Would you prefer that over external dependency or just a link to cldr-json npm package?

@zbraniecki
Copy link
Collaborator Author

I'd also prefer if the JSON file name was all lowercase, as this is the file naming scheme we've been using so far.

My hope was to keep a direct link to CLDR and they use camel case.

I'd suggest using default exports more, especially with well-scoped modules. From https://esdiscuss.org/topic/moduleimport#content-0: ES6 favors the single/default export style.

The reason I didn't do that is so that I can expose more internal methods for tests. If each file can only expose one function, I can't expose others for testing.

Let's move this to a future milestone, together with finding out how to bundle JSON files using our build system.

SGTM.

@Pike?

@stasm
Copy link
Contributor

stasm commented Feb 23, 2017

Would you prefer that over external dependency or just a link to cldr-json npm package?

Ah, if that's an option, then let's go for it: let's link the CLDR repo in the README.

The reason I didn't do that is so that I can expose more internal methods for tests. If each file can only expose one function, I can't expose others for testing.

You sure can :)

import foo, { bar } from './module'

imports the default export as foo as well as the named bar export.

@zbraniecki
Copy link
Collaborator Author

I want to ❤️️ es6 right now.


## fluent-langneg 0.0.1

- (05d2487c) fluent-langneg 0.0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my bad: I shouldn't have included the SHAs in other changelogs. I've since stopped doing it. Changelogs should describe changes between named versions and they don't have to correspond to commits. Let's remove the SHA of the commit here and squash everything into a single commit.


The API supports three negotiation strategies:

* filtering (defualt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please either use a ### header or intend the text below by 4 space to make it part of the bullet point.

user can load into `fluent-langneg` to replace the minimal version.

```javascript
let data = require('./data/likelySubtags.json');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not shipping this file with the rest of the package (which is good) so maybe link to the CLDR repo and put the following here:

const data = require('cldr-core/supplemental/likelySubtags.json');

@zbraniecki zbraniecki merged commit 053ede7 into projectfluent:master Feb 27, 2017
Copy link
Contributor

@Pike Pike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got some comments more on details, and some comments on code, and some comments on defaults.

Overall, I wonder how to document this right, too. We have three lengthy blurbs, one in README, and one index.js and one in matches.js. Not sure why which is where, and it seems that the generated docs are rather confused, too? I ran make html, and the important comments don't seem to make it into the generated docs?


The API supports three negotiation strategies:

### filtering (default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the general use-case within fluent be to use matching instead of filtering? In that case, I'd make that default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's up to @stasm I guess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented it in C++ as the default, I'll update js version once I'm done with C++.

['filtering', 'matching', 'lookup'], 'filtering');

if (strategy === 'lookup' && defaultLocale === undefined) {
throw new Error('defaultLocale cannot be undefined for strategy `lookup`');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition should be in the docs.

function variantRangeFor(locale) {
locale.variant = '*';
return locale;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regionRangeFor and variantRangeFor sound like they'd create new Locale objects, IMHO. Not sure if it's better to actually do a copy, or make this a setter. I could even go as far as to not have this be a function. Or make it a setter on the object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stasm - what do you think? Is it worth creating a new object over modifying the existing one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on the C++ port right now for Gecko. Once I'm done with it, I'll update the JS implementation to match it. It does have this change as well.

// Attempt to match against the available range
// This turns `en` into `en-*-*-*` and `en-US` into `en-*-US-*`
// Example: ['en-US'] * ['en'] = ['en']
for (const availableLocale of availableLocales) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me quite a while to untangle this code, mostly because the patterns in the code aren't factored in.

Is there a way to make this a second inner loop, between this and the inner

if (matches) {
  add_and_bail_if_not_matching
}

? Might make the code shorter, and easier to digest, I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would I bail out of the outer if I factored it out to a separate function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of just loops, not functions.
You could also do functions, and instead of continue :outer, do early returns?

'zh': 'zh-hans-cn',
'zh-gb': 'zh-hant-gb',
'zh-us': 'zh-hant-us',
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to add to the comment on how you got to this particular sublist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ashamed to admit that I went through likelySubtags and selected locales that I consider to be more significant and likely to have someone specify the minimized version (like ab in requested instead of ab-CD) where there are more than one region/script.

Not sure how to document it and I believe we should eventually do this in a more formal and algorithmic way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go for a comment saying that then. You could even file an issue for the follow-up and link to that from the comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

describe('Basic Language Negotiation without likelySubtags', () => {
const nl = negotiateLanguages;

it('exact match', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these it() statements read as full sentences starting with "it ..."? mocha docs say yes, not sure if there's a special style among the fluent repos.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #14.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants