-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR-16720 json: add transforms #4036
Conversation
- new package cldr-transforms - add manifest file transforms.json at the top level - each transform has a metadata file (transforms/ID.json) and a raw text file (transforms/ID.txt). - metadata has all of the keys from the transform rule - the _rulesFile key formally indicates the textfile's name (in case we need to massage the id for some reason in the future).
7bdacce
to
6d40472
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
f19ce43
to
4750f88
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
Deploy will fail because I used my personal fork, so no preview URL. Please review the Markdown change carefully. |
tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java
Fixed
Show fixed
Hide fixed
The mix of JSON and TXT files in the same directory might make it hard to parse. The file extension is the only way to tell the difference, which usually isn't ideal. Can you split them into separate directories? Also, since the JSON files are basically metadata on the TXT files, I have a mild expectation that there would be a single index.json with all the transform metadata in one place. |
This is per-locale data. Nowhere in CLDR-JSON multiple locales are merged in an index file, they always use the file system structure. |
It's designed so clients don't need to crawl the directory or parse any filenames.
{
"transforms": {
"available": [
"InterIndic-Bengali",
"Oriya-Arabic",
"my-t-my-d0-zawgyi",
"tlh-am",
{
"transforms": {
"BGN": {
"_visibility": "external",
"_alias": "Amharic-Latin/BGN am-Latn-t-am-m0-bgn",
"_source": "am",
"_target": "am_Latn",
"_direction": "forward",
"_rulesFile": "Amharic-Latin-BGN.txt"
}
}
}
There is, it's |
I see a couple of bugs:
|
OK, that looks cool, thanks! I didn't see it the first time. I still think I mildly favor not putting the JSON and TXT in the same directory, but I'll leave that to @robertbastian to weigh in on. |
I don't see why being in the same directory would be a problem. As Mark suggested, we need docs on the formats. That will be a separate effort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had approved this, but Steven mentioned 3 outstanding bugs:
- _source and _target need to be bcp47, not old IDs.
- There's a bug in _alias that has some corruption.
- the 2nd level key in the metadata .json has a problem (because that ID might have slashes in it)
Shane, are there any other blockers? |
- properly use BCP47 for source/target - fix corruption in alias and slashes in output
- back out bcp47 - broke some source/target ids
Please review sample data in https://github.com/unicode-org/cldr-json/tree/cldr-16720/transforms/cldr-json/cldr-transforms I now think source and target should not be bcp47 as they aren't always locale IDs. The |
Right, as long as the alias data is preserved. Eg,
<transform source="Any" target="Accents" direction="both" alias="
*und-t-d0-accents*" backwardAlias="*und-t-s0-accents*">
However it would probably be better to pull out the bcp47 items into
separate fields. Could be done later, though.
…On Tue, Sep 17, 2024, 15:45 Steven R. Loomis ***@***.***> wrote:
***@***.**** commented on this pull request.
'source' and 'target' aren't strictly locale IDs, so we cant just convert
them to bcp47.
The aliases field includes bcp47 aliases, so perhaps nothing more needs to
be done.
—
Reply to this email directly, view it on GitHub
<#4036 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMHYG7EPDH45POUJAXTZXCWHJAVCNFSM6AAAAABN7UFAY6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGMJRGIYTKMJRGU>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
What is the name of the second level key? In Amharic-Latin-BGN.json it is BGN
But in most other files it is just "transform". I agree that it would be nice to name the files consistently with either the alias name or the BCP-47 name but not a mix as is currently in the branch. |
I'm not a fan of the nesting in the JSON files. {
"transforms": {
"BGN": {
"_value": "Arabic-Latin-BGN.txt",
"_visibility": "external",
"_alias": "Arabic-Latin/BGN ar-Latn-t-ar-m0-bgn",
"_source": "ar",
"_target": "ar_Latn",
"_direction": "forward"
}
}
} could be represented as {
"_value": "Arabic-Latin-BGN.txt",
"_visibility": "external",
"_alias": "Arabic-Latin/BGN ar-Latn-t-ar-m0-bgn",
"_source": "ar",
"_target": "ar_Latn",
"_variant": "BGN",
"_direction": "forward"
} As far as I can tell no JSON file will have multiple values in the |
Sounds reasonable. The _value should be more meaningful, like: rules.
It doesn't need a _ since it doesn't correspond to an attribute in XML.
…On Wed, Sep 18, 2024, 01:14 Robert Bastian ***@***.***> wrote:
I'm not a fan of the nesting in the JSON files.
{
"transforms": {
"BGN": {
"_value": "Arabic-Latin-BGN.txt",
"_visibility": "external",
"_alias": "Arabic-Latin/BGN ar-Latn-t-ar-m0-bgn",
"_source": "ar",
"_target": "ar_Latn",
"_direction": "forward"
}
}
}
could be represented as
{
"_value": "Arabic-Latin-BGN.txt",
"_visibility": "external",
"_alias": "Arabic-Latin/BGN ar-Latn-t-ar-m0-bgn",
"_source": "ar",
"_target": "ar_Latn",
"_variant": "BGN",
"_direction": "forward"
}
As far as I can tell no JSON file will have multiple values in the
transforms map.
—
Reply to this email directly, view it on GitHub
<#4036 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMBDM2WBJXAP25L2RHTZXEY67AVCNFSM6AAAAABN7UFAY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJXHAYDCOBXGQ>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
That's a bug. will fix
The files are by the id name, which is neither the bcp47 nor the alias name. As explained. |
Agh. another bug. _value is supposed to be _ruleFile. |
I'll change it to: {
"transform": {
"_source": "am",
...
}
} all of the json files have duck typed content similarly. |
- hoist json content up 2 levels - fix 'BGN' in path
please recheck sample data in https://github.com/unicode-org/cldr-json/tree/cldr-16720/transforms/cldr-json/cldr-transforms |
I still don't understand why half of these are identified by BCP-47 and half are identified by their alias.
|
that's how they are identified in the source data. |
In CLDR, the BCP47 version of the ID is in the _alias list (resp _backwardAlias) In JSON, we could filter these to break:
into
Even better would be for us to do this in the XML source, but that's not something we could do in v46 |
@macchiati filter how? How do I know which alias is which? Maybe we should go with this format, and we can add bcp47? |
I think it is sufficient to parse with a strict BCP47 parser. If it
succeeds without error, it is BCP47, otherwise legacy.
…On Thu, Sep 19, 2024 at 4:27 PM Steven R. Loomis ***@***.***> wrote:
@macchiati <https://github.com/macchiati> filter how? How do I know which
alias is which?
—
Reply to this email directly, view it on GitHub
<#4036 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMDYW47MJQVYQS4CSVLZXNMX7AVCNFSM6AAAAABN7UFAY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRSGM3DSNJUGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
seems a bit imprecise/mixed but OK |
It will work, but as I wrote earlier we should add better structure |
Can you recommend a parser and options ? |
- split bcp47 and non-bcp47 aliases.
OK, try the latest, same review link. attributes are only absent if non-empty. {
"_backwardAlias": "Latin-Ethiopic/Tekie_Alibekit",
"_visibility": "external",
"_backwardAliasBcp47": "byn-Ethi-t-byn-latn-m0-tekieali",
"_alias": "Ethiopic-Latin/Tekie_Alibekit",
"_aliasBcp47": "byn-Latn-t-byn-ethi-m0-tekieali",
"_source": "byn_Ethi",
"_direction": "both",
"_target": "byn_Latn",
"_rulesFile": "byn-Ethi-t-byn-latn-m0-tekie-alibekit.txt"
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Respot-checked data, looks good to me.
Sample data available at this branch: https://github.com/unicode-org/cldr-json/tree/cldr-16720/transforms/cldr-json/cldr-transforms
CLDR-16720
ALLOW_MANY_COMMITS=true