Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manual: document support for latin-9 identifier. #13668

Merged
merged 11 commits into from
Dec 19, 2024

Conversation

Octachron
Copy link
Member

This PR document the new lexical convention for identifiers with the addition of support for the latin-9 character subset of unicode. It also adds a brief comment that OCaml source files are now expected to be valid UTF-8 encoded unicode text.

The first two commit of this PR implements a few prerequisite:

  • The first commit switches the latex driver to lualatex for better support of unicode characters (and better font support in general).
  • The second commit improve the support of unicode character in the grammar definition pseudo-environment by not escaping unicode characters beyond the ascii range.

@Octachron Octachron added this to the 5.3.0 milestone Dec 12, 2024
@Octachron
Copy link
Member Author

The compilation issue seems to be an issue with the texlive distribution on Ubuntu 22.04, I am planning to wait for the CI update to 24.04.

@nojb
Copy link
Contributor

nojb commented Dec 12, 2024

xref: #1802 (comment)

Copy link
Contributor

@dbuenzli dbuenzli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nojb. I think a few clarifications are needed

Changes Outdated
@@ -563,6 +563,10 @@ ___________
not currently supported on OCaml 5.
(Jan Midtgaard, review by Gabriel Scherer)

- #????: Document the basic support for unicode identifiers and the switch to
utf-8 encoded unicode text for OCaml source file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8 and _U_nicode


\subsubsection*{sss:lex:text-encoding}{Source file encoding}

OCaml source files are expected to be valid UTF-8 encoded unicode text.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_U_unicode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but what do you think about using a more forceful wording:

OCaml source files must be valid UTF-8 encoded Unicode text.
The interpretation of source files which are not UTF-8 encoded is unspecified.
Such source files may be rejected in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a contradiction between the must in the first line and the following sentences. What do you think of:

OCaml source files are expected to be valid UTF-8 encoded Unicode text.
The interpretation of source files which are not UTF-8 encoded is unspecified.
Such source files may be rejected in the future.

where the later sentences explain that the leniency in the first sentence is temporary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe, we could be more explicit that second sentence describes a temporary leniency:

OCaml source files must be valid UTF-8 encoded Unicode text.
Source files which are not UTF-8 encoded may be accepted but their interpretation is unspecified.
Such source files may be rejected in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with your first suggestion ("The interpretation of source files ... is unspecified"), which is short and to the point. Thanks!

"a"\ldots"z" || "š" || "ž" || "œ" || "ß" \ldots "ö" || "ø" \dots "ÿ" ;
uppercase-letter:
"A"\ldots"Z" || "Š" || "Ž" || "Œ" || "Ÿ" || "À" \ldots "Ö" || "Ø" \ldots "Þ" || "ẞ"
;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I don't remember well no normalisation is supported so this is confusing. Exact unicode code points should be mentioned.

If normalisation is supported then I would still mention the code points and add a comment that any byte sequence which NFC to the code point is supported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NFC normalization is supported.

characters 223--246 and 248--255 as lowercase letters). This
feature is deprecated and should be avoided for future compatibility.
letters from the ASCII set, and the 70 lowercase and uppercase letters
from the ISO 8859-15 character set extended with uppercase ẞ.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing.

I think you should say something like "letters contain the UTF-8 encoding of the letters of the ASCII set (U+0041-U+005A and U+0061-U+007A), letter of latin1 character sets (U+…) aswell as upper case (U+1E9E).

Something should also be said about normalisation or the lack thereof.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact it's better to simply stick to Unicode terminology and avoid mentioning other character sets (except perhaps ASCII). So either mention only the code points or the code points and the Unicode blocks they are in, see the names on this page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current text is:

Letters contain at least the 52 lowercase and uppercase letters from the ASCII
set, letters from the latin-1 complement block ("U+0080" to "U+00FF"), letters
"ŠšŽžŒœŸ" from the Latin Extended-A block ("U+0100" to "U+017F") and upper case
ẞ ("U+189E").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Though the block seems to be called Latin-1 Supplement rather than complement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I will also remove the "at least" from "Letters contain at least".
And for the normalization part, I propose:

Any byte sequence which is equivalent to a supported Unicode
character under NFC (Normalization Form C) is supported too.

Copy link
Contributor

@dbuenzli dbuenzli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few remaining nits which could clarify but it's fine with me. Also sorry I though @nojb had authored the PR. So thanks @Octachron and @nojb for the review :–)

characters 223--246 and 248--255 as lowercase letters). This
feature is deprecated and should be avoided for future compatibility.
Letters contain the 52 lowercase and uppercase letters from the ASCII set,
letters from the Latin-1 Supplement block ("U+0080" to "U+00FF"), letters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parens define the block which I find confusing. I think you can drop these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could demote them to a footnote? My hope was that having the block definition could help people decipher the grammar definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mentioning the printed letters like you did for a subset below is more useful. Are there too much of them ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If yes, perhaps mention the ranges as you had them originally in the grammar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the letters in the block are a bit of a mouthful ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
but still seems reasonable to print them all.

letters from the Latin-1 Supplement block ("U+0080" to "U+00FF"), letters
"ŠšŽžŒœŸ" from the Latin Extended-A block ("U+0100" to "U+017F") and upper case
ẞ ("U+189E"). Any byte sequence which is equivalent to a supported Unicode
character under NFC (Normalization Form C) is supported too.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear what a supported Unicode character is in my opinion. I think this would be clearer: Any byte sequence which is equivalent to a supported one of these Unicode character under NFC (Normalization Form C) is supported too.

@dra27 dra27 closed this Dec 16, 2024
@dra27 dra27 reopened this Dec 16, 2024
@Octachron
Copy link
Member Author

@nojb , do you approve of the current state?

@nojb
Copy link
Contributor

nojb commented Dec 18, 2024

@nojb , do you approve of the current state?

I do!

@Octachron Octachron merged commit 418f333 into ocaml:trunk Dec 19, 2024
21 checks passed
Octachron added a commit that referenced this pull request Dec 19, 2024
manual: document support for latin-9 identifier.
(cherry picked from commit 418f333)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants