-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manual: document support for latin-9 identifier. #13668
Conversation
The compilation issue seems to be an issue with the texlive distribution on Ubuntu 22.04, I am planning to wait for the CI update to 24.04. |
xref: #1802 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nojb. I think a few clarifications are needed
Changes
Outdated
@@ -563,6 +563,10 @@ ___________ | |||
not currently supported on OCaml 5. | |||
(Jan Midtgaard, review by Gabriel Scherer) | |||
|
|||
- #????: Document the basic support for unicode identifiers and the switch to | |||
utf-8 encoded unicode text for OCaml source file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-8 and _U_nicode
manual/src/refman/lex.etex
Outdated
|
||
\subsubsection*{sss:lex:text-encoding}{Source file encoding} | ||
|
||
OCaml source files are expected to be valid UTF-8 encoded unicode text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_U_unicode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but what do you think about using a more forceful wording:
OCaml source files must be valid UTF-8 encoded Unicode text.
The interpretation of source files which are not UTF-8 encoded is unspecified.
Such source files may be rejected in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a contradiction between the must
in the first line and the following sentences. What do you think of:
OCaml source files are expected to be valid UTF-8 encoded Unicode text.
The interpretation of source files which are not UTF-8 encoded is unspecified.
Such source files may be rejected in the future.
where the later sentences explain that the leniency in the first sentence is temporary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe, we could be more explicit that second sentence describes a temporary leniency:
OCaml source files must be valid UTF-8 encoded Unicode text.
Source files which are not UTF-8 encoded may be accepted but their interpretation is unspecified.
Such source files may be rejected in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go with your first suggestion ("The interpretation of source files ... is unspecified"), which is short and to the point. Thanks!
"a"\ldots"z" || "š" || "ž" || "œ" || "ß" \ldots "ö" || "ø" \dots "ÿ" ; | ||
uppercase-letter: | ||
"A"\ldots"Z" || "Š" || "Ž" || "Œ" || "Ÿ" || "À" \ldots "Ö" || "Ø" \ldots "Þ" || "ẞ" | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I don't remember well no normalisation is supported so this is confusing. Exact unicode code points should be mentioned.
If normalisation is supported then I would still mention the code points and add a comment that any byte sequence which NFC to the code point is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NFC normalization is supported.
manual/src/refman/lex.etex
Outdated
characters 223--246 and 248--255 as lowercase letters). This | ||
feature is deprecated and should be avoided for future compatibility. | ||
letters from the ASCII set, and the 70 lowercase and uppercase letters | ||
from the ISO 8859-15 character set extended with uppercase ẞ. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing.
I think you should say something like "letters contain the UTF-8 encoding of the letters of the ASCII set (U+0041-U+005A and U+0061-U+007A), letter of latin1 character sets (U+…) aswell as upper case ẞ
(U+1E9E).
Something should also be said about normalisation or the lack thereof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact it's better to simply stick to Unicode terminology and avoid mentioning other character sets (except perhaps ASCII). So either mention only the code points or the code points and the Unicode blocks they are in, see the names on this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current text is:
Letters contain at least the 52 lowercase and uppercase letters from the ASCII
set, letters from the latin-1 complement block ("U+0080" to "U+00FF"), letters
"ŠšŽžŒœŸ" from the Latin Extended-A block ("U+0100" to "U+017F") and upper case
ẞ ("U+189E").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Though the block seems to be called Latin-1 Supplement
rather than complement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I will also remove the "at least" from "Letters contain at least".
And for the normalization part, I propose:
Any byte sequence which is equivalent to a supported Unicode
character under NFC (Normalization Form C) is supported too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few remaining nits which could clarify but it's fine with me. Also sorry I though @nojb had authored the PR. So thanks @Octachron and @nojb for the review :–)
manual/src/refman/lex.etex
Outdated
characters 223--246 and 248--255 as lowercase letters). This | ||
feature is deprecated and should be avoided for future compatibility. | ||
Letters contain the 52 lowercase and uppercase letters from the ASCII set, | ||
letters from the Latin-1 Supplement block ("U+0080" to "U+00FF"), letters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parens define the block which I find confusing. I think you can drop these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could demote them to a footnote? My hope was that having the block definition could help people decipher the grammar definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mentioning the printed letters like you did for a subset below is more useful. Are there too much of them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If yes, perhaps mention the ranges as you had them originally in the grammar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the letters in the block are a bit of a mouthful ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
but still seems reasonable to print them all.
manual/src/refman/lex.etex
Outdated
letters from the Latin-1 Supplement block ("U+0080" to "U+00FF"), letters | ||
"ŠšŽžŒœŸ" from the Latin Extended-A block ("U+0100" to "U+017F") and upper case | ||
ẞ ("U+189E"). Any byte sequence which is equivalent to a supported Unicode | ||
character under NFC (Normalization Form C) is supported too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unclear what a supported Unicode character is in my opinion. I think this would be clearer: Any byte sequence which is equivalent to a supported one of these Unicode character under NFC (Normalization Form C) is supported too.
6480b8b
to
a04d024
Compare
f20f8ac
to
cef9b88
Compare
cef9b88
to
8fd5cab
Compare
@nojb , do you approve of the current state? |
I do! |
manual: document support for latin-9 identifier. (cherry picked from commit 418f333)
This PR document the new lexical convention for identifiers with the addition of support for the latin-9 character subset of unicode. It also adds a brief comment that OCaml source files are now expected to be valid UTF-8 encoded unicode text.
The first two commit of this PR implements a few prerequisite: