manual: document support for latin-9 identifier. #13668

Octachron · 2024-12-12T10:14:51Z

This PR document the new lexical convention for identifiers with the addition of support for the latin-9 character subset of unicode. It also adds a brief comment that OCaml source files are now expected to be valid UTF-8 encoded unicode text.

The first two commit of this PR implements a few prerequisite:

The first commit switches the latex driver to lualatex for better support of unicode characters (and better font support in general).
The second commit improve the support of unicode character in the grammar definition pseudo-environment by not escaping unicode characters beyond the ascii range.

Octachron · 2024-12-12T10:38:56Z

The compilation issue seems to be an issue with the texlive distribution on Ubuntu 22.04, I am planning to wait for the CI update to 24.04.

nojb · 2024-12-12T10:41:18Z

xref: #1802 (comment)

dbuenzli

Thanks @nojb. I think a few clarifications are needed

dbuenzli · 2024-12-12T11:00:52Z

Changes

@@ -563,6 +563,10 @@ ___________
  not currently supported on OCaml 5.
  (Jan Midtgaard, review by Gabriel Scherer)

+- #????: Document the basic support for unicode identifiers and the switch to
+   utf-8 encoded unicode text for OCaml source file


UTF-8 and _U_nicode

dbuenzli · 2024-12-12T11:01:20Z

manual/src/refman/lex.etex

+
+\subsubsection*{sss:lex:text-encoding}{Source file encoding}
+
+OCaml source files are expected to be valid UTF-8 encoded unicode text.


Nit, but what do you think about using a more forceful wording:

OCaml source files must be valid UTF-8 encoded Unicode text. The interpretation of source files which are not UTF-8 encoded is unspecified. Such source files may be rejected in the future.

There is a contradiction between the must in the first line and the following sentences. What do you think of:

OCaml source files are expected to be valid UTF-8 encoded Unicode text.
The interpretation of source files which are not UTF-8 encoded is unspecified.
Such source files may be rejected in the future.

where the later sentences explain that the leniency in the first sentence is temporary.

Or maybe, we could be more explicit that second sentence describes a temporary leniency:

OCaml source files must be valid UTF-8 encoded Unicode text.
Source files which are not UTF-8 encoded may be accepted but their interpretation is unspecified.
Such source files may be rejected in the future.

I would go with your first suggestion ("The interpretation of source files ... is unspecified"), which is short and to the point. Thanks!

dbuenzli · 2024-12-12T11:19:22Z

manual/src/refman/lex.etex

+"a"\ldots"z" || "š" || "ž" || "œ" || "ß" \ldots "ö" || "ø" \dots "ÿ" ;
+uppercase-letter:
+  "A"\ldots"Z" || "Š" || "Ž" || "Œ" || "Ÿ" || "À" \ldots "Ö" || "Ø" \ldots "Þ" || "ẞ"
+ ;


Unless I don't remember well no normalisation is supported so this is confusing. Exact unicode code points should be mentioned.

If normalisation is supported then I would still mention the code points and add a comment that any byte sequence which NFC to the code point is supported.

NFC normalization is supported.

dbuenzli · 2024-12-12T11:19:48Z

manual/src/refman/lex.etex

-characters 223--246 and 248--255 as lowercase letters). This
-feature is deprecated and should be avoided for future compatibility.
+letters from the ASCII set, and the 70 lowercase and uppercase letters
+from the ISO 8859-15 character set extended with uppercase ẞ.


This is a bit confusing.

I think you should say something like "letters contain the UTF-8 encoding of the letters of the ASCII set (U+0041-U+005A and U+0061-U+007A), letter of latin1 character sets (U+…) aswell as upper case ẞ (U+1E9E).

Something should also be said about normalisation or the lack thereof.

In fact it's better to simply stick to Unicode terminology and avoid mentioning other character sets (except perhaps ASCII). So either mention only the code points or the code points and the Unicode blocks they are in, see the names on this page.

My current text is:

Letters contain at least the 52 lowercase and uppercase letters from the ASCII
set, letters from the latin-1 complement block ("U+0080" to "U+00FF"), letters
"ŠšŽžŒœŸ" from the Latin Extended-A block ("U+0100" to "U+017F") and upper case
ẞ ("U+189E").

Sounds good. Though the block seems to be called Latin-1 Supplement rather than complement.

Indeed, I will also remove the "at least" from "Letters contain at least".
And for the normalization part, I propose:

Any byte sequence which is equivalent to a supported Unicode
character under NFC (Normalization Form C) is supported too.

dbuenzli

I left a few remaining nits which could clarify but it's fine with me. Also sorry I though @nojb had authored the PR. So thanks @Octachron and @nojb for the review :–)

dbuenzli · 2024-12-12T15:07:54Z

manual/src/refman/lex.etex

-characters 223--246 and 248--255 as lowercase letters). This
-feature is deprecated and should be avoided for future compatibility.
+Letters contain the 52 lowercase and uppercase letters from the ASCII set,
+letters from the Latin-1 Supplement block ("U+0080" to "U+00FF"), letters


The parens define the block which I find confusing. I think you can drop these.

I could demote them to a footnote? My hope was that having the block definition could help people decipher the grammar definition.

I think mentioning the printed letters like you did for a subset below is more useful. Are there too much of them ?

If yes, perhaps mention the ranges as you had them originally in the grammar.

All the letters in the block are a bit of a mouthful ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
but still seems reasonable to print them all.

dbuenzli · 2024-12-12T15:08:45Z

manual/src/refman/lex.etex

+letters from the Latin-1 Supplement block ("U+0080" to "U+00FF"), letters
+"ŠšŽžŒœŸ" from the Latin Extended-A block ("U+0100" to "U+017F") and upper case
+ẞ ("U+189E"). Any byte sequence which is equivalent to a supported Unicode
+character under NFC (Normalization Form C) is supported too.


It's a bit unclear what a supported Unicode character is in my opinion. I think this would be clearer: Any byte sequence which is equivalent to ~~a supported~~ one of these Unicode character under NFC (Normalization Form C) is supported too.

Octachron · 2024-12-18T10:39:51Z

@nojb , do you approve of the current state?

nojb · 2024-12-18T10:41:56Z

@nojb , do you approve of the current state?

I do!

manual: document support for latin-9 identifier. (cherry picked from commit 418f333)

Octachron added the documentation label Dec 12, 2024

Octachron added this to the 5.3.0 milestone Dec 12, 2024

dbuenzli suggested changes Dec 12, 2024

View reviewed changes

dbuenzli approved these changes Dec 12, 2024

View reviewed changes

Octachron force-pushed the manual-latin-9 branch from 6480b8b to a04d024 Compare December 12, 2024 16:40

dra27 closed this Dec 16, 2024

dra27 reopened this Dec 16, 2024

Octachron added 10 commits December 18, 2024 11:10

manual: switch pdf renderer to lualatex

7ad9280

manual: don't escape non-ascii unicode character in grammars

50f2d28

manual: document the switch to basic unicode support

19de0ed

Changes placeholder

1b7797d

review: changes

c55612b

review: Unicode capitalization

670d5ed

manual: transf.mll better escaping

337e0b2

review: use Unicode vocabulary

2d80773

review: stronger encoding wording

2815c01

review: rewording and expand Latin-1 block

c64ad79

Octachron force-pushed the manual-latin-9 branch from f20f8ac to cef9b88 Compare December 18, 2024 10:11

github workflow: add texlive-luatex for the manual

8fd5cab

Octachron force-pushed the manual-latin-9 branch from cef9b88 to 8fd5cab Compare December 18, 2024 10:19

Octachron merged commit 418f333 into ocaml:trunk Dec 19, 2024
21 checks passed

Octachron added a commit that referenced this pull request Dec 19, 2024

Merge pull request #13668 from Octachron/manual-latin-9

bfd367f

manual: document support for latin-9 identifier. (cherry picked from commit 418f333)

hhugo mentioned this pull request Dec 19, 2024

Add utf8 support for string literal ocaml-community/sedlex#127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manual: document support for latin-9 identifier. #13668

manual: document support for latin-9 identifier. #13668

Octachron commented Dec 12, 2024

Octachron commented Dec 12, 2024

nojb commented Dec 12, 2024

dbuenzli left a comment

dbuenzli Dec 12, 2024

dbuenzli Dec 12, 2024

nojb Dec 12, 2024

Octachron Dec 12, 2024

Octachron Dec 12, 2024

nojb Dec 12, 2024

dbuenzli Dec 12, 2024

Octachron Dec 12, 2024

dbuenzli Dec 12, 2024

dbuenzli Dec 12, 2024

Octachron Dec 12, 2024

dbuenzli Dec 12, 2024

Octachron Dec 12, 2024

dbuenzli left a comment

dbuenzli Dec 12, 2024

Octachron Dec 12, 2024

dbuenzli Dec 12, 2024

dbuenzli Dec 12, 2024

Octachron Dec 12, 2024

dbuenzli Dec 12, 2024

Octachron commented Dec 18, 2024

nojb commented Dec 18, 2024


		\subsubsection*{sss:lex:text-encoding}{Source file encoding}

		OCaml source files are expected to be valid UTF-8 encoded unicode text.

manual: document support for latin-9 identifier. #13668

manual: document support for latin-9 identifier. #13668

Conversation

Octachron commented Dec 12, 2024

Octachron commented Dec 12, 2024

nojb commented Dec 12, 2024

dbuenzli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbuenzli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Octachron commented Dec 18, 2024

nojb commented Dec 18, 2024