-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental support for unicode identifiers. #1407
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1407 +/- ##
==========================================
- Coverage 89.87% 89.76% -0.12%
==========================================
Files 63 65 +2
Lines 10525 10597 +72
==========================================
+ Hits 9459 9512 +53
- Misses 1066 1085 +19
|
Isn't there sub slicing when we peel off |
But, that is definitely one area of this that would need much more testing before it could be merged. |
e2bf53a
to
49381b7
Compare
Likely needs more library support: dbuenzli/uucp#25
49381b7
to
2865cca
Compare
There is some prior art now, in the OCaml compiler itself: They define an explicit set of characters they allow, which sidesteps a lot of the issues here (no need for things like UUSeg, etc): https://github.com/ocaml/ocaml/blob/6c298db0e356d0e04dd45acf6684f693f8baa7db/utils/misc.ml#L265-L272 |
I know for a fact that this requires a few changes in
stan-dev/stan
's json data handler to recognize unicode names, which is just one of several reasons this is a draft.The basic overview:
OCaml strings should be treated mostly like arrays of bytes, and
ocamllex
handles inputs as sets of bytes. We can define rules that recognize UTF-8-compatible bytes, and then do validation on them after the fact based on the the Unicode Annex 31: Unicode Identifiers standard.We then pretend for most of the compiler like it's just bytes, which is fine, because we never do things like subslice variable names.
Finally, at output time, we already had string escaping (since #952), so most of the code-gen works fine. Recent C++ standards require that compilers support UTF-8 names based on the same UAX31 rules linked above, but older ones may not. For now I've got it generating "Universal character names" which seem like the legacy version of this, which hopefully means older compilers will be happy with it.
Submission Checklist
Release notes
stanc3 can now accept a flag
--allow-unicode
which enables the use of non-ascii characters in Stan files. All files are expected to be encoded in UTF-8.This is experimental and may not work with older C++ compilers.
Copyright and Licensing
By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)