-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXPERIMENT] Disallow all literal suffixes except the standard numeric ones #103872
[EXPERIMENT] Disallow all literal suffixes except the standard numeric ones #103872
Conversation
…c ones. Partly out of curiosity, and partly because this would significantly simplify parts of the lexer and parser.
@bors try |
⌛ Trying commit 1d0b161 with merge 8330aa5198c56eb493d06cd9c4cc91d00d70e3b6... |
The code is very rough, but should be good enough for the experiment. |
☀️ Try build successful - checks-actions |
@craterbot run mode=check-only |
👌 Experiment ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more |
🚧 Experiment ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more |
🎉 Experiment
|
Plenty of imaginative suffix use in macro-based DSLs. So disallowing them is a non-starter. |
To summarize the crater results: dozens of crates use literal suffixes in all sorts of interesting ways. Here are some examples from many (but not all) of the crates that were broken by the crater run disallowing arbitrary suffixes:
Use cases:
Correctly handling some of these on the macro side must be quite the task. Note that arbitrary literal suffixes go back to #19103 RFC 463, where the idea was introduced to "futureproof" Rust for the possibility of fancier suffixes, e.g. for different kinds of literals. One of the unresolved questions in that RFC was:
The answer chosen by the implementation was "the parser", which allowed arbitrary suffixes to be used as macro inputs. This arguably broke the futureproofing. Can new suffixes can be reasonably added, given that existing macros can (and do) effectively define their own? I'm honestly not sure. |
Integers with arbitrary suffixes are allowed as inputs to proc macros. A number of real-world crates use this capability in interesting ways, as seen in rust-lang#103872. For example: - Suffixes representing units, such as `8bits`, `100px`, `20ns`, `30GB` - CSS hex colours such as `#7CFC00` (LawnGreen) - UUIDs, e.g. `785ada2c-f2d0-11fd-3839-b3104db0cb68` The hex cases may be surprising. - `#7CFC00` is tokenized as a `#` followed by a `7` integer with a `CFC00` suffix. - `785ada2c` is tokenized as a `785` integer with an `ada2c` suffix. - `f2d0` is tokenized as an identifier. - `3839` is tokenized as an integer literal. A proc macro will immediately stringify such tokens and reparse them itself, and so won't care that the token types vary. All suffixes must be consumed by the proc macro, of course; the only suffixes allowed after macro expansion are the numeric ones like `u8`, `i32`, and `f64`. Currently there is an annoying inconsistency in how integer literal suffixes are handled, which is that no suffix starting with `e` is allowed, because that it interpreted as a float literal with an exponent. For example: - Units: `1eV` and `1em` - CSS colours: `#90EE90` (LightGreen) - UUIDs: `785ada2c-f2d0-11ed-3839-b3104db0cb68` In each case, a sequence of digits followed by an 'e' or 'E' followed by a letter results in an "expected at least one digit in exponent" error. This is an annoying inconsistency in general, and a problem in practice. It's likely that some users haven't realized this inconsistency because they've gotten lucky and never used a token with an 'e' that causes problems. Other users *have* noticed; it's causing problems when embedding DSLs into proc macros, as seen in rust-lang#111615, where the CSS colours case is causing problems for two different UI frameworks (Slint and Makepad). We can do better. This commit changes the lexer so that, when it hits a possible exponent, it looks ahead and only produces an exponent if a valid one is present. Otherwise, it produces a non-exponent form, which may be a single token (e.g. `1eV`) or multiple tokens (e.g. `1e+a`). Consequences of this: - All the proc macro problem cases mentioned above are fixed. - The "expected at least one digit in exponent" error is no longer possible. A few tests that only worked in the presence of that error have been removed. - The lexer requires unbounded lookahead due to the presence of '_' chars in exponents. E.g. to distinguish `1e+_______3` (a float literal with exponent) from `1e+_______a` (previously invalid, but now the tokenised as `1e`, `+`, `_______a`). This is a backwards compatible language change: all existing valid programs will be treated in the same way, and some previously invalid programs will become valid. The tokens chapter of the language reference (https://doc.rust-lang.org/reference/tokens.html) will need changing to account for this. In particular, the "Reserved forms similar to number literals" section will need updating, and grammar rules involving the SUFFIX_NO_E nonterminal will need adjusting. Fixes rust-lang#111615.
Partly out of curiosity, and partly because this would significantly simplify parts of the lexer and parser.
r? @ghost