Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for RFC 3101: Reserved Prefixes #84978

Closed
2 of 3 tasks
nikomatsakis opened this issue May 6, 2021 · 20 comments
Closed
2 of 3 tasks

Tracking Issue for RFC 3101: Reserved Prefixes #84978

nikomatsakis opened this issue May 6, 2021 · 20 comments
Assignees
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC F-reserved_prefixes `#![feature(reserved_prefixes)]` T-lang Relevant to the language team, which will review and decide on the PR/issue.

Comments

@nikomatsakis
Copy link
Contributor

nikomatsakis commented May 6, 2021

This is a tracking issue for the RFC "Reserved Prefixes" (rust-lang/rfcs#3101).
The feature gate for the issue is #![feature(reserved_prefixes)].

About tracking issues

Tracking issues are used to record the overall progress of implementation.
They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions.
A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature.
Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.

Steps

Unresolved Questions

Implementation history

@nikomatsakis nikomatsakis added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-lang Relevant to the language team, which will review and decide on the PR/issue. labels May 6, 2021
@nikomatsakis nikomatsakis added the F-reserved_prefixes `#![feature(reserved_prefixes)]` label May 6, 2021
@lrh2000
Copy link
Contributor

lrh2000 commented May 12, 2021

@rustbot claim

I'm interested at implementing this. But there are some questions.

  1. The RFC says foo# bar is treated as three tokens. But currently r# bar failed with found invalid character.
  • The error is weird and unnecessary (in my opinion).
  • It may lead to bad consistency. For example, we may reserve the prefixxr# and accept xr# bar as three tokens (that's what the RFC specified). Once we decide to use xr# represents something like a raw string, xr# bar may not compile anymore (due to the error found invalid character).

Possible solutions:

  • Remove the error found invalid character to improve consistency. I don't think it breaks anything.
  • Treat everything starting with foo# as an error (suggested by @m-ou-se). It is easy to implement but it does not conform to the requirements of the RFC.
  1. The RFC does not describe some corner cases clearly. I give my expected results for them below. (It's according to the RFC, and obviously we get different results if we deny everything starting from foo#.)
  • foo## bar
    4 tokens foo, #, #, bar (NOTE: However,r## bar cannot compile now.)
  • foo#! bar
    4 tokens foo, #, !, bar (NOTE: However, r#! bar cannot compile now.)
  • foo##! bar
    5 tokens foo, #, #, !, bar (NOTE: However, r##! bar cannot compile now.)
  • foo##" bar
    error: reserved prefixes
  • foo##bar
    error: reserved prefixes
  • r#foo#bar
    error: reserved prefixes
  • r"foo"#bar
    3 tokens "foo", #, bar

(See also the discussion at zulip https://rust-lang.zulipchat.com/#narrow/stream/268952-edition-2021/topic/reserved.20prefixes/near/238470099.)

@nikomatsakis

This comment has been minimized.

@steffahn
Copy link
Member

Some relevant examples regarding string (and similarly char) literals – the first two copied from the linked zulip discussion (archive) – include

match"foo" { _ => () }

if"foo".eq("foo") {}

while"foo".ne("foo") {}

if let"foo" = "foo" {}

loop {
    break"foo"
};

for _ in"foo".chars() {}

return"foo"

(playground)

which raises the question whether we want to exclude all keywords or maybe certain keywords from the disallowed prefixes.

@petrochenkov
Copy link
Contributor

whether we want to exclude all keywords or maybe certain keywords from the disallowed prefixes

Note that lexer isn't currently aware of the keyword list (*), and making it aware is a step in the wrong direction, IMO.
I generally support the @matklad's idea of working with literal prefixes using jointness (#84979 (comment)), this is the direction in which multi-character punctuation and "compound" float literals are supposed to move in accordance with #71322 (comment).

(*) After all, every left-hand side of a macro is able to introduce its own syntax entirely independent from Rust syntax, with its own keyword list, and lexer works with macro inputs too.

@nikomatsakis
Copy link
Contributor Author

nikomatsakis commented May 21, 2021

@petrochenkov I was wondering about "jointness". The idea here would be that we produce the same tokens, but in the Rust parser we check for (e.g.) k#foo and interpret it differently than k# foo? (By inspecting the "is there adjoining whitespace" info?)

@petrochenkov
Copy link
Contributor

Yeah, it should be applicable to prefix#ident too in some form (with less urgency), but I was talking about prefixLITERAL specifically (as an answer to #84978 (comment)) because it has larger compatibility issues.

@lrh2000
Copy link
Contributor

lrh2000 commented May 22, 2021

@petrochenkov So do you mean that we should emit errors for reserved prefixes in the parser instead of the lexer?

I think it does not solve the problem about TokenStream::from_str. Instead, it only defers the problem. We may not alter the lexing rules now, but in the future we have to change the lexing rules if we want to introduce another prefix that is also used to denote raw strings (or something like them). Am I missing anything?

But I think it sounds good because it can easily solve almost all problems pointed out by @steffahn . For example, we can distinguish lifetimes from string literals and we can also distinguish keywords from identifiers. And we can have better diagnostics messages like

error: unknown prefix on string literal: bar
 --> file.rs:x:y
  |
1 | foo!(bar"qux");
  |      ^^^ help: try using whitespace here: `bar "qux"`
  |
  = note: prefixed string literals are reserved for future use

But I've found currently parsing the macro arguments and expanding macros is done in rustc_expand, so it means that we will emit errors for reserved prefixes both from rustc_parse and rustc_expand, does it sound good?

@petrochenkov
Copy link
Contributor

petrochenkov commented May 23, 2021

@lrh2000

So do you mean that we should emit errors for reserved prefixes in the parser instead of the lexer?

Yes, at least for breakage-prone things like match"foo".
For anything involving raw strings with #s the errors can reported in the lexer just to save time and defer the final decision, I'm pretty sure that code like match#"foo"# is not used in practice.
What I'd really like to see is a crater result for the most conservative reservation, that would help to make decisions.

I think it does not solve the problem about TokenStream::from_str

Could you remind what is the problem with TokenStream::from_str exactly?

but in the future we have to change the lexing rules if we want to introduce another prefix that is also used to denote raw strings (or something like them). Am I missing anything?

String escaping is done during parsing (proc macros can do custom escaping in particular), so lexer doesn't need to discern between raw and non-raw strings unless #s are involved, but #s should be covered by the first item.

But I've found currently parsing the macro arguments and expanding macros is done in rustc_expand, so it means that we will emit errors for reserved prefixes both from rustc_parse and rustc_expand, does it sound good?

rustc_expand should always delegate to rustc_parse for parsing (maybe unless compatibility shims for tt matchers are involved?).
Sorry for the brief answers, I can't spend much time on this right now.

@steffahn
Copy link
Member

lexer doesn't need to discern between raw and non-raw strings unless #s are involved

It does need to discern. Both "\"" and r"\" are valid (raw) string literals, whereas "\" is an un terminated string literal and r"\"" is just r"\" followed by an unterminated string literal.

@nikomatsakis
Copy link
Contributor Author

We can certainly just accept that any future form of raw string will have to use #.

@lrh2000
Copy link
Contributor

lrh2000 commented May 24, 2021

@petrochenkov Thanks for your reply first.

Could you remind what is the problem with TokenStream::from_str exactly?

We may get different results (TokenStream) for the same string (from_str) in different versions of Rust, but the version information of the string is not known during the evaluation of TokenStream::from_str.

I'm not sure why you ask this. Maybe I misunderstood the problem?

String escaping is done during parsing (proc macros can do custom escaping in particular), so lexer doesn't need to discern between raw and non-raw strings unless #s are involved, but #s should be covered by the first item.

Now the lexer reports the start position and the end position for every token, so it has to do some (although simplified) escaping, see

/// Eats double-quoted string and returns true
/// if string is terminated.
fn double_quoted_string(&mut self) -> bool {
debug_assert!(self.prev() == '"');
while let Some(c) = self.bump() {
match c {
'"' => {
return true;
}
'\\' if self.first() == '\\' || self.first() == '"' => {
// Bump again to skip escaped character.
self.bump();
}
_ => (),
}
}
// End of file reached.
false
}

rustc_expand should always delegate to rustc_parse for parsing (maybe unless compatibility shims for tt matchers are involved?).

If we check reserved prefixes in parser, I think the best place is in parse_ident. But when parsing macro arguments, things are different. I don't believe it will go through the parse_ident path, especially when it just expects a token. I think it reflects that checking reserved prefixed in parser is less elegant than checking in lexer (at least in some certain aspects), although it also brings many benefits.

I will conduct more experiments on this, but now I'm curious about your opinions (if any).

@m-ou-se
Copy link
Member

m-ou-se commented May 31, 2021

Note that the 'jointness' solution won't really work. That's exactly what we wanted to avoid by reserving prefixes in the lexer, to make sure we don't have to decide right now on the lexing rules for every possible prefix. As I wrote in #84979 (comment):

We specifically wanted to make these things a lexer error, so we don't have to decide on the lexing rules for that literal yet. E.g. is fb"\xFF" valid or not? How about fb"❉"? Or fr"\"? Or f"hello {123 + "hey".len()}"?


Also, this doesn't seem to be true:

String escaping is done during parsing (proc macros can do custom escaping in particular)

Proc macros cannot accept string literals with escape sequences or characters that don't match the prefix. These errors are given without parsing:

#[proc_macro]
pub fn hey(_: TokenStream) -> TokenStream {
    TokenStream::new()
}
hey!("\xFF");
hey!(b"å");
error: out of range hex escape
 --> examples/hey.rs:3:7
  |
3 | hey!("\xFF");
  |       ^^^^ must be a character in the range [\x00-\x7f]

error: non-ASCII character in byte constant
 --> examples/hey.rs:4:8
  |
4 | hey!(b"å");
  |        ^
  |        |
  |        byte constant must be ASCII
  |        help: use a \xHH escape for a non-ASCII byte: `\xE5`

@m-ou-se
Copy link
Member

m-ou-se commented May 31, 2021

As for match"hey" etc., I think it's perfectly fine if that gives an error and a suggestion for match "hey" with a space.

@nikomatsakis
Copy link
Contributor Author

I wrote up a summary of what I believe to be the issues and concerns raised thus far:

https://hackmd.io/YLe7viGLTu2PfE5sQO4v0w

I think that, on balance, the right path is to continue with a hard lex error (also for keywords). I don't think the lexer should know about the keyword listing, but it would have to know about the edition. Is there any critical problem with this, @petrochenkov or @matklad? The downsides of using jointness information (and some unknowns) are covered in the hackmd.

@m-ou-se
Copy link
Member

m-ou-se commented Aug 2, 2021

This seems to have already been completed. Apart from edition 2021 not being stable yet, this is stable.

@m-ou-se m-ou-se closed this as completed Aug 2, 2021
@nikomatsakis
Copy link
Contributor Author

Re-opening as we have not made a stabilization decision here. Has there been a Rust reference PR? I drafted a stabilization report in this hackmd.

@nikomatsakis nikomatsakis reopened this Aug 18, 2021
@nikomatsakis
Copy link
Contributor Author

@TriageBot release-assignment

I don't believe @lrh2000 is still pushing this forward.

@nikomatsakis
Copy link
Contributor Author

@rustbot release-assignment

@nikomatsakis
Copy link
Contributor Author

nikomatsakis commented Aug 18, 2021

Stabilization proposed in #88140

@compiler-errors
Copy link
Member

For the record, T-lang decided to retroactively reserve 'prefix#lifetime lifetimes as an extension to this work, which is implemented in #126452. That PR also implements raw lifetimes, like 'r#async.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC F-reserved_prefixes `#![feature(reserved_prefixes)]` T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants