Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce unescape module #60261

Merged
merged 2 commits into from
May 6, 2019
Merged

introduce unescape module #60261

merged 2 commits into from
May 6, 2019

Conversation

matklad
Copy link
Member

@matklad matklad commented Apr 25, 2019

A WIP PR to gauge early feedback

Currently, we deal with escape sequences twice: once when we lex a string, and a second time when we unescape literals. Note that we also produce different sets of diagnostics in these two cases.

This PR aims to remove this duplication, by introducing a new unescape module as a single source of truth for character escaping rules.

I think this would be a useful cleanup by itself, but I also need this for #59706.

In the current state, the PR has unescape module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module

What this PR doesn't have yet are:

  • handling of byte and byte string literals (should be simple to add)
  • good diagnostics
  • actual removal of code from lexer (giant scan_char_or_byte should go away completely)
  • performance check
  • general cleanup of the new code

Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that unescape produces a plain old enum with various problems, and they are rendered into Handler separately. This bit is not actually required (it is possible to just pass the Handler in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706

cc @eddyb , @petrochenkov

@rust-highfive
Copy link
Collaborator

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 25, 2019
@petrochenkov petrochenkov self-assigned this Apr 25, 2019
@rust-highfive

This comment has been minimized.

@matklad matklad changed the title introduce unescape module WIP: introduce unescape module Apr 25, 2019
@rust-highfive

This comment has been minimized.

@rust-highfive

This comment has been minimized.

@petrochenkov
Copy link
Contributor

Meta: for reviewing convenience it's better to update UI test outputs and satisfy tidy to make CI green, even if the changes in test results are temporarily wrong / intended to disappear.
This way it's clear how exactly they are wrong and what still needs to be fixed.

@rust-highfive

This comment has been minimized.

@matklad
Copy link
Member Author

matklad commented Apr 28, 2019

Apparently, running ./x.py test src/test/ui --bless is not enough to make tests green :(

@matklad matklad force-pushed the one-escape branch 2 times, most recently from a39c555 to f91fbdc Compare April 28, 2019 20:20
@rust-highfive

This comment has been minimized.

@petrochenkov
Copy link
Contributor

Question: what happens if a literal is lexed, but never "parsed properly"?
For example, if it's passed to a macro that accepts tts and throws them away.
The errors for incorrect escapes, etc, should be reported in that case as well.

(P.S. I haven't reviewed everything yet, will continue tomorrow.)

@petrochenkov
Copy link
Contributor

r? @petrochenkov

@matklad
Copy link
Member Author

matklad commented Apr 29, 2019

Question: what happens if a literal is lexed, but never "parsed properly"?

Good question! Given that diag: Option<(Span, &Handler)> argument to char_lit function, I was under the impression that we always parse literals properly. Turns out that even today we don't do that, so existing code is buggy.

The following compiles, while it shouldn't (the 6F literal is out of range for char):

macro_rules! erase {
    ($($tt:tt)*) => {}
}

fn main() {
    erase! {
        '\u{FFFFFF}'
    }
}

playground

If we pursue the approach in this PR, then we should run unescape_* family of functions twice: once in the lexer, where we just report errors and disregard escaped characters, and once in the parser, where we do the opposite and ignore errors, but collect unescaped literals. That means that we will be able to remove that diag: Option argument (indeed, "optionally" reporting diagnostics seems like a sure way to have bugs)

@matklad
Copy link
Member Author

matklad commented Apr 29, 2019

Hm, or is the above example an expected behavior? We don't check ranges of integer literals, for example:

macro_rules! erase {
    ($($tt:tt)*) => {}
}

fn main() {
    erase!(999u8);
}

for chars, we do check that there are at most six hex digits in the lexer, but we only do precise check for range and surrogates in the parser, which seems somewhat arbitrary.

@rust-highfive

This comment has been minimized.

@eddyb
Copy link
Member

eddyb commented Apr 30, 2019

@matklad I think for out of range integer literals, we warn and truncate later in the compilation, but there's not much we can do for char (there is no meaningful truncation to speak of).

That said, maybe literals should be checked by AST -> HIR lowering, i.e. only unescaped once, to store both versions, or just the unescaped one, in HIR?

One notable exception to "AST literals can be opaque, only HIR needs unescaping" is string literals used in attributes (such as #[path = "foo\\bar\\baz.rs"] mod baz;).

@matklad
Copy link
Member Author

matklad commented Apr 30, 2019

That said, maybe literals should be checked by AST -> HIR lowering, i.e. only unescaped once, to store both versions, or just the unescaped one, in HIR?

I think we should do validation before macro expansion, for two reasons at least:

  • macro_rules that drop tokens can be used to skip AST -> HIR validation, and that seems bad
  • proc_macros should be able to assume that the tokens they get are well-formed

As for actual unescapeing, agree that, abstractly, it makes sense to do it as late as possible. Ideally (not necessary practically), running cargo check should not construct unescapped strings at all. However, changing the place where unescaping occurs is probably out of scope for this PR.

So my current plan is:

  1. run the unescape_* functions twice, once for validation, in lexer, once for actual unescaping, in the parser
  2. treat \u{fff_fff} as a lexical error (that will require lang team sign off and a crate run) (after thinking about this more, I come to the conclusion that chars and integers are really different: to validate ints, we need type inference results. chars we can validate lexically, and something like "hello \u{lone surrogate} world" just doesn't make sense to me)

@rust-highfive

This comment has been minimized.

@bors
Copy link
Contributor

bors commented May 3, 2019

⌛ Trying commit 1835cbe with merge bfdcf6d...

bors added a commit that referenced this pull request May 3, 2019
introduce unescape module

A WIP PR to gauge early feedback

Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases.

This PR aims to remove this duplication, by introducing a new `unescape` module as a single source of truth for character escaping rules.

I think this would be a useful cleanup by itself, but I also need this for #59706.

In the current state, the PR has `unescape` module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module

What this PR doesn't have yet are:
* [x] handling of byte and byte string literals (should be simple to add)
* [x] good diagnostics
* [x] actual removal of code from lexer (giant `scan_char_or_byte` should go away completely)
* [ ] performance check
* [x] general cleanup of the new code

Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that `unescape` produces a plain old `enum` with various problems, and they are rendered into `Handler` separately. This bit is not actually required (it is possible to just pass the `Handler` in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706

cc @eddyb , @petrochenkov
@bors
Copy link
Contributor

bors commented May 3, 2019

☀️ Try build successful - checks-travis
Build commit: bfdcf6d

@matklad
Copy link
Member Author

matklad commented May 3, 2019

This probably should be tagged with Breaking Change and Waiting on Crater presumably?

@petrochenkov
Copy link
Contributor

@craterbot run mode=check-only

@craterbot
Copy link
Collaborator

👌 Experiment pr-60261 created and queued.
🤖 Automatically detected try build bfdcf6d
🔍 You can check out the queue and this experiment's details.

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot craterbot added S-waiting-on-crater Status: Waiting on a crater run to be completed. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels May 3, 2019
@craterbot
Copy link
Collaborator

🚧 Experiment pr-60261 is now running on agent aws-2.

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@petrochenkov
Copy link
Contributor

@rust-timer build bfdcf6d

@rust-timer
Copy link
Collaborator

Success: Queued bfdcf6d with parent 1891bfa, comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit bfdcf6d

@matklad
Copy link
Member Author

matklad commented May 3, 2019

Looks like there are no significant perf differences, let's wait what crater says

@craterbot
Copy link
Collaborator

🎉 Experiment pr-60261 is completed!
📊 0 regressed and 0 fixed (60951 total)
📰 Open the full report.

⚠️ If you notice any spurious failure please add them to the blacklist!
ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot craterbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-crater Status: Waiting on a crater run to be completed. labels May 5, 2019
@petrochenkov
Copy link
Contributor

@bors r+

@bors
Copy link
Contributor

bors commented May 5, 2019

📌 Commit 1835cbe has been approved by petrochenkov

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 5, 2019
@bors
Copy link
Contributor

bors commented May 6, 2019

⌛ Testing commit 1835cbe with merge 46d0ca0...

bors added a commit that referenced this pull request May 6, 2019
introduce unescape module

A WIP PR to gauge early feedback

Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases.

This PR aims to remove this duplication, by introducing a new `unescape` module as a single source of truth for character escaping rules.

I think this would be a useful cleanup by itself, but I also need this for #59706.

In the current state, the PR has `unescape` module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module

What this PR doesn't have yet are:
* [x] handling of byte and byte string literals (should be simple to add)
* [x] good diagnostics
* [x] actual removal of code from lexer (giant `scan_char_or_byte` should go away completely)
* [x] performance check
* [x] general cleanup of the new code

Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that `unescape` produces a plain old `enum` with various problems, and they are rendered into `Handler` separately. This bit is not actually required (it is possible to just pass the `Handler` in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706

cc @eddyb , @petrochenkov
@bors
Copy link
Contributor

bors commented May 6, 2019

☀️ Test successful - checks-travis, status-appveyor
Approved by: petrochenkov
Pushing 46d0ca0 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label May 6, 2019
@bors bors merged commit 1835cbe into rust-lang:master May 6, 2019
@matklad matklad deleted the one-escape branch May 6, 2019 07:24
@matklad
Copy link
Member Author

matklad commented May 7, 2019

FWIW, this is now used by rust-analyzer: rust-lang/rust-analyzer#1253

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants